---
# Algoritmos para Big Data – Projeto
## Parte 4: Modelação e Previsão

**Dataset:** Flight Delay Dataset (amostrado e limpo)

**Autores:**
- Henrique Niza (131898)
- Paulo Francisco Pinto (128962)
- Rute Roque (128919)

In [1]:
# 1. Spark Setup
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel

spark = SparkSession.builder.appName("FlightDelayDeploymentBatch").getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/05/24 22:34:54 WARN Utils: Your hostname, MacBook-Pro-de-admin.local, resolves to a loopback address: 127.0.2.3; using 192.168.68.50 instead (on interface en0)
25/05/24 22:34:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/24 22:34:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
# 2. Carregar o modelo treinado
model_path = "../models/flight_delay_rf_model"
model = PipelineModel.load(model_path)
print(f"Modelo carregado de: {model_path}")

Modelo carregado de: ../models/flight_delay_rf_model


In [3]:
# 3. Carregar dados novos (neste caso, simulamos com os mesmos dados limpos)
input_path = "../data/processed/flights_cleaned_sample.parquet"
novos_dados = spark.read.parquet(input_path)
print(f"Número de voos para prever: {novos_dados.count()}")

Número de voos para prever: 8500634


In [4]:
# 4. Aplicar o modelo para gerar previsões
previsoes = model.transform(novos_dados)
previsoes.select("FlightDate", "Airline", "Origin", "Dest", "DepDelay", "prediction").show(10)

+----------+-----------------+------+----+--------+----------+
|FlightDate|          Airline|Origin|Dest|DepDelay|prediction|
+----------+-----------------+------+----+--------+----------+
|2018-01-06|Endeavor Air Inc.|   CRW| ATL|     9.0|       0.0|
|2018-01-14|Endeavor Air Inc.|   ATL| XNA|    -6.0|       0.0|
|2018-01-27|Endeavor Air Inc.|   LGA| MSY|    -8.0|       0.0|
|2018-01-30|Endeavor Air Inc.|   DTW| MEM|    -4.0|       0.0|
|2018-01-28|Endeavor Air Inc.|   MSP| ROC|    -2.0|       0.0|
|2018-01-05|Endeavor Air Inc.|   DTW| PIT|    -1.0|       0.0|
|2018-01-23|Endeavor Air Inc.|   RDU| LGA|    42.0|       0.0|
|2018-01-12|Endeavor Air Inc.|   ATL| BMI|    64.0|       0.0|
|2018-01-11|Endeavor Air Inc.|   SAT| DTW|    28.0|       0.0|
|2018-01-17|Endeavor Air Inc.|   BNA| LGA|    34.0|       0.0|
+----------+-----------------+------+----+--------+----------+
only showing top 10 rows


In [5]:
# 5. Guardar previsões em formato Parquet
output_path = "../data/predictions/batch_predictions.parquet"
previsoes.select("FlightDate", "Airline", "Origin", "Dest", "DepDelay", "prediction") \
         .write.mode("overwrite").parquet(output_path)
print(f"Previsões salvas com sucesso em: {output_path}")

25/05/24 22:35:09 WARN MemoryManager: Total allocation exceeds 95,00% (1 020 054 720 bytes) of heap memory
Scaling row group sizes to 95,00% for 8 writers

Previsões salvas com sucesso em: ../data/predictions/batch_predictions.parquet


                                                                                