---
# Algoritmos para Big Data – Projeto
## Parte 3: Modelação e Previsão

**Dataset:** Flight Delay Dataset (amostrado e limpo)

**Autores:**
- Henrique Niza (131898)
- Paulo Francisco Pinto (128962)
- Rute Roque (128919)

In [1]:
# Importação de bibliotecas necessárias para Spark e visualização
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline
from pyspark.sql.functions import col

In [2]:
# Criação da SparkSession
# spark = SparkSession.builder.appName("FlightDelayModeling").getOrCreate()

spark = SparkSession.builder \
    .appName("FlightDelayModeling") \
    .config("spark.driver.memory", "6g") \
    .getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/05/24 22:24:05 WARN Utils: Your hostname, MacBook-Pro-de-admin.local, resolves to a loopback address: 127.0.2.3; using 192.168.68.50 instead (on interface en0)
25/05/24 22:24:05 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/24 22:24:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
# Caminho para os dados limpos
input_path = "../data/processed/flights_cleaned_sample.parquet"
data = spark.read.parquet(input_path)

---
#### Seleção de Features para Modelagem

In [4]:
# Variável-alvo: DepDel15 (1 se atraso > 15 min, 0 caso contrário)
# Features selecionadas (numéricas e categóricas relevantes)
features = [
    "Month", "DayofMonth", "DayOfWeek", "CRSDepTime", "Distance",
    "DepTimeBlk", "Airline", "Origin", "Dest"
]

In [5]:
# Indexação de variáveis categóricas
cat_features = ["DepTimeBlk", "Airline", "Origin", "Dest"]
indexers = [StringIndexer(inputCol=c, outputCol=c+"_Idx") for c in cat_features]

In [6]:
# Assembler para features
input_features = ["Month", "DayofMonth", "DayOfWeek", "CRSDepTime", "Distance"] + [c+"_Idx" for c in cat_features]
assembler = VectorAssembler(inputCols=input_features, outputCol="features")

In [7]:
# Indexador do label
label_indexer = StringIndexer(inputCol="DepDel15", outputCol="label")

In [8]:
# Classificador
rf = RandomForestClassifier(featuresCol="features", labelCol="label", numTrees=20, seed=42, maxBins=512)

In [9]:
# Pipeline
pipeline = Pipeline(stages=indexers + [label_indexer, assembler, rf])

 ---

In [10]:
# Split de Dados 
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

In [11]:
# Treinamento do Modelo
model = pipeline.fit(train_data)

25/05/24 22:24:25 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
25/05/24 22:25:49 WARN MemoryStore: Not enough space to cache rdd_79_1 in memory! (computed 93.8 MiB so far)
25/05/24 22:25:49 WARN BlockManager: Persisting block rdd_79_1 to disk instead.
25/05/24 22:25:49 WARN MemoryStore: Not enough space to cache rdd_79_2 in memory! (computed 93.8 MiB so far)
25/05/24 22:25:49 WARN BlockManager: Persisting block rdd_79_2 to disk instead.
25/05/24 22:25:51 WARN MemoryStore: Not enough space to cache rdd_79_0 in memory! (computed 140.7 MiB so far)
25/05/24 22:25:51 WARN BlockManager: Persisting block rdd_79_0 to disk instead.
25/05/24 22:25:51 WARN MemoryStore: Not enough space to cache rdd_79_7 in memory! (computed 140.7 MiB so far)
25/05/24 22:25:51 WARN BlockManager: Persisting block rdd_79_7 to disk instead.
25/05/24 22:25:51 WARN MemoryStore: Not enough space 

In [12]:
# Previsões e Avaliação
predictions = model.transform(test_data)

In [13]:
# Avaliação da acurácia
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print(f"Acurácia do modelo RandomForest: {accuracy:.4f}")

[Stage 45:>                                                         (0 + 8) / 8]

Acurácia do modelo RandomForest: 0.8274


                                                                                

In [14]:
# Matriz de Confusão
confusion = predictions.groupBy("label", "prediction").count().orderBy("label", "prediction")
confusion.show()



+-----+----------+-------+
|label|prediction|  count|
+-----+----------+-------+
|  0.0|       0.0|1406579|
|  1.0|       0.0| 293409|
+-----+----------+-------+



                                                                                

Fica assim preparado para ser continuado com tuning, outras métricas ou modelos adicionais.

In [15]:
# Guardar o modelo treinado
model_path = "../models/flight_delay_rf_model"
model.write().overwrite().save(model_path)
print(f"Modelo salvo com sucesso em: {model_path}")

Modelo salvo com sucesso em: ../models/flight_delay_rf_model


In [16]:
# Avaliação do modelo
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

predictions = model.transform(test_data)
evaluator = MulticlassClassificationEvaluator(
    labelCol="label", predictionCol="prediction", metricName="accuracy")

accuracy = evaluator.evaluate(predictions)
print(f"Acurácia do modelo: {accuracy:.4f}")


[Stage 69:>                                                         (0 + 8) / 8]

Acurácia do modelo: 0.8274


                                                                                

In [17]:
predictions.select("features", "label", "prediction").show(10)

[Stage 71:>                                                         (0 + 1) / 1]

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|[1.0,1.0,1.0,2200...|  1.0|       0.0|
|[1.0,1.0,1.0,1747...|  0.0|       0.0|
|[1.0,1.0,1.0,1200...|  0.0|       0.0|
|[1.0,1.0,1.0,2040...|  0.0|       0.0|
|[1.0,1.0,1.0,1000...|  0.0|       0.0|
|[1.0,1.0,1.0,1600...|  0.0|       0.0|
|[1.0,1.0,1.0,1510...|  0.0|       0.0|
|[1.0,1.0,1.0,1035...|  0.0|       0.0|
|[1.0,1.0,1.0,1126...|  0.0|       0.0|
|[1.0,1.0,1.0,2230...|  1.0|       0.0|
+--------------------+-----+----------+
only showing top 10 rows


                                                                                