## Configuração de Env e Sparks

In [4]:
import os
os.environ["JAVA_HOME"] = "/usr/local/openjdk-8"
os.environ["SPARK_HOME"] = "/user_data/spark-3.3.0-bin-hadoop2"

import findspark
findspark.init('spark-3.3.0-bin-hadoop2')

## Imports

In [14]:
# Importando bibliotecas necessárias
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.stat import Correlation

spark = (
    SparkSession.builder.appName("spark_flight")
    .config("spark.sql.warehouse.dir", "hdfs:///user/hive/warehouse")
    .config("spark.sql.catalogImplementation", "hive")
    .getOrCreate()
)

24/04/28 18:32:28 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


## Dataset escolhido

O dataset escolhido foi o [Flight Status Prediction (link do Kaggle)](https://www.kaggle.com/datasets/robikscube/flight-delay-dataset-20182022/data). Esse dataset possui diversas informações sobre voos realizados, incluindo dados sobre cancelamento e atrasos.

Apesar da disponibilidade de dados adquiridos desde 2018, selecionamos o arquivo CSV referente ao ano de 2022, que contém 1.42 GB de dados.

## Leitura do dataset

In [6]:
dataframe = spark.read.csv("hdfs://spark-master:9000/datasets/flights/Combined_Flights_2022.csv", header=True, inferSchema=True)
num_linhas = dataframe.count()
print(f"Número de linhas no DataFrame: {num_linhas}")

[Stage 2:>                                                        (0 + 16) / 16]

Número de linhas no DataFrame: 4078318


                                                                                

## Análise Exploratória de Dados (EDA)

In [7]:
# Visualizando o esquema dos dados
dataframe.printSchema()

root
 |-- FlightDate: timestamp (nullable = true)
 |-- Airline: string (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Cancelled: boolean (nullable = true)
 |-- Diverted: boolean (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- DepTime: double (nullable = true)
 |-- DepDelayMinutes: double (nullable = true)
 |-- DepDelay: double (nullable = true)
 |-- ArrTime: double (nullable = true)
 |-- ArrDelayMinutes: double (nullable = true)
 |-- AirTime: double (nullable = true)
 |-- CRSElapsedTime: double (nullable = true)
 |-- ActualElapsedTime: double (nullable = true)
 |-- Distance: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Quarter: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- Marketing_Airline_Network: string (nullable = true)
 |-- Operated_or_Branded_Code_Share_Partners: string (nullable = tru

In [8]:
# Remover depois, é apenas para conseguir rodar
dataframe_sample = dataframe.sample(fraction=0.1, seed=3)
print(f"Número de linhas no DataFrame Sample: {dataframe_sample.count()}")

# Exibindo algumas informações dos atributos
dataframe_sample.summary().show(truncate=False, vertical=True)

                                                                                

Número de linhas no DataFrame Sample: 408034


[Stage 10:>                                                         (0 + 1) / 1]

24/04/28 18:14:14 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
-RECORD 0--------------------------------------------------------------
 summary                                 | count                       
 Airline                                 | 408034                      
 Origin                                  | 408034                      
 Dest                                    | 408034                      
 CRSDepTime                              | 408034                      
 DepTime                                 | 395927                      
 DepDelayMinutes                         | 395914                      
 DepDelay                                | 395914                      
 ArrTime                                 | 395555                      
 ArrDelayMinutes                         | 394642                      
 AirTime                    

                                                                                

In [9]:
# Mostrando as primeiras linhas do DataFrame
dataframe.show(n=5, truncate=False, vertical=True)

-RECORD 0----------------------------------------------------------------------------
 FlightDate                              | 2022-04-04 00:00:00                       
 Airline                                 | Commutair Aka Champlain Enterprises, Inc. 
 Origin                                  | GJT                                       
 Dest                                    | DEN                                       
 Cancelled                               | false                                     
 Diverted                                | false                                     
 CRSDepTime                              | 1133                                      
 DepTime                                 | 1123.0                                    
 DepDelayMinutes                         | 0.0                                       
 DepDelay                                | -10.0                                     
 ArrTime                                 | 1228.0     

In [11]:
# Checagem por dados nulos
Dict_Null = {col:dataframe_sample.filter(dataframe_sample[col].isNull()).count() for col in dataframe_sample.columns}
Dict_Null

                                                                                

{'FlightDate': 0,
 'Airline': 0,
 'Origin': 0,
 'Dest': 0,
 'Cancelled': 0,
 'Diverted': 0,
 'CRSDepTime': 0,
 'DepTime': 12107,
 'DepDelayMinutes': 12120,
 'DepDelay': 12120,
 'ArrTime': 12479,
 'ArrDelayMinutes': 13392,
 'AirTime': 13392,
 'CRSElapsedTime': 0,
 'ActualElapsedTime': 13392,
 'Distance': 0,
 'Year': 0,
 'Quarter': 0,
 'Month': 0,
 'DayofMonth': 0,
 'DayOfWeek': 0,
 'Marketing_Airline_Network': 0,
 'Operated_or_Branded_Code_Share_Partners': 0,
 'DOT_ID_Marketing_Airline': 0,
 'IATA_Code_Marketing_Airline': 0,
 'Flight_Number_Marketing_Airline': 0,
 'Operating_Airline': 0,
 'DOT_ID_Operating_Airline': 0,
 'IATA_Code_Operating_Airline': 0,
 'Tail_Number': 2770,
 'Flight_Number_Operating_Airline': 0,
 'OriginAirportID': 0,
 'OriginAirportSeqID': 0,
 'OriginCityMarketID': 0,
 'OriginCityName': 0,
 'OriginState': 0,
 'OriginStateFips': 0,
 'OriginStateName': 0,
 'OriginWac': 0,
 'DestAirportID': 0,
 'DestAirportSeqID': 0,
 'DestCityMarketID': 0,
 'DestCityName': 0,
 'DestStat

Pudemos visualizar que existem alguns dados nulos principalmente nas colunas referentes ao tempo de viagem, como o delay na partida ou chegada e tempo de voo.

## Pré-processamento

In [24]:
# Remover colunas de ID e nome do DataFrame
colunas_para_remover = [
    'DOT_ID_Marketing_Airline', 'DOT_ID_Operating_Airline', 
    'OriginAirportID', 'OriginAirportSeqID', 'OriginCityMarketID', 
    'DestAirportID', 'DestAirportSeqID', 'DestCityMarketID',
    'IATA_Code_Marketing_Airline', 'Flight_Number_Marketing_Airline', 'IATA_Code_Operating_Airline', 'Tail_Number', 'Flight_Number_Operating_Airline', 'Airline',
    'Origin',
    'Dest',
    'Marketing_Airline_Network',
    'Operated_or_Branded_Code_Share_Partners',
    'Operating_Airline',
    'OriginCityName',
    'OriginState',
    'OriginStateName',
    'DestCityName',
    'DestState',
    'DestStateName',
    'DepTimeBlk',
    'ArrTimeBlk',
    'FlightDate'
]

dataframe_sem_colunas_de_id = dataframe_sample.drop(*colunas_para_remover)

# Mostrar somente o nome das colunas
print("Nome das colunas:")
for coluna in dataframe_sem_colunas_de_id.columns:
    print(coluna)


Nome das colunas:
Cancelled
Diverted
CRSDepTime
DepTime
DepDelayMinutes
DepDelay
ArrTime
ArrDelayMinutes
AirTime
CRSElapsedTime
ActualElapsedTime
Distance
Year
Quarter
Month
DayofMonth
DayOfWeek
OriginStateFips
OriginWac
DestStateFips
DestWac
DepDel15
DepartureDelayGroups
TaxiOut
WheelsOff
WheelsOn
TaxiIn
CRSArrTime
ArrDelay
ArrDel15
ArrivalDelayGroups
DistanceGroup
DivAirportLandings


In [30]:
# Tratando valores nulos
dataframe_sem_valores_nulos = dataframe_sem_colunas_de_id.dropna()

# Correlação
assembler = VectorAssembler(inputCols=dataframe_sem_valores_nulos.columns, outputCol="features")
dataframe_vetorizado = assembler.transform(dataframe_sem_valores_nulos)

correlation = Correlation.corr(dataframe_vetorizado, "features", method="pearson").collect()[0][0]

rows = correlation.toArray().tolist()
spark.createDataFrame(rows,dataframe_sem_valores_nulos.columns).show(truncate=False, vertical=True)

                                                                                

24/04/28 20:14:44 WARN PearsonCorrelation: Pearson correlation matrix contains NaN values.
-RECORD 0--------------------------------------
 Cancelled            | 1.0                    
 Diverted             | NaN                    
 CRSDepTime           | NaN                    
 DepTime              | NaN                    
 DepDelayMinutes      | NaN                    
 DepDelay             | NaN                    
 ArrTime              | NaN                    
 ArrDelayMinutes      | NaN                    
 AirTime              | NaN                    
 CRSElapsedTime       | NaN                    
 ActualElapsedTime    | NaN                    
 Distance             | NaN                    
 Year                 | NaN                    
 Quarter              | NaN                    
 Month                | NaN                    
 DayofMonth           | NaN                    
 DayOfWeek            | NaN                    
 OriginStateFips      | NaN                  

## Data Preprocessing

In [10]:
# Adicionando coluna Severity4 e definindo valores
df_filtered = df_filtered.withColumn('Severity4', col('Severity').cast('int'))
df_filtered = df_filtered.withColumn('Severity4', (col('Severity4') == 4).cast('int'))
df_filtered = df_filtered.drop('Severity')
df_filtered.groupBy('Severity4').count().show()



+---------+-------+
|Severity4|  count|
+---------+-------+
|        1|  94421|
|        0|3264408|
+---------+-------+



                                                                                

In [11]:
# Criando um vetor de features para o modelo
target = 'Severity4'
feature_columns = df_filtered.columns
feature_columns.remove(target)

vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df_vector = vector_assembler.transform(df_filtered)

In [12]:
# Dividindo o conjunto de dados em treino e teste
train_data, test_data = df_vector.randomSplit([0.8, 0.2], seed=42)

## Model

### Random Forest

In [13]:
# Criando e treinando um modelo de classificação (Random Forest)
rf_classifier = RandomForestClassifier(labelCol=target, featuresCol="features", numTrees=10)
pipeline = Pipeline(stages=[rf_classifier])
model = pipeline.fit(train_data)



24/02/16 00:07:12 WARN MemoryStore: Not enough space to cache rdd_64_16 in memory! (computed 19.7 MiB so far)
24/02/16 00:07:12 WARN BlockManager: Persisting block rdd_64_16 to disk instead.
24/02/16 00:07:12 WARN MemoryStore: Not enough space to cache rdd_64_17 in memory! (computed 29.6 MiB so far)
24/02/16 00:07:12 WARN BlockManager: Persisting block rdd_64_17 to disk instead.




24/02/16 00:07:15 WARN MemoryStore: Not enough space to cache rdd_64_19 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:15 WARN BlockManager: Persisting block rdd_64_19 to disk instead.




24/02/16 00:07:18 WARN MemoryStore: Not enough space to cache rdd_64_21 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:18 WARN BlockManager: Persisting block rdd_64_21 to disk instead.




24/02/16 00:07:22 WARN MemoryStore: Not enough space to cache rdd_64_20 in memory! (computed 29.6 MiB so far)
24/02/16 00:07:22 WARN BlockManager: Persisting block rdd_64_20 to disk instead.


                                                                                

24/02/16 00:07:24 WARN MemoryStore: Not enough space to cache rdd_64_12 in memory! (computed 19.7 MiB so far)
24/02/16 00:07:24 WARN MemoryStore: Not enough space to cache rdd_64_11 in memory! (computed 19.7 MiB so far)
24/02/16 00:07:24 WARN MemoryStore: Not enough space to cache rdd_64_13 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:24 WARN MemoryStore: Not enough space to cache rdd_64_10 in memory! (computed 29.6 MiB so far)


                                                                                

24/02/16 00:07:25 WARN MemoryStore: Not enough space to cache rdd_64_12 in memory! (computed 3.7 MiB so far)
24/02/16 00:07:25 WARN MemoryStore: Not enough space to cache rdd_64_10 in memory! (computed 8.3 MiB so far)
24/02/16 00:07:25 WARN MemoryStore: Not enough space to cache rdd_64_13 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:25 WARN MemoryStore: Not enough space to cache rdd_64_11 in memory! (computed 19.7 MiB so far)


                                                                                

24/02/16 00:07:27 WARN MemoryStore: Not enough space to cache rdd_64_10 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:27 WARN MemoryStore: Not enough space to cache rdd_64_13 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:27 WARN MemoryStore: Not enough space to cache rdd_64_12 in memory! (computed 8.3 MiB so far)
24/02/16 00:07:27 WARN MemoryStore: Not enough space to cache rdd_64_11 in memory! (computed 13.1 MiB so far)


                                                                                

24/02/16 00:07:29 WARN MemoryStore: Not enough space to cache rdd_64_12 in memory! (computed 8.3 MiB so far)
24/02/16 00:07:29 WARN MemoryStore: Not enough space to cache rdd_64_13 in memory! (computed 3.7 MiB so far)
24/02/16 00:07:29 WARN MemoryStore: Not enough space to cache rdd_64_10 in memory! (computed 13.1 MiB so far)
24/02/16 00:07:29 WARN MemoryStore: Not enough space to cache rdd_64_11 in memory! (computed 19.7 MiB so far)


                                                                                

In [14]:
# Fazendo previsões no conjunto de teste
predictions = model.transform(test_data)

In [15]:
# Avaliando o desempenho do modelo
evaluator = MulticlassClassificationEvaluator(labelCol="Severity4", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy:", accuracy)



Accuracy: 0.972008995725541


                                                                                

In [None]:
# Salvando o DataFrame em formato Parquet
parquet_output_path="hdfs://spark-master:9000/datasets/accidents_output"
df_filtered.write.parquet(parquet_output_path)