---
# Algoritmos para Big Data – Projeto
## Parte 5: Previsão de Atrasos em Streaming com Kafka

**Dataset:** Previsões em tempo real simuladas

**Autores:**
- Henrique Niza (131898)
- Paulo Francisco Pinto (128962)
- Rute Roque (128919)

In [7]:
# 1. Spark Streaming Setup
from pyspark.sql import SparkSession
from pyspark.ml import PipelineModel
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, BooleanType

spark = SparkSession.builder \
    .appName("FlightDelayStreaming") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN")

In [8]:
# 2. Definição do Esquema para os dados do Kafka (exemplo simplificado)
schema = StructType([
    StructField("FlightDate", StringType(), True),
    StructField("Airline", StringType(), True),
    StructField("Origin", StringType(), True),
    StructField("Dest", StringType(), True),
    StructField("DepDelay", DoubleType(), True),
    StructField("DepDel15", DoubleType(), True)
])

In [9]:
# 3. Leitura do stream Kafka (assumindo Kafka em localhost:9092 e topico "flights")
kafka_stream = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "flights") \
    .option("startingOffsets", "latest") \
    .load()

AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of Structured Streaming + Kafka Integration Guide.

In [None]:
# 4. Conversão do valor para JSON estruturado
df_parsed = kafka_stream.selectExpr("CAST(value AS STRING) as json") \
    .select(from_json(col("json"), schema).alias("data")).select("data.*")

In [None]:
# 5. Carregamento do modelo treinado
model_path = "../models/flight_delay_rf_model"
model = PipelineModel.load(model_path)

In [None]:
# 6. Aplicação do modelo ao stream
df_previsao = model.transform(df_parsed)

In [None]:
# 7. Escrita no console para debug (em ambiente real, escrever para sink apropriado)
query = df_previsao.select("FlightDate", "Airline", "Origin", "Dest", "DepDelay", "prediction") \
    .writeStream \
    .format("console") \
    .outputMode("append") \
    .option("truncate", False) \
    .start()

query.awaitTermination()