# Flight Delay Prediction based on Weather Conditions

This project aims to replicate and implement the methodology described in the paper *"Using Scalable Data Mining for Predicting Flight Delays"* (Belcastro et al., 2016).

## 1. Project Objective
The goal is to predict arrival delays for scheduled domestic US flights by merging architectural flight data with historical weather observations. We focus on delays specifically influenced by weather conditions at both origin and destination airports.

## 2. Implementation Roadmap

### Phase 1: Data Preparation & Mapping
*   **Mapping Stations**: Load `wban_airport_timezone.csv` to create a lookup table between Airport IDs and Weather Stations (WBAN codes).
*   **Data Sampling**: Given the large size of the dataset (~5GB), we will start by building a pipeline on a single month (January 2012) to ensure processing efficiency.

### Phase 2: Data Cleaning & Preprocessing
*   **Flight Data (AOTP)**:
    *   Filter out cancelled and diverted flights.
    *   Convert departure/arrival times to standardized datetime objects.
    *   Define the target variable based on a threshold (e.g., 15 minutes as per FAA standards).
*   **Weather Data (QCLCD)**:
    *   Parse hourly observations.
    *   Handle missing values (often marked as 'M' or blank) and numeric conversion for temperature, visibility, and wind speed.

### Phase 3: Feature Engineering (The 12-Hour Sliding Window)
*   For each flight, extract **12 hourly weather observations** preceding the scheduled departure at the origin airport.
*   Extract **12 hourly weather observations** preceding the scheduled arrival at the destination airport.
*   Flatten these observations into a single feature vector for the flight.

### Phase 4: Modeling & Evaluation
*   **Balancing**: Apply random under-sampling to create a balanced dataset (50% on-time, 50% delayed).
*   **Random Forest Training**: Implement a Random Forest classifier using Scikit-Learn.
*   **Validation**: Evaluate performance using Accuracy, Precision, and Recall (focusing on 'Delayed Recall').

---

In [None]:
import pandas as pd
import numpy as np
import os
from datetime import datetime, timedelta

# Path configuration
DATA_DIR = 'Data/flights_data'
FLIGHTS_DIR = os.path.join(DATA_DIR, 'Flights')
WEATHER_DIR = os.path.join(DATA_DIR, 'Weather')
MAPPING_FILE = os.path.join(DATA_DIR, 'wban_airport_timezone.csv')

print(f"Checking directories...")
print(f"Flight files available: {len(os.listdir(FLIGHTS_DIR))}")
print(f"Weather files available: {len(os.listdir(WEATHER_DIR))}")

### Step 1: Loading Airport-to-Weather Station Mapping
We need to link `ORIGIN_AIRPORT_ID` / `DEST_AIRPORT_ID` to their respective `WBAN` weather station codes.

In [None]:
mapping = pd.read_csv(MAPPING_FILE)
mapping.head()