# Actual Arrival Delay Prediction

## Pre-processing and Trainging Data Development

The purpose of this project is to build a model that predicts the likelihood and duration of flight delays with a specified level of accuracy, helping travelers meet specific arrival time requirements.

The source data comes from the US Domestic Flights Delay (2013–2018) dataset, which includes scheduled and actual departure and arrival times. Collected by the U.S. Office of Airline Information, Bureau of Transportation Statistics (BTS), the dataset covers flights between 2014 and 2018 and provides details such as date, time, origin, destination, airline, distance, and delay status. (Source: [Kaggle](https://www.kaggle.com/datasets/gabrielluizone/us-domestic-flights-delay-prediction-2013-2018))

During the Data Wrangling stage of the project, the raw data was collected, evaluated, and cleaned. The resulting dataset was stored in the pickle data format.

In the Exploratory Data Analysis (EDA) stage, the following factors were explored and tested:
*	Airports of departure and arrival
*	Airlines operating the flight
*	Month of the flight
*	Day of the week of the flight
*	Time of day (categorized into time blocks) for departure and arrival

All these variables are categorical by nature (even though the month and day of the week are stored as integers) and need to be encoded for modeling purposes.

### Loading packages and data

In [1]:
import pandas as pd
import numpy as np
import pickle
# from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

input_file_name = '../data/processed/processed_2014_2018_final.pickle'
with open(input_file_name, 'rb') as in_file:
    flights = pickle.load(in_file)

### Creating dummy features for categorical variables

In [2]:
flights[['Origin', 'Dest']].nunique()

Origin    368
Dest      368
dtype: int64

As we can see, there are 368 airports for both Origin and Destination. Converting them into dummy features would result in more than 730 features, which is excessive for the model, especially since some of these airports are used infrequently. It would be reasonable to apply a threshold to distinguish major airports from minor ones. I propose using the cumulative share of total flights departing from or arriving at each airport for this purpose.

In [3]:
# Calculation of cumulative share of the domestic flights
number_of_flights = flights.groupby('Origin', observed=False)[['CRSDepDT']].count().sort_values(by='CRSDepDT', ascending=False)
number_of_flights['Share'] = number_of_flights['CRSDepDT'] / np.sum(number_of_flights['CRSDepDT'])
number_of_flights['Cum_share'] = number_of_flights['Share'].cumsum()

# Applying the THRESHOLD to identify the major airports responsible for the specified share of domestic flights
THRESHOLD = 0.95
main_airports = list(number_of_flights[number_of_flights['Cum_share'] <= THRESHOLD].index)
print('The tolal number of airports cumulatively fullfilled {:.0%} of domestic flights is {:,d} airports'.format(THRESHOLD, len(main_airports)))

The tolal number of airports cumulatively fullfilled 95% of domestic flights is 129 airports


Marking minor airports — those accounting for less than 5% of total domestic flights — as ‘OTHER’ will significantly reduce the number of dummy variables, decreasing from 728 to 258. These dummy variables represent only the airports of Origin and Destination.

In [4]:
# Converting the categories of minor Origin and Destination airports to the 'OTHER' category
flights['Origin_'] = pd.Categorical(np.where(~flights['Origin'].isin(main_airports), 'OTHER', flights['Origin']))
flights['Dest_'] = pd.Categorical(np.where(~flights['Dest'].isin(main_airports), 'OTHER', flights['Dest']))

In [5]:
print(flights[['Dest_', 'Origin_']].nunique())

Dest_      130
Origin_    130
dtype: int64


We can see that the result contains 130 Origin and 130 Destination airports, which will be converted into 258 dummy variables in the next step.

In [6]:
# List of predictive categorical factors
factors = ['Month', 'Weekday', 'Reporting_Airline', 'Origin_', 'Dest_', 'DepTimeBlk', 'ArrTimeBlk']

# Creating two predicted variables: Cancelled flights and Actual Arrival Delay
Cancelled = flights['Cancelled']
ActArrDelay = flights['ActArrDelay'].astype('float16')

# Truncating the dataset to reduce memory usage
flights = flights[factors]

# Converting 'Month' and 'Weekday' to 'category' data type for memory efficiency
flights[['Month', 'Weekday']] = flights[['Month', 'Weekday']].astype('category')

In [7]:
# Converting all categorical data to dummy variables and dropping the first category to avoid multicollinearity.
model_data = pd.DataFrame()
for factor in factors:
    model_data = pd.concat([model_data, pd.get_dummies(flights[factor], drop_first=True, prefix=factor)], axis=1)

### Creating Training and Testing Datasets

I plan to create two prediction models — one to predict the likelihood of flight cancellation and another to estimate the arrival delay, using features such as flight date, time, airline, and airport. To accomplish this, the features dataset, along with the two target variables (flight cancellation and actual arrival delay), will be split into training and testing sets in a 70/30% ratio.

In [8]:
X_train, X_test, y_arr_train, y_arr_test, y_cncl_train, y_cncl_test= train_test_split(model_data, ActArrDelay, Cancelled, test_size = 0.3, random_state = 1812)

In [9]:
print('Traing features shape: {}'.format(X_train.shape))
print('Test features shape: {}'.format(X_test.shape))
print('Target variable (actual arrival delay) training shape: {}'.format(y_arr_train.shape))
print('Target variable (actual arrival delay) test shape:  {}'.format(y_arr_test.shape))
print('Target variable (cancellation) training shape:  {}'.format(y_cncl_train.shape))
print('Target variable (cancellation) test shape: {}'.format(y_cncl_test.shape))

Traing features shape: (20910038, 330)
Test features shape: (8961446, 330)
Target variable (actual arrival delay) training shape: (20910038,)
Target variable (actual arrival delay) test shape:  (8961446,)
Target variable (cancellation) training shape:  (20910038,)
Target variable (cancellation) test shape: (8961446,)


### Standartization of numeric features

The only numeric feature in this model is the Actual Arrival Delay, which is the target variable. In this case, standardization does not make sense.

## Conclusuions

1.	__Data Preprocessing and Feature Engineering:__  
*	By applying a cumulative share threshold, minor airports were grouped under the ‘OTHER’ category, significantly reducing the number of dummy variables from 728 to 258, which makes the model more efficient.
*	The categorical features (such as month, weekday, airport origin and destination, airline, and time blocks) were successfully encoded into dummy variables. This step ensures the model can handle these features properly without introducing multicollinearity.
*   Two target variables were identified: flight cancellation (binary) and actual arrival delay (numeric). These two different target types will require different approaches for modeling but were both addressed within the same framework.
*   Standardization is typically used when the predictive features have different scales. Since the only numeric feature is the actual arrival delay (target variable), standardization was deemed unnecessary. 

2.	__Training and Testing Split:__ 
*	The dataset was split into training and testing sets with a 70/30 ratio, which is a common practice for training models while ensuring enough data for validation. This split will allow for assessing the model’s performance on unseen data.

3.	__Modeling Potential:__  
*	The framework set up allows for the development of predictive models that will estimate flight delays and cancellation likelihood based on various features, including the airline, airport, time of day, and day of the week and month.
*	Future steps would involve training the model, testing it, and evaluating its accuracy, considering the use of machine learning techniques like classification for cancellations and regression for delay prediction.

4.	__Next Steps:__  
*	The next steps in this project include applying machine learning algorithms to train the model, evaluating its performance with appropriate metrics (e.g., accuracy, precision, recall for classification; RMSE for regression), and refining the model based on the results.