# Flight Delay Prediction ✈️🕗

This notebook contains the steps taken to get the data, explore and know about data and preprocess the data for better intrepretation for machine learning model.


## Step 0: Defining the problem statement

**Objective:**  
To determine whether an airplane is going to get delayed or not, if yes, find out by the delay (in minutes).  

**Speculated Solution:**  
Make a model to classify whether a flight is going to get delayed or not, as well as prepare a regression model to predict the time of delay.

## Step 1: Getting the data

The data used for this problem is available on [Kaggle](https://www.kaggle.com/datasets/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018).



- Use Kaggle's API to download the data into the Colab Environment
- Get the utility functions that may help later.
- Configure data files to read using Python

The data contains multiple features for each year from 2009 to 2018.

Glossary of the features:

| Name      | Meaning |
| :----------------:        |    :-------------------:   |
| FL_Date      | Date of the Flight       |
| OP_CARRIER   | Airline Identifier        |
| OP_CARRIER_FL_NUM   | Flight Number        |
| ORIGIN   | Starting Airport Code        |
| DEST   | Destination Airport Code        |
| CRS_DEP_TIME   | Planned Departure Time        |
| DEP_TIME   | Actual Departure Time        |
| DEP_DELAY   | Total Delay on Departure in minutes        |
| TAXI_OUT    | Time duration elapsed between departure from the origin airport gate and wheels off        |
| WHEELS_OFF    | Time point that the aircraft's wheels leave the ground        |
| WHEELS_ON    | Time point that the aircraft's wheels touch on the ground        |
| TAXI_IN    | Time duration elapsed between wheels-on and gate arrival at the destination airport        |
| CRS_ARR_TIME    | Planned arrival time       |
| ARR_TIME     | Actual Arrival Time       |
| ARR_DELAY     | Total Delay on Arrival in minutes       |
| CANCELLED     | Flight Cancelled (1 = cancelled)       |
| CANCELLATION_CODE     | Reason for Cancellation of flight( `A - Airline/Carrier; B - Weather; C - National Air System; D - Security`)   |
| DIVERTED     | Aircraft landed on different airport that the one scheduled   |
| CRS_ELAPSED_TIME     | Planned time amount needed for the flight trip   |
| ACTUAL_ELAPSED_TIME     | `AIR_TIME`+ `TAXI_IN` + `TAXI_OUT`   |
| AIR_TIME     | The time duration between wheels_off and wheels_on time   |
| DISTANCE     | Distance between two airports   |
| CARRIER_DELAY     | Delay caused by the airline in minutes   |
| WEATHER_DELAY     | Delay caused by weather   |
| NAS_DELAY     | Delay caused by air system   |
| SECURITY_DELAY      | Delay caused by security reasons   |
| LATE_AIRCRAFT_DELAY      | Delay caused by security  |



In [None]:
# Getting the helper functions script
!wget https://raw.githubusercontent.com/ishandandekar/Airline-delay-prediction/main/src/utils/utils.py

# Install the kaggle library
!pip install -q kaggle

# Upload the Kaggle API keys
from google.colab import files
files.upload()

!mkdir ~/.kaggle

# Copy the json file to the folder
!cp kaggle.json ~/.kaggle

# Change permissions for keys to work with the Kaggle API
!chmod 600 ~/.kaggle/kaggle.json

# Download the dataset
!kaggle datasets download -d yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018 --quiet

# Creating a directory to store all kinds of data
!mkdir data

# Unzip data
from utils import unzip_data
unzip_data('airline-delay-and-cancellation-data-2009-2018.zip', data_dir="data/raw")

## Step 2: Know more about the data
- Load in the data using Pandas
- Optimize data for faster reading
- Get the statistics about the data
- Fix missing/incorrect values
- Analyze features
- Summarize observations

In [None]:
from glob import glob

from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

csv_files = glob("data/raw/*.csv")
dfs = []

for f in tqdm(csv_files):
  df = pd.read_csv(f)
  dfs.append(df)

df = pd.concat(dfs, ignore_index=True)
print(f"Shape of the data: {df.shape}")

In [None]:
# Getting the memory and data type information of the data
df.info()

In [None]:
# Converting data types for better read/write optimization
categorical_cols = ['OP_CARRIER', 'ORIGIN', 'DEST']

for col in categorical_cols:
  df[c] = df[c].astype("category")

# Checking the memory usage again
df.info(verbose=False)

In [None]:
# Checking null values
df.isnull().sum() * 100 / len(df)

Per Wikipedia: https://www.wikiwand.com/en/Flight_cancellation_and_delay

```
Delays are divided into three categories, namely "on time or small delay" (up to 15 minutes delay), "Medium delay" (15 – 45 minutes delay) and "Large delay" ( 45 minutes delay).
```

--------
CODE SNIPPET TO USE:
```python
t="1318"
pd.to_datetime(t, format='%H%M')
Out[111]:
Timestamp('1900-01-01 13:18:00')
```

In [None]:
# Adding column to check whether it was delayed or not
df['FLIGHT_STATUS'] = df['ARR_DELAY'] > 0

> Check whether delay is correct or not by subtracting columns