# Flight Delay Prediction ✈️🕗

This notebook contains the steps taken to get the data, explore and know about data and preprocess the data for better intrepretation for machine learning model.


## Step 0: Defining the problem statement

**Objective:**  
To determine whether an airplane is going to get delayed or not, if yes, find out by the delay (in minutes).  

**Speculated Solution:**  
Make a model to classify whether a flight is going to get delayed or not, as well as prepare a regression model to predict the time of delay.

## Step 1: Getting the data

The data used for this problem is available on [Kaggle](https://www.kaggle.com/datasets/yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018).



- Use Kaggle's API to download the data into the Colab Environment
- Get the utility functions that may help later.
- Configure data files to read using Python

The data contains multiple features for each year from 2009 to 2018.

Glossary of the features:

| Name      | Meaning |
| :----------------:        |    :-------------------:   |
| FL_Date      | Date of the Flight       |
| OP_CARRIER   | Airline Identifier        |
| OP_CARRIER_FL_NUM   | Flight Number        |
| ORIGIN   | Starting Airport Code        |
| DEST   | Destination Airport Code        |
| CRS_DEP_TIME   | Planned Departure Time        |
| DEP_TIME   | Actual Departure Time        |
| DEP_DELAY   | Total Delay on Departure in minutes        |
| TAXI_OUT    | Time duration elapsed between departure from the origin airport gate and wheels off        |
| WHEELS_OFF    | Time point that the aircraft's wheels leave the ground        |
| WHEELS_ON    | Time point that the aircraft's wheels touch on the ground        |
| TAXI_IN    | Time duration elapsed between wheels-on and gate arrival at the destination airport        |
| CRS_ARR_TIME    | Planned arrival time       |
| ARR_TIME     | Actual Arrival Time       |
| ARR_DELAY     | Total Delay on Arrival in minutes       |
| CANCELLED     | Flight Cancelled (1 = cancelled)       |
| CANCELLATION_CODE     | Reason for Cancellation of flight( `A - Airline/Carrier; B - Weather; C - National Air System; D - Security`)   |
| DIVERTED     | Aircraft landed on different airport that the one scheduled   |
| CRS_ELAPSED_TIME     | Planned time amount needed for the flight trip   |
| ACTUAL_ELAPSED_TIME     | `AIR_TIME`+ `TAXI_IN` + `TAXI_OUT`   |
| AIR_TIME     | The time duration between wheels_off and wheels_on time   |
| DISTANCE     | Distance between two airports   |
| CARRIER_DELAY     | Delay caused by the airline in minutes   |
| WEATHER_DELAY     | Delay caused by weather   |
| NAS_DELAY     | Delay caused by air system   |
| SECURITY_DELAY      | Delay caused by security reasons   |
| LATE_AIRCRAFT_DELAY      | Delay caused by security  |



In [None]:
# Getting the helper functions script
!wget https://raw.githubusercontent.com/ishandandekar/Airline-delay-prediction/main/src/utils/utils.py

# Install the kaggle library
!pip install -q kaggle

# Upload the Kaggle API keys
from google.colab import files
files.upload()

!mkdir ~/.kaggle

# Copy the json file to the folder
!cp kaggle.json ~/.kaggle

# Change permissions for keys to work with the Kaggle API
!chmod 600 ~/.kaggle/kaggle.json

# Download the dataset
!kaggle datasets download -d yuanyuwendymu/airline-delay-and-cancellation-data-2009-2018 --quiet

# Creating a directory to store all kinds of data
!mkdir data

# Unzip data
from utils import unzip_data
unzip_data('airline-delay-and-cancellation-data-2009-2018.zip', data_dir="data/raw")

## Step 2: Know more about the data
- Load in the data using Pandas
- Optimize data for faster reading
- Get the statistics about the data
- Fix missing/incorrect values
- Analyze features
- Summarize observations

In [None]:
from glob import glob
from datetime import datetime

from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def validate_int2str(col):
    try:
        if col: 
            col = int(float(col))         
            if (col):
                col = str(col).zfill(4) 
                col = datetime.strptime(col, '%H%M').time().strftime("%I:%M") 
            return col   
        else: 
            return np.NaN          
    except Exception as e:      
        return np.NaN

def optimize_dataframe(df: pd.DataFrame) -> pd.DataFrame:
  df = df.copy(deep=True)

  print(f"Before memory optimization: {df.memory_usage(deep=True).sum() * 0.000001} MBs")

  print("Dropping columns...")
  df = df.drop('Unnamed: 27', axis=1)
  # df = df.drop('CARRIER_DELAY', axis=1)
  # df = df.drop('WEATHER_DELAY', axis=1)
  # df = df.drop('NAS_DELAY', axis=1)
  # df = df.drop('SECURITY_DELAY', axis=1)
  # df = df.drop('CANCELLATION_CODE', axis=1)

  print("Changing data types...")
  df['FL_DATE'] = pd.to_datetime(df['FL_DATE'], yearfirst=True)
  df['DEP_TIME'] = df['DEP_TIME'].astype('int8', errors='ignore')
  df['DEP_DELAY'] = df['DEP_DELAY'].astype('int8', errors='ignore')
  df['OP_CARRIER_FL_NUM'] = df['OP_CARRIER_FL_NUM'].astype('category', errors='ignore')
  df['OP_CARRIER'] = df['OP_CARRIER'].astype('category', errors='ignore')
  df['ORIGIN'] = df['ORIGIN'].astype('category', errors='ignore')
  df['DEST'] = df['DEST'].astype('category', errors='ignore')
  df['CANCELLED'] = df['CANCELLED'].astype('bool', errors='ignore')
  df['DIVERTED'] = df['DIVERTED'].astype('bool', errors='ignore')
  df['CANCELLATION_CODE'] = df['CANCELLATION_CODE'].astype('category', errors='ignore')

  print("Parsing time columns...")
  cols_ = ['CRS_DEP_TIME', 'DEP_TIME', 'CRS_ARR_TIME', 'ARR_TIME', 'ARR_TIME', 'NAS_DELAY', 'SECURITY_DELAY', 'CARRIER_DELAY', 'LATE_AIRCRAFT_DELAY', 'WHEELS_OFF', 'WHEELS_ON']
  for col_ in cols_:
    df[col_] = df[col_].apply(lambda x: validate_int2str(x))

  # df['CRS_DEP_TIME'] = df['CRS_DEP_TIME']
  # df['DEP_TIME'] = df['DEP_TIME'].apply(lambda x: validate_int2str(x))
  # df['CRS_ARR_TIME'] = df['CRS_ARR_TIME'].apply(lambda x: validate_int2str(x))
  # df['ARR_TIME'] = df['ARR_TIME'].apply(lambda x: validate_int2str(x))
  # df['NAS_DELAY'] = df['NAS_DELAY'].apply(lambda x: validate_int2str(x))
  # df['SECURITY_DELAY'] = df['SECURITY_DELAY'].apply(lambda x: validate_int2str(x))
  # df['CARRIER_DELAY'] = df['CARRIER_DELAY'].apply(lambda x: validate_int2str(x))
  # df['LATE_AIRCRAFT_DELAY'] = df['LATE_AIRCRAFT_DELAY'].apply(lambda x: validate_int2str(x))
  # df['WHEELS_OFF'] = df['WHEELS_OFF'].apply(lambda x: validate_int2str(x))
  # df['WHEELS_ON'] = df['WHEELS_ON'].apply(lambda x: validate_int2str(x))

  print(f"After memory optimization: {df.memory_usage(deep=True).sum() * 0.000001} MBs")

  return df

# csv_files = glob("data/raw/*.csv")
# dfs = []

# for f in tqdm(csv_files):
#   df = pd.read_csv(f)
#   df = optimize_dataframe(df)
#   dfs.append(df)

# df = pd.concat(dfs, ignore_index=True)
# print(f"Shape of the data: {df.shape}")

In [None]:
# Getting the memory and data type information of the data
df.info()

In [None]:
# Checking the first 5 rows of data


----
TEST

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df2009 = pd.read_csv('/content/data/raw/2009.csv')
df2009.info()

In [None]:
df2009_optimized = optimize_dataframe(df2009)
df2009_optimized.info()

In [None]:
df2009['CRS_DEP_TIME'].head(3)

In [None]:
df2009['CRS_DEP_TIME2'] = df2009['CRS_DEP_TIME'].apply(lambda x: validate_int2str(x))

In [None]:
df2009['CRS_DEP_TIME3'] = pd.to_datetime(df2009['CRS_DEP_TIME2'], format='%I:%M')

In [None]:
df2009['CRS_DEP_TIME3'].dt.time.head(3)

In [None]:
df2009_optimized['CRS_DEP_TIME'] = df2009_optimized['CRS_DEP_TIME'].apply(lambda x: validate_int2str(x))

In [None]:
df2009_optimized['CRS_DEP_TIME'].isna().sum()*100/len(df2009_optimized)

In [None]:
df2009.columns

In [None]:
df2009.info()

In [None]:
df2009.head(2)

In [None]:
df2009.shape

In [None]:
df_without_null = df2009.dropna(subset=['DEP_TIME', 'DEP_DELAY', 'TAXI_OUT', 'WHEELS_OFF', 'WHEELS_ON', 'TAXI_IN', 'CRS_ARR_TIME', 'ARR_TIME', 'ARR_DELAY'])

In [None]:
df_without_null.head(2)

In [None]:
df2009['WHEELS_OFF'].head(10)

In [None]:
# Checking null values
df.isnull().sum() * 100 / len(df)

Per Wikipedia: https://www.wikiwand.com/en/Flight_cancellation_and_delay

```
Delays are divided into three categories, namely "on time or small delay" (up to 15 minutes delay), "Medium delay" (15 – 45 minutes delay) and "Large delay" ( 45 minutes delay).
```

In [None]:
# Adding column to check whether it was delayed or not
df['FLIGHT_STATUS'] = df['ARR_DELAY'] > 0

> Check whether delay is correct or not by subtracting columns