## Flight delays and cancellations
- The purpose is to apply knowledge learn about pandas to find answers of questions from 2015 Flight Delays and Cancellations dataset

### Import libraries

In [1]:
import os
import dask.dataframe as dd
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

True

### Declare constants

In [2]:
flights_des = os.getenv('FLIGHTS_OUT')
air_des = os.getenv('AIRLINES_OUT')
chunk_size = 25e6

In [3]:
flight_dt = {
    'YEAR': 'int32',
    'MONTH': 'int16',
    'DAY': 'int16',
    'DAY_OF_WEEK': 'int16',
    'AIRLINE': 'object',
    'FLIGHT_NUMBER': 'int32',
    'TAIL_NUMBER': 'object',
    'ORIGIN_AIRPORT': 'object',
    'DESTINATION_AIRPORT': 'object',
    'SCHEDULED_DEPARTURE': 'int32',
    'DEPARTURE_TIME': 'float64',
    'DEPARTURE_DELAY': 'float64',
    'TAXI_OUT': 'float64',
    'WHEELS_OFF': 'float64',
    'SCHEDULED_TIME': 'float64',
    'ELAPSED_TIME': 'float64',
    'AIR_TIME': 'float64',
    'DISTANCE': 'float64',
    'WHEELS_ON': 'float64',
    'TAXI_IN': 'float64',
    'SCHEDULED_ARRIVAL': 'int32',
    'ARRIVAL_TIME': 'float64',
    'ARRIVAL_DELAY': 'float64',
    'DIVERTED': 'bool',
    'CANCELLED': 'bool',
    'CANCELLATION_REASON': 'object',
    'AIR_SYSTEM_DELAY': 'float64',
    'SECURITY_DELAY': 'float64',
    'AIRLINE_DELAY': 'float64',
    'LATE_AIRCRAFT_DELAY': 'float64',
    'WEATHER_DELAY': 'float64',
    'CANCELLATION_REASON': 'object'
}

air_dt = {
    'IATA_CODE': 'object',
    'AIRLINE': 'object'
}

### Download data

In [4]:
%run -i './data_downloader.py'

### Steps

1. Read `flights.csv` and `airlines.csv` as dataframe

In [5]:
flights = dd.read_csv(
    flights_des, 
    dtype = flight_dt,
    blocksize = chunk_size, # split data into 25 MB chunks
)
flights.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_TIME,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,408.0,-22.0,False,False,,,,,,
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,741.0,-9.0,False,False,,,,,,
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,811.0,5.0,False,False,,,,,,
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,756.0,-9.0,False,False,,,,,,
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,259.0,-21.0,False,False,,,,,,


In [6]:
airlines = dd.read_csv(
    air_des, 
    dtype = flight_dt,
    blocksize = chunk_size,
)
airlines.head()

Unnamed: 0,IATA_CODE,AIRLINE
0,UA,United Air Lines Inc.
1,AA,American Airlines Inc.
2,US,US Airways Inc.
3,F9,Frontier Airlines Inc.
4,B6,JetBlue Airways


2. How many features and rows/records are there in both dataframes ?

In [7]:
row, col = airlines.shape
print(f'airlines dataframe has {row.compute()} rows {col} columns')
row, col = flights.shape
print(f'flights dataframe has {row.compute()} rows {col} columns')

airlines dataframe has 14 rows 2 columns
flights dataframe has 5819079 rows 31 columns


3. How many flights were cancelled ?

In [8]:
flights['CANCELLED'].value_counts().compute()

False    5729195
True       89884
Name: CANCELLED, dtype: int64

**Observation**
- There were `89884` out of `5819079` flights that were cancelled which is around `1.54 percent`

4. Which airline has the most cancelled flights ?

In [9]:
cancelled_by_air = pd.DataFrame(
    (
        flights[flights['CANCELLED']]
        .groupby('AIRLINE')
        .size()
        .compute()
    ),
    columns=['FLIGHT_COUNT']
)

In [10]:
(
    airlines
    .join(cancelled_by_air, on = 'IATA_CODE')
    .compute()
    .sort_values(by = 'FLIGHT_COUNT', ascending = False)
)

Unnamed: 0,IATA_CODE,AIRLINE,FLIGHT_COUNT
8,WN,Southwest Airlines Co.,16043
10,EV,Atlantic Southeast Airlines,15231
12,MQ,American Eagle Airlines Inc.,15025
1,AA,American Airlines Inc.,10919
5,OO,Skywest Airlines Inc.,9960
0,UA,United Air Lines Inc.,6573
4,B6,JetBlue Airways,4276
2,US,US Airways Inc.,4067
9,DL,Delta Air Lines Inc.,3824
7,NK,Spirit Air Lines,2004


**Observation**
- Southwest Airlines Co. is the airline with the most cancelled flights of `16043`

5. What is the maximum, minimum, mean and standard deviation of `departure delay` and `arrival delay` ?

In [11]:
flights[['DEPARTURE_DELAY', 'ARRIVAL_DELAY']].describe().compute().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
DEPARTURE_DELAY,5732926.0,9.370158,37.080942,-82.0,-4.0,1.0,18.0,1988.0
ARRIVAL_DELAY,5714008.0,4.407057,39.271297,-87.0,-10.0,0.0,19.0,1971.0


**Observation**
- Average `ARRIVAL_DELAY` i.e. `4.41` is around half of mean `DEPARTURE_DELAY`
- `DEPARTURE_DELAY` has standard deviation of `37.08` whereas `ARRIVAL_DELAY` has `2.19` higher standard deviation
- Minimum `DEPARTURE_DELAY` is higher than the minimum of `ARRIVAL_DELAY` by `5`
- Maximum `DEPARTURE_DELAY` is higher than the maximum of `ARRIVAL_DELAY` by `17`

6. Print dataframe after following
    - Filtering NA out of `departure_delay` and `arrival_delay`
    - Joining airlines data to flight data using `airline` as a key

In [12]:
dropped_na = flights.dropna(subset = ['DEPARTURE_DELAY', 'ARRIVAL_DELAY'])
joined = dropped_na.join(airlines.set_index('IATA_CODE'), on = 'AIRLINE', rsuffix = '_air')

joined.head()

Unnamed: 0,YEAR,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,...,ARRIVAL_DELAY,DIVERTED,CANCELLED,CANCELLATION_REASON,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,AIRLINE_air
0,2015,1,1,4,AS,98,N407AS,ANC,SEA,5,...,-22.0,False,False,,,,,,,Alaska Airlines Inc.
1,2015,1,1,4,AA,2336,N3KUAA,LAX,PBI,10,...,-9.0,False,False,,,,,,,American Airlines Inc.
2,2015,1,1,4,US,840,N171US,SFO,CLT,20,...,5.0,False,False,,,,,,,US Airways Inc.
3,2015,1,1,4,AA,258,N3HYAA,LAX,MIA,20,...,-9.0,False,False,,,,,,,American Airlines Inc.
4,2015,1,1,4,AS,135,N527AS,SEA,ANC,25,...,-21.0,False,False,,,,,,,Alaska Airlines Inc.


7. Which airline has the highest average `departure_delay` time and how long ?

In [13]:
(
    joined.groupby(['AIRLINE', 'AIRLINE_air'])['DEPARTURE_DELAY']
    .mean()
    .compute()
    .sort_values(ascending = False)
)

AIRLINE  AIRLINE_air                 
NK       Spirit Air Lines                15.883101
UA       United Air Lines Inc.           14.333056
F9       Frontier Airlines Inc.          13.303352
B6       JetBlue Airways                 11.442467
WN       Southwest Airlines Co.          10.517183
MQ       American Eagle Airlines Inc.     9.967187
VX       Virgin America                   8.993486
AA       American Airlines Inc.           8.826106
EV       Atlantic Southeast Airlines      8.615598
OO       Skywest Airlines Inc.            7.736083
DL       Delta Air Lines Inc.             7.313300
US       US Airways Inc.                  6.081000
AS       Alaska Airlines Inc.             1.718926
HA       Hawaiian Airlines Inc.           0.469918
Name: DEPARTURE_DELAY, dtype: float64

**Observation**
- Spirit Air Lines (NK) has highest average departure delay of `15.8831`

8. Which month has the highest number of flight ? and how many ?

In [14]:
flights.groupby('MONTH').size().compute()

MONTH
1     469968
2     429191
3     504312
4     485151
5     496993
6     503897
7     520718
8     510536
9     464946
10    486165
11    467972
12    479230
dtype: int64

**Observation**
- July was the month with `520718` flights making it the month with highest no. of flights

9. Create the new column; if the flight has positive value on `departure delay` then the value of new column will be `Delay`. Otherwise `Not Delay`

In [15]:
flights_no_dept_na = flights.dropna(subset=['DEPARTURE_DELAY'])

flights_no_dept_na['DEPARTURE_DELAY_STATUS'] = flights_no_dept_na.apply(
    lambda x: 'Delay' if x['DEPARTURE_DELAY'] > 0 else 'Not Delay',
    meta = (None, 'object'),
    axis=1
)

10. How many flights that are delay, how many that are not ?

In [16]:
flights_no_dept_na['DEPARTURE_DELAY_STATUS'].value_counts().compute()

Not Delay    3607308
Delay        2125618
Name: DEPARTURE_DELAY_STATUS, dtype: int64

**Observation**
- There were `2125618` flights that were delayed on departure which is around `37.08 percent` of all flights while the remaining `3607308` flights had left on time or earlier

11. Create one of your own insight from the data
    - Which month has the highest no. of flights with departure delay ?

In [17]:
(
    flights_no_dept_na[flights_no_dept_na['DEPARTURE_DELAY_STATUS'] == 'Delay']
    .groupby('MONTH')
    .size()
    .compute()
)

MONTH
1     176627
2     173442
3     193817
4     167314
5     178856
6     215381
7     209619
8     190840
9     132591
10    145102
11    152690
12    189339
dtype: int64

**Observation**
- Month `June` has `215381` or simply highest no. of flight that were delayed on departure