<span style="font-size:42px"><b>Practise Case 03</b></span><br><br>
<span style="font-size:36px">Foundation of Machine Learning</span>

Copyright 2019 Gunawan Lumban Gaol

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language overning permissions and limitations under the License.

Tasks
1. Provide travel recommendation to passenger (best time to take a flight or best airline), so they will
not get caught in delay. Each recommendation has to be supported by at least 1 graph
2. Create a model to estimate the delay duration ( Linear Regression )
3. Create a model to predict delay > 60 min ( Logistic regression and another supervised model you
choosing )
4. Did you do some feature engineering on the dataset ? if yes, please give the reason for each
feature you created
5. Using those models predict delay that will happened in December for delay >60

# Import Packages

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from configparser import ConfigParser

from edapy.edapy import transformation
from edapy.edapy import plotting

In [3]:
config = ConfigParser()
config.read('./config.ini')

['./config.ini']

# Import Data

Dataset is a historical flight data in Malaysia from Oct 2018 – Nov 2018
Data consist of information of flight such as:
1. Date of flight
2. Date of arrival
3. Departure Delay
4. Tail Number
5. Airline Name
6. Departure Location
7. Arrival Location
8. Flight Number
9. Delay

In [4]:
data_train = pd.read_csv('csv/training_dataset.csv')
data_test = pd.read_csv('csv/test_dataset.csv')

In [5]:
print(data_train.shape)
print(data_test.shape)

(111068, 26)
(41557, 26)


In [6]:
data_train.head()

Unnamed: 0,id,number,airline,airline_name,scheduled_departure_time,scheduled_arrival_time,departure_airport_city,departure_airport_code,departure_airport_country,departure_airport_gate,...,arrival_airport_country,arrival_airport_gate,arrival_airport_name,arrival_airport_region,arrival_airport_terminal,arrival_airport_timezone,flight_equipment_iata,flight_equipment_name,flight_duration,delay
0,1,AK6430,AK,AirAsia,2018-10-05 22:00:00,2018-10-05 23:05:00,Kuala Lumpur,KUL,MY,J15,...,MY,2,Sultan Ismail Petra Airport,Asia,,Asia/Kuala_Lumpur,32S,Airbus A318 / A319 / A320 / A321,58m,2
1,2,ID*7164,ID*,Batik Air,2018-10-05 22:00:00,2018-10-05 23:55:00,Kuala Lumpur,KUL,MY,C33,...,ID,5,Soekarno-Hatta International Airport,Asia,2,Asia/Jakarta,32S,Airbus A318 / A319 / A320 / A321,1h 55m,8
2,3,MXD9116,MXD,Malindo Air,2018-10-05 22:00:00,2018-10-05 23:55:00,Kuala Lumpur,KUL,MY,C33,...,ID,5,Soekarno-Hatta International Airport,Asia,2,Asia/Jakarta,32S,Airbus A318 / A319 / A320 / A321,1h 55m,8
3,4,AK5198,AK,AirAsia,2018-10-05 22:05:00,2018-10-06 01:00:00,Kuala Lumpur,KUL,MY,J9,...,MY,INT,Sandakan Airport,Asia,,Asia/Kuala_Lumpur,32S,Airbus A318 / A319 / A320 / A321,2h 54m,0
4,5,AK516,AK,AirAsia,2018-10-05 22:10:00,2018-10-06 01:25:00,Kuala Lumpur,KUL,MY,P1,...,VN,,Noi Bai International Airport,Asia,T2,Asia/Ho_Chi_Minh,32S,Airbus A318 / A319 / A320 / A321,3h 17m,0


In [7]:
data_train.columns

Index(['id', 'number', 'airline', 'airline_name', 'scheduled_departure_time',
       'scheduled_arrival_time', 'departure_airport_city',
       'departure_airport_code', 'departure_airport_country',
       'departure_airport_gate', 'departure_airport_name',
       'departure_airport_region', 'departure_airport_terminal',
       'departure_airport_timezone', 'arrival_airport_city',
       'arrival_airport_code', 'arrival_airport_country',
       'arrival_airport_gate', 'arrival_airport_name',
       'arrival_airport_region', 'arrival_airport_terminal',
       'arrival_airport_timezone', 'flight_equipment_iata',
       'flight_equipment_name', 'flight_duration', 'delay'],
      dtype='object')

Combine the train and test dataset for EDA purpose.

In [9]:
data = pd.concat([data_train, data_test])

# Data Preprocessing

* Check missing & null data, remove them if necessary
* Check infinite values data, remove or convert to null if necessary
* Check duplicated data, remove them if necessary
* Split numerical and categorical column

Before proceed, set 'col_id' and 'col_target' from the data.

In [10]:
col_ID = ''
col_target = 'delay'

## Infinite Values Data

In [11]:
data = data.replace([np.inf, -np.inf], np.nan)

## Duplicated Data

In [12]:
data.duplicated(subset=list(set(data.columns) - set(col_target))).sum()

0

## Missing & Null Data

In [13]:
missing_df = data.isnull().sum(axis=0).reset_index()
missing_df.columns = ['variable', 'missing values']
missing_df['filling factor (%)']=(data.shape[0]-missing_df['missing values'])/data.shape[0]*100
missing_df.sort_values('filling factor (%)').reset_index(drop = True)

Unnamed: 0,variable,missing values,filling factor (%)
0,arrival_airport_gate,137275,10.05733
1,arrival_airport_terminal,79410,47.970516
2,departure_airport_gate,60619,60.282391
3,departure_airport_terminal,56588,62.923505
4,flight_equipment_name,22,99.985586
5,flight_equipment_iata,22,99.985586
6,scheduled_arrival_time,7,99.995414
7,flight_duration,5,99.996724
8,arrival_airport_timezone,0,100.0
9,arrival_airport_region,0,100.0


Some of data that has at least 30% of missing values:
* `arrival_airport_terminal`
* `arrival_airpot_gate`
* `departure_airport_gate`
* `departure_airport_terminal`

Choose to delete these attributes for now, and explore later if there is enough time. 

In [20]:
data = data.drop(['arrival_airport_terminal', 'arrival_airport_gate', 'departure_airport_gate', 'departure_airport_terminal'], axis=1)

## Split Numerical & Categorical Data

* Create numerical & categorical column list.

In [22]:
transformation.convert_to_categorical(data, 200)

Column airline casted to categorical
Column airline_name casted to categorical
Column departure_airport_city casted to categorical
Column departure_airport_code casted to categorical
Column departure_airport_country casted to categorical
Column departure_airport_name casted to categorical
Column departure_airport_region casted to categorical
Column departure_airport_timezone casted to categorical
Column arrival_airport_city casted to categorical
Column arrival_airport_code casted to categorical
Column arrival_airport_country casted to categorical
Column arrival_airport_name casted to categorical
Column arrival_airport_region casted to categorical
Column arrival_airport_timezone casted to categorical
Column flight_equipment_iata casted to categorical
Column flight_equipment_name casted to categorical


In [24]:
if (data[col_target].nunique() == 2):
    cols_num = list(set(data.select_dtypes(include=[np.number]).columns) - set([col_ID]))
    cols_cat = list(set(data.select_dtypes(exclude=[np.number]).columns) - set([col_target]))
else:
    cols_num = list(set(data.select_dtypes(include=[np.number]).columns) - set([col_ID, col_target]))
    cols_cat = list(set(data.select_dtypes(exclude=[np.number]).columns))

# Task 1: Travel Recommendation

This task gives recommendation on best flight or best time to avoid having delays. Before proceeding with the analysis, it is nice to know the reasons behind a flight delays. From [claimcompass.eu](https://www.claimcompass.eu/blog/why-is-my-flight-delayed/), we can see probable reasons behind a flight delays.

From 15 most common reasons listed there, it is worth a notice that a flight delays can be caused by a `Knock-on effect`, in which when a flight is delayed because of the late arrival of an aircraft. This may alter the prediction, which will be done later, for flights which delays can't actually be predicted from the available attributes in the data. For this analysis, it will now be assumed that all delays in the data is not caused by such effect.

# Task 2: Model Estimating Delay Duration

# Task 3: Model Predict Delay > 60

# Task 4: Predict Delay that will Happened in December for Delay >60