# <center><span style="color:blue">AIRLINE ARRIVALS 2008</span></center>

###### Use this dataset of airline arrival information to predict how late flights will be. A flight only counts as late if it is more than 30 minutes late.
    1. The project should follow guideline as previous projects.
    2. Apply models in Naïve Bayes, Logistic Regression, Decision Tree, Random Forest, Gradient Boosting and SVM.
    3. Apply PCA, SelectKBest and RFE for feature selections.
    4. Using gridsearchCV to obtain best parameters for models.
    5. Compare performances among models, write up analysis why the model is good or bad in the algorithmic approach (explain why the algorithm is good or bad for the dataset structure, can you do something to improve the model?)
    6. Include the conclusions

    Each entry of this file corresponds to a flight and we see that more than 7.009.728 flights have been recordeed in 2008. These fights are described by 29 variables. A description of these variables are

#### Name	Description

    * 1 Year	1987-2008

    * 2	Month	1-12

    * 3	DayofMonth	1-31

    * 4	DayOfWeek	1 (Monday) - 7 (Sunday)

    * 5	DepTime	actual departure time (local, hhmm)

    * 6	CRSDepTime	scheduled departure time (local, hhmm)

    * 7	ArrTime	actual arrival time (local, hhmm)

    * 8	CRSArrTime	scheduled arrival time (local, hhmm)

    * 9	UniqueCarrier	unique carrier code

    * 10	FlightNum	flight number

    * 11	TailNum	plane tail number

    * 12	ActualElapsedTime	in minutes

    * 13	CRSElapsedTime	in minutes

    * 14	AirTime	in minutes

    * 15	ArrDelay	arrival delay, in minutes

    * 16	DepDelay	departure delay, in minutes

    * 17	Origin	origin IATA airport code

    * 18	Dest	destination IATA airport code

    * 19	Distance	in miles

    * 20	TaxiIn	taxi in time, in minutes

    * 21	TaxiOut	taxi out time in minutes
    
    * 22	Cancelled	was the flight cancelled?

    * 23	CancellationCode	reason for cancellation (A = carrier, B = weather, C = NAS, D = security)

    * 24	Diverted	1 = yes, 0 = no

    * 25	CarrierDelay	in minutes

    * 26	WeatherDelay	in minutes

    * 27	NASDelay	in minutes

    * 28	SecurityDelay	in minutes

    * 29	LateAircraftDelay	in minutes
http://stat-computing.org/dataexpo/2009/the-data.html
    

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import warnings
warnings.simplefilter('ignore')
%matplotlib inline

In [2]:
data= pd.read_csv('Airline_Arrivals_Cleaned1.csv',sep='\t')

In [3]:
data.columns

Index(['Unnamed: 0', 'DepTime', 'CRSDepTime', 'ArrTime', 'CRSArrTime',
       'UniqueCarrier', 'FlightNum', 'TailNum', 'ActualElapsedTime',
       'CRSElapsedTime', 'AirTime', 'ArrDelay', 'DepDelay', 'Origin', 'Dest',
       'Distance', 'Cancelled', 'CancellationCode', 'Diverted', 'CarrierDelay',
       'WeatherDelay', 'NASDelay', 'SecurityDelay', 'LateAircraftDelay', 'Day',
       'Date'],
      dtype='object')

In [4]:
data.head(3)

Unnamed: 0.1,Unnamed: 0,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,...,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay,Day,Date
0,0,20:03:00,2008-01-03 19:55:00,22:11:00,22:25:00,WN,335,N712SW,128.0,150.0,...,0,,0,,,,,,3,2008-01-03 00:00:00
1,1,07:54:00,2008-01-03 07:35:00,10:02:00,10:00:00,WN,3231,N772SW,128.0,145.0,...,0,,0,,,,,,3,2008-01-03 00:00:00
2,2,06:28:00,2008-01-03 06:20:00,08:04:00,07:50:00,WN,448,N428WN,96.0,90.0,...,0,,0,,,,,,3,2008-01-03 00:00:00


# Data Cleaning

In [5]:
df = data[['ArrDelay','ActualElapsedTime','CRSArrTime','ArrTime','DepTime','CRSDepTime','Origin','DepDelay','Cancelled','Dest','Date','Distance','FlightNum','UniqueCarrier','Diverted']]

In [6]:
df.head(3)

Unnamed: 0,ArrDelay,ActualElapsedTime,CRSArrTime,ArrTime,DepTime,CRSDepTime,Origin,DepDelay,Cancelled,Dest,Date,Distance,FlightNum,UniqueCarrier,Diverted
0,-14.0,128.0,22:25:00,22:11:00,20:03:00,2008-01-03 19:55:00,IAD,8.0,0,TPA,2008-01-03 00:00:00,810,335,WN,0
1,2.0,128.0,10:00:00,10:02:00,07:54:00,2008-01-03 07:35:00,IAD,19.0,0,TPA,2008-01-03 00:00:00,810,3231,WN,0
2,14.0,96.0,07:50:00,08:04:00,06:28:00,2008-01-03 06:20:00,IND,8.0,0,BWI,2008-01-03 00:00:00,515,448,WN,0


In [7]:
missing_df = df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['variables','missing values']
missing_df['filling factor (%)'] = (df.shape[0]-missing_df['missing values'])/df.shape[0]*100
missing_df.sort_values('filling factor (%)').reset_index(drop=True)

Unnamed: 0,variables,missing values,filling factor (%)
0,ArrDelay,154699,97.793081
1,ActualElapsedTime,154699,97.793081
2,ArrTime,151649,97.836592
3,DepTime,136246,98.05633
4,DepDelay,136246,98.05633
5,CRSArrTime,0,100.0
6,CRSDepTime,0,100.0
7,Origin,0,100.0
8,Cancelled,0,100.0
9,Dest,0,100.0


        We see that the variables filling factor is quite good (> 97%). Since the scope of this work is not to establish the state-of-the-art in predicting flight delays, I decide to proceed without trying to impute what's missing and I simply remove the entries that contain missing values:

In [8]:
df.dropna(inplace = True)

        Check again

In [9]:
missing_df = df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['variables','missing values']
missing_df['filling factor (%)'] = (df.shape[0]-missing_df['missing values'])/df.shape[0]*100
missing_df.sort_values('filling factor (%)').reset_index(drop=True)

Unnamed: 0,variables,missing values,filling factor (%)
0,ArrDelay,0,100.0
1,ActualElapsedTime,0,100.0
2,CRSArrTime,0,100.0
3,ArrTime,0,100.0
4,DepTime,0,100.0
5,CRSDepTime,0,100.0
6,Origin,0,100.0
7,DepDelay,0,100.0
8,Cancelled,0,100.0
9,Dest,0,100.0


        Great! Now, we dont have any NaN values anymore! 

---------------------------------

# Apply PCA, SelectKBest and RFE for feature selections.

In [None]:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import RFE 

# RFE is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
#SelectKBest, Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below “See also”). The default function only works with classification tasks.

k_filter = SelectKBest(f_regression,k=10)
pipe_lr = Pipeline([('PCA', PCA(n_components=2,svd_solver='full')),
                    ('SelectKBest', k_filter),
                    ('RFE', RFE())
                   ])

# Apply models in Naïve Bayes, Logistic Regression, Decision Tree, Random Forest and SVM.

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

# Compare performances among models, write up analysis why the model is good or bad in the algorithmic approach

# Conclusion