# Case Study: Airline on-time performance 

## Problem Definition

Our objective is to predict weather a particular flight will get delayed or not using past data from 2007-2008

## Hypothesis

Inorder to solve a problem, the first thing we've to do is to define it.
In this section, we are going to define the possible features for our problem statement. 
What makes a flight delayed or cancelled?
1. Weather (Tornadoes, Blizzards, Hurricanes)
2. Late arriving Aircrafts
3. Airport Operations (Like Security Lines )
4. Problem with Aircraft (Engine failure, faulty component and so on)
5. Air-Traffic control
6. Carrier Crew problems
7. Aircraft Cleaning
8. Baggage Loading
9. Fuelling

So these are few features that according to us, is relevant to the problem.

## Dataset

We are using "Airline on-time performance" dataset provided by Stat-Computing.org 
For our purposes, we'll be using dataset from Jan 2007 - Dec 2008. 

More about this dataset can be found on http://stat-computing.org/dataexpo/2009/the-data.html

## Exploratory Data Analysis

Right now, we don't realy know much about your dataset, so lets us explore some trends. In this section, we are going to answer few of the questions asked in the problem statement.

    Q1. Which carrier performs better? 

It is safe to assume that the carrier that has least probablity to get delayed/cancelled gives the best performance. Lets gets into some stats!
So I've crossed Unique Carriers with various kinds of Delays.

<img src="files/images/weather_delay.jpg">

Alaska Airlines, Hawaiin Airlines, Alaska Airlines & Frontier Airlines have pretty low delays due to weather.
NOTE: Carrier code were only provided, it took a little extra labour to map all the carrier codes with Carrier names for better understanding

<img src="files/images/carrier_delay.jpg">

<img src="files/images/nas_delay.jpg">

<img src="files/images/security_delay.jpg">

<img src="files/images/aircraft_arriving_late.jpg">

<img src="files/images/cancelled.jpg">

        So to answer Q1 - Alaskan Airlines, Hawaiian Airlines & Frontier Airlines shows better performance 

Q2. How well does weather predict plane delays?

I dug a little deeper and compared all of these delays under one tabulation. It lead me to answer Q2.

<img src="images/tabulation_delays.jpg">

<img src="images/tabulation_delays2.jpg">

    So to answer Q2, weather only accounts to 0.16% of plane delays which is not much compared to other kinds of delays 

Q3. When is the best time of day/day of week/time of year to fly to minimise delays?  

<img src="files/images/month_delay.jpg">

<img src="files/images/day_hr_delay.jpg">

    So, it is safe to say that delays are less when we stick to following things - 
    - Friday has the largest probability of delayed flights
    - December, June, July have largest percentage of delayed flights
    - Avoid flying during holidays & summer
    - Avoid flights that depart btw 5 & 7pm
    + Fly in april may september
    + Fly early in the day

Q4. Do older planes suffer more delays?

<img src="files/images/age_delay.jpg">

    This doesnt show any trend. No, the older planes donot suffer more delay

Q5. Can you detect cascading failures as delays in one airport create delays in others? Are there critical links in the system?

    To predict this, would require extreme computational resources. So it is beyond the scope of this study.
    However, we can give preference to Arrival Timings of flights since Delay in one flight at an Aiport has cascading delay on other flights.

Q6. Create a model to predict flight delays

Using machine learning libraries, I'm going to create a model in python, that predicts whether a flight will get delayed or not. Training dataset is 70% of 2007-08 data and Testing is the remaining 30% for cross validation. Let's go.

You can take a look at "modeling.py" to look at the code. Since it is a huge dataset, execution will be slow on local system. We can either use Decision trees or Logistic Regression for classification. 
The architecture of the predictive model looks something like this

<img src="files/images/dt.jpg">

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import sklearn
import io
import requests

# ----------------------------------
#	STEP 1 - READ THE DATASET AND UNDERSTAND 

dataset = pd.read_csv("2007.csv")

print(type(dataset))

#looking into head
print(dataset.head())

#no of rows, cols
print(dataset.shape)

#info bout emm data
print(dataset.info())

#figure out non NA values
print(dataset.count())

#info bout cols
print(dataset.columns)

#lets summarize the dataset

#get sum
print(dataset.sum())

#get stats
print(dataset.describe())

#get mean
print(dataset.mean())

#get median
print(dataset.median())

print("#----------------------xx-------------------------")


from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC


#convert dataframe into matrix
dataArray = dataset.values

#splitting input features & o/p vars
X = dataArray[:,1:4]
y = dataArray[:,0:1]

#splitting training & testing
validation_size = 0.10
seed = 9
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=validation_size, random_state = seed)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)


print('#----------------------xx------------------------------#')
print('---------SEC 2 MODELING--------------')

#models - LR,LDA,KNN, CART, RF, NB, SVM

num_trees = 200
max_features = 3
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('RF', RandomForestClassifier(n_estimators=num_trees, max_features=max_features)))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))


#fit models & eval

results = []
names = []
scoring = 'accuracy'

#bring out em cross validation
for name, model in models:
	kfold = KFold(n_splits = 10, random_state=7)
	cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring = scoring)
	results.append(cv_results)
	names.append(name)
	msg = "%s: %f (%f)" % (name,cv_results.mean(), cv_results.std())
	print(msg)


#lets box plot model scores

fig = pyplot.figure()
fig.suptitle('ML algo comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
pyplot.show()

#create prediction model
model = LogisticRegression()

#fit model
model.fit(X_train, Y_train)

#predict!
predictions = model.predict(X_test)

#check accuracy
print("Model --- LogisticRegression")
print("Accuracy: {} ".format(accuracy_score(Y_test,predictions) * 100)
print(classification_report(Y_test, predictions))


#plotting confusion matrix on heatmap

cm = confusion_matrix(Y_test, predictions)
sns.heatmap(cm, annot=True, xticklabels=['not_delayed','delayed'], yticklabels=['not_delayed','delayed'])
plt.figure(figsize=(3,3))
plt.show()

#make predictions on some new data





print('#--------------END------THANKS-------------#')

So now we are done with Predictive modeling as well. Next we are going to figure out the architecture.

## Big Data Architecture

Owing to the large dataset, it falls under the category of Big data and we must treat it differently.

<img src="files/images/architecture.jpg">

## Something about me

I'm a Data Scientist. I like to play with data and generate predictive models. I have also worked as a Machine Learning developer helping in development of Recommendation systems, Time Series Forecasting, Marketing Analytics.
The reason that I stayed up two nights to finish this case study is because I've come across IGT's work and am more than excited to be a part of it.

What can I do for you?
I can help you with all things machine learning & deep learning. I believe that I'll add value to the team because I'm said to have a statistical edge and creative. I'm a bachelor and willing to travel at moment's notice. 
I find this project really interesting and has kept me excited throughout the weekend.
I hope you find this case study valuable and insightful.

Thank you for your time.