Reference : https://www.kaggle.com/fabiendaniel/predicting-flight-delays-tutorial by Fabien Daniel(September 2017)

# Predicting flight delays [Tutorial]

In this notebook, I develop a model aimed at predicting flight delays at take-off, The purpose is not to obtain the best possible prediction but rather to emphasize on the various steps needed to build such a model. Along this path, I then put in evidence some <b>basic but important</b> concepts. Among then, I comment on the importance of the separation of the dataset during the training stage and how <b>cross-validation</b> helps in determining accurate model parameters. I show how to build <b>linear</b> and <b>polynomial</b> models for <b>univariate</b> or <b>multivariate regressions</b> and also, I give some insight on the reason why <b>regularisation</b> helps us in developing models that generalize well.

From a _<b>technical point of view</b>_. the main aspects of python covered throughout the notebook are:

1. <b>visualization</b> : matplotlib, seaborn, basemap
2. <b>data manipulation</b> : pandas, numpy
3. <b>modeling</b> : sklearn, scipy
4. <b>class definition</b> : regression, figures

During the EDA, I intended to create good quality figures from which the information would be easily accessible at a first glance. An important aspect of the data scientist job consists in divulgating its findings to people who do not necessarily have knowledge in the technical aspects data scientists master. Graphics are surely the most powerful tool to achieve that goal, and mastering visualization techniques thus seems important.

Also, as soon as an action is repeated (mostly at identical) a few times, I tend to write classes or functions and eventually embed them in loops. Doing so is sometimes longer than a simple _copy-paste-edit_ process but, on the one hand, this improves the readibility of the code and most importantly, this reduces the number of lines of code (and so, the number of opportunities to introduce mistakes !!). In the current notebook, I defined classes in the modeling part in order to perform regressions. I also defined a class to wrap the making of figures. This allows to create stylish figures, by tuning the matplotlib parameters, that can be subsequently re-used thanks to that template. I feel that this could be useful to create nice looking graphics and then use them extensively once you are satisfied with the tuning. Moreover, this helps to keep some homogeneity in your plots.

This notebook is composed of three parts : cleaning (section 1), exploration (section 2-5) and modeling (section 6).

_<b>Preamble</b> : overvie of the datatset_

<b>1. Cleaning</b>

        1.1 Dates and times
        1.2 Filling factor
   
<b>2. Comparing airlines</b>

        2.1 Basic statistical description of airlines
        2.2 Delays distribution : establishing the ranking of airlines
        
<b>3. Delays take-off or landing?</b>


<b>4. Relation between the origin airport and delays</b>

        4.1 Geographical area covered by airlines
        4.2 How the origin airport impact delays
        4.3 Flights with usual delays?
    
<b>5. Temporal variability of delays</b>


<b>6. Predicting flight delays </b>

        6.1 Model no1 : one airline, one airport
            6.1.1 Pitfalls
            6.1.2 Polynomial degree : splitting the dataset
            6.1.3 Model test : prediction of end-January delays
        
        6.2 Model no2 : one airline, all airports
            6.2.1 Linear regression
            6.2.2 Polynomial regression
            6.2.3 Setting the free parameters
            6.2.4 Model test : prediction of end-January delays
        
        6.3 Model no3 : Accounting for destinations
            6.3.1 Choice of the free parameters
            6.3.2 Model test : prediction of end-January delays
    
<b>Conclusion</b>

## Preamble : Overview of the dataset
First, I load all the packages that will be needed during this project : 

In [5]:
import datetime, warnings, scipy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as patches
from matplotlib.patches import ConnectionPatch
from collections import OrderedDict
from matplotlib.gridspec import GridSpec
from mpl_toolkits.basemap import Basemap
from sklearn import metrics, linear_model
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from scipy.optimize import curve_fit
plt.rcParams["patch.force_edgecolor"] = True
plt.style.use['fivethirtyeight']
mpl.rc('patch', edgecolor = 'dimgray', linewidth = 1)
from Ipython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr"
pd.options.display.max_columns = 50
%matplotlib inline
warnings.filterwarnings("ignore")

KeyError: 'PROJ_LIB'