# Predicting Air Delays
Notebook II: Exploratory Data Analysis (EDA)
----

Thank you for your review of my code notebook.
This notebook's goal is to analyze and prepare the dataset for exposure on a Machine Learning algorithm.

---
#### Problem Statement: 
Both travelers and airlines find delays frustrating and costly. This project attempts to be able to predict the probability of a commercial flight delay for any flight in the United States. 

---

#### MVP:
My product will be a small lightweight application run on `streamlit` platform for proof-of-concept where a user can find the probability of their desired flight having a delay, how long the delay may be, and how much will the delay cost the user in _lost time_ at the destination 

---
# EDA 

The primary challenge in this notebook is making sense of what will matter to a machine learning algorithm. At the end of this notebook will will have a final processed set of data to work our models on. 

The notebook is structured as follows.<br> 
**Notebook II: EDA and Preprocessing **
>1. Setup/Imports
>2. Load data
>3. EDA
>4. Feature engineering and selection. 
>5. Save the final CSV and discuss next steps. 

The processed dataframe will be saved on a new file by the end of this notebook and called into the subsequent notebook EDA for readability.

The notebooks in this project are: <br>
I. Intake and cleaning<br>
**II. EDA and preprocessing**<br>
III. Modeling and predictions<br>
IV. App<br>

---

## 1. Library imports
----
Load our analytical libraries and load our previously saved dataset. 


In [1]:
import os 
import glob
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
os.chdir('/Volumes/lacie/data_ingestion/capstone_hopper')

In [31]:
pd.set_option('display.max_columns', None)

## 2. Load Dataset
---


In [32]:
flights = pd.read_csv('sample_cleaned.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [33]:
flights.shape

(2837556, 26)

## 3. Exploratory Data Analysis:
---
This section of the notebook deals with data explorations. 

Our working definition of a **_delay_** is a flight arriving at the destination airport 15 minute after their scheduled arrival.

This was addressed and handled in the first notebook when we were cleaning and obtaining data. 

'arr_delay' is our target variable. 


In [34]:
flights.head(3)

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,airline,tail_num,op_carrier_fl_num,origin,origin_city_name,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_time,arr_delay,cancelled,cancellation_code,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018,Sep,3,Mon,2018-09-03,Envoy Air,N807AE,3863,JFK,"New York, NY",CLE,"Cleveland, OH",2100,2125.0,25.0,2258,2323.0,25.0,0.0,,425.0,0.0,0.0,0.0,0.0,25.0
1,2016,Feb,1,Mon,2016-02-01,Delta Airlines,N944AT,1875,MCO,"Orlando, FL",RDU,"Raleigh/Durham, NC",1537,1533.0,0.0,1721,1715.0,0.0,0.0,,534.0,,,,,
2,2016,Jun,27,Mon,2016-06-27,Alaska Airlines,N560AS,381,SFO,"San Francisco, CA",PDX,"Portland, OR",1300,1255.0,0.0,1440,1501.0,21.0,0.0,,550.0,0.0,0.0,21.0,0.0,0.0


In [35]:
flights.isnull().sum()

year                         0
month                        0
day_of_month                 0
day_of_week                  0
fl_date                      0
airline                      0
tail_num                     0
op_carrier_fl_num            0
origin                       0
origin_city_name             0
dest                         0
dest_city_name               0
crs_dep_time                 0
dep_time                     0
dep_delay                    0
crs_arr_time                 0
arr_time                     0
arr_delay                    0
cancelled                    0
cancellation_code      2779032
distance                     0
carrier_delay          1453956
weather_delay          1453956
nas_delay              1453956
security_delay         1453956
late_aircraft_delay    1453956
dtype: int64

### EDA: Delay class balance
---
First look at the balance of classes. 

In [36]:
delayed = flights[flights['arr_delay'] >= 15]
delayed.shape[0]

1383600

In [37]:
ontime = flights[flights['arr_delay']==0]
ontime.shape[0]

1395432

In [38]:
print(f'The balance of delayed to on-time flights is: {round(delayed.shape[0]/ontime.shape[0], 3)}')

The balance of delayed to on-time flights is: 0.992


The ratio of delayed to on-time is nearly even. That is, it is nearly 1. 
We have approximately 1.4 million examples of both on-time flights, those of delayed flights. 

In [39]:
flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2837556 entries, 0 to 2837555
Data columns (total 26 columns):
 #   Column               Dtype  
---  ------               -----  
 0   year                 int64  
 1   month                object 
 2   day_of_month         int64  
 3   day_of_week          object 
 4   fl_date              object 
 5   airline              object 
 6   tail_num             object 
 7   op_carrier_fl_num    int64  
 8   origin               object 
 9   origin_city_name     object 
 10  dest                 object 
 11  dest_city_name       object 
 12  crs_dep_time         int64  
 13  dep_time             float64
 14  dep_delay            float64
 15  crs_arr_time         int64  
 16  arr_time             float64
 17  arr_delay            float64
 18  cancelled            float64
 19  cancellation_code    object 
 20  distance             float64
 21  carrier_delay        float64
 22  weather_delay        float64
 23  nas_delay            float64
 24

In [41]:
categorical = {'year' : str, 'month' : str, 'day_of_month': str, 'day_of_week': str, 'tail_num': str,
              'op_carrier_fl_num': str, 'crs_dep_time': str, 'dep_time' : str, 'crs_arr_time': str, 'arr_time' : str}

In [42]:
flights = flights.astype(categorical).copy()

In [46]:
flights.describe()

Unnamed: 0,dep_delay,arr_delay,cancelled,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
count,2837556.0,2837556.0,2837556.0,2837556.0,1383600.0,1383600.0,1383600.0,1383600.0,1383600.0
mean,30.11391,31.58051,0.02062479,834.5014,21.10365,3.694776,14.98505,0.1139303,24.91176
std,67.42929,66.76941,0.1421247,601.9292,63.08657,30.30673,33.62932,3.299001,50.2793
min,-1.0,-1.0,0.0,29.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,391.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,679.0,1.0,0.0,2.0,0.0,0.0
75%,37.0,38.0,0.0,1069.0,19.0,0.0,19.0,0.0,30.0
max,3890.0,3864.0,1.0,5812.0,3864.0,2692.0,1515.0,789.0,2206.0


## 4. Feature Engineering

In [54]:
test.isnull().sum().sum() /flights.shape[0]

0.03248031989195508