# Predicting Air Delays
Notebook II: Exploratory Data Analysis (EDA)
----

Thank you for your review of my code notebook.
This notebook's goal is to analyze and prepare the dataset for exposure on a Machine Learning algorithm.

---
#### Problem Statement: 
Both travelers and airlines find delays frustrating and costly. This project attempts to be able to predict the probability of a commercial flight delay for any flight in the United States. 

---

#### MVP:
My product will be a small lightweight application run on `streamlit` platform for proof-of-concept where a user can find the probability of their desired flight having a delay, how long the delay may be, and how much will the delay cost the user in _lost time_ at the destination 

---
# EDA 

The primary challenge in this notebook is making sense of what will matter to a machine learning algorithm. At the end of this notebook will will have a final processed set of data to work our models on. 

The notebook is structured as follows.<br> 
**Notebook II: EDA and Preprocessing **
>1. Setup/Imports
>2. Load data
>3. EDA
>4. Feature engineering and selection. 
>5. Save the final CSV and discuss next steps. 

The processed dataframe will be saved on a new file by the end of this notebook and called into the subsequent notebook EDA for readability.

The notebooks in this project are: <br>
I. Intake and cleaning<br>
**II. EDA and preprocessing**<br>
III. Modeling and predictions<br>
IV. App<br>

---

## 1. Library imports
----
Load our analytical libraries and load our previously saved dataset. 


In [1]:
import os 
import glob
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
os.chdir('/Volumes/lacie/data_ingestion/capstone_hopper')

In [3]:
pd.set_option('display.max_columns', None)

## 2. Load Dataset
---


In [4]:
flights = pd.read_csv('sample_cleaned.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [5]:
flights.shape

(2821814, 27)

## 3. Exploratory Data Analysis: Time Columns Exploratory Analysis
---
This section of the notebook deals with data explorations. 

Our working definition of a **_delay_** is a flight arriving at the destination airport 15 minute after their scheduled arrival.

This was addressed and handled in the first notebook when we were cleaning and obtaining data. 



In [6]:
flights.head(3)

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,airline,tail_num,op_carrier_fl_num,origin,origin_city_name,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_time,arr_delay,cancelled,cancellation_code,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,op_unique_carrier
0,2018,Sep,3,Mon,2018-09-03,Envoy Air,N807AE,3863,JFK,"New York, NY",CLE,"Cleveland, OH",2100,2125.0,25.0,2258,2323.0,25.0,0.0,,425.0,0.0,0.0,0.0,0.0,25.0,
1,2016,Feb,1,Mon,2016-02-01,Delta Airlines,N944AT,1875,MCO,"Orlando, FL",RDU,"Raleigh/Durham, NC",1537,1533.0,-4.0,1721,1715.0,0.0,0.0,,534.0,,,,,,
2,2016,Jun,27,Mon,2016-06-27,Alaska Airlines,N560AS,381,SFO,"San Francisco, CA",PDX,"Portland, OR",1300,1255.0,-5.0,1440,1501.0,21.0,0.0,,550.0,0.0,0.0,21.0,0.0,0.0,


In [9]:
flights.isnull().sum()

year                         0
month                        0
day_of_month                 0
day_of_week                  0
fl_date                      0
airline                  42782
tail_num                     0
op_carrier_fl_num            0
origin                       0
origin_city_name             0
dest                         0
dest_city_name               0
crs_dep_time                 0
dep_time                 41420
dep_delay                41757
crs_arr_time                 0
arr_time                 42782
arr_delay                    0
cancelled                    0
cancellation_code      2779032
distance                     0
carrier_delay          1438214
weather_delay          1438214
nas_delay              1438214
security_delay         1438214
late_aircraft_delay    1438214
op_unique_carrier      2779032
dtype: int64

In [12]:
flights[flights['airline'].isnull()]['cancelled'].sum() / flights.shape[0]

0.015161169375444306

## 4. Feature Engineering

In [54]:
test.isnull().sum().sum() /flights.shape[0]

0.03248031989195508