# Predicting Air Delays
Notebook II: Exploratory Data Analysis (EDA)
----

Thank you for your review of my code notebook.
This notebook's goal is to analyze and prepare the dataset for exposure on a Machine Learning algorithm.

---
#### Problem Statement: 
Both travelers and airlines find delays frustrating and costly. This project attempts to be able to predict the probability of a commercial flight delay for any flight in the United States. 

---

#### MVP:
My product will be a small lightweight application run on `streamlit` platform for proof-of-concept where a user can find the probability of their desired flight having a delay, how long the delay may be, and how much will the delay cost the user in _lost time_ at the destination 

---
# EDA 

The primary challenge in this notebook is making sense of what will matter to a machine learning algorithm. At the end of this notebook will will have a final processed set of data to work our models on. 

The notebook is structured as follows.<br> 
**Notebook II: EDA and Preprocessing **
>1. Setup/Imports
>2. Load data
>3. EDA
>4. Feature engineering and selection. 
>5. Save the final CSV and discuss next steps. 

The processed dataframe will be saved on a new file by the end of this notebook and called into the subsequent notebook EDA for readability.

The notebooks in this project are: <br>
I. Intake and cleaning<br>
**II. EDA and preprocessing**<br>
III. Modeling and predictions<br>
IV. App<br>

---

## 1. Library imports
----
Load our analytical libraries and load our previously saved dataset. 


In [1]:
import os 
import glob
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
os.chdir('/Volumes/lacie/data_ingestion/capstone_hopper')

In [8]:
pd.set_option('display.max_columns', None)

## 2. Load Dataset
---


In [3]:
flights = pd.read_csv('sample_cleaned.csv')

In [6]:
flights.shape

(8183263, 24)

## 3. Exploratory Analysis Time Columns Exploratory Analysis
---
This section of the notebook deals with data explorations. 


`year`, `month`, `day_of_month`, `day_of_week` analyzed against the `arr_delay` distributions values. 

`arr_delay` reveals the number of minutes difference between the scheduled arrival and actual arrival values.

Our working definition of a **_delay_** is a flight arriving at the destination airport 15 minute after their scheduled arrival.

In [9]:
flights.head(3)

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,airline,tail_num,origin,origin_city_name,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_time,arr_delay,cancellation_code,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2018,Mar,26,Mon,2018-03-26,Skywest Airlines,N679SA,PAH,"Paducah, KY",ORD,"Chicago, IL",635,637.0,2.0,811,802.0,0.0,,342.0,,,,,
1,2020,May,3,Sun,2020-05-03,Delta Airlines,N981AT,ATL,"Atlanta, GA",HSV,"Huntsville, AL",945,937.0,-8.0,933,925.0,0.0,,151.0,,,,,
2,2017,Jul,3,Mon,2017-07-03,Delta Airlines,N337NW,DTW,"Detroit, MI",RDU,"Raleigh/Durham, NC",2024,2023.0,-1.0,2206,2151.0,0.0,,501.0,,,,,


In [52]:
test = flights[flights['arr_delay'] > 13][['year','arr_delay','carrier_delay','weather_delay','nas_delay','security_delay','late_aircraft_delay']].copy()

In [53]:
test.isnull().sum().sum()

265795

## 4. Feature Engineering

In [54]:
test.isnull().sum().sum() /flights.shape[0]

0.03248031989195508