# Predicting Air Delays 
----

Thank you for your review of my code notebook.
This notebook's goal is to obtain, modify, clean, and prepare the dataset for exposure on a Machine Learning algorithm.

---
#### Problem Statement: 
Both travelers and airlines find delays frustrating and costly. This project attempts to be able to predict the probability of a commercial flight delay for any flight in the United States. 

---

#### MVP:
My product will be a small lightweight application run on `streamlit` platform for proof-of-concept where a user can find the probability of their desired flight having a delay, how long the delay may be, and how much will the delay cost the user in _lost time_ at the destination 

---
# Intake, Cleaning, and EDA. 

The primary challenge in this notebook is managing a large dataset. 
The next challenge will be to conduct meaningful EDA across the whole dataset. 
The notebook is structured as follows. 
1. Imports and set up
2. The size and complexity issue. 
3. Cleaning steps. 
4. Feature engineering and selection. 
5. Save the final CSV and discuss next steps. 

---


## 1. Set-up
----
I will be making use of `OS` Library and command line commands from the notebook to join large CSV tables together. 
The standard set up.

In [3]:
import os 
import glob
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

 I have 68 CSV files. Each file represents 1 month of flight history from all U.S. Airports. As a result each CSV is approximatley 150mb in size. Loading a few of them into the workspace of the notebook will result in a loss of data due to data exceeding memory capacity on the local machine. 
 <br>
 <br>
The approach will be to manipulate each of the CSV's and join them directly in the command line. 

In [4]:
#change the directory from root to where all the files I will join will be
os.chdir('/Volumes/lacie/data_ingestion/capstone_hopper')

### Data Sources
---
This project attempts gathered delay data from the **Department of Transportation (DOT) Flight Delay reporting Database**. Sadly, there was no public API available to access this data from DOT or from Federal Aviation Administration.

Given there was no way to programatically acquire the desired amount of data, I proceeded to utilize the basic public data library tool and download a CSV for one monthly period at a time.  

This created a lot of _just **too big** files_ and hence our first unanticipated technical challenge with this project; what do I do? 

The plan: use the command line to join all the tables. 
After cleaning see how large the file is. 

To implement this plan, using `glob` methods and direct command line. 


---
References
[Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ)

In [7]:
#a list of all the files. a total of 6.958GB of unfiltered raw data. 
!ls

[31m00_cause_of_delay.csv[m[m [31m30_mar_19.csv[m[m         [31m51_jun_17.csv[m[m
[31m10_nov_20.csv[m[m         [31m31_feb_19.csv[m[m         [31m52_may_17.csv[m[m
[31m11_oct_20.csv[m[m         [31m32_jan_19.csv[m[m         [31m53_apr_17.csv[m[m
[31m12_sep_20.csv[m[m         [31m33_dec_18.csv[m[m         [31m54_mar_17.csv[m[m
[31m13_aug_20.csv[m[m         [31m34_nov_18.csv[m[m         [31m55_feb_17.csv[m[m
[31m14_jul_20.csv[m[m         [31m35_oct_18.csv[m[m         [31m56_jan_17.csv[m[m
[31m15_jun_20.csv[m[m         [31m36_sep_18.csv[m[m         [31m57_dec_16.csv[m[m
[31m16_may_20.csv[m[m         [31m37_aug_18.csv[m[m         [31m58_nov_16.csv[m[m
[31m17_apr_20.csv[m[m         [31m38_jul_18.csv[m[m         [31m59_oct_16.csv[m[m
[31m18_mar_20.csv[m[m         [31m39_jun_18.csv[m[m         [31m5_apr_21.csv[m[m
[31m19_feb_20.csv[m[m         [31m3_jun_21.csv[m[m          [31m60_sep_16.csv[m[m


In [8]:
#using glob, to locate all file names. 
file_ext = '.csv'
files = [file for file in glob.glob(f'*{file_ext}')]

In [13]:
#the first five in the list to confirm 
print(files[0:5])

['00_cause_of_delay.csv', '1_aug_21.csv', '2_jul_21.csv', '3_jun_21.csv', '4_may_21.csv']
