# Predicting Air Delays 
----

Thank you for your review of my code notebook.
This notebook's goal is to obtain, modify, clean, and prepare the dataset for exposure on a Machine Learning algorithm.

---
#### Problem Statement: 
Both travelers and airlines find delays frustrating and costly. This project attempts to be able to predict the probability of a commercial flight delay for any flight in the United States. 

---

#### MVP:
My product will be a small lightweight application run on `streamlit` platform for proof-of-concept where a user can find the probability of their desired flight having a delay, how long the delay may be, and how much will the delay cost the user in _lost time_ at the destination 

---
# Intake, Cleaning, and EDA. 

The primary challenge in this notebook is managing a large dataset. 
The next challenge will be to conduct meaningful EDA across the whole dataset. 
The notebook is structured as follows. 
1. Imports and set up
2. The size and complexity issue. 
3. Cleaning steps. 
4. Feature engineering and selection. 
5. Save the final CSV and discuss next steps. 

---


## 1. Set-up
----
I will be making use of `os`, `glob`, and `Amadeus API` libraries for python. 

`os` and `glob` will be used in conjunction with command line commands from the notebook to join the large CSV tables together.

`amadeus` is used as a way to utilize the service's self-service APIs. The API requires a token /key to use. 
[**sign up here**](https://developers.amadeus.com)


In [None]:
# !pip install amadeus

In [11]:
import os 
import glob
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from amadeus import Client, ResponseError


 I have 68 CSV files. Each file represents 1 month of flight history from all U.S. Airports. As a result each CSV is approximatley 150mb in size. Loading a few of them into the workspace of the notebook will result in a loss of data due to data exceeding memory capacity on the local machine. 
 <br>
 <br>
The approach will be to manipulate each of the CSV's and join them directly in the command line. 

In [2]:
#change the directory from root to where all the files I will join will be
os.chdir('/Volumes/lacie/data_ingestion/capstone_hopper')

### Data Sources
---
This project attempts gathered delay data from the **Department of Transportation (DOT) Flight Delay reporting Database**. Sadly, there was no public API available to access this data from DOT or from Federal Aviation Administration.

Given there was no way to programatically acquire the desired amount of data, I proceeded to utilize the basic public data library tool and download a CSV for one monthly period at a time.  

This created a lot of _just **too big** files_ and hence our first unanticipated technical challenge with this project; what do I do? 

The plan: use the command line to join all the tables. 
After cleaning see how large the file is. 

To implement this plan, using `glob` methods and direct command line. 


---
References<br>
[Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?gnoyr_VQ=FGJ)
<br>
[GLOB tutorial](https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/)

In [3]:
#a list of all the files. a total of 6.958GB of unfiltered raw data. 
!ls

[31m10_nov_20.csv[m[m [31m23_oct_19.csv[m[m [31m36_sep_18.csv[m[m [31m49_aug_17.csv[m[m [31m61_aug_16.csv[m[m
[31m11_oct_20.csv[m[m [31m24_sep_19.csv[m[m [31m37_aug_18.csv[m[m [31m4_may_21.csv[m[m  [31m62_jul_16.csv[m[m
[31m12_sep_20.csv[m[m [31m25_aug_19.csv[m[m [31m38_jul_18.csv[m[m [31m50_jul_17.csv[m[m [31m63_jun_16.csv[m[m
[31m13_aug_20.csv[m[m [31m26_jul_19.csv[m[m [31m39_jun_18.csv[m[m [31m51_jun_17.csv[m[m [31m64_may_16.csv[m[m
[31m14_jul_20.csv[m[m [31m27_jun_19.csv[m[m [31m3_jun_21.csv[m[m  [31m52_may_17.csv[m[m [31m65_apr_16.csv[m[m
[31m15_jun_20.csv[m[m [31m28_may_19.csv[m[m [31m40_may_18.csv[m[m [31m53_apr_17.csv[m[m [31m66_mar_16.csv[m[m
[31m16_may_20.csv[m[m [31m29_apr_19.csv[m[m [31m41_apr_18.csv[m[m [31m54_mar_17.csv[m[m [31m67_feb_16.csv[m[m
[31m17_apr_20.csv[m[m [31m2_jul_21.csv[m[m  [31m42_mar_18.csv[m[m [31m55_feb_17.csv[m[m [31m68_jan_16.csv[m[m


In [4]:
#using glob, to locate all file names. 
file_ext = '.csv'
files = sorted([file for file in glob.glob(f'*{file_ext}')])

In [5]:
#the first five in the list to confirm 
print('first five files :',files[0:5],
      'last 5 files: ' ,files[-4:])

first five files : ['10_nov_20.csv', '11_oct_20.csv', '12_sep_20.csv', '13_aug_20.csv', '14_jul_20.csv'] last 5 files:  ['6_mar_21.csv', '7_feb_21.csv', '8_jan_21.csv', '9_dec_20.csv']


In [6]:
#using the pd.concat() i will read from a list comprehension to concat each and every csv. 
all_flights = pd.concat([pd.read_csv(file) for file in files ])


  all_flights = pd.concat([pd.read_csv(file) for file in files ])


In [7]:
#export to csv
all_flights.to_csv( "all_flights.csv", index=False, encoding='utf-8-sig')

In [8]:
all_flights.shape

(34409230, 34)

The process is completed with 34,409,230 flights with 34 _raw_ feature columns. It took approximately 8 minutes to process the file. 

In [10]:
all_flights.head()

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,ORIGIN_CITY_NAME,...,DIVERTED,CRS_ELAPSED_TIME,FLIGHTS,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 33
0,2020,11,12,4,2020-11-12,AA,N844NN,1783,PHL,"Philadelphia, PA",...,0.0,235.0,1.0,1303.0,,,,,,
1,2020,11,13,5,2020-11-13,AA,N339PL,1783,PHL,"Philadelphia, PA",...,0.0,235.0,1.0,1303.0,,,,,,
2,2020,11,14,6,2020-11-14,AA,N879NN,1783,PHL,"Philadelphia, PA",...,0.0,235.0,1.0,1303.0,,,,,,
3,2020,11,15,7,2020-11-15,AA,N829NN,1783,PHL,"Philadelphia, PA",...,0.0,235.0,1.0,1303.0,,,,,,
4,2020,11,16,1,2020-11-16,AA,N982AN,1783,PHL,"Philadelphia, PA",...,0.0,235.0,1.0,1303.0,,,,,,


<br>

-----

<br>

#### Using the Amadeus flight price analysis API
---
Amadeus, a transportation global distribution system*, provides developers with several very useful self-service API's to access current and historical data relating to flights and much more. 
<br><br>
For this project I wanted to provide the user the cost of a potential delay. This data will be used in a secondary regression that explains how much the added time will cost. 
<br><br>
To accomplish this price estimate I needed prices for each flight**. We, as travelers, all know that not every seat costs the same amount of money and that pricing conducted by the airline are done dynamically as a result of each airline's pricing strategy. 


---
References:<br>
[Flight Price Analysis API](https://developers.amadeus.com/self-service/category/air/api-doc/flight-price-analysis)
\** The total quota is 10,000 calls before having to use a production price tier. 

Notes:<br>
\* Definition of a [global distribution system (GDS)](https://en.wikipedia.org/wiki/Global_distribution_system):
> "is a computerised network system owned or operated by a company that enables transactions between travel industry service providers, mainly airlines, hotels, car rental companies, and travel agencies."

To test the API, I will submit their example code for a flight_offer_search

In [31]:
amadeus = Client(
    client_id='fYArxk7F2FGo8kJIJpUsJIEP18pDNZHk',
    client_secret='dd2tznw3kZaGGnoW'
)

try:
    '''
    Returns price metrics of a given itinerary
    '''
    response = amadeus.analytics.itinerary_price_metrics.get(originIataCode='LAX',
                                                             destinationIataCode='JFK',
                                                             departureDate='2019-12-31')
    print(response.status_code)
    print(response.data)
except ResponseError as error:
    raise error

200
[{'type': 'itinerary-price-metric', 'origin': {'iataCode': 'LAX'}, 'destination': {'iataCode': 'JFK'}, 'departureDate': '2019-12-31', 'transportType': 'FLIGHT', 'currencyCode': 'EUR', 'oneWay': False, 'priceMetrics': [{'amount': '63.86', 'quartileRanking': 'MINIMUM'}, {'amount': '424.67', 'quartileRanking': 'FIRST'}, {'amount': '493.59', 'quartileRanking': 'MEDIUM'}, {'amount': '552.25', 'quartileRanking': 'THIRD'}, {'amount': '582.42', 'quartileRanking': 'MAXIMUM'}]}]


If you get a rather large JSON returned then you have successfully accessed the self-service. 

### Data Size
---
As you can see above, doing any operations on 34.4 million rows of data would be _taxing_ on anyone's local system. 