# Airline Delays - Examples

## Using `clean.py`

In [1]:
import clean
import pandas as pd



You'll first need the `ca_airports.json` (available in the repository). Specify the file path, as in the cell below.

In [2]:
cln = clean.Airline('./ca_airports.json')

Before you proceed, download the dataset and unzip the file. The columns you'll need to select from the BTS data portal are listed listed in the Miscellaneous section of the `README` file. Once you have the csv, make sure to note the file path.

The following methods are available once you have `Airline` instantiated:
- `process`: given a file path to a csv (from BTS) `df_fp` and the sample size `n_sample`, return a processed dataset.
    - if `n_sample` is `None`, the entire DataFrame will be returned (i.e. no sampling will happen). Otherwise the default value is 750. 
- `join`: give the file path for a csv and a month of weather data, perform an inner join on the two datasets and return the resulting DataFrame. This requires the same month of the flight data and the weather data pulled using the `weather.py` script.


We'll proceed with calling `process` on flight data from February of 2021. The original csv is provided as a reference.

In [3]:
feb = pd.read_csv('./BTS_data/202102.csv')
feb

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,MKT_UNIQUE_CARRIER,ORIGIN_AIRPORT_ID,DEST_AIRPORT_ID,CRS_DEP_TIME,DEP_TIME,...,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 30
0,2021,2,6,6,2021-02-06,DL,14869,13891,2125,2115.0,...,0.0,111.0,95.0,558.0,,,,,,
1,2021,2,6,6,2021-02-06,DL,11292,12892,615,609.0,...,0.0,165.0,136.0,862.0,,,,,,
2,2021,2,6,6,2021-02-06,DL,13487,14771,1017,1155.0,...,0.0,263.0,238.0,1589.0,73.0,0.0,0.0,0.0,0.0,
3,2021,2,6,6,2021-02-06,DL,10397,14771,1003,954.0,...,0.0,324.0,322.0,2139.0,,,,,,
4,2021,2,6,6,2021-02-06,DL,14771,14869,600,552.0,...,0.0,114.0,104.0,599.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54601,2021,2,24,3,2021-02-24,AA,14107,14689,1858,1853.0,...,0.0,110.0,94.0,455.0,,,,,,
54602,2021,2,25,4,2021-02-25,AA,14107,14689,1858,1850.0,...,0.0,110.0,91.0,455.0,,,,,,
54603,2021,2,26,5,2021-02-26,AA,14107,14689,1858,1850.0,...,0.0,110.0,99.0,455.0,,,,,,
54604,2021,2,27,6,2021-02-27,AA,14107,14689,1858,1853.0,...,0.0,110.0,87.0,455.0,,,,,,


In [4]:
feb_processed = cln.process('./BTS_data/202102.csv', n_sample=1000)
feb_processed

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,mkt_unique_carrier,origin_airport_id,dest_airport_id,crs_dep_time,dep_time,...,actual_elapsed_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,origin_ca,is_weekday,external_cause
21615,2021,2,12,5,2021-02-12,AS,14747,14771,705,657.0,...,125.0,679.0,,,,,,0,1,0
11017,2021,2,15,1,2021-02-15,AS,12892,14771,1300,1257.0,...,84.0,337.0,,,,,,1,1,0
21856,2021,2,14,7,2021-02-14,AS,12892,14771,905,905.0,...,78.0,337.0,,,,,,1,0,0
28777,2021,2,1,1,2021-02-01,UA,14679,14771,1945,1933.0,...,91.0,447.0,,,,,,1,1,0
16477,2021,2,11,4,2021-02-11,UA,10157,14771,615,610.0,...,66.0,250.0,,,,,,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9915,2021,2,6,6,2021-02-06,UA,14869,14771,633,628.0,...,134.0,599.0,,,,,,0,0,0
27747,2021,2,6,6,2021-02-06,UA,12266,14771,700,659.0,...,253.0,1635.0,,,,,,0,0,0
28108,2021,2,4,4,2021-02-04,UA,13830,14771,1005,950.0,...,289.0,2338.0,,,,,,0,1,0
51036,2021,2,23,2,2021-02-23,WN,14679,14771,1610,1610.0,...,83.0,447.0,,,,,,1,1,0


We can now use `cln.join()` on the resulting DataFrame `feb_processed` along with February's monthly weather data (follow the code in the next section to get the weather data). 
The output of calling `join` is a joined DataFrame, which we can then save as a CSV file for further analysis or modeling.

In [5]:
feb_joined = cln.join(feb_processed, './noaa_data/2021-02.csv')
feb_joined.head()

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,mkt_unique_carrier,origin_airport_id,dest_airport_id,crs_dep_time,dep_time,...,external_cause,date,TMAX,TMIN,TAVG,PRCP,AWND,WSF2,WT01,WT08
0,2021,2,12,5,2021-02-12,AS,14747,14771,705,657.0,...,0,2021-02-12,60.0,51.0,55.0,0.0,17.2,29.1,1,1
1,2021,2,12,5,2021-02-12,UA,13891,14771,1545,1538.0,...,0,2021-02-12,60.0,51.0,55.0,0.0,17.2,29.1,1,1
2,2021,2,12,5,2021-02-12,WN,14679,14771,1610,1619.0,...,0,2021-02-12,60.0,51.0,55.0,0.0,17.2,29.1,1,1
3,2021,2,12,5,2021-02-12,UA,14689,14771,1555,1544.0,...,0,2021-02-12,60.0,51.0,55.0,0.0,17.2,29.1,1,1
4,2021,2,12,5,2021-02-12,AA,13303,14771,2110,2103.0,...,0,2021-02-12,60.0,51.0,55.0,0.0,17.2,29.1,1,1


In [6]:
feb_joined.to_csv('202102-complete.csv', index=False)

----
## Getting Weather Data

The weather data for this dataset is obtained from NOAA's Climate Data Online web service. Before you begin, please obtain a [token](https://www.ncdc.noaa.gov/cdo-web/token) to access the API.

In [7]:
import weather
#import noaa_token

In [8]:
# My token is stored in a separate file - please change the first
# parameter noaa_token._token to your personal API token.
# The second parameter is the directory where data from the API 
# is saved - feel free to change this.
api = weather.NOAAApi(noaa_token._token, save_dir='./noaa_data/')

After you have instantiated the `NOAAApi` class, you can use the method `get_data` which takes in a start and end date as the two arguments and writes the weather data of the `start` date month, and every month up to and including the month of the specified `end` date. The start and end date must be a string formatted as `YYYY-MM-DD`. 

The data retrieved is written to the directory specified previously where each month's data is saved as `YYYY-MM.csv`.

In [9]:
# Obtain data from February and March 2021
api.get_data('2021-02-01', '2021-03-31')

# Progress will be printed as each month is completed.

February already exists in 2021-02.csv
March already exists in 2021-03.csv
Weather data successfully saved in ./noaa_data/.


In [10]:
pd.read_csv('./noaa_data/2021-02.csv').head()

Unnamed: 0,date,TMAX,TMIN,TAVG,PRCP,AWND,WSF2,WT01,WT08
0,2021-02-01,65.0,48.0,55.0,0.35,9.6,23.9,1,1
1,2021-02-02,61.0,51.0,57.0,0.18,6.5,14.1,1,1
2,2021-02-03,59.0,43.0,52.0,0.0,8.3,18.1,0,1
3,2021-02-04,62.0,45.0,53.0,0.0,6.3,16.1,0,0
4,2021-02-05,67.0,43.0,52.0,0.0,4.9,23.0,0,1


### Supplementary Details

- `get_data` only retrieves data from a weather station at San Francisco International Airport (the station cannot be changed through any parameters but must be changed within the Python script).
- If a month's weather data already exists within the specified `save_dir`, it will skip that month and only obtain data that is not saved to that directory.

------

## Extra: Putting It All Together

With the basic usage covered, the data can be processed for any range of months downloaded from the BTS .

In [11]:
api.get_data('2021-06-01', '2021-08-31')

June already exists in 2021-06.csv
July already exists in 2021-07.csv
August already exists in 2021-08.csv
Weather data successfully saved in ./noaa_data/.


In [3]:
import os
cwd = os.getcwd()
to_concat = []
for month in os.listdir('./noaa_data'):
    bts_month = ''.join(month.split('-')) 
    # BTS csvs are in a folder "BTS_data" and each month's csv is
    # named YYYYMM.csv
    proc_month = cln.process(os.path.join(cwd, 'BTS_data', bts_month), 750)
    
    weather_fp = os.path.join(cwd, 'noaa_data', month)
    concat.append(cln.join(proc_month, weather_fp))

In [9]:
complete = pd.concat(concat, ignore_index=True).sort_values('fl_date').drop(columns=['date', 'dep_delay_new', 'arr_delay_new'])
complete

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,mkt_unique_carrier,origin_airport_id,dest_airport_id,crs_dep_time,dep_time,...,is_weekday,external_cause,TMAX,TMIN,TAVG,PRCP,AWND,WSF2,WT01,WT08
3198,2020,6,1,1,2020-06-01,UA,14679,14771,800,754.0,...,1,0,72.0,55.0,62.0,0.0,12.1,23.9,1,1.0
3190,2020,6,1,1,2020-06-01,HA,12173,14771,1515,1524.0,...,1,0,72.0,55.0,62.0,0.0,12.1,23.9,1,1.0
3191,2020,6,1,1,2020-06-01,UA,14107,14771,1445,1444.0,...,1,0,72.0,55.0,62.0,0.0,12.1,23.9,1,1.0
3192,2020,6,1,1,2020-06-01,WN,10800,14771,1200,1155.0,...,1,0,72.0,55.0,62.0,0.0,12.1,23.9,1,1.0
3193,2020,6,1,1,2020-06-01,UA,11292,14771,1830,1818.0,...,1,0,72.0,55.0,62.0,0.0,12.1,23.9,1,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4229,2021,8,31,2,2021-08-31,UA,13930,14771,715,709.0,...,1,0,74.0,55.0,63.0,0.0,8.5,23.9,0,1.0
4228,2021,8,31,2,2021-08-31,DL,12478,14771,805,754.0,...,1,0,74.0,55.0,63.0,0.0,8.5,23.9,0,1.0
4227,2021,8,31,2,2021-08-31,UA,12982,14771,2220,2214.0,...,1,0,74.0,55.0,63.0,0.0,8.5,23.9,0,1.0
4249,2021,8,31,2,2021-08-31,UA,12889,14771,801,757.0,...,1,0,74.0,55.0,63.0,0.0,8.5,23.9,0,1.0


In [10]:
complete.columns

Index(['year', 'month', 'day_of_month', 'day_of_week', 'fl_date',
       'mkt_unique_carrier', 'origin_airport_id', 'dest_airport_id',
       'crs_dep_time', 'dep_time', 'dep_delay', 'dep_del15', 'taxi_out',
       'taxi_in', 'crs_arr_time', 'arr_time', 'arr_delay', 'arr_del15',
       'crs_elapsed_time', 'actual_elapsed_time', 'distance', 'carrier_delay',
       'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay',
       'origin_ca', 'is_weekday', 'external_cause', 'TMAX', 'TMIN', 'TAVG',
       'PRCP', 'AWND', 'WSF2', 'WT01', 'WT08'],
      dtype='object')

In [25]:
complete.to_csv('airline_delays.csv', index=False)

In [26]:
len(complete.columns)

37