# Examples

## Using `clean.py`

In [2]:
import clean
import pandas as pd

You'll first need the `ca_airports.json` (available in the repository). Specify the file path, as in the cell below.

In [3]:
cln = clean.Airline('./ca_airports.json')

Before you proceed, download the dataset and unzip the file. The columns you'll need to select from the BTS data portal are listed listed in the Miscellaneous section of the `README` file. Once you have the csv, make sure to note the file path.

The following methods are available once you have `Airline` instantiated:
- `process`: given a file path to a csv (from BTS), return a processed dataset. 
    - Optional: If you want to create a random sample of size n, set `sample=True` and specify the value of n right after.
- `join`: give the file path for a csv and a month of weather data, perform an inner join on the two datasets and return the resulting DataFrame. This requires the same month of the flight data and the weather data pulled using the `weather.py` script.


We'll proceed with calling `process` on flight data from February of 2021. The original csv is provided as a reference.

In [4]:
feb = pd.read_csv('./BTS_data/202102.csv')
feb

Unnamed: 0,YEAR,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,MKT_UNIQUE_CARRIER,ORIGIN_AIRPORT_ID,DEST_AIRPORT_ID,CRS_DEP_TIME,DEP_TIME,...,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,DISTANCE,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY,Unnamed: 30
0,2021,2,6,6,2021-02-06,DL,14869,13891,2125,2115.0,...,0.0,111.0,95.0,558.0,,,,,,
1,2021,2,6,6,2021-02-06,DL,11292,12892,615,609.0,...,0.0,165.0,136.0,862.0,,,,,,
2,2021,2,6,6,2021-02-06,DL,13487,14771,1017,1155.0,...,0.0,263.0,238.0,1589.0,73.0,0.0,0.0,0.0,0.0,
3,2021,2,6,6,2021-02-06,DL,10397,14771,1003,954.0,...,0.0,324.0,322.0,2139.0,,,,,,
4,2021,2,6,6,2021-02-06,DL,14771,14869,600,552.0,...,0.0,114.0,104.0,599.0,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
54601,2021,2,24,3,2021-02-24,AA,14107,14689,1858,1853.0,...,0.0,110.0,94.0,455.0,,,,,,
54602,2021,2,25,4,2021-02-25,AA,14107,14689,1858,1850.0,...,0.0,110.0,91.0,455.0,,,,,,
54603,2021,2,26,5,2021-02-26,AA,14107,14689,1858,1850.0,...,0.0,110.0,99.0,455.0,,,,,,
54604,2021,2,27,6,2021-02-27,AA,14107,14689,1858,1853.0,...,0.0,110.0,87.0,455.0,,,,,,


In [5]:
feb_processed = cln.process('./BTS_data/202102.csv')
feb_processed

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,mkt_unique_carrier,origin_airport_id,dest_airport_id,crs_dep_time,dep_time,...,actual_elapsed_time,distance,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay,origin_ca,is_weekday,external_cause
2,2021,2,6,6,2021-02-06,DL,13487,14771,1017,1155.0,...,238.0,1589.0,73.0,0.0,0.0,0.0,0.0,0,0,1
3,2021,2,6,6,2021-02-06,DL,10397,14771,1003,954.0,...,322.0,2139.0,,,,,,0,0,0
7,2021,2,6,6,2021-02-06,DL,14869,14771,2120,2115.0,...,125.0,599.0,,,,,,0,0,0
11,2021,2,6,6,2021-02-06,DL,14869,14771,820,829.0,...,132.0,599.0,,,,,,0,0,0
15,2021,2,6,6,2021-02-06,DL,14869,14771,1142,1141.0,...,114.0,599.0,,,,,,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
53653,2021,2,26,5,2021-02-26,WN,14107,14771,1215,1244.0,...,116.0,651.0,15.0,0.0,0.0,0.0,0.0,0,1,1
53654,2021,2,26,5,2021-02-26,WN,14107,14771,1945,1954.0,...,115.0,651.0,,,,,,0,1,0
53693,2021,2,26,5,2021-02-26,WN,14679,14771,1005,1003.0,...,81.0,447.0,,,,,,1,1,0
53694,2021,2,26,5,2021-02-26,WN,14679,14771,1610,1615.0,...,84.0,447.0,,,,,,1,1,0


We can now use `cln.join()` on the resulting DataFrame `feb_processed` along with February's monthly weather data (follow the code in the next section to get the weather data). 
The output of calling `join` is a joined DataFrame, which we can then save as a CSV file for further analysis or modeling.

In [9]:
feb_joined = cln.join(feb_processed, './noaa_data/2021-02.csv')
feb_joined.head()

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,mkt_unique_carrier,origin_airport_id,dest_airport_id,crs_dep_time,dep_time,...,external_cause,date,TMAX,TMIN,TAVG,PRCP,AWND,WSF2,WT01,WT08
0,2021,2,6,6,2021-02-06,DL,13487,14771,1017,1155.0,...,1,2021-02-06,67.0,44.0,53.0,0.0,5.6,21.0,1,1
1,2021,2,6,6,2021-02-06,DL,10397,14771,1003,954.0,...,0,2021-02-06,67.0,44.0,53.0,0.0,5.6,21.0,1,1
2,2021,2,6,6,2021-02-06,DL,14869,14771,2120,2115.0,...,0,2021-02-06,67.0,44.0,53.0,0.0,5.6,21.0,1,1
3,2021,2,6,6,2021-02-06,DL,14869,14771,820,829.0,...,0,2021-02-06,67.0,44.0,53.0,0.0,5.6,21.0,1,1
4,2021,2,6,6,2021-02-06,DL,14869,14771,1142,1141.0,...,0,2021-02-06,67.0,44.0,53.0,0.0,5.6,21.0,1,1


In [10]:
feb_joined.to_csv('202102-complete.csv', index=False)

----
## Getting Weather Data

The weather data for this dataset is obtained from NOAA's Climate Data Online web service. Before you begin, please obtain a [token](https://www.ncdc.noaa.gov/cdo-web/token) to access the API.

In [6]:
import weather
#import noaa_token

In [7]:
# My token is stored in a separate file - please change the first
# parameter noaa_token._token to your personal API token.
# The second parameter is the directory where data from the API 
# is saved - feel free to change this.
api = weather.NOAAApi(noaa_token._token, save_dir='./noaa_data/')

After you have instantiated the `NOAAApi` class, you can use the method `get_data` which takes in a start and end date as the two arguments and writes the weather data of the `start` date month, and every month up to and including the month of the specified `end` date. The start and end date must be a string formatted as `YYYY-MM-DD`. 

The data retrieved is written to the directory specified previously where each month's data is saved as `YYYY-MM.csv`.

In [8]:
# Obtain data from February and March 2021
api.get_data('2021-02-01', '2021-03-31')

# Progress will be printed as each month is completed.

Weather for February written to ./noaa_data/2021-02.csv.
Weather for March written to ./noaa_data/2021-03.csv.
Weather data successfully saved in ./noaa_data/.


In [11]:
pd.read_csv('./noaa_data/2021-02.csv').head()

Unnamed: 0,date,TMAX,TMIN,TAVG,PRCP,AWND,WSF2,WT01,WT08
0,2021-02-01,65.0,48.0,55.0,0.35,9.6,23.9,1,1
1,2021-02-02,61.0,51.0,57.0,0.18,6.5,14.1,1,1
2,2021-02-03,59.0,43.0,52.0,0.0,8.3,18.1,0,1
3,2021-02-04,62.0,45.0,53.0,0.0,6.3,16.1,0,0
4,2021-02-05,67.0,43.0,52.0,0.0,4.9,23.0,0,1


### Supplementary Details

- `get_data` only retrieves data from a weather station at San Francisco International Airport (the station cannot be changed through any parameters).
- `get_data` collects the following measurements (in standard units):
    - `TMAX`: recorded maximum temperature
    - `TMIN`: recorded minimum temperature
    - `TAVG`: temperature average
    - `PRCP`: precipitation
    - `AWND`: average wind speed  
    - `WSF2`: fastest 2-min wind speed
    - `WT01`: fog
    - `WT08`: smog/haze