# Airport Data: Processing Notebook

This is an extra notebook prepared by Nicholas (Thanh Nhan) Le. The following processing is concerned within this notebook:
- **Airport names:** the airports’ full names and IATA codes, extracted from https://www.leonardsguide.com/us-airport-codes.shtml. This is so that the airport choices are more user-friendly in the UI. For example, a choice of airport can be *Birmingham International Airport (BHM)* instead of just *“BHM”*.
- **Distance data:** the distance data for each pair of airports appearing in the dataset. This is for calculating the `totalTravelDistance` feature.
- **Travel duration data:** the total duration (in days) for each pair of airports appearing in the dataset. This is for calculating the `travelDurationDays` feature.


In [1]:
import json

import pandas as pd

## Distance data

In [2]:
df = pd.read_csv('models/distance_data.csv')

In [3]:
data_dict = {}

In [4]:
for index, row in df.iterrows():
    try:
        data_dict[row.loc['ORIGIN']][row.loc['DEST']] = float(row.loc['DISTANCE IN MILES'])
    except KeyError:
        data_dict[row.loc['ORIGIN']] = {
            row.loc['DEST']: float(row.loc['DISTANCE IN MILES'])
        }

In [5]:
data_dict

{'LAX': {'ATL': 1947.0,
  'BOS': 2611.0,
  'CLT': 2125.0,
  'DEN': 862.0,
  'DFW': 1235.0,
  'DTW': 1979.0,
  'EWR': 2454.0,
  'IAD': 2288.0,
  'JFK': 2475.0,
  'LGA': 2469.0,
  'MIA': 2342.0,
  'OAK': 337.0,
  'ORD': 1744.0,
  'PHL': 2402.0,
  'SFO': 337.0},
 'ATL': {'BOS': 946.0,
  'CLT': 226.0,
  'DEN': 1199.0,
  'DFW': 731.0,
  'DTW': 594.0,
  'EWR': 746.0,
  'IAD': 534.0,
  'JFK': 760.0,
  'LAX': 1947.0,
  'LGA': 762.0,
  'MIA': 594.0,
  'OAK': 2130.0,
  'ORD': 606.0,
  'PHL': 666.0,
  'SFO': 2139.0},
 'MIA': {'ATL': 594.0,
  'BOS': 1258.0,
  'CLT': 650.0,
  'DEN': 1709.0,
  'DFW': 1121.0,
  'DTW': 1145.0,
  'EWR': 1085.0,
  'IAD': 921.0,
  'JFK': 1089.0,
  'LAX': 2342.0,
  'LGA': 1096.0,
  'OAK': 2577.0,
  'ORD': 1197.0,
  'PHL': 1013.0,
  'SFO': 2585.0},
 'JFK': {'ATL': 760.0,
  'BOS': 187.0,
  'CLT': 541.0,
  'DEN': 1626.0,
  'DFW': 1391.0,
  'DTW': 509.0,
  'IAD': 228.0,
  'LAX': 2475.0,
  'MIA': 1089.0,
  'OAK': 2576.0,
  'ORD': 740.0,
  'PHL': 94.0,
  'SFO': 2586.0},
 'ORD':

In [6]:
with open("models/distance_data.json", 'w') as fout:
    json_dumps_str = json.dump(data_dict, fout, indent=4)

## Travel duration data

In [7]:
travel_durations = pd.read_csv('models/travel_duration_data.csv')

In [8]:
travel_durations_dict = {}

In [9]:
for index, row in travel_durations.iterrows():
    try:
        travel_durations_dict[row.loc['startingAirport']][row.loc['destinationAirport']] = float(row.loc['travelDurationDay'])
    except KeyError:
        travel_durations_dict[row.loc['startingAirport']] = {
            row.loc['destinationAirport']: float(row.loc['travelDurationDay'])
        }

In [10]:
travel_durations_dict

{'ATL': {'BOS': 0.2440681428841595,
  'CLT': 0.1763718264914905,
  'DEN': 0.2840644687162891,
  'DFW': 0.1812342749886643,
  'DTW': 0.2337483726123518,
  'EWR': 0.1971769121605345,
  'IAD': 0.1450869070493166,
  'JFK': 0.2287455527790691,
  'LAX': 0.3418960009700192,
  'LGA': 0.1983835727488292,
  'MIA': 0.2181360602605387,
  'OAK': 0.4976261487748463,
  'ORD': 0.2393146910001916,
  'PHL': 0.209320661167545,
  'SFO': 0.4051941167321192},
 'BOS': {'ATL': 0.2505915674624991,
  'CLT': 0.2165504476163493,
  'DEN': 0.3483038936688973,
  'DFW': 0.3042229945386198,
  'DTW': 0.2549633587603147,
  'EWR': 0.1590280992575104,
  'IAD': 0.2202437516271804,
  'JFK': 0.0912767182752647,
  'LAX': 0.3885254556286728,
  'LGA': 0.1275624278788192,
  'MIA': 0.2704726987225637,
  'OAK': 0.5449044469479793,
  'ORD': 0.1984570099091971,
  'PHL': 0.1743785433415478,
  'SFO': 0.406416939712128},
 'CLT': {'ATL': 0.1792174493828796,
  'BOS': 0.1953081984747023,
  'DEN': 0.2904901364762827,
  'DFW': 0.26325663705

In [11]:
with open("models/travel_duration_data.json", 'w') as fout:
    json_dumps_str = json.dump(travel_durations_dict, fout, indent=4)

## Airport names

The airport names are taken from the following website: https://www.leonardsguide.com/us-airport-codes.shtml

In [12]:
names = pd.read_csv('models/airport_names.csv', header=None, names=["name", "iata_code"])

In [13]:
origin_names = names[names["iata_code"].isin(data_dict.keys())]

In [14]:
destination_names = set()

for d in list(data_dict.values()):
    for i in d.keys():
        destination_names.add(i)

destination_names = names[names["iata_code"].isin(destination_names)]

In [15]:
destination_names

Unnamed: 0,name,iata_code
18,Los Angeles International Airport,LAX
19,Oakland,OAK
24,San Francisco International Airport,SFO
29,Denver International Airport,DEN
34,"Washington, Dulles International Airport",IAD
41,Miami International Airport,MIA
49,Atlanta Hartsfield International Airport,ATL
59,"Chicago, O'Hare International Airport Airport",ORD
79,"Boston, Logan International Airport",BOS
84,Detroit Metropolitan Airport,DTW


Airport names for both origin and destination are the same, so I will process a single CSV file.

In [16]:
names_dict = {}

for index, row in origin_names.sort_values(by='name').iterrows():
    names_dict[row.loc['iata_code']] = row.loc['name']

In [17]:
names_dict

{'ATL': 'Atlanta Hartsfield International Airport',
 'BOS': 'Boston, Logan International Airport',
 'CLT': 'Charlotte/Douglas International Airport',
 'ORD': "Chicago,\xa0O'Hare International Airport Airport",
 'DFW': 'Dallas/Fort Worth International Airport',
 'DEN': 'Denver International Airport',
 'DTW': 'Detroit Metropolitan Airport',
 'IAD': 'Washington,\xa0Dulles International Airport',
 'LAX': 'Los Angeles International Airport',
 'MIA': 'Miami International Airport',
 'JFK': 'New York,\xa0John F Kennedy International Airport',
 'LGA': 'New York,\xa0La Guardia Airport',
 'EWR': 'Newark International Airport',
 'OAK': 'Oakland',
 'PHL': 'Philadelphia',
 'SFO': 'San Francisco International Airport'}

In [18]:
names_dict['ORD'] = "Chicago, O'Hare International Airport Airport"
names_dict['JFK'] = 'New York, John F. Kennedy International Airport'
names_dict['LGA'] = 'New York, La Guardia Airport'
names_dict['IAD'] = 'Washington, Dulles International Airport'

In [19]:
names_dict

{'ATL': 'Atlanta Hartsfield International Airport',
 'BOS': 'Boston, Logan International Airport',
 'CLT': 'Charlotte/Douglas International Airport',
 'ORD': "Chicago, O'Hare International Airport Airport",
 'DFW': 'Dallas/Fort Worth International Airport',
 'DEN': 'Denver International Airport',
 'DTW': 'Detroit Metropolitan Airport',
 'IAD': 'Washington, Dulles International Airport',
 'LAX': 'Los Angeles International Airport',
 'MIA': 'Miami International Airport',
 'JFK': 'New York, John F. Kennedy International Airport',
 'LGA': 'New York, La Guardia Airport',
 'EWR': 'Newark International Airport',
 'OAK': 'Oakland',
 'PHL': 'Philadelphia',
 'SFO': 'San Francisco International Airport'}

In [20]:
with open("models/names_data.json", 'w') as fout:
    json_dumps_str = json.dump(names_dict, fout, indent=4)