### REQUIRED DATA CITATION:

U.S. Department of Transportation, Research and Innovative Technology Administration, Bureau of Transportation Statistics, Freight Transportation: T-100 Domestic Market (U.S. Carriers), 2020 (Washington, DC: 2020)

U.S. Department of Transportation, Research and Innovative Technology Administration, Bureau of Transportation Statistics, Freight Transportation: T-100 International Market (All Carriers), 2020 (Washington, DC: 2020)

All material contained in this report is in the public domain and may be used and reprinted without special permission; citation as to source is required.

## Read in Data

### Aviation data from: https://www.transtats.bts.gov/Fields.asp

#### Auto download: 
- enabled via code below

#### Manual (this should not be requried):
- Domestic: Select `US Carriers Only`. The `All Carriers` file truncates the last 3 months of doemstic travel due to the reporting lag of 3 months for international flights.

- Domestic: `T100D_MARKET_US_CARRIER_ONLY`
- International: `T_T100I_MARKET_ALL_CARRIERS`

- Select `Download` under `Data Tools`
- Filter to `Year` you would like
- Toggle `Select all fields`
- Select `Download`
- Unzip and move `csv` file to wrkdir

### Airport Directory Data download (this should not be required): 
- The Updated file is included in the repo and a new download should not be required. But just in case, below is how to manually get the file.

https://www.faa.gov/airports/airport_safety/airportdata_5010/ 

- Used to Map airport to county name
- Many manual updates required, see update notes at end of notebook
- There is a cleaned version of this file in the repo, but to get your own:
  - scroll down to `Location(s) Selection Form`
  - Select `Submit`
  - Go to downloads
- For a new Download, you will have to update the airport codes as shown in the notes at the bottom  

In [1]:
# ALL IMPORTS
import os
import pandas as pd
import json
import requests
import numpy as np 
import urllib.request
from zipfile import ZipFile as zp

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 2000)

In [2]:
#Enter the year of the data you downloaded (used for timestamp)
year = "2019"

# Filter by minimum number of pax per month
min_pax = 100

In [3]:
wrkdir = os.getcwd()

### Download Data

In [4]:
headers = {
           'Content-Type': 'application/x-www-form-urlencoded',
           'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
           'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
          }

In [5]:
# f-string for 'year' selected above
raw_data_dom = f'UserTableName=T_100_Domestic_Market__U.S._Carriers&DBShortName=Air_Carriers&RawDataTable=T_T100D_MARKET_US_CARRIER_ONLY&sqlstr=+SELECT+PASSENGERS%2CFREIGHT%2CMAIL%2CDISTANCE%2CUNIQUE_CARRIER%2CAIRLINE_ID%2CUNIQUE_CARRIER_NAME%2CUNIQUE_CARRIER_ENTITY%2CREGION%2CCARRIER%2CCARRIER_NAME%2CCARRIER_GROUP%2CCARRIER_GROUP_NEW%2CORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CORIGIN%2CORIGIN_CITY_NAME%2CORIGIN_STATE_ABR%2CORIGIN_STATE_FIPS%2CORIGIN_STATE_NM%2CORIGIN_WAC%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID%2CDEST%2CDEST_CITY_NAME%2CDEST_STATE_ABR%2CDEST_STATE_FIPS%2CDEST_STATE_NM%2CDEST_WAC%2CYEAR%2CQUARTER%2CMONTH%2CDISTANCE_GROUP%2CCLASS+FROM++T_T100D_MARKET_US_CARRIER_ONLY+WHERE+YEAR%3D{year}&varlist=PASSENGERS%2CFREIGHT%2CMAIL%2CDISTANCE%2CUNIQUE_CARRIER%2CAIRLINE_ID%2CUNIQUE_CARRIER_NAME%2CUNIQUE_CARRIER_ENTITY%2CREGION%2CCARRIER%2CCARRIER_NAME%2CCARRIER_GROUP%2CCARRIER_GROUP_NEW%2CORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CORIGIN%2CORIGIN_CITY_NAME%2CORIGIN_STATE_ABR%2CORIGIN_STATE_FIPS%2CORIGIN_STATE_NM%2CORIGIN_WAC%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID%2CDEST%2CDEST_CITY_NAME%2CDEST_STATE_ABR%2CDEST_STATE_FIPS%2CDEST_STATE_NM%2CDEST_WAC%2CYEAR%2CQUARTER%2CMONTH%2CDISTANCE_GROUP%2CCLASS&grouplist=&suml=&sumRegion=&filter1=title%3D&filter2=title%3D&geo=All%A0&time=All%A0Months&timename=Month&GEOGRAPHY=All&XYEAR={year}&FREQUENCY=All&AllVars=All&VarName=PASSENGERS&VarDesc=Passengers&VarType=Num&VarName=FREIGHT&VarDesc=Freight&VarType=Num&VarName=MAIL&VarDesc=Mail&VarType=Num&VarName=DISTANCE&VarDesc=Distance&VarType=Num&VarName=UNIQUE_CARRIER&VarDesc=UniqueCarrier&VarType=Char&VarName=AIRLINE_ID&VarDesc=AirlineID&VarType=Num&VarName=UNIQUE_CARRIER_NAME&VarDesc=UniqueCarrierName&VarType=Char&VarName=UNIQUE_CARRIER_ENTITY&VarDesc=UniqCarrierEntity&VarType=Char&VarName=REGION&VarDesc=CarrierRegion&VarType=Char&VarName=CARRIER&VarDesc=Carrier&VarType=Char&VarName=CARRIER_NAME&VarDesc=CarrierName&VarType=Char&VarName=CARRIER_GROUP&VarDesc=CarrierGroup&VarType=Num&VarName=CARRIER_GROUP_NEW&VarDesc=CarrierGroupNew&VarType=Num&VarName=ORIGIN_AIRPORT_ID&VarDesc=OriginAirportID&VarType=Num&VarName=ORIGIN_AIRPORT_SEQ_ID&VarDesc=OriginAirportSeqID&VarType=Num&VarName=ORIGIN_CITY_MARKET_ID&VarDesc=OriginCityMarketID&VarType=Num&VarName=ORIGIN&VarDesc=Origin&VarType=Char&VarName=ORIGIN_CITY_NAME&VarDesc=OriginCityName&VarType=Char&VarName=ORIGIN_STATE_ABR&VarDesc=OriginState&VarType=Char&VarName=ORIGIN_STATE_FIPS&VarDesc=OriginStateFips&VarType=Char&VarName=ORIGIN_STATE_NM&VarDesc=OriginStateName&VarType=Char&VarName=ORIGIN_WAC&VarDesc=OriginWac&VarType=Num&VarName=DEST_AIRPORT_ID&VarDesc=DestAirportID&VarType=Num&VarName=DEST_AIRPORT_SEQ_ID&VarDesc=DestAirportSeqID&VarType=Num&VarName=DEST_CITY_MARKET_ID&VarDesc=DestCityMarketID&VarType=Num&VarName=DEST&VarDesc=Dest&VarType=Char&VarName=DEST_CITY_NAME&VarDesc=DestCityName&VarType=Char&VarName=DEST_STATE_ABR&VarDesc=DestState&VarType=Char&VarName=DEST_STATE_FIPS&VarDesc=DestStateFips&VarType=Char&VarName=DEST_STATE_NM&VarDesc=DestStateName&VarType=Char&VarName=DEST_WAC&VarDesc=DestWac&VarType=Num&VarName=YEAR&VarDesc=Year&VarType=Num&VarName=QUARTER&VarDesc=Quarter&VarType=Num&VarName=MONTH&VarDesc=Month&VarType=Num&VarName=DISTANCE_GROUP&VarDesc=DistanceGroup&VarType=Num&VarName=CLASS&VarDesc=Class&VarType=Char'

In [6]:
# f-string for 'year' selected above
raw_data_inter = f'UserTableName=T_100_International_Market__All_Carriers&DBShortName=Air_Carriers&RawDataTable=T_T100I_MARKET_ALL_CARRIER&sqlstr=+SELECT+PASSENGERS%2CFREIGHT%2CMAIL%2CDISTANCE%2CUNIQUE_CARRIER%2CAIRLINE_ID%2CUNIQUE_CARRIER_NAME%2CUNIQUE_CARRIER_ENTITY%2CREGION%2CCARRIER%2CCARRIER_NAME%2CCARRIER_GROUP%2CCARRIER_GROUP_NEW%2CORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CORIGIN%2CORIGIN_CITY_NAME%2CORIGIN_COUNTRY%2CORIGIN_COUNTRY_NAME%2CORIGIN_WAC%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID%2CDEST%2CDEST_CITY_NAME%2CDEST_COUNTRY%2CDEST_COUNTRY_NAME%2CDEST_WAC%2CYEAR%2CQUARTER%2CMONTH%2CDISTANCE_GROUP%2CCLASS+FROM++T_T100I_MARKET_ALL_CARRIER+WHERE+YEAR%3D{year}&varlist=PASSENGERS%2CFREIGHT%2CMAIL%2CDISTANCE%2CUNIQUE_CARRIER%2CAIRLINE_ID%2CUNIQUE_CARRIER_NAME%2CUNIQUE_CARRIER_ENTITY%2CREGION%2CCARRIER%2CCARRIER_NAME%2CCARRIER_GROUP%2CCARRIER_GROUP_NEW%2CORIGIN_AIRPORT_ID%2CORIGIN_AIRPORT_SEQ_ID%2CORIGIN_CITY_MARKET_ID%2CORIGIN%2CORIGIN_CITY_NAME%2CORIGIN_COUNTRY%2CORIGIN_COUNTRY_NAME%2CORIGIN_WAC%2CDEST_AIRPORT_ID%2CDEST_AIRPORT_SEQ_ID%2CDEST_CITY_MARKET_ID%2CDEST%2CDEST_CITY_NAME%2CDEST_COUNTRY%2CDEST_COUNTRY_NAME%2CDEST_WAC%2CYEAR%2CQUARTER%2CMONTH%2CDISTANCE_GROUP%2CCLASS&grouplist=&suml=&sumRegion=&filter1=title%3D&filter2=title%3D&geo=All%A0&time=All%A0Months&timename=Month&GEOGRAPHY=All&XYEAR={year}&FREQUENCY=All&AllVars=All&VarName=PASSENGERS&VarDesc=Passengers&VarType=Num&VarName=FREIGHT&VarDesc=Freight&VarType=Num&VarName=MAIL&VarDesc=Mail&VarType=Num&VarName=DISTANCE&VarDesc=Distance&VarType=Num&VarName=UNIQUE_CARRIER&VarDesc=UniqueCarrier&VarType=Char&VarName=AIRLINE_ID&VarDesc=AirlineID&VarType=Num&VarName=UNIQUE_CARRIER_NAME&VarDesc=UniqueCarrierName&VarType=Char&VarName=UNIQUE_CARRIER_ENTITY&VarDesc=UniqCarrierEntity&VarType=Char&VarName=REGION&VarDesc=CarrierRegion&VarType=Char&VarName=CARRIER&VarDesc=Carrier&VarType=Char&VarName=CARRIER_NAME&VarDesc=CarrierName&VarType=Char&VarName=CARRIER_GROUP&VarDesc=CarrierGroup&VarType=Num&VarName=CARRIER_GROUP_NEW&VarDesc=CarrierGroupNew&VarType=Num&VarName=ORIGIN_AIRPORT_ID&VarDesc=OriginAirportID&VarType=Num&VarName=ORIGIN_AIRPORT_SEQ_ID&VarDesc=OriginAirportSeqID&VarType=Num&VarName=ORIGIN_CITY_MARKET_ID&VarDesc=OriginCityMarketID&VarType=Num&VarName=ORIGIN&VarDesc=Origin&VarType=Char&VarName=ORIGIN_CITY_NAME&VarDesc=OriginCityName&VarType=Char&VarName=ORIGIN_COUNTRY&VarDesc=OriginCountry&VarType=Char&VarName=ORIGIN_COUNTRY_NAME&VarDesc=OriginCountryName&VarType=Char&VarName=ORIGIN_WAC&VarDesc=OriginWac&VarType=Num&VarName=DEST_AIRPORT_ID&VarDesc=DestAirportID&VarType=Num&VarName=DEST_AIRPORT_SEQ_ID&VarDesc=DestAirportSeqID&VarType=Num&VarName=DEST_CITY_MARKET_ID&VarDesc=DestCityMarketID&VarType=Num&VarName=DEST&VarDesc=Dest&VarType=Char&VarName=DEST_CITY_NAME&VarDesc=DestCityName&VarType=Char&VarName=DEST_COUNTRY&VarDesc=DestCountry&VarType=Char&VarName=DEST_COUNTRY_NAME&VarDesc=DestCountryName&VarType=Char&VarName=DEST_WAC&VarDesc=DestWac&VarType=Num&VarName=YEAR&VarDesc=Year&VarType=Num&VarName=QUARTER&VarDesc=Quarter&VarType=Num&VarName=MONTH&VarDesc=Month&VarType=Num&VarName=DISTANCE_GROUP&VarDesc=DistanceGroup&VarType=Num&VarName=CLASS&VarDesc=Class&VarType=Char'

In [7]:
# DOMESTIC Get file name of zip file to be downloaded
url_dom = 'https://www.transtats.bts.gov/DownLoad_Table.asp?Table_ID=258&Has_Group=3&Is_Zipped=0'
response = requests.post(url_dom, headers=headers,data=raw_data_dom)
url_dom = response.url

In [8]:
# International Get file name of zip file to be downloaded
url_inter = 'https://www.transtats.bts.gov/DownLoad_Table.asp?Table_ID=260&Has_Group=3&Is_Zipped=0'
response = requests.post(url_inter, headers=headers,data=raw_data_inter)
url_inter = response.url

In [9]:
# Download and unzip the zip file

# Files should look like these, but with different starting number string
#url_dom = "https://transtats.bts.gov/ftproot/TranStatsData/982825318_T_T100D_MARKET_US_CARRIER_ONLY.zip"
#url_inter = "https://transtats.bts.gov/ftproot/TranStatsData/982825318_T_T100I_MARKET_ALL_CARRIER.zip"

urls =[url_dom, url_inter]

files_to_unzip = []
for url in urls:
    
    zip_file = url.split("/")[-1].split(".")[0]
    files_to_unzip.append(zip_file)
    
    remote = urllib.request.urlopen(url)
    data = remote.read() 
    remote.close()

    local = open(zip_file, 'wb')
    local.write(data)
    local.close()

# Unzip the downloaded aviation data
for file_to_unzip in files_to_unzip:

    # specifying the zip file name 
    file_name = file_to_unzip 

    # opening the zip file in READ mode 
    #with ZipFile(file_name, 'r') as zip: 
    with zp(file_name, 'r') as zip_:
        # printing all the contents of the zip file 
        zip_.printdir() 

        # extracting all the files 
        print('Extracting all the files now...') 
        zip_.extractall() 
        print('Done!') 

File Name                                             Modified             Size
336873339_T_T100D_MARKET_US_CARRIER_ONLY.csv   2020-12-17 14:32:38     64750631
Extracting all the files now...
Done!
File Name                                             Modified             Size
336873381_T_T100I_MARKET_ALL_CARRIER.csv       2020-12-17 14:32:52     18006051
Extracting all the files now...
Done!


## DOMESTIC

### Read in Data

In [10]:
# Domestic Only: Take off and land in US

dom_fn = urls[0].split("/")[-1].split(".")[0] + ".csv"
dom_dir = f"{wrkdir}/{dom_fn}"
df_domestic = pd.read_csv(dom_dir, sep=",", 
                      converters={'PASSENGERS': lambda x: int(float(x))},
                      engine='python')

#Add country name (headers from International data)
df_domestic["ORIGIN_COUNTRY_NAME"]="United States"
df_domestic["ORIGIN_COUNTRY"]="US"
df_domestic["DEST_COUNTRY_NAME"]="United States"
df_domestic["DEST_COUNTRY"]="US"

#delete phantom column:
df_domestic = df_domestic[df_domestic.columns.drop(list(df_domestic.filter(regex='Unnamed')))]

# add marker for domestic
df_domestic["dataset"] = "domestic"

### Clean up df

In [11]:
dft = df_domestic.copy()

#lower case all columns
col_up = dft.columns

col_low = [x.lower() for x in col_up]
dft.columns = [x.lower() for x in dft.columns]
dft = dft[col_low]

# Delete zero pax
dft = dft[dft["passengers"] != 0]

# Delete rows with < min_pax
dft = dft[dft['passengers'] >= min_pax]

#Sort by month
dft = dft.sort_values('month').reset_index(drop=True)

#delete Saipan
dft = dft[dft["origin_state_abr"]!="TT"]

In [12]:
# Add timestamp YYYY-MM-DD
def timestamp(x, year):
    DD = "01"    
    MM = str(x)
    ts = f'{year}-{MM}-{DD}'
    return ts
dft["timestamp"] = dft.month.apply(lambda x: timestamp(x, year))

In [13]:
# Split to just city name
dft["origin_city"] = dft.origin_city_name.apply(lambda x: x.split(",")[0])
dft["dest_city"] = dft.dest_city_name.apply(lambda x: x.split(",")[0])

In [14]:
# Rename columns for schema:
dft=dft.rename(columns = {'origin':'origin_airport_code', 'dest': 'dest_airport_code',
                       'origin_country_name': 'origin_admin0', 'dest_country_name': 'dest_admin0',
                       'origin_country': 'origin_iso2', 'dest_country': 'dest_iso2'})


In [15]:
df = dft.copy()

# Add admin1 "Ohio"
conv_state_fn =f"{wrkdir}/abv_to_state.txt"

df_conv_state = pd.read_csv(conv_state_fn, sep="\t")
state_conv_d = df_conv_state.set_index("Code").to_dict()
state_conv_d = state_conv_d["Description"]

df["origin_admin1"] = df.origin_state_abr.apply(lambda x: state_conv_d.get(x, "None"))
df["dest_admin1"] = df.dest_state_abr.apply(lambda x: state_conv_d.get(x, "None"))

In [16]:
# ADD COUNTY NAME

# Read in Airport Facilities Directory data to get county name
afd_fn = f"{wrkdir}/airportFD.txt"
df_afd = pd.read_csv(afd_fn, sep="\t")

#Build county NAME dictionary
df_afd_county = pd.DataFrame(df_afd, columns = ['LocationID', 'County'])
county_dict = df_afd_county.set_index('LocationID').to_dict()
county_d = county_dict["County"]

df["origin_admin2"] = df.origin_airport_code.apply(lambda x: county_d.get(x, np.NaN))
df["dest_admin2"] = df.dest_airport_code.apply(lambda x: county_d.get(x, np.NaN))

In [17]:
# Replace #NAME? with NaN (for Puerto Rico)
df.replace('#NAME?',"None", inplace=True)

In [18]:
# ADD FIPS CODE

# Add county FIPS from county NAME
fips_fn = f"{wrkdir}/county_to_fips.csv"
df_fips = pd.read_csv(fips_fn , sep=",", converters={"FIPS County Code": lambda x: str(x)},engine='python')

# Lookup dict
fips_d = df_fips.groupby('State').apply(lambda x: [dict(zip(x["County Name"], x["FIPS County Code"]))]).to_dict()

In [19]:
df["munger_origin"]= df[["origin_state_abr",'origin_admin2']].values.tolist()
df["munger_dest"]= df[["dest_state_abr",'dest_admin2']].values.tolist()

In [20]:
def fips_it(x):
    
    abbv = x[0]
    county = x[1]
    try:
        return fips_d[abbv][0][county]
    except:
        return "None"

df["origin_fips"] = df.munger_origin.apply(lambda x: fips_it(x))
df["dest_fips"] = df.munger_dest.apply(lambda x: fips_it(x))

In [21]:
# Match capitalization format
def uncap_it(x):
    
    if "-" in str(x):
        temp = str(x).split("-")
        tt= ""
        for t in temp:
            tt = tt + "-" + t.capitalize()        
        
        return tt[1:]
        
    else:
        temp = str(x).split()
        tt= ""
        for t in temp:
            tt = tt + " " + t.capitalize()
        
        return tt.lstrip()

# Format proper capitalization for County NAMEs
df["origin_admin2"]=df["origin_admin2"].apply(lambda x: uncap_it(x))
df["dest_admin2"]=df["dest_admin2"].apply(lambda x: uncap_it(x))

In [22]:
# Get the columns we need
df_domestic_final = df.copy()

keepers = ['timestamp',
'origin_airport_code','origin_city','origin_state_abr','origin_admin2','origin_fips','origin_admin1','origin_admin0','origin_iso2',
'dest_airport_code','dest_city','dest_state_abr','dest_admin2','dest_fips','dest_admin1','dest_admin0','dest_iso2',
'distance','passengers', 'month']
df_domestic_final = df[keepers]

## INTERNATIONAL

In [23]:
# International: One (and only one) non-US location
#for fn in os.listdir(wrkdir):
#    if "T100I_MARKET_ALL_CARRIER.csv" in fn:
#        file = fn

inter_fn =  urls[1].split("/")[-1].split(".")[0] + ".csv"
inter_dir = f"{wrkdir}/{inter_fn}"
df_international = pd.read_csv(inter_dir , 
                               sep=",", 
                               converters={"PASSENGERS": lambda x: int(float(x))},
                               engine='python')

df = df_international.copy()

### Clean up DF

In [24]:
# add marker for international flights
df["dataset"] = "inter" 

#delete phantom column:
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]

#lower case all columns
col_up = df.columns
col_low = [x.lower() for x in col_up]
df.columns = [x.lower() for x in df.columns]
df = df[col_low]

# Delete zero pax
df = df[df["passengers"] != 0]

# Delete rows with < min_pax
df = df[df['passengers'] >= min_pax]

#Sort by month
df = df.sort_values('month').reset_index(drop=True)

In [25]:
# Add timestamp YYYY-MM-DD
def timestamp(x, year):
    DD = "01"    
    MM = str(x)
    ts = f'{year}-{MM}-{DD}'
    return ts
df["timestamp"] = df.month.apply(lambda x: timestamp(x, year))

# Split to just city name
df["origin_city"] = df.origin_city_name.apply(lambda x: x.split(",")[0])
df["dest_city"] = df.dest_city_name.apply(lambda x: x.split(",")[0])

# Rename columns for schema:
df=df.rename(columns = {'origin':'origin_airport_code', 'dest': 'dest_airport_code',
                       'origin_country_name': 'origin_admin0', 'dest_country_name': 'dest_admin0',
                       'origin_country': 'origin_iso2', 'dest_country': 'dest_iso2'})

In [26]:
# add state abr:
def abrv_it(x):
    abr = x.split(",")[1].lstrip()

    if len(abr) == 2:
        return abr
    else:
        return "None"
    
df["origin_state_abr"] = df.origin_city_name.apply(lambda x: abrv_it(x))
df["dest_state_abr"] = df.dest_city_name.apply(lambda x: abrv_it(x))

In [27]:
# Add admin1 (state/province)
conv_fn = f"{wrkdir}/worldcities.csv"
df_conv = pd.read_csv(conv_fn, sep=",")
df_conv["munger"] = df_conv[['iso2', 'city_ascii', 'admin_name']].values.tolist()

conv_d = df_conv.groupby('country').apply(lambda x: [dict(zip(x.city_ascii, x.admin_name))]).to_dict()


df["origin_admin1_temp"] = df.origin_state_abr.apply(lambda x: state_conv_d.get(x, "None"))
df["dest_admin1_temp"] = df.dest_state_abr.apply(lambda x: state_conv_d.get(x, "None"))

df["origin_munger"] = df[['origin_admin0', "origin_city", 'origin_admin1_temp']].values.tolist()
df["dest_munger"] = df[['dest_admin0', "dest_city", 'dest_admin1_temp']].values.tolist()

In [28]:
def admin1_origin(x):

    country = x[0]
    
    if country != "United States":

        city = x[1]
        
        if conv_d.get(country, "None") != "None":
            admin1 = conv_d[country][0].get(city)

            return admin1
        
    else:
        return x[2]

In [29]:
df['origin_admin1'] = df.origin_munger.apply(lambda x: admin1_origin(x))
df['dest_admin1'] = df.dest_munger.apply(lambda x: admin1_origin(x))

In [30]:
# Add admin2 (County Name for US)
def admin2_it(x):
    
    if x[1] == "United States":
        
        return county_d.get(x[0], "None")
    else:
        return "None"
    
df["admin2org"] = df[['origin_airport_code', 'origin_admin0']].values.tolist()
df["admin2dest"] = df[['dest_airport_code', 'dest_admin0']].values.tolist()


df["origin_admin2"] = df.admin2org.apply(lambda x: admin2_it(x))
df["dest_admin2"] = df.admin2dest.apply(lambda x: admin2_it(x))

In [31]:
# Get fips for US airports

df["munger_origin_fips"]= df[["origin_state_abr",'origin_admin2']].values.tolist()
df["munger_dest_fips"]= df[["dest_state_abr",'dest_admin2']].values.tolist()

df["origin_fips"] = df.munger_origin_fips.apply(lambda x: fips_it(x))
df["dest_fips"] = df.munger_dest_fips.apply(lambda x: fips_it(x))


# Format proper capitalization for County NAMEs
df["origin_admin2"]=df["origin_admin2"].apply(lambda x: uncap_it(x))
df["dest_admin2"]=df["dest_admin2"].apply(lambda x: uncap_it(x))


# Get the columns we need
df_inter_final = df.copy()

keepers = ['timestamp',
'origin_airport_code','origin_city','origin_state_abr','origin_admin2','origin_fips','origin_admin1','origin_admin0','origin_iso2',
'dest_airport_code','dest_city','dest_state_abr','dest_admin2','dest_fips','dest_admin1','dest_admin0','dest_iso2',
'distance', 'passengers', 'month']
df_inter_final = df_inter_final[keepers]

### Combine domestic and international dataframes

In [32]:
# Verify dfs have same columns...
a=df_domestic_final.columns.to_list()
b=df_inter_final.columns.to_list()
e= set(a)^set(b)
e

set()

In [33]:
# Combine dataframes
df_final = pd.concat([df_domestic_final, df_inter_final], ignore_index=True)

In [34]:
a = df_domestic_final.shape[0]
b = df_inter_final.shape[0]
c = df_final.shape[0]
c == (a+b)
print(f'{a} + {b} = {c} is {c == (a+b)}')

147887 + 48147 = 196034 is True


### Aggregrate origin to dest flights by month (without regard to airline)

In [35]:
cols = list(df_final.columns)
cols.remove("passengers")

In [36]:
df_agg = df_final.groupby(cols, as_index=False)[['passengers']].sum()

### Aggregation Test Below (not needed in final code):

In [37]:
# Fill these out
orig = "JFK"
dest = "MAD"
mon = 12
typ = "inter" #  "inter" or "dom"...Different dfs to test domestic or international...pick one

In [38]:
if typ == "dom":
    #Domestic
    no_agg = dft[(dft["origin_airport_code"]==orig) & (dft["dest_airport_code"]==dest) & (dft["month"]==mon)]
    agg = df_agg[(df_agg["origin_airport_code"]==orig) & (df_agg["dest_airport_code"]==dest) & (df_agg["month"]==mon)]

else:    
    #INTER
    no_agg = df[(df["origin_airport_code"]==orig) & (df["dest_airport_code"]==dest) & (df["month"]==mon)]
    agg = df_agg[(df_agg["origin_airport_code"]==orig) & (df_agg["dest_airport_code"]==dest) & (df_agg["month"]==mon)]

In [39]:
no_agg

Unnamed: 0,passengers,freight,mail,distance,unique_carrier,airline_id,unique_carrier_name,unique_carrier_entity,region,carrier,carrier_name,carrier_group,carrier_group_new,origin_airport_id,origin_airport_seq_id,origin_city_market_id,origin_airport_code,origin_city_name,origin_iso2,origin_admin0,origin_wac,dest_airport_id,dest_airport_seq_id,dest_city_market_id,dest_airport_code,dest_city_name,dest_iso2,dest_admin0,dest_wac,year,quarter,month,distance_group,class,dataset,timestamp,origin_city,dest_city,origin_state_abr,dest_state_abr,origin_admin1_temp,dest_admin1_temp,origin_munger,dest_munger,origin_admin1,dest_admin1,admin2org,admin2dest,origin_admin2,dest_admin2,munger_origin_fips,munger_dest_fips,origin_fips,dest_fips
44174,18040,532851.0,0.0,3589.0,IB,19547,Iberia Air Lines Of Spain,9482B,I,IB,Iberia Air Lines Of Spain,0,0,12478,1247805,31703,JFK,"New York, NY",US,United States,22,13156,1315606,31584,MAD,"Madrid, Spain",ES,Spain,482,2019,4,12,8,F,inter,2019-12-01,New York,Madrid,NY,,New York,,"[United States, New York, New York]","[Spain, Madrid, None]",New York,Madrid,"[JFK, United States]","[MAD, Spain]",Queens,,"[NY, QUEENS]","[None, None]",36081,
44334,6974,212034.0,7667.0,3589.0,AA,19805,American Airlines Inc.,10049,A,AA,American Airlines Inc.,3,3,12478,1247805,31703,JFK,"New York, NY",US,United States,22,13156,1315606,31584,MAD,"Madrid, Spain",ES,Spain,482,2019,4,12,8,F,inter,2019-12-01,New York,Madrid,NY,,New York,,"[United States, New York, New York]","[Spain, Madrid, None]",New York,Madrid,"[JFK, United States]","[MAD, Spain]",Queens,,"[NY, QUEENS]","[None, None]",36081,
45223,4478,125807.0,49256.0,3589.0,DL,19790,Delta Air Lines Inc.,10261,A,DL,Delta Air Lines Inc.,3,3,12478,1247805,31703,JFK,"New York, NY",US,United States,22,13156,1315606,31584,MAD,"Madrid, Spain",ES,Spain,482,2019,4,12,8,F,inter,2019-12-01,New York,Madrid,NY,,New York,,"[United States, New York, New York]","[Spain, Madrid, None]",New York,Madrid,"[JFK, United States]","[MAD, Spain]",Queens,,"[NY, QUEENS]","[None, None]",36081,
45583,6510,5188.0,0.0,3589.0,UX,20119,Air Europa,9482A,I,UX,Air Europa,0,0,12478,1247805,31703,JFK,"New York, NY",US,United States,22,13156,1315606,31584,MAD,"Madrid, Spain",ES,Spain,482,2019,4,12,8,F,inter,2019-12-01,New York,Madrid,NY,,New York,,"[United States, New York, New York]","[Spain, Madrid, None]",New York,Madrid,"[JFK, United States]","[MAD, Spain]",Queens,,"[NY, QUEENS]","[None, None]",36081,
47909,4812,0.0,0.0,3589.0,DY,21579,Norwegian Air Shuttle ASA,71127,I,DY,Norwegian Air Shuttle ASA,0,0,12478,1247805,31703,JFK,"New York, NY",US,United States,22,13156,1315606,31584,MAD,"Madrid, Spain",ES,Spain,482,2019,4,12,8,F,inter,2019-12-01,New York,Madrid,NY,,New York,,"[United States, New York, New York]","[Spain, Madrid, None]",New York,Madrid,"[JFK, United States]","[MAD, Spain]",Queens,,"[NY, QUEENS]","[None, None]",36081,


In [40]:
agg

Unnamed: 0,timestamp,origin_airport_code,origin_city,origin_state_abr,origin_admin2,origin_fips,origin_admin1,origin_admin0,origin_iso2,dest_airport_code,dest_city,dest_state_abr,dest_admin2,dest_fips,dest_admin1,dest_admin0,dest_iso2,distance,month,passengers
31622,2019-12-01,JFK,New York,NY,Queens,36081,New York,United States,US,MAD,Madrid,,,,Madrid,Spain,ES,3589.0,12,40814


In [41]:
no = no_agg['passengers'].sum()
yes = agg["passengers"]

print(f'{(yes==no).iloc[0]} for {orig} to {dest} (non-aggreated {no} = aggregrated {yes.iloc[0]})')

True for JFK to MAD (non-aggreated 40814 = aggregrated 40814)


### Agg testing ^ above complete...back to the code:

In [42]:
# Final cleanup
df_agg.replace('#name?',np.nan, inplace=True)
df_agg.replace('None',np.nan, inplace=True)

df_agg = df_agg.sort_values('month').reset_index(drop=True)

In [43]:
# Write to CSV
del df_agg['month']
df_agg.to_csv(f'{wrkdir}/airport_pax_traffic_year={year}_min_pax={min_pax}.csv', index = False)

### Airport Facilities Dir NOTES:

#### County name to FIPS:
- Delete `.` after `ST`

Change "DADE" to "MIAMI-DADE"

#### Updates to IATA vs ICAO in airportFD.txt:
state, from, to

- AK, RBH copy of 5Z9
- AK KLW copy AKW
- CA, TRK, TKF
- CA, CLD, CRQ
- CA, IZA, SQA
- MT, FCA, GPI
- AZ, AZA, IWA
- AZ, SCF, SDL
- AZ, NYL, YUM
- AZ, 1Z1, DQS
- AZ, AZC, AZ7
- PR, VQS, JRV
- PA, UNV, SCE
- MA, added UBF copy of CQX
- MA, added QMN copy of 1B9
- MO, added BKG copy of BBG
- GA, added QMA, copy of RYY
- GA, added LIY copy of LHW
- NC, JQF, USA
- NC, AKH, NC1
- NC,  added NC2
- NV, BVU, BLD
- NV, HSH copy of HND
- NV, NV05,NV5
- MI, SAW, MQT
- WA, S60, KEH
- WA, added LKE same as KEH
- WA, ORS, ESD
- WA, FHR, FRD
- WA, W33, FBS
- TX DNE copy of DFW
- FL RQZ copy of HRT
- FL, X44, MPB
- FL RBN has no ICAO....Fort Jefferson Island off Key West
- FL, DTS, DSI
- NY, POU, DQK * added row
- NY, 0B8, FID *
- NY, VWK, 5B2 *
- ND ISN: closed October 10, 2019
- SC, HXD, HHH *
- SC, SC1 added Beaufort MCAS 
- UT, UXR copy of UT25
- NJ, added PCT copy 39N
- NJ, added NJ1 copy 19N
- NM, TSM copy SKX


#### TO `county_to_fips.csv` added:
"MIAMI-DADE","FL","10120","12086","33124","Miami-Miami Beach-Kendall, FL"

# Hand-jam mapping of ICAO to County code:

w= df_res[df_res["DEST_COUNTY"]=="None"]
w.shape

# Switch to Origin also
w["DEST_STATE_ABR"].unique()

w[w["ORIGIN_STATE_ABR"]=="FL"]["ORIGIN"].unique()

w[w["DEST_STATE_ABR"]=="TN"]

# Airports without County FIPS

df_fin[df_fin["ORIGIN_COUNTY_FIPS"]=='None'].shape

d=df_fin[df_fin["ORIGIN_COUNTY"]=='None'][df_fin["ORIGIN_STATE_ABR"]=="AK"]
d=df_fin[df_fin["ORIGIN_COUNTY"]=='None'][df_fin["ORIGIN_STATE_ABR"]=="AK"]

s=d["ORIGIN"]

v=set(s)

len(v)