# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [92]:
# Do all imports and installs here
import pandas as pd
import json

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [83]:
# Read in the data here
df_airport = pd.read_csv("airport-codes_csv.csv")
df_immigration = pd.read_csv("immigration_data_sample.csv")
df_demographics = pd.read_csv("us-cities-demographics.csv",sep=';')
df_temperature = pd.read_csv("GlobalLandTemperaturesByState.csv")

In [7]:
df_airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [8]:
df_immigration.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


In [9]:
df_demographics.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [10]:
df_temperature.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,State,Country
0,1855-05-01,25.544,1.171,Acre,Brazil
1,1855-06-01,24.228,1.103,Acre,Brazil
2,1855-07-01,24.371,1.044,Acre,Brazil
3,1855-08-01,25.427,1.073,Acre,Brazil
4,1855-09-01,25.675,1.014,Acre,Brazil


In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [17]:
df_sas = df_spark.limit(500).toPandas()

In [18]:
df_sas.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,...,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,...,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

### Temperature Data by State

In [11]:
# Performing cleaning tasks here
import datetime
df_temperature['dt'] = pd.to_datetime(df_temperature['dt'])
df_temperature['year'] = df_temperature['dt'].dt.year
df_temperature['month'] = df_temperature['dt'].dt.month

In [12]:
us_df_temperature = df_temperature[(df_temperature["Country"]=="United States")&(df_temperature['year'] == 2013)]

In [35]:
us_state_abbrev = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'District Of Columbia': 'DC',
    'Florida': 'FL',
    'Georgia (State)': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Northern Mariana Islands':'MP',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Palau': 'PW',
    'Pennsylvania': 'PA',
    'Puerto Rico': 'PR',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virgin Islands': 'VI',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
}

us_abbrev_state = {v: k for k, v in us_state_abbrev.items()}

In [33]:
us_df_temperature['state_abbrev'] = us_df_temperature.apply(lambda row: us_state_abbrev[row["State"]],axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [34]:
us_df_temperature.head(20)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,State,Country,year,month,state_abbrev
10688,2013-01-01,10.284,0.241,Alabama,United States,2013,1,AL
10689,2013-02-01,9.161,0.213,Alabama,United States,2013,2,AL
10690,2013-03-01,10.226,0.158,Alabama,United States,2013,3,AL
10691,2013-04-01,17.067,0.221,Alabama,United States,2013,4,AL
10692,2013-05-01,20.619,0.229,Alabama,United States,2013,5,AL
10693,2013-06-01,26.072,0.175,Alabama,United States,2013,6,AL
10694,2013-07-01,25.952,0.207,Alabama,United States,2013,7,AL
10695,2013-08-01,26.107,0.257,Alabama,United States,2013,8,AL
10696,2013-09-01,25.121,1.12,Alabama,United States,2013,9,AL
15098,2013-01-01,-15.329,0.495,Alaska,United States,2013,1,AK


In [48]:
city_codes= {'ALC'	:	'ALCAN             ',
              'ANC'	:	'ANCHORAGE         ',
              'BAR'	:	'BAKER AAF - BAKER ISLAND',
              'DAC'	:	'DALTONS CACHE     ',
              'PIZ'	:	'DEW STATION PT LAY DEW',
              'DTH'	:	'DUTCH HARBOR      ',
              'EGL'	:	'EAGLE             ',
              'FRB'	:	'FAIRBANKS         ',
              'HOM'	:	'HOMER             ',
              'HYD'	:	'HYDER             ',
              'JUN'	:	'JUNEAU            ',
              '5KE'	:	'KETCHIKAN',
              'KET'	:	'KETCHIKAN         ',
              'MOS'	:	'MOSES POINT INTERMEDIATE',
              'NIK'	:	'NIKISKI           ',
              'NOM'	:	'NOM               ',
              'PKC'	:	'POKER CREEK       ',
              'ORI'	:	'PORT LIONS SPB',
              'SKA'	:	'SKAGWAY           ',
              'SNP'	:	'ST. PAUL ISLAND',
              'TKI'	:	'TOKEEN',
              'WRA'	:	'WRANGELL          ',
              'HSV'	:	'HUNTSVILLE',
              'MOB'	:	'MOBILE            ',
              'LIA'	:	'LITTLE ROCK',
              'ROG'	:	'ROGERS ARPT',
              'DOU'	:	'DOUGLAS           ',
              'LUK'	:	'LUKEVILLE         ',
              'MAP'	:	'MARIPOSA            ',
              'NAC'	:	'NACO              ',
              'NOG'	:	'NOGALES           ',
              'PHO'	:	'PHOENIX           ',
              'POR'	:	'PORTAL',
              'SLU'	:	'SAN LUIS          ',
              'SAS'	:	'SASABE            ',
              'TUC'	:	'TUCSON            ',
              'YUI'	:	'YUMA              ',
              'AND'	:	'ANDRADE           ',
              'BUR'	:	'BURBANK',
              'CAL'	:	'CALEXICO          ',
              'CAO'	:	'CAMPO             ',
              'FRE'	:	'FRESNO            ',
              'ICP'	:	'IMPERIAL COUNTY   ',
              'LNB'	:	'LONG BEACH         ',
              'LOS'	:	'LOS ANGELES       ',
              'BFL'	:	'BAKERSFIELD',
              'OAK'	:	'OAKLAND ',
              'ONT'	:	'ONTARIO',
              'OTM'	:	'OTAY MESA          ',
              'BLT'	:	'PACIFIC, HWY. STATION ',
              'PSP'	:	'PALM SPRINGS',
              'SAC'	:	'SACRAMENTO        ',
              'SLS'	:	'SALINAS (BPS)',
              'SDP'	:	'SAN DIEGO',
              'SFR'	:	'SAN FRANCISCO     ',
              'SNJ'	:	'SAN JOSE          ',
              'SLO'	:	'SAN LUIS OBISPO   ',
              'SLI'	:	'SAN LUIS OBISPO (BPS)',
              'SPC'	:	'SAN PEDRO         ',
              'SYS'	:	'SAN YSIDRO        ',
              'SAA'	:	'SANTA ANA         ',
              'STO'	:	'STOCKTON (BPS)',
              'TEC'	:	'TECATE            ',
              'TRV'	:	'TRAVIS-AFB        ',
              'APA'	:	'ARAPAHOE COUNTY',
              'ASE'	:	'ASPEN #ARPT',
              'COS'	:	'COLORADO SPRINGS',
              'DEN'	:	'DENVER            ',
              'DRO'	:	'LA PLATA - DURANGO',
              'BDL'	:	'BRADLEY INTERNATIONAL',
              'BGC'	:	'BRIDGEPORT        ',
              'GRT'	:	'GROTON            ',
              'HAR'	:	'HARTFORD          ',
              'NWH'	:	'NEW HAVEN         ',
              'NWL'	:	'NEW LONDON        ',
              'TST'	:	'NEWINGTON DATA CENTER TEST',
              'WAS'	:	'WASHINGTON DC         ',
              'DOV'	:	'DOVER AFB',
              'DVD'	:	'DOVER-AFB         ',
              'WLL'	:	'WILMINGTON        ',
              'BOC'	:	'BOCAGRANDE        ',
              'SRQ'	:	'BRADENTON - SARASOTA',
              'CAN'	:	'CAPE CANAVERAL    ',
              'DAB'	:	'DAYTONA BEACH INTERNATIONAL',
              'FRN'	:	'FERNANDINA        ',
              'FTL'	:	'FORT LAUDERDALE   ',
              'FMY'	:	'FORT MYERS        ',
              'FPF'	:	'FORT PIERCE       ',
              'HUR'	:	'HURLBURT FIELD',
              'GNV'	:	'J R ALISON MUNI - GAINESVILLE',
              'JAC'	:	'JACKSONVILLE      ',
              'KEY'	:	'KEY WEST          ',
              'LEE'	:	'LEESBURG MUNICIPAL AIRPORT',
              'MLB'	:	'MELBOURNE',
              'MIA'	:	'MIAMI             ',
              'APF'	:	'NAPLES #ARPT',
              'OPF'	:	'OPA LOCKA',
              'ORL'	:	'ORLANDO           ',
              'PAN'	:	'PANAMA CITY       ',
              'PEN'	:	'PENSACOLA         ',
              'PCF'	:	'PORT CANAVERAL    ',
              'PEV'	:	'PORT EVERGLADES   ',
              'PSJ'	:	'PORT ST JOE       ',
              'SFB'	:	'SANFORD           ',
              'SGJ'	:	'ST AUGUSTINE ARPT',
              'SAU'	:	'ST AUGUSTINE      ',
              'FPR'	:	'ST LUCIE COUNTY',
              'SPE'	:	'ST PETERSBURG     ',
              'TAM'	:	'TAMPA             ',
              'WPB'	:	'WEST PALM BEACH   ',
              'ATL'	:	'ATLANTA           ',
              'BRU'	:	'BRUNSWICK         ',
              'AGS'	:	'BUSH FIELD - AUGUSTA',
              'SAV'	:	'SAVANNAH          ',
              'AGA'	:	'AGANA             ',
              'HHW'	:	'HONOLULU          ',
              'OGG'	:	'KAHULUI - MAUI',
              'KOA'	:	'KEAHOLE-KONA      ',
              'LIH'	:	'LIHUE             ',
              'CID'	:	'CEDAR RAPIDS/IOWA CITY',
              'DSM'	:	'DES MOINES',
              'BOI'	:	'AIR TERM. (GOWEN FLD) BOISE',
              'EPI'	:	'EASTPORT          ',
              'IDA'	:	'FANNING FIELD - IDAHO FALLS',
              'PTL'	:	'PORTHILL          ',
              'SPI'	:	'CAPITAL - SPRINGFIELD',
              'CHI'	:	'CHICAGO           ',
              'DPA'	:	'DUPAGE COUNTY',
              'PIA'	:	'GREATER PEORIA',
              'RFD'	:	'GREATER ROCKFORD',
              'UGN'	:	'MEMORIAL - WAUKEGAN',
              'GAR'	:	'GARY              ',
              'HMM'	:	'HAMMOND           ',
              'INP'	:	'INDIANAPOLIS      ',
              'MRL'	:	'MERRILLVILLE      ',
              'SBN'	:	'SOUTH BEND',
              'ICT'	:	'MID-CONTINENT - WITCHITA',
              'LEX'	:	'BLUE GRASS - LEXINGTON',
              'LOU'	:	'LOUISVILLE        ',
              'BTN'	:	'BATON ROUGE       ',
              'LKC'	:	'LAKE CHARLES      ',
              'LAK'	:	'LAKE CHARLES (BPS)',
              'MLU'	:	'MONROE',
              'MGC'	:	'MORGAN CITY       ',
              'NOL'	:	'NEW ORLEANS       ',
              'BOS'	:	'BOSTON            ',
              'GLO'	:	'GLOUCESTER        ',
              'BED'	:	'HANSCOM FIELD - BEDFORD',
              'LYN'	:	'LYNDEN            ',
              'ADW'	:	'ANDREWS AFB',
              'BAL'	:	'BALTIMORE         ',
              'MKG'	:	'MUSKEGON',
              'PAX'	:	'PATUXENT RIVER    ',
              'BGM'	:	'BANGOR            ',
              'BOO'	:	'BOOTHBAY HARBOR   ',
              'BWM'	:	'BRIDGEWATER       ',
              'BCK'	:	'BUCKPORT          ',
              'CLS'	:	'CALAIS   ',
              'CRB'	:	'CARIBOU           ',
              'COB'	:	'COBURN GORE       ',
              'EST'	:	'EASTCOURT         ',
              'EPT'	:	'EASTPORT MUNICIPAL',
              'EPM'	:	'EASTPORT          ',
              'FOR'	:	'FOREST CITY       ',
              'FTF'	:	'FORT FAIRFIELD    ',
              'FTK'	:	'FORT KENT         ',
              'HML'	:	'HAMIIN            ',
              'HTM'	:	'HOULTON           ',
              'JKM'	:	'JACKMAN           ',
              'KAL'	:	'KALISPEL          ',
              'LIM'	:	'LIMESTONE         ',
              'LUB'	:	'LUBEC             ',
              'MAD'	:	'MADAWASKA         ',
              'POM'	:	'PORTLAND          ',
              'RGM'	:	'RANGELEY (BPS)',
              'SBR'	:	'SOUTH BREWER      ',
              'SRL'	:	'ST AURELIE        ',
              'SPA'	:	'ST PAMPILE        ',
              'VNB'	:	'VAN BUREN         ',
              'VCB'	:	'VANCEBORO         ',
              'AGN'	:	'ALGONAC           ',
              'ALP'	:	'ALPENA            ',
              'BCY'	:	'BAY CITY          ',
              'DET'	:	'DETROIT           ',
              'GRP'	:	'GRAND RAPIDS',
              'GRO'	:	'GROSSE ISLE       ',
              'ISL'	:	'ISLE ROYALE       ',
              'MRC'	:	'MARINE CITY       ',
              'MRY'	:	'MARYSVILLE        ',
              'PTK'	:	'OAKLAND COUNTY - PONTIAC',
              'PHU'	:	'PORT HURON        ',
              'RBT'	:	'ROBERTS LANDING   ',
              'SAG'	:	'SAGINAW           ',
              'SSM'	:	'SAULT STE. MARIE  ',
              'SCL'	:	'ST CLAIR          ',
              'YIP'	:	'WILLOW RUN - YPSILANTI',
              'BAU'	:	'BAUDETTE          ',
              'CAR'	:	'CARIBOU MUNICIPAL AIRPORT',
              'GTF'	:	'Collapsed into INT',
              'INL'	:	'Collapsed into INT',
              'CRA'	:	'CRANE LAKE        ',
              'MIC'	:	'CRYSTAL MUNICIPAL AIRPORT',
              'DUL'	:	'DULUTH            ',
              'ELY'	:	'ELY               ',
              'GPM'	:	'GRAND PORTAGE     ',
              'SVC'	:	'GRANT COUNTY - SILVER CITY',
              'INT'	:	'INT''L FALLS      ',
              'LAN'	:	'LANCASTER         ',
              'MSP'	:	'MINN./ST PAUL     ',
              'LIN'	:	'NORTHERN SVC CENTER   ',
              'NOY'	:	'NOYES             ',
              'PIN'	:	'PINE CREEK        ',
              '48Y'	:	'PINECREEK BORDER ARPT',
              'RAN'	:	'RAINER            ',
              'RST'	:	'ROCHESTER',
              'ROS'	:	'ROSEAU            ',
              'SPM'	:	'ST PAUL           ',
              'WSB'	:	'WARROAD INTL',
              'WAR'	:	'WARROAD           ',
              'KAN'	:	'KANSAS CITY       ',
              'SGF'	:	'SPRINGFIELD-BRANSON',
              'STL'	:	'ST LOUIS          ',
              'WHI'	:	'WHITETAIL         ',
              'WHM'	:	'WILD HORSE        ',
              'GPT'	:	'BILOXI REGIONAL',
              'GTR'	:	'GOLDEN TRIANGLE LOWNDES CNTY',
              'GUL'	:	'GULFPORT          ',
              'PAS'	:	'PASCAGOULA        ',
              'JAN'	:	'THOMPSON FIELD - JACKSON',
              'BIL'	:	'BILLINGS          ',
              'BTM'	:	'BUTTE             ',
              'CHF'	:	'CHIEF MT          ',
              'CTB'	:	'CUT BANK MUNICIPAL',
              'CUT'	:	'CUT BANK          ',
              'DLB'	:	'DEL BONITA        ',
              'EUR'	:	'EUREKA (BPS)',
              'BZN'	:	'GALLATIN FIELD - BOZEMAN',
              'FCA'	:	'GLACIER NATIONAL PARK',
              'GGW'	:	'GLASGOW           ',
              'GRE'	:	'GREAT FALLS       ',
              'HVR'	:	'HAVRE             ',
              'HEL'	:	'HELENA            ',
              'LWT'	:	'LEWISTON          ',
              'MGM'	:	'MORGAN            ',
              'OPH'	:	'OPHEIM            ',
              'PIE'	:	'PIEGAN            ',
              'RAY'	:	'RAYMOND           ',
              'ROO'	:	'ROOSVILLE         ',
              'SCO'	:	'SCOBEY            ',
              'SWE'	:	'SWEETGTASS        ',
              'TRL'	:	'TRIAL CREEK       ',
              'TUR'	:	'TURNER            ',
              'WCM'	:	'WILLOW CREEK      ',
              'CLT'	:	'CHARLOTTE         ',
              'FAY'	:	'FAYETTEVILLE',
              'MRH'	:	'MOREHEAD CITY     ',
              'FOP'	:	'MORRIS FIELDS AAF',
              'GSO'	:	'PIEDMONT TRIAD INTL AIRPORT',
              'RDU'	:	'RALEIGH/DURHAM    ',
              'SSC'	:	'SHAW AFB - SUMTER',
              'WIL'	:	'WILMINGTON        ',
              'AMB'	:	'AMBROSE           ',
              'ANT'	:	'ANTLER            ',
              'CRY'	:	'CARBURY           ',
              'DNS'	:	'DUNSEITH          ',
              'FAR'	:	'FARGO             ',
              'FRT'	:	'FORTUNA           ',
              'GRF'	:	'GRAND FORKS       ',
              'HNN'	:	'HANNAH            ',
              'HNS'	:	'HANSBORO          ',
              'MAI'	:	'MAIDA             ',
              'MND'	:	'MINOT             ',
              'NEC'	:	'NECHE             ',
              'NOO'	:	'NOONAN            ',
              'NRG'	:	'NORTHGATE         ',
              'PEM'	:	'PEMBINA           ',
              'SAR'	:	'SARLES            ',
              'SHR'	:	'SHERWOOD          ',
              'SJO'	:	'ST JOHN           ',
              'WAL'	:	'WALHALLA          ',
              'WHO'	:	'WESTHOPE          ',
              'WND'	:	'WILLISTON         ',
              'OMA'	:	'OMAHA             ',
              'LEB'	:	'LEBANON           ',
              'MHT'	:	'MANCHESTER',
              'PNH'	:	'PITTSBURG         ',
              'PSM'	:	'PORTSMOUTH        ',
              'BYO'	:	'BAYONNE           ',
              'CNJ'	:	'CAMDEN            ',
              'HOB'	:	'HOBOKEN           ',
              'JER'	:	'JERSEY CITY       ',
              'WRI'	:	'MC GUIRE AFB - WRIGHTSOWN',
              'MMU'	:	'MORRISTOWN',
              'NEW'	:	'NEWARK/TETERBORO  ',
              'PER'	:	'PERTH AMBOY       ',
              'ACY'	:	'POMONA FIELD - ATLANTIC CITY',
              'ALA'	:	'ALAMAGORDO (BPS)',
              'ABQ'	:	'ALBUQUERQUE       ',
              'ANP'	:	'ANTELOPE WELLS    ',
              'CRL'	:	'CARLSBAD          ',
              'COL'	:	'COLUMBUS          ',
              'CDD'	:	'CRANE LAKE - ST. LOUIS CNTY',
              'DNM'	:	'DEMING (BPS)',
              'LAS'	:	'LAS CRUCES        ',
              'LOB'	:	'LORDSBURG (BPS)',
              'RUI'	:	'RUIDOSO',
              'STR'	:	'SANTA TERESA      ',
              'RNO'	:	'CANNON INTL - RENO/TAHOE',
              'FLX'	:	'FALLON MUNICIPAL AIRPORT',
              'LVG'	:	'LAS VEGAS         ',
              'REN'	:	'RENO              ',
              'ALB'	:	'ALBANY            ',
              'AXB'	:	'ALEXANDRIA BAY    ',
              'BUF'	:	'BUFFALO           ',
              'CNH'	:	'CANNON CORNERS',
              'CAP'	:	'CAPE VINCENT      ',
              'CHM'	:	'CHAMPLAIN         ',
              'CHT'	:	'CHATEAUGAY        ',
              'CLA'	:	'CLAYTON           ',
              'FTC'	:	'FORT COVINGTON    ',
              'LAG'	:	'LA GUARDIA        ',
              'LEW'	:	'LEWISTON          ',
              'MAS'	:	'MASSENA           ',
              'MAG'	:	'MCGUIRE AFB       ',
              'MOO'	:	'MOORES            ',
              'MRR'	:	'MORRISTOWN        ',
              'NYC'	:	'NEW YORK          ',
              'NIA'	:	'NIAGARA FALLS     ',
              'OGD'	:	'OGDENSBURG        ',
              'OSW'	:	'OSWEGO            ',
              'ELM'	:	'REGIONAL ARPT - HORSEHEAD',
              'ROC'	:	'ROCHESTER         ',
              'ROU'	:	'ROUSES POINT      ',
              'SWF'	:	'STEWART - ORANGE CNTY',
              'SYR'	:	'SYRACUSE          ',
              'THO'	:	'THOUSAND ISLAND BRIDGE',
              'TRO'	:	'TROUT RIVER       ',
              'WAT'	:	'WATERTOWN         ',
              'HPN'	:	'WESTCHESTER - WHITE PLAINS',
              'WRB'	:	'WHIRLPOOL BRIDGE',
              'YOU'	:	'YOUNGSTOWN        ',
              'AKR'	:	'AKRON             ',
              'ATB'	:	'ASHTABULA         ',
              'CIN'	:	'CINCINNATI        ',
              'CLE'	:	'CLEVELAND         ',
              'CLM'	:	'COLUMBUS          ',
              'LOR'	:	'LORAIN            ',
              'MBO'	:	'MARBLE HEADS      ',
              'SDY'	:	'SANDUSKY          ',
              'TOL'	:	'TOLEDO            ',
              'OKC'	:	'OKLAHOMA CITY     ',
              'TUL'	:	'TULSA',
              'AST'	:	'ASTORIA           ',
              'COO'	:	'COOS BAY          ',
              'HIO'	:	'HILLSBORO',
              'MED'	:	'MEDFORD           ',
              'NPT'	:	'NEWPORT           ',
              'POO'	:	'PORTLAND          ',
              'PUT'	:	'PUT-IN-BAY        ',
              'RDM'	:	'ROBERTS FIELDS - REDMOND',
              'ERI'	:	'ERIE              ',
              'MDT'	:	'HARRISBURG',
              'HSB'	:	'HARRISONBURG      ',
              'PHI'	:	'PHILADELPHIA      ',
              'PIT'	:	'PITTSBURG         ',
              'AGU'	:	'AGUADILLA         ',
              'BQN'	:	'BORINQUEN - AGUADILLO',
              'JCP'	:	'CULEBRA - BENJAMIN RIVERA',
              'ENS'	:	'ENSENADA          ',
              'FAJ'	:	'FAJARDO           ',
              'HUM'	:	'HUMACAO           ',
              'JOB'	:	'JOBOS             ',
              'MAY'	:	'MAYAGUEZ          ',
              'PON'	:	'PONCE             ',
              'PSE'	:	'PONCE-MERCEDITA',
              'SAJ'	:	'SAN JUAN          ',
              'VQS'	:	'VIEQUES-ARPT',
              'PRO'	:	'PROVIDENCE        ',
              'PVD'	:	'THEODORE FRANCIS - WARWICK',
              'CHL'	:	'CHARLESTON        ',
              'CAE'	:	'COLUMBIA #ARPT',
              'GEO'	:	'GEORGETOWN        ',
              'GSP'	:	'GREENVILLE',
              'GRR'	:	'GREER',
              'MYR'	:	'MYRTLE BEACH',
              'SPF'	:	'BLACK HILLS, SPEARFISH',
              'HON'	:	'HOWES REGIONAL ARPT - HURON',
              'SAI'	:	'SAIPAN, SPN           ',
              'TYS'	:	'MC GHEE TYSON - ALCOA',
              'MEM'	:	'MEMPHIS           ',
              'NSV'	:	'NASHVILLE         ',
              'TRI'	:	'TRI CITY ARPT',
              'ADS'	:	'ADDISON AIRPORT- ADDISON',
              'ADT'	:	'AMISTAD DAM       ',
              'ANZ'	:	'ANZALDUAS',
              'AUS'	:	'AUSTIN            ',
              'BEA'	:	'BEAUMONT          ',
              'BBP'	:	'BIG BEND PARK (BPS)',
              'SCC'	:	'BP SPEC COORD. CTR',
              'BTC'	:	'BP TACTICAL UNIT  ',
              'BOA'	:	'BRIDGE OF AMERICAS',
              'BRO'	:	'BROWNSVILLE       ',
              'CRP'	:	'CORPUS CHRISTI    ',
              'DAL'	:	'DALLAS            ',
              'DLR'	:	'DEL RIO           ',
              'DNA'	:	'DONNA',
              'EGP'	:	'EAGLE PASS        ',
              'ELP'	:	'EL PASO           ',
              'FAB'	:	'FABENS            ',
              'FAL'	:	'FALCON HEIGHTS    ',
              'FTH'	:	'FORT HANCOCK      ',
              'AFW'	:	'FORT WORTH ALLIANCE',
              'FPT'	:	'FREEPORT          ',
              'GAL'	:	'GALVESTON         ',
              'HLG'	:	'HARLINGEN         ',
              'HID'	:	'HIDALGO           ',
              'HOU'	:	'HOUSTON           ',
              'SGR'	:	'HULL FIELD, SUGAR LAND ARPT',
              'LLB'	:	'JUAREZ-LINCOLN BRIDGE',
              'LCB'	:	'LAREDO COLUMBIA BRIDGE',
              'LRN'	:	'LAREDO NORTH      ',
              'LAR'	:	'LAREDO            ',
              'LSE'	:	'LOS EBANOS        ',
              'IND'	:	'LOS INDIOS',
              'LOI'	:	'LOS INDIOS        ',
              'MRS'	:	'MARFA (BPS)',
              'MCA'	:	'MCALLEN           ',
              'MAF'	:	'ODESSA REGIONAL',
              'PDN'	:	'PASO DEL NORTE,TX     ',
              'PBB'	:	'PEACE BRIDGE      ',
              'PHR'	:	'PHARR             ',
              'PAR'	:	'PORT ARTHUR       ',
              'ISB'	:	'PORT ISABEL       ',
              'POE'	:	'PORT OF EL PASO   ',
              'PRE'	:	'PRESIDIO          ',
              'PGR'	:	'PROGRESO          ',
              'RIO'	:	'RIO GRANDE CITY   ',
              'ROM'	:	'ROMA              ',
              'SNA'	:	'SAN ANTONIO       ',
              'SNN'	:	'SANDERSON         ',
              'VIB'	:	'VETERAN INTL BRIDGE',
              'YSL'	:	'YSLETA            ',
              'CHA'	:	'CHARLOTTE AMALIE  ',
              'CHR'	:	'CHRISTIANSTED     ',
              'CRU'	:	'CRUZ BAY, ST JOHN ',
              'FRK'	:	'FREDERIKSTED      ',
              'STT'	:	'ST THOMAS         ',
              'LGU'	:	'CACHE AIRPORT - LOGAN',
              'SLC'	:	'SALT LAKE CITY    ',
              'CHO'	:	'ALBEMARLE CHARLOTTESVILLE',
              'DAA'	:	'DAVISON AAF - FAIRFAX CNTY',
              'HOP'	:	'HOPEWELL          ',
              'HEF'	:	'MANASSAS #ARPT',
              'NWN'	:	'NEWPORT           ',
              'NOR'	:	'NORFOLK           ',
              'RCM'	:	'RICHMOND          ',
              'ABS'	:	'ALBURG SPRINGS    ',
              'ABG'	:	'ALBURG            ',
              'BEB'	:	'BEEBE PLAIN       ',
              'BEE'	:	'BEECHER FALLS     ',
              'BRG'	:	'BURLINGTON        ',
              'CNA'	:	'CANAAN            ',
              'DER'	:	'DERBY LINE (I-91) ',
              'DLV'	:	'DERBY LINE (RT. 5)',
              'ERC'	:	'EAST RICHFORD     ',
              'HIG'	:	'HIGHGATE SPRINGS  ',
              'MOR'	:	'MORSES LINE       ',
              'NPV'	:	'NEWPORT           ',
              'NRT'	:	'NORTH TROY        ',
              'NRN'	:	'NORTON            ',
              'PIV'	:	'PINNACLE ROAD     ',
              'RIF'	:	'RICHFORT          ',
              'STA'	:	'ST ALBANS         ',
              'SWB'	:	'SWANTON (BP - SECTOR HQ)',
              'WBE'	:	'WEST BERKSHIRE    ',
              'ABE'	:	'ABERDEEN          ',
              'ANA'	:	'ANACORTES         ',
              'BEL'	:	'BELLINGHAM        ',
              'BLI'	:	'BELLINGHAMSHINGTON #INTL',
              'BLA'	:	'BLAINE            ',
              'BWA'	:	'BOUNDARY          ',
              'CUR'	:	'CURLEW (BPS)',
              'DVL'	:	'DANVILLE          ',
              'EVE'	:	'EVERETT           ',
              'FER'	:	'FERRY             ',
              'FRI'	:	'FRIDAY HARBOR     ',
              'FWA'	:	'FRONTIER          ',
              'KLM'	:	'KALAMA            ',
              'LAU'	:	'LAURIER           ',
              'LON'	:	'LONGVIEW          ',
              'MET'	:	'METALINE FALLS    ',
              'MWH'	:	'MOSES LAKE GRANT COUNTY ARPT',
              'NEA'	:	'NEAH BAY          ',
              'NIG'	:	'NIGHTHAWK         ',
              'OLY'	:	'OLYMPIA           ',
              'ORO'	:	'OROVILLE          ',
              'PWB'	:	'PASCO             ',
              'PIR'	:	'POINT ROBERTS     ',
              'PNG'	:	'PORT ANGELES      ',
              'PTO'	:	'PORT TOWNSEND     ',
              'SEA'	:	'SEATTLE           ',
              'SPO'	:	'SPOKANE           ',
              'SUM'	:	'SUMAS             ',
              'TAC'	:	'TACOMA            ',
              'PSC'	:	'TRI-CITIES - PASCO',
              'VAN'	:	'VANCOUVER         ',
              'AGM'	:	'ALGOMA            ',
              'BAY'	:	'BAYFIELD          ',
              'GRB'	:	'GREEN BAY         ',
              'MNW'	:	'MANITOWOC         ',
              'MIL'	:	'MILWAUKEE         ',
              'MSN'	:	'TRUAX FIELD - DANE COUNTY',
              'CHS'	:	'CHARLESTON        ',
              'CLK'	:	'CLARKSBURG        ',
              'BLF'	:	'MERCER COUNTY',
              'CSP'	:	'CASPER            ',
              'XXX': 'NOT REPORTED/UNKNOWN  ',
              '888': 'UNIDENTIFED AIR / SEAPORT',
              'UNK': 'UNKNOWN POE           ',
              'CLG': 'CALGARY, CANADA       ',
              'EDA': 'EDMONTON, CANADA      ',
              'YHC': 'HAKAI PASS, CANADA',
              'HAL': 'Halifax, NS, Canada   ',
              'MON': 'MONTREAL, CANADA      ',
              'OTT': 'OTTAWA, CANADA        ',
              'YXE': 'SASKATOON, CANADA',
              'TOR': 'TORONTO, CANADA       ',
              'VCV': 'VANCOUVER, CANADA     ',
              'VIC': 'VICTORIA, CANADA      ',
              'WIN': 'WINNIPEG, CANADA      ',
              'AMS': 'AMSTERDAM-SCHIPHOL, NETHERLANDS',
              'ARB': 'ARUBA, NETH ANTILLES  ',
              'BAN': 'BANKOK, THAILAND      ',
              'BEI': 'BEICA #ARPT, ETHIOPIA',
              'PEK': 'BEIJING CAPITAL INTL, PRC',
              'BDA': 'KINDLEY FIELD, BERMUDA',
              'BOG': 'BOGOTA, EL DORADO #ARPT, COLOMBIA',
              'EZE': 'BUENOS AIRES, MINISTRO PIST, ARGENTINA',
              'CUN': 'CANCUN, MEXICO',
              'CRQ': 'CARAVELAS, BA #ARPT, BRAZIL',
              'MVD': 'CARRASCO, URUGUAY',
              'DUB': 'DUBLIN, IRELAND       ',
              'FOU': 'FOUGAMOU #ARPT, GABON',
              'FBA': 'FREEPORT, BAHAMAS      ',
              'MTY': 'GEN M. ESCOBEDO, Monterrey, MX',
              'HMO': 'GEN PESQUEIRA GARCIA, MX',
              'GCM': 'GRAND CAYMAN, CAYMAN ISLAND',
              'GDL': 'GUADALAJARA, MIGUEL HIDAL, MX',
              'HAM': 'HAMILTON, BERMUDA     ',
              'ICN': 'INCHON, SEOUL KOREA',
              'IWA': 'INVALID - IWAKUNI, JAPAN',
              'CND': 'KOGALNICEANU, ROMANIA',
              'LAH': 'LABUHA ARPT, INDONESIA',
              'DUR': 'LOUIS BOTHA, SOUTH AFRICA',
              'MAL': 'MANGOLE ARPT, INDONESIA',
              'MDE': 'MEDELLIN, COLOMBIA',
              'MEX': 'JUAREZ INTL, MEXICO CITY, MX',
              'LHR': 'MIDDLESEX, ENGLAND',
              'NBO': 'NAIROBI, KENYA        ',
              'NAS': 'NASSAU, BAHAMAS       ',
              'NCA': 'NORTH CAICOS, TURK & CAIMAN',
              'PTY': 'OMAR TORRIJOS, PANAMA',
              'SPV': 'PAPUA, NEW GUINEA',
              'UIO': 'QUITO (MARISCAL SUCR), ECUADOR',
              'RIT': 'ROME, ITALY           ',
              'SNO': 'SAKON NAKHON #ARPT, THAILAND',
              'SLP': 'SAN LUIS POTOSI #ARPT, MEXICO',
              'SAN': 'SAN SALVADOR, EL SALVADOR',
              'SRO': 'SANTANA RAMOS #ARPT, COLOMBIA',
              'GRU': 'GUARULHOS INTL, SAO PAULO, BRAZIL',
              'SHA': 'SHANNON, IRELAND      ',
              'HIL': 'SHILLAVO, ETHIOPIA',
              'TOK': 'TOROKINA #ARPT, PAPUA, NEW GUINEA',
              'VER': 'VERACRUZ, MEXICO',
              'LGW': 'WEST SUSSEX, ENGLAND  ',
              'ZZZ': 'MEXICO Land (Banco de Mexico) ',
              'CHN': 'No PORT Code (CHN)',
              'CNC': 'CANNON CORNERS, NY',
              'MAA': 'Abu Dhabi',
              'AG0': 'MAGNOLIA',
              'BHM': 'BAR HARBOR',
              'BHX': 'BIRMINGHAM',
              'CAK': 'AKRON',
              'FOK': 'SUFFOLK COUNTY',
              'LND': 'LANDER',
              'MAR': 'MARFA',
              'MLI': 'MOLINE',
              'RIV': 'RIVERSIDE',
              'RME': 'ROME',
              'VNY': 'VAN NUYS',
              'YUM': 'YUMA',
             'W55': 'SEATTLE'
}

### Immigration Data by State with Origin

In [54]:
df_immigration_without_nulls = df_immigration[(df_immigration.i94addr.notnull())&(df_immigration.i94res.notnull())]
df_immigration_filtered = df_immigration_without_nulls[(df_immigration_without_nulls['i94addr'].isin(us_abbrev_state))&(df_immigration_without_nulls['i94port'].isin(city_codes))]


In [59]:
immigration_codes = {
    582: 'MEXICO Air Sea, and Not Reported (I-94, no land arrivals)',
    236:  'AFGHANISTAN',
    101:  'ALBANIA',
    316:  'ALGERIA',
    102:  'ANDORRA',
    324:  'ANGOLA',
    529:  'ANGUILLA',
    518:  'ANTIGUA-BARBUDA',
    687:  'ARGENTINA ',
    151:  'ARMENIA',
    532:  'ARUBA',
    438:  'AUSTRALIA',
    103:  'AUSTRIA',
    152:  'AZERBAIJAN',
    512:  'BAHAMAS',
    298:  'BAHRAIN',
    274:  'BANGLADESH',
    513:  'BARBADOS',
    104:  'BELGIUM',
    581:  'BELIZE',
    386:  'BENIN',
    509:  'BERMUDA',
    153:  'BELARUS',
    242:  'BHUTAN',
    688:  'BOLIVIA',
    717:  'BONAIRE, ST EUSTATIUS, SABA',
    164:  'BOSNIA-HERZEGOVINA',
    336:  'BOTSWANA',
    689:  'BRAZIL',
    525:  'BRITISH VIRGIN ISLANDS',
    217:  'BRUNEI',
    105:  'BULGARIA',
    393:  'BURKINA FASO',
    243:  'BURMA',
    375:  'BURUNDI',
    310:  'CAMEROON',
    326:  'CAPE VERDE',
    526:  'CAYMAN ISLANDS',
    383:  'CENTRAL AFRICAN REPUBLIC',
    384:  'CHAD',
    690:  'CHILE',
    245:  'CHINA, PRC',
    721:  'CURACAO',
    270:  'CHRISTMAS ISLAND',
    271:  'COCOS ISLANDS',
    691:  'COLOMBIA',
    317:  'COMOROS',
    385:  'CONGO',
    467:  'COOK ISLANDS',
    575:  'COSTA RICA',
    165:  'CROATIA',
    584:  'CUBA',
    218:  'CYPRUS',
    140:  'CZECH REPUBLIC',
    723:  'FAROE ISLANDS (PART OF DENMARK)',
    108:  'DENMARK',
    322:  'DJIBOUTI',
    519:  'DOMINICA',
    585:  'DOMINICAN REPUBLIC',
    240:  'EAST TIMOR',
    692:  'ECUADOR',
    368:  'EGYPT',
    576:  'EL SALVADOR',
    399:  'EQUATORIAL GUINEA',
    372:  'ERITREA',
    109:  'ESTONIA',
    369:  'ETHIOPIA',
    604:  'FALKLAND ISLANDS',
    413:  'FIJI',
    110:  'FINLAND',
    111:  'FRANCE',
    601:  'FRENCH GUIANA',
    411:  'FRENCH POLYNESIA',
    387:  'GABON',
    338:  'GAMBIA',
    758:  'GAZA STRIP',
    154:  'GEORGIA',
    112:  'GERMANY',
    339:  'GHANA',
    143:  'GIBRALTAR',
    113:  'GREECE',
    520:  'GRENADA',
    507:  'GUADELOUPE',
    577:  'GUATEMALA',
    382:  'GUINEA',
    327:  'GUINEA-BISSAU',
    603:  'GUYANA',
    586:  'HAITI',
    726:  'HEARD AND MCDONALD IS.',
    149:  'HOLY SEE/VATICAN',
    528:  'HONDURAS',
    206:  'HONG KONG',
    114:  'HUNGARY',
    115:  'ICELAND',
    213:  'INDIA',
    759:  'INDIAN OCEAN AREAS (FRENCH)',
    729:  'INDIAN OCEAN TERRITORY',
    204:  'INDONESIA',
    249:  'IRAN',
    250:  'IRAQ',
    116:  'IRELAND',
    251:  'ISRAEL',
    117:  'ITALY',
    388:  'IVORY COAST',
    514:  'JAMAICA',
    209:  'JAPAN',
    253:  'JORDAN',
    201:  'KAMPUCHEA',
    155:  'KAZAKHSTAN',
    340:  'KENYA',
    414:  'KIRIBATI',
    732:  'KOSOVO',
    272:  'KUWAIT',
    156:  'KYRGYZSTAN',
    203:  'LAOS',
    118:  'LATVIA',
    255:  'LEBANON',
    335:  'LESOTHO',
    370:  'LIBERIA',
    381:  'LIBYA',
    119:  'LIECHTENSTEIN',
    120:  'LITHUANIA',
    121:  'LUXEMBOURG',
    214:  'MACAU',
    167:  'MACEDONIA',
    320:  'MADAGASCAR',
    345:  'MALAWI',
    273:  'MALAYSIA',
    220:  'MALDIVES',
    392:  'MALI',
    145:  'MALTA',
    472:  'MARSHALL ISLANDS',
    511:  'MARTINIQUE',
    389:  'MAURITANIA',
    342:  'MAURITIUS',
    760:  'MAYOTTE (AFRICA - FRENCH)',
    473:  'MICRONESIA, FED. STATES OF',
    157:  'MOLDOVA',
    122:  'MONACO',
    299:  'MONGOLIA',
    735:  'MONTENEGRO',
    521:  'MONTSERRAT',
    332:  'MOROCCO',
    329:  'MOZAMBIQUE',
    371:  'NAMIBIA',
    440:  'NAURU',
    257:  'NEPAL',
    123:  'NETHERLANDS',
    508:  'NETHERLANDS ANTILLES',
    409:  'NEW CALEDONIA',
    464:  'NEW ZEALAND',
    579:  'NICARAGUA',
    390:  'NIGER',
    343:  'NIGERIA',
    470:  'NIUE',
    275:  'NORTH KOREA',
    124:  'NORWAY',
    256:  'OMAN',
    258:  'PAKISTAN',
    474:  'PALAU',
    743:  'PALESTINE',
    504:  'PANAMA',
    441:  'PAPUA NEW GUINEA',
    693:  'PARAGUAY',
    694:  'PERU',
    260:  'PHILIPPINES',
    416:  'PITCAIRN ISLANDS',
    107:  'POLAND',
    126:  'PORTUGAL',
    297:  'QATAR',
    748:  'REPUBLIC OF SOUTH SUDAN',
    321:  'REUNION',
    127:  'ROMANIA',
    158:  'RUSSIA',
    376:  'RWANDA',
    128:  'SAN MARINO',
    330:  'SAO TOME AND PRINCIPE',
    261:  'SAUDI ARABIA',
    391:  'SENEGAL',
    142:  'SERBIA AND MONTENEGRO',
    745:  'SERBIA',
    347:  'SEYCHELLES',
    348:  'SIERRA LEONE',
    207:  'SINGAPORE',
    141:  'SLOVAKIA',
    166:  'SLOVENIA',
    412:  'SOLOMON ISLANDS',
    397:  'SOMALIA',
    373:  'SOUTH AFRICA',
    276:  'SOUTH KOREA',
    129:  'SPAIN',
    244:  'SRI LANKA',
    346:  'ST. HELENA',
    522:  'ST. KITTS-NEVIS',
    523:  'ST. LUCIA',
    502:  'ST. PIERRE AND MIQUELON',
    524:  'ST. VINCENT-GRENADINES',
    716:  'SAINT BARTHELEMY',
    736:  'SAINT MARTIN',
    749:  'SAINT MAARTEN',
    350:  'SUDAN',
    602:  'SURINAME',
    351:  'SWAZILAND',
    130:  'SWEDEN',
    131:  'SWITZERLAND',
    262:  'SYRIA',
    268:  'TAIWAN',
    159:  'TAJIKISTAN',
    353:  'TANZANIA',
    263:  'THAILAND',
    304:  'TOGO',
    417:  'TONGA',
    516:  'TRINIDAD AND TOBAGO',
    323:  'TUNISIA',
    264:  'TURKEY',
    161:  'TURKMENISTAN',
    527:  'TURKS AND CAICOS ISLANDS',
    420:  'TUVALU',
    352:  'UGANDA',
    162:  'UKRAINE',
    296:  'UNITED ARAB EMIRATES',
    135:  'UNITED KINGDOM',
    695:  'URUGUAY',
    163:  'UZBEKISTAN',
    410:  'VANUATU',
    696:  'VENEZUELA',
    266:  'VIETNAM',
    469:  'WALLIS AND FUTUNA ISLANDS',
    757:  'WEST INDIES (FRENCH)',
    333:  'WESTERN SAHARA',
    465:  'WESTERN SAMOA',
    216:  'YEMEN',
    139:  'YUGOSLAVIA',
    301:  'ZAIRE',
    344:  'ZAMBIA',
    315:  'ZIMBABWE',
    403:  'INVALID: AMERICAN SAMOA',
    712:  'INVALID: ANTARCTICA',
    700:  'INVALID: BORN ON BOARD SHIP',
    719:  'INVALID: BOUVET ISLAND (ANTARCTICA/NORWAY TERR.)',
    574:  'INVALID: CANADA',
    720:  'INVALID: CANTON AND ENDERBURY ISLS',
    106:  'INVALID: CZECHOSLOVAKIA',
    739:  'INVALID: DRONNING MAUD LAND (ANTARCTICA-NORWAY)',
    394:  'INVALID: FRENCH SOUTHERN AND ANTARCTIC',
    501:  'INVALID: GREENLAND',
    404:  'INVALID: GUAM',
    730:  'INVALID: INTERNATIONAL WATERS',
    731:  'INVALID: JOHNSON ISLAND',
    471:  'INVALID: MARIANA ISLANDS, NORTHERN',
    737:  'INVALID: MIDWAY ISLANDS',
    753:  'INVALID: MINOR OUTLYING ISLANDS - USA',
    740:  'INVALID: NEUTRAL ZONE (S. ARABIA/IRAQ)',
    710:  'INVALID: NON-QUOTA IMMIGRANT',
    505:  'INVALID: PUERTO RICO',
    0:  'INVALID: STATELESS',
    705:  'INVALID: STATELESS',
    583:  'INVALID: UNITED STATES',
    407:  'INVALID: UNITED STATES',
    999:  'INVALID: UNKNOWN',
    239:  'INVALID: UNKNOWN COUNTRY',
    134:  'INVALID: USSR',
    506:  'INVALID: U.S. VIRGIN ISLANDS',
    755:  'INVALID: WAKE ISLAND',
    311:  'Collapsed Tanzania (should not show)',
    741:  'Collapsed Curacao (should not show)',
    54:  'No Country Code (54)',
    100:  'No Country Code (100)',
    187:  'No Country Code (187)',
    190:  'No Country Code (190)',
    200:  'No Country Code (200)',
    219:  'No Country Code (219)',
    238:  'No Country Code (238)',
    277:  'No Country Code (277)',
    293:  'No Country Code (293)',
    300:  'No Country Code (300)',
    319:  'No Country Code (319)',
    365:  'No Country Code (365)',
    395:  'No Country Code (395)',
    400:  'No Country Code (400)',
    485:  'No Country Code (485)',
    503:  'No Country Code (503)',
    589:  'No Country Code (589)',
    592:  'No Country Code (592)',
    791:  'No Country Code (791)',
    849:  'No Country Code (849)',
    914:  'No Country Code (914)',
    944:  'No Country Code (944)',
    996:  'No Country Code (996)',
}

In [72]:
df_immigration_filtered['origin_country'] = df_immigration_filtered.apply(lambda row: immigration_codes[row["i94res"]],axis=1)
df_immigration_filtered['dest_state_name'] = df_immigration_filtered.apply(lambda row: us_abbrev_state[row["i94addr"]],axis=1)
df_immigration_filtered['city_port_name'] = df_immigration_filtered.apply(lambda row: city_codes[row["i94port"]],axis=1)
df_immigration_filtered['i94yr'] = df_immigration_filtered.apply(lambda row: int(row["i94yr"]),axis=1)
df_immigration_filtered['i94mon'] = df_immigration_filtered.apply(lambda row: int(row["i94mon"]),axis=1)
df_immigration_filtered['cicid'] = df_immigration_filtered.apply(lambda row: int(row["cicid"]),axis=1)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See

In [67]:
df_immigration_filtered

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,dtaddto,gender,insnum,airline,admnum,fltno,visatype,dest_state_name,origin_country,city_port_name
0,2027561,4084316.0,2016,4,209.0,209.0,HHW,20566.0,1.0,HI,...,07202016,F,,JL,5.658267e+10,00782,WT,Hawaii,JAPAN,HONOLULU
1,2171295,4422636.0,2016,4,582.0,582.0,MCA,20567.0,1.0,TX,...,10222016,M,,*GA,9.436200e+10,XBLNG,B2,Texas,"MEXICO Air Sea, and Not Reported (I-94, no lan...",MCALLEN
2,589494,1195600.0,2016,4,148.0,112.0,OGG,20551.0,1.0,FL,...,07052016,M,,LH,5.578047e+10,00464,WT,Florida,GERMANY,KAHULUI - MAUI
3,2631158,5291768.0,2016,4,297.0,297.0,LOS,20572.0,1.0,CA,...,10272016,M,,QR,9.478970e+10,00739,B2,California,QATAR,LOS ANGELES
4,3032257,985523.0,2016,4,111.0,111.0,CHM,20550.0,3.0,NY,...,07042016,F,,,4.232257e+10,LAND,WT,New York,FRANCE,CHAMPLAIN
5,721257,1481650.0,2016,4,577.0,577.0,ATL,20552.0,1.0,GA,...,10072016,M,,DL,7.368526e+08,910,B2,Georgia (State),GUATEMALA,ATLANTA
6,1072780,2197173.0,2016,4,245.0,245.0,SFR,20556.0,1.0,CA,...,10112016,F,,CX,7.863122e+08,870,B2,California,"CHINA, PRC",SAN FRANCISCO
7,112205,232708.0,2016,4,113.0,135.0,NYC,20546.0,1.0,NY,...,06302016,F,,BA,5.547449e+10,00117,WT,New York,UNITED KINGDOM,NEW YORK
8,2577162,5227851.0,2016,4,131.0,131.0,CHI,20572.0,1.0,IL,...,07262016,,,LX,5.941342e+10,00008,WT,Illinois,SWITZERLAND,CHICAGO
9,10930,13213.0,2016,4,116.0,116.0,LOS,20545.0,1.0,CA,...,06292016,,,AA,5.544979e+10,00109,WT,California,IRELAND,LOS ANGELES


In [68]:
df_immigration_filtered["year"] = df_immigration_filtered["i94yr"]
df_immigration_filtered["month"] = df_immigration_filtered["i94mon"]
df_immigration_filtered["state_code"] = df_immigration_filtered["i94addr"]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [70]:
new_I94_Data = df_immigration_filtered[['cicid','year','month','origin_country','i94port','city_port_name','state_code','dest_state_name']]

In [71]:
new_I94_Data.head()

Unnamed: 0,cicid,year,month,origin_country,i94port,city_port_name,state_code,dest_state_name
0,4084316.0,2016,4,JAPAN,HHW,HONOLULU,HI,Hawaii
1,4422636.0,2016,4,"MEXICO Air Sea, and Not Reported (I-94, no lan...",MCA,MCALLEN,TX,Texas
2,1195600.0,2016,4,GERMANY,OGG,KAHULUI - MAUI,FL,Florida
3,5291768.0,2016,4,QATAR,LOS,LOS ANGELES,CA,California
4,985523.0,2016,4,FRANCE,CHM,CHAMPLAIN,NY,New York


### U.S. Demographic Data by State

### U.S. Airport Data by State 

In [None]:
airport_data=airport.filter(airport["type"]=="small_airport")\
.filter(airport["iso_country"]=="US")\
.withColumn("iso_region",substring(airport["iso_region"],4,2))\
.withColumn("elevation_ft",col("elevation_ft").cast("float"))

#Find average elevation per state
airport_data_elevation=airport_data.groupBy("iso_country","iso_region").avg("elevation_ft")

#Select relevant columns and drop duplicates
new_airport_data=airport_data_elevation.select(col("iso_country").alias("country"),\
                                               col("iso_region").alias("state"),\
                                               round(col("avg(elevation_ft)"),1).alias("avg_elevation_ft")).orderBy("iso_region")

In [85]:
df_airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [101]:
us_df_airport = df_airport[df_airport["iso_country"]=="US"]
us_df_airport = us_df_airport[(us_df_airport["type"]=="small_airport")|(us_df_airport["type"]=="medium_airport")|(us_df_airport["type"]=="large_airport")]

In [102]:
us_df_airport["elevation_ft"] = us_df_airport.apply(lambda row: float(row["elevation_ft"]),axis=1)
us_df_airport["state"] = us_df_airport.apply(lambda row: row["iso_region"].split("-")[-1],axis=1)
us_df_airport["x_coordinate"] = us_df_airport.apply(lambda row: float(row["coordinates"].split(",")[0]),axis=1)
us_df_airport["y_coordinate"] = us_df_airport.apply(lambda row: float(row["coordinates"].split(",")[-1]),axis=1)

In [111]:
us_df_airport["country"] = us_df_airport["iso_country"]
us_df_airport = us_df_airport[["ident","type","name","elevation_ft","country","state","municipality","x_coordinate","y_coordinate"]]

In [113]:
us_df_airport

Unnamed: 0,ident,type,name,elevation_ft,country,state,municipality,x_coordinate,y_coordinate
1,00AA,small_airport,Aero B Ranch Airport,3435.0,US,KS,Leoti,-101.473911,38.704022
2,00AK,small_airport,Lowell Field,450.0,US,AK,Anchor Point,-151.695999,59.949200
3,00AL,small_airport,Epps Airpark,820.0,US,AL,Harvest,-86.770302,34.864799
5,00AS,small_airport,Fulton Airport,1100.0,US,OK,Alex,-97.818019,34.942803
6,00AZ,small_airport,Cordes Airport,3810.0,US,AZ,Cordes,-112.165001,34.305599
7,00CA,small_airport,Goldstone /Gts/ Airport,3038.0,US,CA,Barstow,-116.888000,35.350498
8,00CL,small_airport,Williams Ag Airport,87.0,US,CA,Biggs,-121.763427,39.427188
11,00FA,small_airport,Grass Patch Airport,53.0,US,FL,Bushnell,-82.219002,28.645500
13,00FL,small_airport,River Oak Airport,35.0,US,FL,Okeechobee,-80.969200,27.230900
14,00GA,small_airport,Lt World Airport,700.0,US,GA,Lithonia,-84.068298,33.767502


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.