# Project Title
### Data Engineering Capstone Project

#### Project Summary
- This project aims to combine four data sets containing immigration data, airport codes, demographics of US cities and global temperature data. The primary purpose of the combination is to create a schema which can be used to derive various correlations, trends and analytics. For example, one could attempt to correlate the influence of the average temperature of a migrant's resident country on their choice of US state, and what the current dempgraphic layout of that state is.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
! pip install -U numpy
! pip install missingno

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import configparser
import datetime as dt
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear, avg, monotonically_increasing_id
from pyspark.sql.types import *
import requests
requests.packages.urllib3.disable_warnings()
from pyspark.sql.functions import year, month, dayofmonth, weekofyear, date_format
from pyspark.sql import SparkSession, SQLContext, GroupedData, HiveContext
from pyspark.sql.functions import *
from pyspark.sql.functions import date_add as d_add
from pyspark.sql.types import DoubleType, StringType, IntegerType, FloatType
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql import Row
import datetime, time
import tools as tools
import create_tables as ct

In [5]:
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.config("spark.python.worker.memory", "15g") \
.enableHiveSupport().getOrCreate()

df = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [6]:
#write to parquet
# df_spark.write.parquet("sas_data")
# df_spark=spark.read.parquet("sas_data")

### Step 1: Scope the Project and Gather Data

#### Scope 
##### Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use?

- In this project I will gather the data from four sources. I will load this data into staging dataframes. I will clean the raw data, write it to parquet files and perform an ETL process using a Spark cluster. Then I will write the data into Fact & Dimension tables to form a star schema. The star schema can then be used by the relevant parties to perform data analytics, correlation and ad-hoc reporting in an effective and efficient manner.

#### Describe and Gather Data 
##### Describe the data sets you're using. Where did it come from? What type of information is included? 

- i94 Immigration Sample Data: Sample data of immigration records from the US National Tourism and Trade Office. This data source will serve as the Fact table in the schema. This data comes from https://travel.trade.gov/research/reports/i94/historical/2016.html.
- World Temperature Data world_temperature. This dataset contains temperature data in various cities from the 1700’s to 2013. Although the data is only recorded until 2013, we can use this as an average/gauge of temperature in 2017. This data comes from https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data.
- US City Demographic Data: Data about the demographics of US cities. This dataset includes information on the population of all US cities such as race, household size and gender. This data comes from https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/.
- Airport Codes: This table contains the airport codes for the airports in corresponding cities. This data comes from https://datahub.io/core/airport-codes#data.

##### TEMPERATURE DATA

In [5]:
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
temperature_df = pd.read_csv(fname)

In [79]:
temperature_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


##### Data Dictionary

Feature                       |Description
:-----------------------------|:-----------
dt                            |Date
AverageTemperature            |Average temperature in celsius
AverageTemperatureUncertainty |95% confidence interval around average temperature
City                          |Name of city
Country                       |Name of country
Latitude                      |Latitude of city
Longitude                     |Longitude of city

##### AIRPORT CODES

In [6]:
airport_codes = 'airport-codes_csv.csv'
airport_df = pd.read_csv(airport_codes)

In [77]:
airport_df.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"



##### Data Dictionary

Feature       |Description
:-------------|:-----------
ident         |Unique identifier
type          |Airport type
name          |Airport name
elevation_ft  |Airport altitude
continent     |Continent
iso_country   |ISO Code of the airport's country
iso_region    |ISO Code for the airport's region
municipality  | City/Municipality where the airport is located
gps_code      |Airport GPS Code
iata_code     |Airport IATA Code
local_code    |Airport local code
coordinates   |Airport coordinates

##### IMMIGRATION DATA

In [7]:
immigration_data = 'immigration_data_sample.csv'
immigration_df = pd.read_csv(immigration_data)

In [74]:
immigration_df.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,,G,O,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,,Z,K,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


##### Data Dictionary

Feature  |Description
:--------|:-----------
cicid    |Unique ID
i94yr    |year
i94mon   |month
i94cit   |3 digit code for immigrant country of birth
i94res   |3 digit code for immigrant country of residence
i94port  |Port of admission
arrdate  |Arrival Date in the USA
i94mode  |Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)
i94addr  |USA State of arrival
depdate  |Departure Date from the USA
i94bir   |Age of Respondent in Years
i94visa  |Visa codes collapsed into three categories
count    |Field used for summary statistics
dtadfile |Character Date Field - Date added to I-94 Files
visapost |Department of State where where Visa was issued
occup    |Occupation that will be performed in U.S
entdepa  |Arrival Flag - admitted or paroled into the U.S.
entdepd  |Departure Flag - Departed, lost I-94 or is deceased
entdepu  |Update Flag - Either apprehended, overstayed, adjusted to perm residence
matflag  |Match flag - Match of arrival and departure records
biryear  |4 digit year of birth
dtaddto  |Character Date Field - Date to which admitted to U.S. (allowed to stay until)
gender   |Non-immigrant sex
insnum   |INS number
airline  |Airline used to arrive in U.S.
admnum   |Admission Number
fltno    |Flight number of Airline used to arrive in U.S.
visatype |Class of admission legally admitting the non-immigrant to temporarily stay in U.S.

###### Immigration Country Mapping

##### US CITIES DEMOGRAPHICS

In [8]:
us_cities_demographics = 'us-cities-demographics.csv'
demographics_df = spark.read.csv(us_cities_demographics, inferSchema=True, header=True, sep=';')

In [19]:
demographics_df.limit(5).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


##### Data Dictionary

Feature                       |Description
:-----------------------------|:-----------
City |City Name
State |US State of the City
Median Age |The median population age
Male Population |Male population total
Female Population |Female population total
Total Population |Total population
Number of Veterans |Number of veterans living in the city
Foreign-born |Number of residents who were not born in the city
Average Household Size |Average size of houses in the city
State Code |Code of the state
Race |Race class
Count |Number of individuals in each race

### Step 2: Explore and Assess the Data

 - Please refer to the "Explore & Assess Data" notebook for data exploration and analysis
 

##### Data Cleaning Steps Required:
- Drop columns containing over 90% missing values
- Drop duplicate values


In [9]:
# Drop columns with over 90% missing values
clean_temperature = tools.eliminate_missing_data(temperature_df)

Dropping missing data...
Cleaning complete!


In [10]:
clean_temperature = tools.drop_duplicate_rows(clean_temperature)

Dropping duplicate rows...
0 rows dropped.


In [11]:
start_date = "2010-01-01"
end_date = "2020-01-01"

after_start_date = clean_temperature["dt"] >= start_date
before_end_date = clean_temperature["dt"] <= end_date
between_two_dates = after_start_date & before_end_date
clean_temperature = clean_temperature.loc[between_two_dates]

In [12]:
clean_temperature.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
3194,2010-01-01,-3.799,0.24,Århus,Denmark,57.05N,10.33E
3195,2010-02-01,-2.691,0.272,Århus,Denmark,57.05N,10.33E
3196,2010-03-01,2.429,0.427,Århus,Denmark,57.05N,10.33E
3197,2010-04-01,7.123,0.234,Århus,Denmark,57.05N,10.33E
3198,2010-05-01,10.657,0.314,Århus,Denmark,57.05N,10.33E


In [12]:
# Drop columns with over 90% missing values
clean_airport_codes = tools.eliminate_missing_data(airport_df)

Dropping missing data...
Cleaning complete!


In [13]:
clean_airport_codes = tools.drop_duplicate_rows(clean_airport_codes)

Dropping duplicate rows...
0 rows dropped.


In [15]:
clean_airport_codes.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [14]:
# Drop columns with over 90% missing values
clean_immigration = tools.eliminate_missing_data(immigration_df)

Dropping missing data...
Cleaning complete!


In [15]:
clean_immigration = tools.drop_duplicate_rows(clean_immigration)

Dropping duplicate rows...
0 rows dropped.


In [18]:
clean_immigration.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,entdepa,entdepd,matflag,biryear,dtaddto,gender,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,G,O,M,1955.0,7202016,F,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,G,R,M,1990.0,10222016,M,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,G,O,M,1940.0,7052016,M,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,G,O,M,1991.0,10272016,M,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,Z,K,M,1997.0,7042016,F,,42322570000.0,LAND,WT


In [16]:
# Drop columns with over 90% missing values
clean_demographics = tools.eliminate_missing_data(demographics_df.toPandas())

Dropping missing data...
Cleaning complete!


In [17]:
clean_demographics = tools.drop_duplicate_rows(clean_demographics)

Dropping duplicate rows...
0 rows dropped.


In [21]:
clean_demographics.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
##### Map out the conceptual data model and explain why you chose that model:

In accordance with Kimball Dimensional Modelling Techniques, laid out in this document 
(http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf), 
the following modelling steps have been taken:

- 1. Select the Business Process:
    - The immigration department follows their business process of admitting migrants into the country. This process generates events which are captured and translated to facts in a fact table

- 2. Declare the Grain:
    - The grain identifies exactly what is represented in a single fact table row.
    - In this project, the grain is declared as a single occurrence of a migrant entering the USA.

- 3. Identify the Dimensions:
    - Dimension tables provide context around an event or business process.
    - The dimensions identified in this project are:
        - dim_migrant
        - dim_status
        - dim_visa
        - dim_temperature
        - dim_country
        - dim_state
        - dim_time
        - dim_airport
        

- 4. Identify the Facts:
    - Fact tables focus on the occurrences of a singular business process, and have a one-to-one relationship with the events described in the grain.
    - The fact table identified in this project is:
        - fact_immigration
    
For this application, I have developed a set of Fact and Dimension tables in a Relational Database Management System to form a Star Schema.
This Star Schema can be used by Data Analysts and other relevant business professionals to gain deeper insight into various immigration figures, trends and statistics recorded historically.

![alt_text](./Conceptual_Data_Model_5.png)

#### 3.2 Mapping Out Data Pipelines
##### List the steps necessary to pipeline the data into the chosen data model:

- 1. Load the data into staging tables
- 2. Create Dimension tables
- 3. Create Fact table
- 4. Write data into parquet files
- 5. Perform data quality checks
    

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [18]:
output_path = "tables/"

In [88]:
clean_immigration.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,entdepa,entdepd,matflag,biryear,dtaddto,gender,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,G,O,M,1955.0,7202016,F,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,G,R,M,1990.0,10222016,M,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,G,O,M,1940.0,7052016,M,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,G,O,M,1991.0,10272016,M,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,Z,K,M,1997.0,7042016,F,,42322570000.0,LAND,WT


In [19]:
# query plan

# comes from clean_immigration

# create schema
immigration_schema = StructType([StructField("0", IntegerType(), True)\
                          ,StructField("cicid", FloatType(), True)\
                          ,StructField("i94yr", FloatType(), True)\
                          ,StructField("i94mon", FloatType(), True)\
                          ,StructField("i94cit", FloatType(), True)\
                          ,StructField("i94res", FloatType(), True)\
                          ,StructField("i94port", StringType(), True)\
                          ,StructField("arrdate", FloatType(), True)\
                          ,StructField("i94mode", FloatType(), True)\
                          ,StructField("i94addr", StringType(), True)\
                          ,StructField("depdate", FloatType(), True)\
                          ,StructField("i94bir", FloatType(), True)\
                          ,StructField("i94visa", FloatType(), True)\
                          ,StructField("count", FloatType(), True)\
                          ,StructField("dtadfile", StringType(), True)\
                          ,StructField("visapost", StringType(), True)\
                          ,StructField("entdepa", StringType(), True)\
                          ,StructField("entdepd", StringType(), True)\
                          ,StructField("matflag", StringType(), True)\
                          ,StructField("biryear", FloatType(), True)\
                          ,StructField("dtaddto", StringType(), True)\
                          ,StructField("gender", StringType(), True)\
                          ,StructField("airline", StringType(), True)\
                          ,StructField("admnum", FloatType(), True)\
                          ,StructField("fltno", StringType(), True)\
                          ,StructField("visatype", StringType(), True)])

immigration_spark = spark.createDataFrame(clean_immigration, schema=immigration_schema)

immigration_spark.toPandas().head()

Unnamed: 0,0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,entdepa,entdepd,matflag,biryear,dtaddto,gender,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,G,O,M,1955.0,7202016,F,JL,56582680000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,G,R,M,1990.0,10222016,M,*GA,94361990000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,G,O,M,1940.0,7052016,M,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,G,O,M,1991.0,10272016,M,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,Z,K,M,1997.0,7042016,F,,42322570000.0,LAND,WT


In [20]:
# create schema
temperature_schema = StructType([StructField("dt", StringType(), True)\
                          ,StructField("AverageTemperature", FloatType(), True)\
                          ,StructField("AverageTemperatureUncertainty", FloatType(), True)\
                          ,StructField("City", StringType(), True)\
                          ,StructField("Country", StringType(), True)\
                          ,StructField("Latitude", StringType(), True)\
                          ,StructField("Longitude", StringType(), True)])

temperature_spark = spark.createDataFrame(clean_temperature, schema=temperature_schema)

temperature_spark.toPandas().head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,2010-01-01,-3.799,0.24,Århus,Denmark,57.05N,10.33E
1,2010-02-01,-2.691,0.272,Århus,Denmark,57.05N,10.33E
2,2010-03-01,2.429,0.427,Århus,Denmark,57.05N,10.33E
3,2010-04-01,7.123,0.234,Århus,Denmark,57.05N,10.33E
4,2010-05-01,10.657,0.314,Århus,Denmark,57.05N,10.33E


In [21]:
# create schema
demographics_schema = StructType([StructField("City", StringType(), True)\
                        ,StructField("State", StringType(), True)\
                        ,StructField("Median Age", FloatType(), True)\
                        ,StructField("Male Population", FloatType(), True)\
                        ,StructField("Female Population", FloatType(), True)\
                        ,StructField("Total Population", IntegerType(), True)\
                        ,StructField("Number of Veterans", FloatType(), True)\
                        ,StructField("Foreign-born", FloatType(), True)\
                        ,StructField("Average Household Size", FloatType(), True)\
                        ,StructField("State Code", StringType(), True)\
                        ,StructField("Race", StringType(), True)\
                        ,StructField("Count", IntegerType(), True)])

demographics_spark = spark.createDataFrame(clean_demographics, schema=demographics_schema)

demographics_spark.toPandas().head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.799999,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.599998,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [22]:
# create schema
airport_codes_schema = StructType([StructField("ident", StringType(), True)\
                        ,StructField("type", StringType(), True)\
                        ,StructField("name", StringType(), True)\
                        ,StructField("elevation_ft", FloatType(), True)\
                        ,StructField("continent", StringType(), True)\
                        ,StructField("iso_country", StringType(), True)\
                        ,StructField("iso_region", StringType(), True)\
                        ,StructField("municipality", StringType(), True)\
                        ,StructField("gps_code", StringType(), True)\
                        ,StructField("iata_code", StringType(), True)\
                        ,StructField("local_code", StringType(), True)\
                        ,StructField("coordinates", StringType(), True)])

airport_codes_spark = spark.createDataFrame(clean_airport_codes, schema=airport_codes_schema)

airport_codes_spark.toPandas().head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


##### 1. Create dim_migrant

In [29]:
migrant =  ct.create_migrant_dimension(immigration_spark, output_path)

Writing table migrant to tables/migrant
Write complete!


In [20]:
migrant = spark.read.parquet("tables/migrant")
migrant.toPandas().head()

Unnamed: 0,migrant_id,birth_year,gender
0,21,1976.0,
1,38,1955.0,M
2,180,1960.0,F
3,359,1938.0,M
4,16,1995.0,F


##### 2. Create dim_status

In [33]:
status = ct.create_status_dimension(immigration_spark, output_path)

Writing table status to tables/status
Write complete!


In [34]:
status = spark.read.parquet("tables/status")
status.toPandas().head()

Unnamed: 0,status_flag_id,arrival_flag,departure_flag,match_flag
0,8,O,O,M
1,35,A,D,M
2,105,Z,O,M
3,270,T,,
4,82,Z,,


##### 3. Create dim_visa

In [35]:
visa = ct.create_visa_dimension(immigration_spark, output_path)

Writing table visa to tables/visa
Write complete!


In [36]:
visa = spark.read.parquet("tables/visa")
visa.toPandas().head()

Unnamed: 0,visa_id,i94visa,visatype,visapost
0,32,2.0,B2,CPT
1,80,2.0,B2,BGT
2,231,2.0,B2,KEV
3,443,2.0,B2,ABD
4,13,2.0,B2,MNL


##### 4. Create dim_state

In [37]:
state = ct.create_state_dimension(demographics_spark, output_path)

Writing table state to tables/state
Write complete!


In [38]:
state = spark.read.parquet("tables/state")
state.toPandas().head()

Unnamed: 0,state_code,state,median_age,total_population,male_population,female_population,foreign_born,average_household_size
0,DC,District of Columbia,33.8,3361140,1598525.0,1762615.0,475585.0,2.24
1,AR,Arkansas,32.74,2882889,1400724.0,1482165.0,307753.0,2.53
2,TN,Tennessee,34.31,10690165,5124189.0,5565976.0,900149.0,2.46
3,LA,Louisiana,34.63,6502975,3134990.0,3367985.0,417095.0,2.47
4,AZ,Arizona,35.04,22497710,11137275.0,11360435.0,3411565.0,2.77


##### 5. Create dim_time

In [39]:
time = ct.create_time_dimension(immigration_spark, output_path)

Writing table time to tables/time
Write complete!


In [40]:
time = spark.read.parquet("tables/time")
time.toPandas().head()

Unnamed: 0,arrdate,arrival_date,day,month,year,week,weekday
0,20569.0,2016-04-25,25,4,2016,17,2
1,20559.0,2016-04-15,15,4,2016,15,6
2,20568.0,2016-04-24,24,4,2016,16,1
3,20561.0,2016-04-17,17,4,2016,15,1
4,20547.0,2016-04-03,3,4,2016,13,1


##### 6. Create dim_airport

In [41]:
airport_codes = ct.create_airport_dimension(airport_codes_spark, output_path)

Writing table airport to tables/airport
Write complete!


In [42]:
airport = spark.read.parquet("tables/airport")
airport.toPandas().head()

Unnamed: 0,ident,type,iata_code,name,iso_country,iso_region,municipality,gps_code,coordinates,elevation_ft
0,00W,small_airport,,Lower Granite State Airport,US,US-WA,Colfax,00W,"-117.44300079346, 46.673500061035",719.0
1,04SD,heliport,,Cheyenne River Health Center Heliport,US,US-SD,Eagle Butte,04SD,"-101.243011, 44.993124",2437.0
2,0GA3,small_airport,,Ayresouth Airport,US,US-GA,Temple,0GA3,"-85.06079864501953, 33.77009963989258",1287.0
3,0IA0,heliport,,Knoxville Area Community Hospital Heliport,US,US-IA,Knoxville,0IA0,"-93.09600067138672, 41.316898345947266",927.0
4,0IL4,heliport,,Good Samaritan Hospital Heliport,US,US-IL,Downers Grove,0IL4,"-88.00779724121094, 41.81890106201172",772.0


##### 7. Create dim_temperature

In [43]:
temperature = ct.create_temperature_dimension(temperature_spark, output_path)

Writing table temperature to tables/temperature
Write complete!


In [44]:
temperature = spark.read.parquet("tables/temperature")
temperature.toPandas().head()

Unnamed: 0,temperature_id,country,average_temperature,average_temperature_uncertainty
0,1400159338496,Dominican Republic,26.71,0.43
1,575525617664,Puerto Rico,26.29,0.3
2,1554778161152,El Salvador,25.96,0.5
3,1554778161153,Costa Rica,25.49,0.48
4,481036337152,Nicaragua,27.5,0.51


##### 8. Create dim_country

In [45]:
country_names = spark.read.parquet("./i94res_country_mapping")
country_names.toPandas().head() 

Unnamed: 0,country_code,country
0,582,"MEXICO Air Sea, and Not Reported (I-94, no lan..."
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA


In [46]:
country = ct.create_country_dimension(country_names, output_path)

Writing table country to tables/country
Write complete!


In [47]:
country = spark.read.parquet("./tables/country")
country.toPandas().head()

Unnamed: 0,country_code,country
0,582,"MEXICO Air Sea, and Not Reported (I-94, no lan..."
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA


##### 9. Create fact_immigration

In [48]:
# join city and temperature
country_temp = country.select(["*"])\
            .join(temperature, (country.country == upper(temperature.country)), how='full')\
            .select([country.country_code, country.country, temperature.temperature_id, temperature.average_temperature, temperature.average_temperature_uncertainty])

country_temp.write.mode("overwrite").parquet("tables/country_temperature_mapping")

In [21]:
immigration = ct.create_immigration_fact(immigration_spark, output_path, spark)

Writing table immigration to tables/immigration
Write complete!


In [22]:
immigration = spark.read.parquet("tables/immigration")

immigration.toPandas().head()

Unnamed: 0,cicid,i94res,depdate,i94mode,i94port,i94cit,i94addr,airline,fltno,ident,country_code,temperature_id,migrant_id,status_flag_id,visa_id,state_code,arrdate
0,4084316.0,209.0,20573.0,1.0,HHW,209.0,HI,JL,00782,,209,,0,0,0,HI,20566.0
1,4422636.0,582.0,20568.0,1.0,MCA,582.0,TX,*GA,XBLNG,,582,,1,1,1,TX,20567.0
2,1195600.0,112.0,20571.0,1.0,OGG,148.0,FL,LH,00464,,112,,2,0,0,,20551.0
3,5291768.0,297.0,20581.0,1.0,LOS,297.0,CA,QR,00739,,297,,3,0,3,CA,20572.0
4,985523.0,111.0,20553.0,3.0,CHM,111.0,NY,,LAND,,111,,4,4,0,NY,20550.0


#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

##### 1. Check table columns

In [6]:
airport = spark.read.parquet("tables/airport")
airport.toPandas().head()

Unnamed: 0,ident,type,iata_code,name,iso_country,iso_region,municipality,gps_code,coordinates,elevation_ft
0,00W,small_airport,,Lower Granite State Airport,US,US-WA,Colfax,00W,"-117.44300079346, 46.673500061035",719.0
1,04SD,heliport,,Cheyenne River Health Center Heliport,US,US-SD,Eagle Butte,04SD,"-101.243011, 44.993124",2437.0
2,0GA3,small_airport,,Ayresouth Airport,US,US-GA,Temple,0GA3,"-85.06079864501953, 33.77009963989258",1287.0
3,0IA0,heliport,,Knoxville Area Community Hospital Heliport,US,US-IA,Knoxville,0IA0,"-93.09600067138672, 41.316898345947266",927.0
4,0IL4,heliport,,Good Samaritan Hospital Heliport,US,US-IL,Downers Grove,0IL4,"-88.00779724121094, 41.81890106201172",772.0


In [7]:
country = spark.read.parquet("tables/country")
country.toPandas().head()

Unnamed: 0,country_code,country
0,582,"MEXICO Air Sea, and Not Reported (I-94, no lan..."
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA


In [8]:
temperature = spark.read.parquet("tables/temperature")
temperature.toPandas().head()

Unnamed: 0,temperature_id,country,average_temperature,average_temperature_uncertainty
0,1400159338496,Dominican Republic,26.71,0.43
1,575525617664,Puerto Rico,26.29,0.3
2,1554778161152,El Salvador,25.96,0.5
3,1554778161153,Costa Rica,25.49,0.48
4,481036337152,Nicaragua,27.5,0.51


In [9]:
migrant = spark.read.parquet("tables/migrant")
migrant.toPandas().head()

Unnamed: 0,migrant_id,birth_year,gender
0,21,1976.0,
1,38,1955.0,M
2,180,1960.0,F
3,359,1938.0,M
4,16,1995.0,F


In [10]:
state = spark.read.parquet("tables/state")
state.toPandas().head()

Unnamed: 0,state_code,state,median_age,total_population,male_population,female_population,foreign_born,average_household_size
0,DC,District of Columbia,33.8,3361140,1598525.0,1762615.0,475585.0,2.24
1,AR,Arkansas,32.74,2882889,1400724.0,1482165.0,307753.0,2.53
2,TN,Tennessee,34.31,10690165,5124189.0,5565976.0,900149.0,2.46
3,LA,Louisiana,34.63,6502975,3134990.0,3367985.0,417095.0,2.47
4,AZ,Arizona,35.04,22497710,11137275.0,11360435.0,3411565.0,2.77


In [11]:
status = spark.read.parquet("tables/status")
status.toPandas().head()

Unnamed: 0,status_flag_id,arrival_flag,departure_flag,match_flag
0,8,O,O,M
1,35,A,D,M
2,105,Z,O,M
3,270,T,,
4,82,Z,,


In [12]:
time = spark.read.parquet("tables/time")
time.toPandas().head()

Unnamed: 0,arrdate,arrival_date,day,month,year,week,weekday
0,20569.0,2016-04-25,25,4,2016,17,2
1,20559.0,2016-04-15,15,4,2016,15,6
2,20568.0,2016-04-24,24,4,2016,16,1
3,20561.0,2016-04-17,17,4,2016,15,1
4,20547.0,2016-04-03,3,4,2016,13,1


In [13]:
visa = spark.read.parquet("tables/visa")
visa.toPandas().head()

Unnamed: 0,visa_id,i94visa,visatype,visapost
0,32,2.0,B2,CPT
1,80,2.0,B2,BGT
2,231,2.0,B2,KEV
3,443,2.0,B2,ABD
4,13,2.0,B2,MNL


In [14]:
immigration = spark.read.parquet("tables/immigration")
immigration.toPandas().head()

Unnamed: 0,cicid,i94res,depdate,i94mode,i94port,i94cit,i94addr,airline,fltno,ident,country_code,temperature_id,migrant_id,status_flag_id,visa_id,state_code,arrdate
0,4084316.0,209.0,20573.0,1.0,HHW,209.0,HI,JL,00782,,209,,0,0,0,HI,20566.0
1,4422636.0,582.0,20568.0,1.0,MCA,582.0,TX,*GA,XBLNG,,582,,1,1,1,TX,20567.0
2,1195600.0,112.0,20571.0,1.0,OGG,148.0,FL,LH,00464,,112,,2,0,0,,20551.0
3,5291768.0,297.0,20581.0,1.0,LOS,297.0,CA,QR,00739,,297,,3,0,3,CA,20572.0
4,985523.0,111.0,20553.0,3.0,CHM,111.0,NY,,LAND,,111,,4,4,0,NY,20550.0


##### 2. Check Record Count

In [15]:
tables = {
    "airport": airport,
    "country": country,
    "temperature": temperature,
    "migrant": migrant,
    "state": state,
    "status": status,
    "time": time,
    "visa": visa,
    "immigration": immigration
}

for table_name, table in tables.items():
    tools.perform_quality_check(table, table_name)

Data quality check passed for airport with record_count: 55075 records.
Data quality check passed for country with record_count: 289 records.
Data quality check passed for temperature with record_count: 13 records.
Data quality check passed for migrant with record_count: 215 records.
Data quality check passed for state with record_count: 47 records.
Data quality check passed for status with record_count: 34 records.
Data quality check passed for time with record_count: 30 records.
Data quality check passed for visa with record_count: 149 records.
Data quality check passed for immigration with record_count: 1000 records.


#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

##### - Please refer to Data_Dictionary.txt

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

##### 1. Clearly state the rationale for the choice of tools and technologies for the project:

- This project makes use of various Big Data processing technologies including:
    - Apache Spark, because of its ability to process massive amounts of data as well as the use of its unified analytics engine and convenient APIs
    - Pandas, due to its convenient dataframe manipulation functions
    - Matplotlib, to plot data and gain further insights

##### 2. Propose how often the data should be updated and why:

- The immigration (i94) data set is updated monthly, hence all relevant data should be updated monthly as well

##### 3. Write a description of how you would approach the problem differently under the following scenarios:

##### 3.1 The data was increased by 100x:
- If the data was increased by 100x I would use more sophisticated and appropriate frameworks to perform processing and storage functions, such as Amazon Redshift, Amazon EMR or Apache Cassandra.

##### 3.2 The data populates a dashboard that must be updated on a daily basis by 7am every day:
- If the data had to populate a dashboard daily, I would manage the ETL pipeline in a DAG from Apache Airflow. This would ensure that the pipeline runs in time, that data quality checks pass, and provide a convenient means of notification should the pipeline fail.

##### 3.3 The database needed to be accessed by 100+ people:
- If the data needed to be accessed by many people simultaneously, I would move the analytics database to Amazon Redshift which can handle massive request volumes and is easily scalable.