# Data Lake Pipeline of US-Immigration 2016

### Data Engineering Capstone Project

#### Project Summary
Spark is used to build a data lake pipeline with the datasets listed below.

* US I94 immigration
* airports
* airline
* US cities demographic data
* world temperature data

#### Project Steps
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Import Required packages.
%load_ext autoreload
%autoreload 2
from utils.cleaning import *
from utils.common import *
from utils.loading import *
from utils.modeling import *
from utils.quality_checking import *
from utils.transforming import *

In [2]:
# Get spark session.
spark = create_spark_session()

## Step 1: Scope the Project and Gather Data

### Scope 

The goal of this project is to provide a ETL pipeline of immigration datasets with some additional datasets.

### Datasets

1. I94 Immigration Data: This data comes from the US National Tourism and Trade Office. [link](https://travel.trade.gov/research/reports/i94/historical/2016.html) **Only the i94_apr16_sub.sas7bdat will be used in this project.**
2. World Temperature Data: This dataset came from Kaggle. You can read more about it. [link](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)
3. U.S. City Demographic Data: This data comes from OpenSoft. You can read more about it. [link](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)
4. Airport Code Table: This is a simple table of airport codes and corresponding cities from data hub. [link](https://datahub.io/core/airport-codes#data)
5. Countries: This dataset is extracted from `I94_SAS_Labels_Descriptions.SAS` provided by Udacity.


## Step 2: Explore and Assess the Data

1. Explore the datasets, we have datasets in csv, json and parquet format.
2. Clean the datasets.
3. Transform the datasets.

### Step 2.1: Explore the datasets

In [3]:
demographics = load_json(spark=spark, path='data/us-cities-demographics.json').select('fields.*')
print_data(demographics)

data count: 2891


Unnamed: 0,average_household_size,city,count,female_population,foreign_born,male_population,median_age,number_of_veterans,race,state,state_code,total_population
0,2.56,Wichita,8791,197601,40270.0,192354,34.6,23978.0,American Indian and Alaska Native,Kansas,KS,389955
1,2.67,Allen,22304,59581,19652.0,60626,33.5,5691.0,Black or African-American,Pennsylvania,PA,120207
2,2.74,Danbury,8454,41227,25675.0,43435,37.3,3752.0,Black or African-American,Connecticut,CT,84662
3,2.39,Nashville,67526,340365,88193.0,314231,34.1,27942.0,Hispanic or Latino,Tennessee,TN,654596
4,2.7,Stamford,11013,63936,44003.0,64941,35.4,2269.0,Asian,Connecticut,CT,128877
5,,San Juan,335559,186829,,155408,41.4,,Hispanic or Latino,Puerto Rico,PR,342237
6,3.28,Provo,108471,59027,10925.0,56231,23.6,2177.0,White,Utah,UT,115258
7,3.13,San Marcos,4447,47688,21558.0,45246,35.4,5189.0,Black or African-American,California,CA,92934
8,3.27,Escondido,3151,74907,46298.0,76551,33.3,8110.0,American Indian and Alaska Native,California,CA,151458
9,,Caguas,76349,42265,,34743,40.4,,Hispanic or Latino,Puerto Rico,PR,77008


In [4]:
airports = load_csv(spark=spark, path='data/airport-codes_csv.csv')
print_data(airports)

data count: 55075


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237,,US,US-AR,Newport,,,,"-91.254898, 35.6087"
5,00AS,small_airport,Fulton Airport,1100,,US,US-OK,Alex,00AS,,00AS,"-97.8180194, 34.9428028"
6,00AZ,small_airport,Cordes Airport,3810,,US,US-AZ,Cordes,00AZ,,00AZ,"-112.16500091552734, 34.305599212646484"
7,00CA,small_airport,Goldstone /Gts/ Airport,3038,,US,US-CA,Barstow,00CA,,00CA,"-116.888000488, 35.350498199499995"
8,00CL,small_airport,Williams Ag Airport,87,,US,US-CA,Biggs,00CL,,00CL,"-121.763427, 39.427188"
9,00CN,heliport,Kitchen Creek Helibase Heliport,3350,,US,US-CA,Pine Valley,00CN,,00CN,"-116.4597417, 32.7273736"


In [5]:
countries = load_csv(spark, 'data/countries.csv')
print_data(countries)

data count: 289


Unnamed: 0,code,name
0,582,"MEXICO Air Sea, and Not Reported (I-94, no lan..."
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA
5,324,ANGOLA
6,529,ANGUILLA
7,518,ANTIGUA-BARBUDA
8,687,ARGENTINA
9,151,ARMENIA


In [6]:
temperature = load_csv(spark, 'data/GlobalLandTemperaturesByCitySample.csv')
print_data(temperature)

data count: 1999


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.787999999999999,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.050999999999998,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
9,1744-08-01,,,Århus,Denmark,57.05N,10.33E


In [7]:
immigration = load_parquet(spark, 'data/sas_data')
print_data(immigration)

data count: 3096313


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,5748517.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,CA,20582.0,...,,M,1976.0,10292016,F,,QF,94953870000.0,11,B1
1,5748518.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,NV,20591.0,...,,M,1984.0,10292016,F,,VA,94955620000.0,7,B1
2,5748519.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20582.0,...,,M,1987.0,10292016,M,,DL,94956410000.0,40,B1
3,5748520.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1987.0,10292016,F,,DL,94956450000.0,40,B1
4,5748521.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1988.0,10292016,M,,DL,94956390000.0,40,B1
5,5748522.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20579.0,...,,M,1959.0,10292016,M,,NZ,94981800000.0,10,B2
6,5748523.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,...,,M,1950.0,10292016,F,,NZ,94979690000.0,10,B2
7,5748524.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,...,,M,1975.0,10292016,F,,NZ,94979750000.0,10,B2
8,5748525.0,2016.0,4.0,245.0,464.0,HOU,20574.0,1.0,FL,20581.0,...,,M,1989.0,10292016,M,,NZ,94973250000.0,28,B2
9,5748526.0,2016.0,4.0,245.0,464.0,LOS,20574.0,1.0,CA,20581.0,...,,M,1990.0,10292016,F,,NZ,95013550000.0,2,B2


### Step 2.2: Clean the datasets

#### Clean demographic dataset

* Cast numeric data to integer or float based on the meaning.
* Fill 0 in numeric columns if data is lost.
* Rename the columns to snake_case.

In [8]:
demographics = clean_demographics(demographics)
print_data(demographics)

data count: 2891


Unnamed: 0,average_household_size,city,count,female_population,foreign_born,male_population,median_age,number_of_veterans,race,state,state_code,total_population
0,2.56,Wichita,8791,197601,40270,192354,34.599998,23978,American Indian and Alaska Native,Kansas,KS,389955
1,2.67,Allen,22304,59581,19652,60626,33.5,5691,Black or African-American,Pennsylvania,PA,120207
2,2.74,Danbury,8454,41227,25675,43435,37.299999,3752,Black or African-American,Connecticut,CT,84662
3,2.39,Nashville,67526,340365,88193,314231,34.099998,27942,Hispanic or Latino,Tennessee,TN,654596
4,2.7,Stamford,11013,63936,44003,64941,35.400002,2269,Asian,Connecticut,CT,128877
5,0.0,San Juan,335559,186829,0,155408,41.400002,0,Hispanic or Latino,Puerto Rico,PR,342237
6,3.28,Provo,108471,59027,10925,56231,23.6,2177,White,Utah,UT,115258
7,3.13,San Marcos,4447,47688,21558,45246,35.400002,5189,Black or African-American,California,CA,92934
8,3.27,Escondido,3151,74907,46298,76551,33.299999,8110,American Indian and Alaska Native,California,CA,151458
9,0.0,Caguas,76349,42265,0,34743,40.400002,0,Hispanic or Latino,Puerto Rico,PR,77008


#### Clean airport dataset

* Remove the data row with null value in the columns to be FK.
* Cast elevation_ft with float data type.
* Substr iso_region to get regin code.

In [9]:
airports = clean_airports(airports)
print_data(airports)

data count: 14383


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
1,00AK,small_airport,Lowell Field,450.0,,US,AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
2,00AL,small_airport,Epps Airpark,820.0,,US,AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
3,00AS,small_airport,Fulton Airport,1100.0,,US,OK,Alex,00AS,,00AS,"-97.8180194, 34.9428028"
4,00AZ,small_airport,Cordes Airport,3810.0,,US,AZ,Cordes,00AZ,,00AZ,"-112.16500091552734, 34.305599212646484"
5,00CA,small_airport,Goldstone /Gts/ Airport,3038.0,,US,CA,Barstow,00CA,,00CA,"-116.888000488, 35.350498199499995"
6,00CL,small_airport,Williams Ag Airport,87.0,,US,CA,Biggs,00CL,,00CL,"-121.763427, 39.427188"
7,00FA,small_airport,Grass Patch Airport,53.0,,US,FL,Bushnell,00FA,,00FA,"-82.21900177001953, 28.64550018310547"
8,00FL,small_airport,River Oak Airport,35.0,,US,FL,Okeechobee,00FL,,00FL,"-80.96920013427734, 27.230899810791016"
9,00GA,small_airport,Lt World Airport,700.0,,US,GA,Lithonia,00GA,,00GA,"-84.06829833984375, 33.76750183105469"


#### Clean countries dataset

* change the name to match the names in demographics for further operations

In [10]:
countries = clean_countries(countries)
print_data(countries)

data count: 289


Unnamed: 0,code,name
0,582,MEXICO
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA
5,324,ANGOLA
6,529,ANGUILLA
7,518,ANTIGUA-BARBUDA
8,687,ARGENTINA
9,151,ARMENIA


#### Clean temperature dataset

* Remove the data row if AverageTemperature is lost.
* Rename the column name to snake_case.

In [11]:
temperature = clean_temperature(temperature)
print_data(temperature)

data count: 1927


Unnamed: 0,dt,average_temperature,average_temperature_uncertainty,city,country,latitude,longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1744-04-01,5.787999999999999,3.624,Århus,Denmark,57.05N,10.33E
2,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
3,1744-06-01,14.050999999999998,1.347,Århus,Denmark,57.05N,10.33E
4,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
5,1744-09-01,12.781,1.454,Århus,Denmark,57.05N,10.33E
6,1744-10-01,7.95,1.63,Århus,Denmark,57.05N,10.33E
7,1744-11-01,4.638999999999999,1.3019999999999998,Århus,Denmark,57.05N,10.33E
8,1744-12-01,0.1219999999999998,1.756,Århus,Denmark,57.05N,10.33E
9,1745-01-01,-1.3330000000000002,1.642,Århus,Denmark,57.05N,10.33E


#### Clean immigration dataset

* Drop unnucessary columns.
* Cast numeric data to integer.
* Cast date data to date string, %Y-%m-%d.
* Remove the row if the data in any of fk column is lost, 'i94cit', 'i94port', 'i94addr'

In [12]:
immigration = clean_immigration(immigration)
print_data(immigration)

data count: 2943721


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,dtadfile,gender,airline,fltno,visatype
0,5748517,2016,4,245,438,LOS,2016-04-30,1,CA,2016-05-08,40,1,20160430,F,QF,11,B1
1,5748518,2016,4,245,438,LOS,2016-04-30,1,NV,2016-05-17,32,1,20160430,F,VA,7,B1
2,5748519,2016,4,245,438,LOS,2016-04-30,1,WA,2016-05-08,29,1,20160430,M,DL,40,B1
3,5748520,2016,4,245,438,LOS,2016-04-30,1,WA,2016-05-14,29,1,20160430,F,DL,40,B1
4,5748521,2016,4,245,438,LOS,2016-04-30,1,WA,2016-05-14,28,1,20160430,M,DL,40,B1
5,5748522,2016,4,245,464,HHW,2016-04-30,1,HI,2016-05-05,57,2,20160430,M,NZ,10,B2
6,5748523,2016,4,245,464,HHW,2016-04-30,1,HI,2016-05-12,66,2,20160430,F,NZ,10,B2
7,5748524,2016,4,245,464,HHW,2016-04-30,1,HI,2016-05-12,41,2,20160430,F,NZ,10,B2
8,5748525,2016,4,245,464,HOU,2016-04-30,1,FL,2016-05-07,27,2,20160430,M,NZ,28,B2
9,5748526,2016,4,245,464,LOS,2016-04-30,1,CA,2016-05-07,26,2,20160430,F,NZ,2,B2


## Step 3: Define the Data Model

### Step 3.1 Conceptual Data Model

The data model is implemented by following the **star schema** with a fact table and several dimension tables.

#### 3.1.1 Dimension Tables

**dim_demographics**

This table is the aggregrated table of demographics at state level. The population columns and number of veterans and foreign-born were applied to `first` function when grouping by `city` because there are repeatedly in different row of the same cities. And the `sum` function were used to get the total number of these population columns and number of veterans and foreign-born. It's obviously to apply `avg` function to the fields of median age and average household size when groupping by the state. Finally, race information were extract respevitvely by `pivo`.


| Field Name|
| :--- |
| state_code (FK) |
| satte |
| median_age |
| male_population |
| female_population |
| total_population |
| number_of_veterans |
| average_household_size |
| foreign_born |
| hispanic_or_latino |
| american_indian_and_alaska_native |
| black_or_african_american |
| white |
| asian |

**dim_airports**

This table was just cleaned. No more process on it.

| Field Name|
| :--- |
| local_code (FK) |
| ident |
| type |
| name |
| elevation_ft |
| continent |
| iso_country |
| iso_region |
| municipality |
| gps_code |
| iata_code |
| coordinates |

**dim_countries**

The information of countries was first extract from `I94_SAS_Labels_Descriptions.SAS` and store the data in csv file. This table contains countries dataset and temperature dataset. a lowercase country name was added to both countries and temperature dataset for joining purpose and it's eventually dropped to aviod the duplicate in dimension table.

| Field Name|
| :--- |
| code (FK) |
| name |
| average_temperature |
| latitude |
| longitude |

#### 3.1.2 Fact Table

**fact_immigration**

This table was firstly cleaned and dropped some unnecessary columns and then I split the arrive date into year, month and day for partitioning purpose. Besides, the immigration dataset is joined with dimension tables and processed countries dataset to remove those data row which are not matched against dimension tables. 

| Field Name| FK |
| :--- | :--- |
| cicid | |
| i94yr | |
| i94mon | |
| i94cit | dim_countries |
| i94res | |
| i94port | dim_airports |
| arrdate | |
| i94mode | |
| i94addr | dim_demographics |
| depdate | |
| i94bir | |
| i94visa | |
| dtadfile | |
| gender | |
| airline | |
| fltno | |
| visatype | |
| arr_year | |
| arr_month | |
| arr_day | |

### Step 3.2 Mapping Out Data Pipelines

_List the steps necessary to pipeline the data into the chosen data model_

The basic steps of pipeline is **loading**, **cleaning**, **transforming**, **modeling** and **quality checking**.

Apach Spark is main tool to build the methods to work arond the data pipeline. 

The relationship amoung these steps is shown as below dag.

![capstone dag](capstone-dag.png)

## Step 4: Run Pipelines to Model the Data 
### 4.1 Create the data model
Build the data pipelines to create the data model.

#### 4.1.1 Transform Data to Match Model

In [13]:
demographics = transform_demographics(demographics)
print_data(demographics)

data count: 49


Unnamed: 0,state,state_code,hispanic_or_latino,foreign_born,asian,male_population,average_household_size,total_population,female_population,black_or_african_american,median_age,number_of_veterans,american_indian_and_alaska_native,white
0,Mississippi,MS,7264,4861,2587,112147,2.595,242683,130536,167366,33.4,14792,323,71645
1,Utah,UT,201695,132819,48801,530818,3.175,1050591,519773,21893,30.98,39671,18746,889798
2,South Dakota,SD,12359,15309,6859,122718,2.345,245098,122380,13121,37.049999,16087,13782,213281
3,Kentucky,KY,50478,66488,32667,452483,2.395,929877,477394,202749,35.950001,56025,7772,705790
4,California,CA,9856464,7448257,4543730,12278281,3.100949,24822460,12544179,2047009,36.182482,928270,401386,14905129
5,Nebraska,NE,83812,71221,34243,357333,2.435,721233,363900,80668,33.25,39197,10599,600094
6,New Hampshire,NH,22473,27199,13989,97771,2.43,198198,100427,11043,37.799999,11005,1213,174085
7,Delaware,DE,5516,3336,1193,32680,2.45,71957,39277,44182,36.400002,3063,414,23743
8,Minnesota,MN,103229,215873,151544,702157,2.500909,1422403,720246,216731,35.618182,64894,25242,1050239
9,North Carolina,NC,354409,379327,178740,1466105,2.475,3060199,1594094,1029446,33.785715,166146,35209,1790136


In [14]:
countries = transform_countries(countries)
print_data(countries)

data count: 289


Unnamed: 0,code,name,lower_name
0,582,MEXICO,mexico
1,236,AFGHANISTAN,afghanistan
2,101,ALBANIA,albania
3,316,ALGERIA,algeria
4,102,ANDORRA,andorra
5,324,ANGOLA,angola
6,529,ANGUILLA,anguilla
7,518,ANTIGUA-BARBUDA,antigua-barbuda
8,687,ARGENTINA,argentina
9,151,ARMENIA,armenia


In [15]:
temperature = transform_temperature(temperature)
print_data(temperature)

data count: 1


Unnamed: 0,country,average_temperature,latitude,longitude,lower_name
0,Denmark,7.44682,57.05N,10.33E,denmark


In [16]:
immigration = transform_immigration(immigration)
print_data(immigration)

data count: 2943721


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,dtadfile,gender,airline,fltno,visatype,arr_year,arr_month,arr_day
0,5748517,2016,4,245,438,LOS,2016-04-30,1,CA,2016-05-08,40,1,20160430,F,QF,11,B1,2016,4,30
1,5748518,2016,4,245,438,LOS,2016-04-30,1,NV,2016-05-17,32,1,20160430,F,VA,7,B1,2016,4,30
2,5748519,2016,4,245,438,LOS,2016-04-30,1,WA,2016-05-08,29,1,20160430,M,DL,40,B1,2016,4,30
3,5748520,2016,4,245,438,LOS,2016-04-30,1,WA,2016-05-14,29,1,20160430,F,DL,40,B1,2016,4,30
4,5748521,2016,4,245,438,LOS,2016-04-30,1,WA,2016-05-14,28,1,20160430,M,DL,40,B1,2016,4,30
5,5748522,2016,4,245,464,HHW,2016-04-30,1,HI,2016-05-05,57,2,20160430,M,NZ,10,B2,2016,4,30
6,5748523,2016,4,245,464,HHW,2016-04-30,1,HI,2016-05-12,66,2,20160430,F,NZ,10,B2,2016,4,30
7,5748524,2016,4,245,464,HHW,2016-04-30,1,HI,2016-05-12,41,2,20160430,F,NZ,10,B2,2016,4,30
8,5748525,2016,4,245,464,HOU,2016-04-30,1,FL,2016-05-07,27,2,20160430,M,NZ,28,B2,2016,4,30
9,5748526,2016,4,245,464,LOS,2016-04-30,1,CA,2016-05-07,26,2,20160430,F,NZ,2,B2,2016,4,30


### 4.2 Model Datasets

**Dimension Tables**

Dimension demographics and dimension airports are modeled directly after cleaning and transforming. About the dimension countries is combined countries dataset and temperature dataset. They are joined by `lower_name` as the key and the duplicated columns were removed before modeling.

In [18]:
model_dim_demographics(demographics, 'models/dim_demographics.parquet')
model_dim_airports(airports, 'models/dim_airports.parquet')
model_dim_countries(countries, temperature, 'models/dim_countries.parquet')

**Fact Table**

The fact immigration table is first joined with dimension tables and processed countries to clean the unnecessary data rows. And then it's modeling wiht the partition of arrived year, month and day.

In [19]:
model_fact_immigration(immigration, demographics, airports, countries, 'models/fact_immigration.parquet')

### 4.3 Data Quality Checks

_**Explain the data quality checks you'll perform to ensure the pipeline ran as expected**_

The data quality check focuses on follows:

1. Ensure the fact table and all the dimension tables have the data by counting the data row.

2. Ensure the integrity of fact table with all the dimension tables by joining the dimension tables wiht `left_anti` strategy and count the joined tables.
 

#### Data Existence Check

In [20]:
# Load data from parquet files
dim_demographics = load_parquet(spark, 'models/dim_demographics.parquet')
dim_airports = load_parquet(spark, 'models/dim_airports.parquet')
dim_countries = load_parquet(spark, 'models/dim_countries.parquet')

In [21]:
check_existence(dim_demographics)
check_existence(dim_airports)
check_existence(dim_countries)

True

In [22]:
fact_immigration = load_parquet(spark, 'models/fact_immigration.parquet')

In [23]:
check_existence(fact_immigration)

True

#### Data Integrity Check

In [25]:
# Check fact table integrity and the order of arguments matters.
check_fact_table_integrity(
    fact_immigration,
    dim_demographics,
    dim_airports,
    dim_countries
)

True

### 4.3 Data dictionary 

**Fact Table, Immigration**

| Column Name | Description |
| :--- | :--- |
| CICID | Record ID |
| I94YR | 4 digit year |
| I94MON | Numeric month |
| I94CIT | Contry of citizenship |
| I94RES | Country of residence |
| I94PORT | Airport of addmittance into the USA |
| ARRDATE | Arrival date in the USA |
| I94MODE | Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported) |
| I94ADDR | State of arrival |
| DEPDATE | Departure date |
| I94BIR | Age of the visitor |
| I94VISA | Visa codes: (1 = Business; 2 = Pleasure; 3 = Student) |
| DTADFILE | Character date field |
| GENDER | Gender of the visitor |
| VISAPOST | Department of State where where Visa was issued |
| FLTNO | Flight number of Airline used to arrive in U.S. |
| VISATYPE | Class of admission legally admitting the non-immigrant to temporarily stay in U.S. |
| arr_year | arrival year, for data partitioning |
| arr_month | arrival month, for data partitioning |
| arr_day | arrival day, for data partitioning |

**Dimension Table, Demographics**

| Column Name | Description |
| :--- | :--- |
| STATE_CODE | Two-letter code of the state |
| STATE | Name of the state |
| MEDIAN_AGE | Median age in the state (estimation) |
| AVERAGE_HOUSEHOLD_SIZE | Average number of people in a household in the state (estimation) |
| TOTAL_POPULATION | Total population of the state |
| FEMALE_POPULATION | Femal population of the state |
| MALE_POPULATION | Male population of the state |
| NUMBER_OF_VETERANS | Population of veteran citizens |
| BLACK_OR_AFRICAN_AMERICAN | Population belonging to this ethnic group |
| HISPANIC_OR_LATINO | Population belonging to this ethnic group |
| ASIAN | Population belonging to this ethnic group |
| AMERICAN_INDIAN_AND_ALASKA_NATIVE | Population belonging to this ethnic group |
| WHITE | Population belonging to this ethnic group |
| FOREIGN_BORN | Population of citizens born outside of US |

**Dimension Table, Airports**


| Column Name | Description |
| :--- | :--- |
| IDENT | Identification code |
| TYPE | Type of the Airport |
| NAME | Name of the Airport |
| ELEVATION_FT | Elevation above the sea level in feet |
| CONTINENT | Continent code |
| ISO_COUNTRY | Country code according to ISO |
| ISO_REGION | Region code according to ISO, 'US' is removed |
| MUNICIPALITY | Mucipality where the airport is located |
| GPS_CODE | GPS code |
| IATA_CODE | Code of the airport assigned by International Air Transport Association |
| LOCAL_CODE | Local code of the airport |
| COORDINATES | GPS coordinates - longitude and latitude |

**Dimension Table, Countries**

| Column Name | Description |
| :--- | :--- |
| CODE | Country Code |
| NAME | Country Name |
| TEMPERATURE | Average temperature of the country between 1743 and 2013 |
| LATITUDE | GPS Latitude |
| LONGITUDE | GPS Longitude |

## Step 5: Complete Project Write Up

_**Clearly state the rationale for the choice of tools and technologies for the project.**_

I expect the technologies that run the whole process both locally and in the cloud. Therefore, **Apache Spark** is choosen to build the whole data pipeline. I can develop and validate the pipeline locally wiht relative small amount data, then deploy the pipeline to cloud like AWS EMR to work with large amount data. Besides, parquet used by **Apache Spark** can help the performance over the raw data and promise the scalability up to large amount data size.

_**Propose how often the data should be updated and why.**_

The ideal schedule of updating data is daily because fact table, immigration data, is partitioned daily.

_**Write a description of how you would approach the problem differently under the following scenarios**_

* **The data was increased by 100x.**

    As the pipeline is built with **Apache Spark**, it is easy to move the pipeline to cloud like AWS EMR cluster for large amount data. Besides, the cluster can be easily scale out by adding the nodes. In this case, S3 can be consider as the storage for the processed data because the storage capacity can be treated as no limit and the processed data can be distributed globally.


* **The data populates a dashboard that must be updated on a daily basis by 7am every day.**

    **Apache Airflow** is a good tool to schedule the job to run daily at expected time. As the functions are modulized, it is easily to integrate these functions into airflow dag file and operators.


* **The database needed to be accessed by 100+ people.**

    We can use either AWS S3 or AWS Redshift to share the data for over 100+ people. It depends on which data format we finally decied to store the processed data. With Redshift, we can provide the read-only permission to different people. With S3, a static page with download features can be hosted on S3 or use labmda wiht cloudflare to provide the api to the other people to download the data. With this method, we can disable the access direct to S3 for security purpose.