# Project i94
##### (Step4_ETL.ipynb)
**Note:** This notebook includes the following work steps:
* Data pipelines run
* Data integrity & quality inspection 

---

- This step can actually be automated by running [**etl.py**](./etl.py) on the terminal
> \>> `python etl.py`
 
- Alternatively, each work step can be imported and run in this notebook (as shown below)

---

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model

Import libraries and files

In [1]:
import pandas as pd
import os

from pyspark.sql import SparkSession
from pyspark.sql import types as T 

from utility_functions import cleaning_immigration_data, cleaning_demographic_data
from utility_functions import quality_check

Create Spark Session

In [2]:
# Start spark session
spark = (SparkSession 
            .builder 
            .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11") 
            .enableHiveSupport().getOrCreate()
        )

input_data = '../data/'
output_data = '../data/'

Create dimension table from immigration data
- due to size limitation the sas file is not included in this repo

In [5]:
# get filepath to the immigration data file
immigration_data = os.path.join(input_data,"18-83510-I94-Data-2016/i94_{}16_sub.sas7bdat")
month= 'apr'

# read i94_immigration data file
df = (spark.read.format('com.github.saurfang.sas.spark') 
            .load(immigration_data.format(month))
    )

# clean and prep immigration data 
df = cleaning_immigration_data(df)

# extract columns to create dimension table
dim_immigration_table = (df.groupBy(['year', 'month','entry_port','destination_state',
                                    'citizenship','age','purpose',
                                    'visa_type']).agg({'count':'sum'})
                        .withColumnRenamed("sum(count)", "count")
                        ).dropna()

# write dim table to parquet files partitioned by destination_state 
dim_immigration_table.write.mode('append').partitionBy('destination_state').parquet(output_data+'dim_immigration.parquet')


Create dimension table from demographic data

In [6]:
# read demographic data file
df = (spark.read.format('csv')
                         .option("header","true")
                         .option("inferSchema","true")
                         .option("sep",";")
                         .load('../data/us-cities-demographics.csv')
                    )

# clean and prep demographic data 
df = cleaning_demographic_data(df)

# extract columns for dim table    
dim_demographic_table = df.select(['state_code','city','median_age','foreign_born',
                                   'total_population','race','race_count'])

# write dim demographic table to parquet files partitioned by destination_state 
dim_demographic_table.write.mode('append').partitionBy('state_code').parquet(output_data+'dim_demographic.parquet')


Create fact table from joining some columns from the dimension tables

In [7]:
# Create temporary views of the immigration and demographic data
dim_immigration_table.createOrReplaceTempView("immigration_view")
dim_demographic_table.createOrReplaceTempView("demographic_view")

# Create the fact table by joining the immigration and demographic views
fact_table = spark.sql('''
    SELECT i.year AS year,
        i.month AS month,
        i.citizenship AS origin_country,
        i.entry_port AS entry_port,
        i.visa_type AS visa_type,
        i.purpose AS visit_purpose,
        i.destination_state AS state_code,
        d.city AS city,
        d.total_population AS city_population,
        SUM(i.count) AS immigration_count
    FROM immigration_view AS i
    JOIN demographic_view AS d 
        ON (i.destination_state = d.state_code)
    GROUP BY 
        1, 2, 3, 4, 5, 6, 7, 8, 9 
    ''')

# Write fact table to parquet files partitioned by state
fact_table.write.mode("append").partitionBy("state_code").parquet(output_data+"fact_table.parquet")


---

#### 4.2 Data Quality Checks
 Definition of function is described in [**utility_functions.py**](./utility_functions.py)
 * Check number of total rows to ensure completeness
 * Check for null values in each column of tables

Quality check for **demographic** dimension table 

In [8]:
quality_check(dim_demographic_table,'demographic dimension table')

Performing quality checks for demographic dimension table
Number of null values for state_code is: 0
Number of null values for city is: 0
Number of null values for median_age is: 0
Number of null values for foreign_born is: 0
Number of null values for total_population is: 0
Number of null values for race is: 0
Number of null values for race_count is: 0
Total number of rows is: 2875


Quality check for **immigration** dimension table 

In [9]:
quality_check(dim_immigration_table, 'immigration dimension table')

Performing quality checks for immigration dimension table
Number of null values for year is: 0
Number of null values for month is: 0
Number of null values for entry_port is: 0
Number of null values for destination_state is: 0
Number of null values for citizenship is: 0
Number of null values for age is: 0
Number of null values for purpose is: 0
Number of null values for visa_type is: 0
Number of null values for count is: 0
Total number of rows is: 856233


Quality check for **fact** table

In [10]:
quality_check(fact_table, 'fact table')

Performing quality checks for fact table
Number of null values for year is: 0
Number of null values for month is: 0
Number of null values for origin_country is: 0
Number of null values for entry_port is: 0
Number of null values for visa_type is: 0
Number of null values for visit_purpose is: 0
Number of null values for state_code is: 0
Number of null values for city is: 0
Number of null values for city_population is: 0
Number of null values for immigration_count is: 0
Total number of rows is: 3233402


---

#### 4.3 Data dictionary 

**Dimensional Table-1**:  immigration dataset
- **year**: 4-digit calendar year of visitor's arrival
- **month**: Calendar month of visitor's arrival
- **entry_port**: 3-digit code for visitor's port of entry, as defined [here](../data/port_dictionary.txt) 
- **destination_state**: Abbreviated state code of visitor's destination
- **citizenship**: 3-digit code for visitor's country of origin, as defined [here](../data/country_code_dictionary.txt) 
- **age**: Age of visitor at the time of arrival
- **purpose**: Purpose of visit, e.g., 1: business, 2: pleasure, 3: student 
- **visa_type**: Visa types, e.g., F1, F2, B1, etc.
- **count**: Total number of arrivals

**Dimensional Table-2**: demographic dataset
- **state_code**: Abbreviated state code
- **city**: city name
- **median_age**: median age in city
- **foreign_born**: number of foreign-born residents 
- **total_population**: Size of population in city
- **race**: Ethnic race in city
- **race_count**: Size of ethnic population in city

**Fact Table**: Joined from immigration & demographic datasets
- **year**: 4-digit calendar year of visitor's arrival
- **month**: Calendar month of visitor's arrival
- **origin_country**: 3-digit code for visitor's country of origin 
- **entry_port**: 3-digit code for visitor's port of entry
- **visa_type**: Visa types, e.g., F1, F2, B1, etc.
- **visit_purpose**: Purpose of visit, e.g., 1: business, 2: pleasure, 3: student 
- **state_code**: Abbreviated state code
- **city**: city name
- **city_population**: Size of population in city
- **immigration_count**: total number of arrivals  

---

**Next:** [Project Summary](./Step5_Summary.md)