# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import re
from pyspark.sql import SparkSession
import os
import glob
import configparser
from datetime import datetime, timedelta, date
from pyspark.sql import types as T
from pyspark.sql import functions as F
from pyspark.sql.functions import udf, col, monotonically_increasing_id
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear
from pyspark.sql.functions import col, isnan, when, count
from pyspark.sql.types import IntegerType, BooleanType, DateType, LongType, StringType, FloatType

In [2]:
# Declare common variable

US_CITY_DEMOGRAPHICS_SOURCE_FILE_PATH = 'us-cities-demographics.csv'
I94_DICTIONARY_SOURCE_FILE_PATH ='I94_SAS_Labels_Descriptions.SAS'


GARTHERED_DATA_PATH = 'garthered_data/'
GARTHERED_IMMIGRATION_DATA_PATH = GARTHERED_DATA_PATH + 'immigration/'
GARTHERED_US_CITY_DEMOGRAPHICS_DATA_PATH = GARTHERED_DATA_PATH + "us_cities_demogarphics.parquet"
GARTHERED_I94_COUNTRY_DATA_PATH = GARTHERED_DATA_PATH + "i94_country.parquet"
GARTHERED_I94_AIRPORT_DATA_PATH = GARTHERED_DATA_PATH + "i94_airport.parquet"
GARTHERED_I94_IMMIGRATION_MODE_DATA_PATH = GARTHERED_DATA_PATH + "i94_immigration_mode.parquet"
GARTHERED_I94_US_STATE_DATA_PATH = GARTHERED_DATA_PATH + "i94_us_state.parquet"
GARTHERED_I94_VISA_TYPE_DATA_PATH = GARTHERED_DATA_PATH + "i94_visa_type.parquet"


OUTPUT_DATA = 'output_data/'

In [3]:
# Init SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()


In [4]:
# Create function to transelate to timestampe
@udf(T.TimestampType())
def to_timestamp (d):
    if d:
        return (datetime(1960,1,1) + timedelta(days=int(d)))
    return None

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>



- immigration
- us-cities

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

#### 1.1 Garther immigration data

In [7]:
# Define get list file from directory and extension

def get_files(filepath, ext):
    all_files = []
    for root, dirs, files in os.walk(filepath):
        files = glob.glob(os.path.join(root, ext))
        for f in files :
            all_files.append((os.path.abspath(f), os.path.basename(f)))
    
    return all_files

In [12]:
# Get list source files for immigration

list_source_i94_immigration_file_path = get_files('../../data/18-83510-I94-Data-2016', '*.sas7bdat')

In [17]:
# Garther all data from source folder and write them to parquet files

for file_path in list_source_i94_immigration_file_path:
    df_spark = spark.read.format('com.github.saurfang.sas.spark').load(file_path[0])
    df_spark.write.option("header", True) \
        .mode("overwrite") \
        .parquet(f"{GARTHERED_IMMIGRATION_DATA_PATH}/{file_path[1]}.parquet")

#### 1.2 Garther US city demographics data

In [57]:
us_city_demographics_raw_df_spark = spark.read.csv(US_CITY_DEMOGRAPHICS_SOURCE_FILE_PATH, header=True, sep=';')

In [86]:
# Write it to parquet file

us_city_demographics_raw_df_spark \
    .withColumnRenamed("Median Age", "median_age") \
    .withColumnRenamed("Male Population", "male_population") \
    .withColumnRenamed("Female Population", "female_population") \
    .withColumnRenamed("Total Population", "total_population") \
    .withColumnRenamed("Number of Veterans", "number_of_veterans") \
    .withColumnRenamed("Average Household Size", "average_household_size") \
    .withColumnRenamed("State Code", "state_code") \
    .write.option("header", True) \
    .mode("overwrite") \
    .parquet(GARTHERED_US_CITY_DEMOGRAPHICS_DATA_PATH)

### 1.3 Garther countries from dictionary

In [48]:
i94_dictionary_lines = open(I94_DICTIONARY_SOURCE_FILE_PATH, 'r').readlines()

In [65]:
i94_country_raw_df_spark = spark.createDataFrame(
    pd.DataFrame(
        [re.search(r"^\s*(\S+)\s*=\s*'\s*(\S.*\S)\s*'.*$", line).groups() for line in i94_dictionary_lines[9:298]],
        columns=['country_code', 'country_name']
    )
)

In [66]:
# Write it to parquet file

i94_country_raw_df_spark.write.mode("overwrite").parquet(GARTHERED_I94_COUNTRY_DATA_PATH)

### 1.4 Garther I94 airports from dictionary

In [67]:
i94_airport_raw_df_spark = spark.createDataFrame(
    pd.DataFrame(
        [re.search(r"^\s*'\s*(\S+)\s*'\s*=\s*'\s*(\S.*\S)\s*'.*$", line).groups() for line in i94_dictionary_lines[302:962]],
        columns=['airport_code', 'airport_name']
    )
)

In [68]:
# Write it to parquet file

i94_airport_raw_df_spark.write.mode("overwrite").parquet(GARTHERED_I94_AIRPORT_DATA_PATH)

### 1.5 Garther modes from dictionary

In [69]:
i94_immigration_mode_raw_df_spark = spark.createDataFrame(
    pd.DataFrame(
        [re.search(r"^\s*(\S+)\s*=\s*'\s*(\S.*\S)\s*'.*$", line).groups() for line in i94_dictionary_lines[972:976]],
        columns=['mode_code', 'mode_name']
    )
)

In [70]:
# Write it to parquet file

i94_immigration_mode_raw_df_spark.write.mode("overwrite").parquet(GARTHERED_I94_IMMIGRATION_MODE_DATA_PATH)

### 1.6 Garther us-states from dictionary

In [71]:
i94_us_state_raw_df_spark = spark.createDataFrame(
    pd.DataFrame(
        [re.search(r"^\s*'(\S+)'\s*=\s*'\s*(\S.*\S)\s*'.*$", line).groups() for line in i94_dictionary_lines[981:1036]],
        columns=['state_code', 'state_name']
    )
)

In [72]:
# Write it to parquet file

i94_us_state_raw_df_spark.write.mode("overwrite").parquet(GARTHERED_I94_US_STATE_DATA_PATH)

### 1.7 Garther visa from dictionary

In [73]:
i94_visa_type_raw_df_spark = spark.createDataFrame(
    pd.DataFrame(
        [re.search(r"^\s*(\S+)\s*=\s*(\S.*\S).*$", line).groups() for line in i94_dictionary_lines[1046:1049]],
        columns=['visa_code', 'visa_type']
    )
)

In [74]:
# Write it to parquet file

i94_visa_type_raw_df_spark.write.mode("overwrite").parquet(GARTHERED_I94_VISA_TYPE_DATA_PATH)

# Step 2: Explore and Assess the Data


## 2.1. The I94 immigration data

### 1) Load data to spark dataframe

In [5]:
# Load immigration data

immigration_df_spark = spark.read.option("mergeSchema", "true").parquet(f"{GARTHERED_IMMIGRATION_DATA_PATH}/*r16_sub.sas7bdat.parquet")

In [6]:
immigration_df_spark.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [6]:
immigration_df_spark.show(5)

+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+--------+-----------+-----------+----------+-----------+-------------+
|    cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|       admnum|fltno|visatype|validres|delete_days|delete_mexl|delete_dup|delete_visa|delete_recdup|
+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+--------+-----------+-----------+----------+-----------+-------------+
|5680949.0|2016.0|   7.0| 117.0| 117.0|    NYC|20659.0|    1.0|     NY|   null|  30.0|    3.0|  1.0|20160724|     N

### 2) Standardize data type

Cast data for these columns:

 - cicid: long
 - i94yr: integer
 - i94mon: integer
 - i94cit: integer
 - i94res: integer
 - arrdate: timestamp
 - i94mode: integer
 - i94addr: string
 - depdate: timestamp
 - i94bir: integer
 - i94visa: integer
 - biryear: integer
 - admnum: long
 



In [12]:
immigration_standardize_df_spark = immigration_df_spark\
                                    .withColumn("cicid", col("cicid").cast(LongType())) \
                                    .withColumn("i94yr", col("i94yr").cast(IntegerType())) \
                                    .withColumn("i94mon", col("i94mon").cast(IntegerType())) \
                                    .withColumn("i94cit", col("i94cit").cast(IntegerType())) \
                                    .withColumn("i94res", col("i94res").cast(IntegerType())) \
                                    .withColumn("arrdate", to_timestamp("arrdate")) \
                                    .withColumn("i94mode", col("i94mode").cast(IntegerType())) \
                                    .withColumn("depdate", to_timestamp("depdate")) \
                                    .withColumn("i94bir", col("i94bir").cast(IntegerType())) \
                                    .withColumn("i94visa", col("i94visa").cast(IntegerType())) \
                                    .withColumn("biryear", col("biryear").cast(IntegerType())) \
                                    .withColumn("admnum", col("admnum").cast(LongType())) \


In [12]:
immigration_standardize_df_spark.printSchema()

root
 |-- cicid: long (nullable = true)
 |-- i94yr: integer (nullable = true)
 |-- i94mon: integer (nullable = true)
 |-- i94cit: integer (nullable = true)
 |-- i94res: integer (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: timestamp (nullable = true)
 |-- i94mode: integer (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: timestamp (nullable = true)
 |-- i94bir: integer (nullable = true)
 |-- i94visa: integer (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: integer (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: long (n

In [10]:
# Show example data

immigration_standardize_df_spark.show(5)

+-------+-----+------+------+------+-------+-------------------+-------+-------+-------------------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+----------+-----+--------+--------+-----------+-----------+----------+-----------+-------------+
|  cicid|i94yr|i94mon|i94cit|i94res|i94port|            arrdate|i94mode|i94addr|            depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|    admnum|fltno|visatype|validres|delete_days|delete_mexl|delete_dup|delete_visa|delete_recdup|
+-------+-----+------+------+------+-------+-------------------+-------+-------+-------------------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+----------+-----+--------+--------+-----------+-----------+----------+-----------+-------------+
|5680949| 2016|     7|   117|   117|    NYC|2016-07-24 00:00:

### 3) Clean for null, empty data, and drop duplicate data

- Drop the rows where these column is null or empty

[ 'i94cit', 'i94port', 'i94res', 'i94addr', 'i94mode ]


- And drop duplicate data when these column was duplicate:

["cicid", "admnum"]

In [13]:
immigration_df_spark_clean = immigration_standardize_df_spark\
                                .dropna(subset=['i94cit', 'i94port', 'i94res', 'i94addr', 'i94mode'])\
                                .dropDuplicates(subset=["cicid", "admnum"])

In [22]:
immigration_df_spark_clean.show(5)

+-----+-----+------+------+------+-------+-------------------+-------+-------+-------------------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-----------+-----+--------+
|cicid|i94yr|i94mon|i94cit|i94res|i94port|            arrdate|i94mode|i94addr|            depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|     admnum|fltno|visatype|
+-----+-----+------+------+------+-------+-------------------+-------+-------+-------------------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-----------+-----+--------+
|  118| 2016|     4|   103|   103|    NEW|2016-04-01 00:00:00|      1|     MD|2016-04-05 00:00:00|    53|      2|  1.0|20160401|    null| null|      G|      O|   null|      M|   1963|06292016|     F|  null|     LH|55436239033|00402|      WT|
|  120| 2016|     4|   103|   10

## 2.2. The US city demographics

### 1) Load data for spark dataframe

In [17]:
# Load data

us_cities_demographics_df_spark = spark.read.option("mergeSchema", "true").parquet(GARTHERED_US_CITY_DEMOGRAPHICS_DATA_PATH)

In [89]:
us_cities_demographics_df_spark.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- median_age: string (nullable = true)
 |-- male_population: string (nullable = true)
 |-- female_population: string (nullable = true)
 |-- total_population: string (nullable = true)
 |-- number_of_veterans: string (nullable = true)
 |-- Foreign-born: string (nullable = true)
 |-- average_household_size: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: string (nullable = true)



In [10]:
# Show a example data

us_cities_demographics_df_spark.show(5)

+----------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+-----+
|            City|        State|median_age|male_population|female_population|total_population|number_of_veterans|Foreign-born|average_household_size|state_code|                Race|Count|
+----------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+-----+
|   Silver Spring|     Maryland|      33.8|          40601|            41862|           82463|              1562|       30908|                   2.6|        MD|  Hispanic or Latino|25924|
|          Quincy|Massachusetts|      41.0|          44129|            49500|           93629|              4147|       32935|                  2.39|        MA|               White|58723|
|          Hoover|      Alabama|      38.5|          38040| 

### 2) Standardize column by change the column name to standard

Change these columns flowing to snake case

In [18]:
us_cities_demographics_standardize_df_spark = us_cities_demographics_df_spark\
                                                .withColumnRenamed("City", "city") \
                                                .withColumnRenamed("State", "state") \
                                                .withColumnRenamed("Median Age", "median_age") \
                                                .withColumnRenamed("Male Population", "male_population") \
                                                .withColumnRenamed("Female Population", "female_population") \
                                                .withColumnRenamed("Total Population", "total_population") \
                                                .withColumnRenamed("Number of Veterans", "number_of_neterans") \
                                                .withColumnRenamed("Foreign-born", "foreign_born") \
                                                .withColumnRenamed("Average Household Size", "average_household_size") \
                                                .withColumnRenamed("State Code", "state_code") \
                                                .withColumnRenamed("Race", "race") \
                                                .withColumnRenamed("Count", "count") \

In [92]:
us_cities_demographics_standardize_df_spark.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- median_age: string (nullable = true)
 |-- male_population: string (nullable = true)
 |-- female_population: string (nullable = true)
 |-- total_population: string (nullable = true)
 |-- number_of_veterans: string (nullable = true)
 |-- foreign_born: string (nullable = true)
 |-- average_household_size: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- race: string (nullable = true)
 |-- count: string (nullable = true)



## 2.3. The I94 countries data

The I94 contries data is dictionary data so dose not need to standardize and clean data

In [19]:
# Load data to spark dataframe

i94_country_df_spark = spark.read.option("mergeSchema", "true").parquet(GARTHERED_I94_COUNTRY_DATA_PATH)

In [95]:
i94_country_df_spark.printSchema()

root
 |-- country_code: string (nullable = true)
 |-- country_name: string (nullable = true)



In [96]:
i94_country_df_spark.show(5)

+------------+--------------------+
|country_code|        country_name|
+------------+--------------------+
|         582|MEXICO Air Sea, a...|
|         236|         AFGHANISTAN|
|         101|             ALBANIA|
|         316|             ALGERIA|
|         102|             ANDORRA|
+------------+--------------------+
only showing top 5 rows



## 2.4. The I94 Airport data

The I94 airport data is dictionary data so dose not need to standardize and clean data

In [20]:
# Load airport data to spark dataframe

i94_airport_df_spark = spark.read.option("mergeSchema", "true").parquet(GARTHERED_I94_AIRPORT_DATA_PATH)

In [98]:
i94_airport_df_spark.printSchema()

root
 |-- airport_code: string (nullable = true)
 |-- airport_name: string (nullable = true)



In [99]:
i94_airport_df_spark.show(5)

+------------+--------------------+
|airport_code|        airport_name|
+------------+--------------------+
|         ALC|           ALCAN, AK|
|         ANC|       ANCHORAGE, AK|
|         BAR|BAKER AAF - BAKER...|
|         DAC|   DALTONS CACHE, AK|
|         PIZ|DEW STATION PT LA...|
+------------+--------------------+
only showing top 5 rows



## 2.5. The I94 Immigration Mode data

The I94 Immigration Mode data is dictionary data so dose not need to standardize and clean data

In [21]:
# Load data to spark dataframe

i94_immigration_mode_df_spark = spark.read.option("mergeSchema", "true").parquet(GARTHERED_I94_IMMIGRATION_MODE_DATA_PATH)

In [101]:
i94_immigration_mode_df_spark.printSchema()

root
 |-- mode_code: string (nullable = true)
 |-- mode_name: string (nullable = true)



In [102]:
i94_immigration_mode_df_spark.show(5)

+---------+------------+
|mode_code|   mode_name|
+---------+------------+
|        1|         Air|
|        2|         Sea|
|        3|        Land|
|        9|Not reported|
+---------+------------+



## 2.6. The I94 US state

The I94 US state data is dictionary data so dose not need to standardize and clean data

In [22]:
# Load data to dataframe

i94_us_state_df_spark = spark.read.option("mergeSchema", "true").parquet(GARTHERED_I94_US_STATE_DATA_PATH)

In [105]:
i94_us_state_df_spark.printSchema()

root
 |-- state_code: string (nullable = true)
 |-- state_name: string (nullable = true)



In [104]:
i94_us_state_df_spark.show(5)

+----------+----------+
|state_code|state_name|
+----------+----------+
|        AL|   ALABAMA|
|        AK|    ALASKA|
|        AZ|   ARIZONA|
|        AR|  ARKANSAS|
|        CA|CALIFORNIA|
+----------+----------+
only showing top 5 rows



## 2.7. The I94 Visa Type

The I94 visa type data is dictionary data so dose not need to standardize and clean data

In [23]:
# Load data to dataframe

i94_visa_type_df_spark = spark.read.option("mergeSchema", "true").parquet(GARTHERED_I94_VISA_TYPE_DATA_PATH)

In [107]:
i94_visa_type_df_spark.printSchema()

root
 |-- visa_code: string (nullable = true)
 |-- visa_type: string (nullable = true)



In [108]:
i94_visa_type_df_spark.show(5)

+---------+---------+
|visa_code|visa_type|
+---------+---------+
|        1| Business|
|        2| Pleasure|
|        3|  Student|
+---------+---------+



# Step 3: Define the Data Model

## 3.1 Conceptual Data Model

I have developed a set of Fact and Dimension tables as a Star Schema for these reasons:
    - It can be used by Data Analysts and other relevant business professionals to gain deeper insight into various immigration figures, trends and statistics recorded historically.
    - Based on the simple relationships between fact and dimensions, development is free to invent creative queries to extract untold insights about the datasets

![conceptual data model](./images/conceptual-model.png)

## 3.2 Mapping Out Data Pipelines

- Garther the data from source data
- Load the data into spark dataframe
- Create Fact tables
- Create Dimension tables
- Write data into parquet files


Fact table:

- immigration_fact

Dimension tables:

- us_state_demographics_dim
- country_dim
- airport_dim
- immigration_mode_dim
- us_state_dim
- visa_type_dim

# Step 4: Run Pipelines to Model the Data 


## 4.1 Create the data model


### 1) Build fact immigartion table

Build fact immigartion data by select these column useful for table

In [14]:
# Create view for immigration table

immigration_df_spark_clean.createOrReplaceTempView("immigration_view")

In [15]:
# Select the columns useful for data immigration

immigration_fact_table = spark.sql("""
        SELECT  DISTINCT cicid,
                         i94yr   AS year,
                         i94mon  AS month,
                         i94cit  AS country_code,
                         i94res  AS residence,
                         i94port AS airport_code,
                         arrdate AS arrival_date,
                         i94mode AS mode_code, 
                         i94addr AS state_code,
                         depdate AS departure_date,
                         i94bir  AS age,
                         i94visa AS visa_code,
                         biryear AS birth_year,
                         gender  AS gender,
                         airline AS airline,
                         admnum  AS admission_number,
                         fltno   AS flight_number
        FROM immigration_view
    """)

- Final schema for fact table

In [47]:
immigration_fact_table.printSchema()

root
 |-- cicid: long (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- country_code: integer (nullable = true)
 |-- residence: integer (nullable = true)
 |-- city_code: string (nullable = true)
 |-- airport_code: string (nullable = true)
 |-- arrival_date: timestamp (nullable = true)
 |-- mode_code: integer (nullable = true)
 |-- state_code: string (nullable = true)
 |-- departure_date: timestamp (nullable = true)
 |-- age: integer (nullable = true)
 |-- visa_code: integer (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admission_number: long (nullable = true)
 |-- flight_number: string (nullable = true)



 - Example data

In [48]:
immigration_fact_table.show(5)

+------+----+-----+------------+---------+---------+------------+-------------------+---------+----------+-------------------+---+---------+----------+------+-------+----------------+-------------+
| cicid|year|month|country_code|residence|city_code|airport_code|       arrival_date|mode_code|state_code|     departure_date|age|visa_code|birth_year|gender|airline|admission_number|flight_number|
+------+----+-----+------------+---------+---------+------------+-------------------+---------+----------+-------------------+---+---------+----------+------+-------+----------------+-------------+
| 39577|2016|    3|         209|      209|      HHW|         HHW|2016-03-01 00:00:00|        1|        HI|2016-03-05 00:00:00| 20|        2|      1996|     M|     JL|     53655815733|        00786|
| 42195|2016|    4|         135|      135|      TAM|         TAM|2016-04-01 00:00:00|        1|        FL|2016-04-16 00:00:00| 80|        2|      1936|     F|     BA|     55443641033|        02167|
|206889|20

- write fact immigration table as parquet data

In [75]:
immigration_fact_table_path = OUTPUT_DATA + "immigration_fact_table.parquet"
immigration_fact_table.write.mode("overwrite").parquet(immigration_fact_table_path)

### 2) Build dimension us-state demographics table

Build dimension us-state demographics table by using the us-city demographics source data

- Final Schema

In [19]:
#
us_cities_demographics_standardize_df_spark.printSchema()

root
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- median_age: string (nullable = true)
 |-- male_population: string (nullable = true)
 |-- female_population: string (nullable = true)
 |-- total_population: string (nullable = true)
 |-- number_of_veterans: string (nullable = true)
 |-- foreign_born: string (nullable = true)
 |-- average_household_size: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- race: string (nullable = true)
 |-- count: string (nullable = true)



- Example data

In [20]:
us_cities_demographics_standardize_df_spark.show(5)

+----------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+-----+
|            city|        state|median_age|male_population|female_population|total_population|number_of_veterans|foreign_born|average_household_size|state_code|                race|count|
+----------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+-----+
|   Silver Spring|     Maryland|      33.8|          40601|            41862|           82463|              1562|       30908|                   2.6|        MD|  Hispanic or Latino|25924|
|          Quincy|Massachusetts|      41.0|          44129|            49500|           93629|              4147|       32935|                  2.39|        MA|               White|58723|
|          Hoover|      Alabama|      38.5|          38040| 

- write dimesion city_demographics table as parquet data

In [14]:
us_state_demographics_dim_table_path = OUTPUT_DATA + "us_state_demographics_dim_table.parquet"
us_cities_demographics_standardize_df_spark.write.mode("overwrite").parquet(us_state_demographics_dim_table_path)

### 3) Build dimension country table

- Final schema

In [33]:
i94_country_df_spark.printSchema()

root
 |-- country_code: string (nullable = true)
 |-- country_name: string (nullable = true)



- Example data

In [21]:
i94_country_df_spark.show(5)

+------------+--------------------+
|country_code|        country_name|
+------------+--------------------+
|         582|MEXICO Air Sea, a...|
|         236|         AFGHANISTAN|
|         101|             ALBANIA|
|         316|             ALGERIA|
|         102|             ANDORRA|
+------------+--------------------+
only showing top 5 rows



- write dimesion country table as parquet data

In [77]:
country_dim_table_path = OUTPUT_DATA + "country_dim_table.parquet"
i94_country_df_spark.write.mode("overwrite").parquet(country_dim_table_path)

### 4) Build dimension airport table

- Final schema

In [49]:
i94_airport_df_spark.printSchema()

root
 |-- airport_code: string (nullable = true)
 |-- airport_name: string (nullable = true)



- Example data

In [50]:
i94_airport_df_spark.show(5)

+------------+--------------------+
|airport_code|        airport_name|
+------------+--------------------+
|         ALC|           ALCAN, AK|
|         ANC|       ANCHORAGE, AK|
|         BAR|BAKER AAF - BAKER...|
|         DAC|   DALTONS CACHE, AK|
|         PIZ|DEW STATION PT LA...|
+------------+--------------------+
only showing top 5 rows



- write dimesion airport table as parquet data

In [51]:
airport_dim_table_path = OUTPUT_DATA + "airport_dim_table.parquet"
i94_airport_df_spark.write.mode("overwrite").parquet(airport_dim_table_path)

### 5) Build dimension immigration mode table

- Final schema

In [36]:
i94_immigration_mode_df_spark.printSchema()

root
 |-- mode_code: string (nullable = true)
 |-- mode_name: string (nullable = true)



- Example data

In [24]:
i94_immigration_mode_df_spark.show(5)

+---------+------------+
|mode_code|   mode_name|
+---------+------------+
|        1|         Air|
|        2|         Sea|
|        3|        Land|
|        9|Not reported|
+---------+------------+



- write dimesion immigration mode table as parquet data

In [79]:
immigration_mode_dim_table_path = OUTPUT_DATA + "immigration_mode_dim_table.parquet"
i94_mode_df_spark.write.mode("overwrite").parquet(immigration_mode_dim_table_path)

### 6) Build dimension us state table

- Final schema

In [25]:
i94_us_state_df_spark.printSchema()

root
 |-- state_code: string (nullable = true)
 |-- state_name: string (nullable = true)



- Example data

In [27]:
i94_us_state_df_spark.show(5)

+----------+----------+
|state_code|state_name|
+----------+----------+
|        AL|   ALABAMA|
|        AK|    ALASKA|
|        AZ|   ARIZONA|
|        AR|  ARKANSAS|
|        CA|CALIFORNIA|
+----------+----------+
only showing top 5 rows



- write dimesion us state table as parquet data

In [80]:
us_state_dim_table_path = OUTPUT_DATA + "us_state_dim_table.parquet"
i94_us_state_df_spark.write.mode("overwrite").parquet(us_state_dim_table_path)

### 7) Build dimension visa type table

- Final schema

In [29]:
i94_visa_type_df_spark.printSchema()

root
 |-- visa_code: string (nullable = true)
 |-- visa_type: string (nullable = true)



- Example data

In [32]:
i94_visa_type_df_spark.show(5)

+---------+---------+
|visa_code|visa_type|
+---------+---------+
|        1| Business|
|        2| Pleasure|
|        3|  Student|
+---------+---------+



- write dimesion visa type table as parquet data

In [81]:
visa_type_dim_table_path = OUTPUT_DATA + "visa_type_dim_table.parquet"
i94_visa_type_df_spark.write.mode("overwrite").parquet(visa_type_dim_table_path)

## 4.2 Data Quality Checks

Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

### 1) Check existing data

In [35]:
def check_existing_data(df, df_name):
    """
    Check data exist or not by dataframe

    """
    if df.count() > 0:
        print(f"Existed data on dataframe: {df_name}.")
    else:
        raise ValueError(f"Not existed data on dataframe: {df_name}")



In [37]:
list_dataframe_to_check = [
    [immigration_fact_table, "immigration_fact_table"],
    [i94_country_df_spark, "i94_country_df_spark"],
    [i94_airport_df_spark, "i94_airport_df_spark"],
    [i94_immigration_mode_df_spark, "i94_immigration_mode_df_spark"],
    [i94_us_state_df_spark, "i94_us_state_df_spark"],
    [i94_visa_type_df_spark, "i94_visa_type_df_spark"]
]

for df in list_dataframe_to_check:
    check_existing_data(*df)


Existed data on dataframe: immigration_fact_table.
Existed data on dataframe: i94_country_df_spark.
Existed data on dataframe: i94_airport_df_spark.
Existed data on dataframe: i94_immigration_mode_df_spark.
Existed data on dataframe: i94_us_state_df_spark.
Existed data on dataframe: i94_visa_type_df_spark.


### 2) Checking relational data

In [None]:
# check relation between immigration data with visa type

immigration_joined = immigration_fact_table.join(i94_visa_type_df_spark, 
                            (immigration_fact_table.visa_code == i94_visa_type_df_spark.visa_code)
                    )

In [39]:
immigration_joined.printSchema()

root
 |-- cicid: long (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- country_code: integer (nullable = true)
 |-- residence: integer (nullable = true)
 |-- airport_code: string (nullable = true)
 |-- arrival_date: timestamp (nullable = true)
 |-- mode_code: integer (nullable = true)
 |-- state_code: string (nullable = true)
 |-- departure_date: timestamp (nullable = true)
 |-- age: integer (nullable = true)
 |-- visa_code: integer (nullable = true)
 |-- birth_year: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admission_number: long (nullable = true)
 |-- flight_number: string (nullable = true)
 |-- visa_code: string (nullable = true)
 |-- visa_type: string (nullable = true)



In [45]:
immigration_joined.select(col('cicid'), col('visa_type')).show(5)

+------+---------+
| cicid|visa_type|
+------+---------+
| 81663| Pleasure|
| 83762|  Student|
|247835| Pleasure|
|267801| Pleasure|
|276293| Pleasure|
+------+---------+
only showing top 5 rows



## 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.