# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-04-03

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into five sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project file follows the following steps:
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data

##### 3.1.2. At what airports do foreign persons arrive for immigration to the U.S.? [(Data pipeline)](#question2_data_pipeline) <a name="question2_description">
**Airport dimension**
1. Clean data and create staging table `st_immigration_airports` from file
   [`I94_SAS_Labels_I94PORT.txt`](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt)
   with the columns `st_ia_airport_code` as referencing column, `st_ia_airport_name` and `st_ia_airport_state_code`.

    Note that the I-94 airport code is **not** the same as the [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association) code and
    does not correspond to it. Therefore, `SFR` (I94: 'SFR' = 'SAN FRANCISCO, CA') is used for San
    Francisco Airport in this scenario instead of `SFO`. `SFR` means normally San Fernando, CA, USA.

    **Project decision:** Data from file [airport-codes.csv](../P8_capstone_resource_files/airport-codes_csv.csv) will **not** be linked to the
    I-94 airport codes because incorrect assignments should not be made.
2. Add the column `st_i94_port_state_code` to staging table `st_i94_immigration` based on staging table `st_immigration_airports`. This
   information is needed to connect the `us-cities-demographics.json` file later on.
   `st_ia_airport_state_code --> st_i94_port_state_code`
3. Add column `st_i94_port_state_code --> f_i94_port_state_code` to fact table `f_i94_immigrations`
4. Creation of a dimension named `d_immigration_airports` based on staging table `st_immigration_airports`.
5. Mapping of dimension `d_immigration_airports` to  fact table `f_i94_immigration` based on columns
   (`st_immigration_airports.st_ia_airport_code` --> `d_immigration_airports.d_ia_id`) ==
   (`st_i94_immigration.st_i94_port` --> `f_i94_immigration.d_ia_id`).
6. Answer Project Question 2: At what airports do foreign persons arrive for immigration to the U.S.?


##### 4.1.2. At what airports do foreign persons arrive for immigration to the U.S.? [(Description)](#question2_description) <a name="question2_data_pipeline">
**Airport dimension**
1. Clean data and create staging table `st_immigration_airports` from file
   [`I94_SAS_Labels_I94PORT.txt`](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt)
   with the columns `st_ia_airport_code` as referencing column, `st_ia_airport_name` and `st_ia_airport_state_code`.

    Note that the I-94 airport code is **not** the same as the [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association) code and
    does not correspond to it. Therefore, `SFR` (I94: 'SFR' = 'SAN FRANCISCO, CA') is used for San
    Francisco Airport in this scenario instead of `SFO`. `SFR` means normally San Fernando, CA, USA.

    **Project decision:** Data from file [airport-codes.csv](../P8_capstone_resource_files/airport-codes_csv.csv) will **not** be linked to the
    I-94 airport codes because incorrect assignments should not be made.

In [1]:
###### Imports and Installs section
import shutil
import pandas as pd
import pyspark.sql.functions as F
# import spark as spark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, LongType, TimestampType, DateType
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, DataFrameNaFunctions
from pyspark.sql.functions import when, count, col, to_date, datediff, date_format, month
import re
import json
from os import path

MAX_MEMORY = "5g"

spark = SparkSession\
    .builder\
    .appName("etl pipeline for project 8 - I94 data") \
    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .appName("Foo") \
    .enableHiveSupport()\
    .getOrCreate()

# setting the current LOG-Level
spark.sparkContext.setLogLevel('ERROR')

In [2]:
"""
Next Steps: Carefully clean list of airports
1. read all available information from file
2. filter all elements on different regex conditions and store them into a new data frame called `df_st_immigration_airports`
3. store cleaned data frame `df_st_immigration_airports` to disk
"""

# path of txt file
filepath_immigration_airports = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt"

# read txt file into data frame
df_txt_immigration_airports_raw = spark.read.text(filepath_immigration_airports)

# get regex_cleaned values --> less error prone --> 582 Entries
regex_cleaned = r"^\s+'([.\w{2,3} ]*)'\s+=\s+'([\w -.\/]*),\s* ([\w\/]+)"

df_st_immigration_airports_regex_cleaned = df_txt_immigration_airports_raw\
    .select( F.regexp_extract('value',regex_cleaned, 1).alias('st_ia_airport_code'),
             F.regexp_extract('value',regex_cleaned, 2).alias('st_ia_airport_name'),
             F.regexp_extract('value',regex_cleaned, 3).alias('st_ia_airport_state_code')) \
    .drop_duplicates() \
    .filter("st_ia_airport_code != ''")  \
    .sort("st_ia_airport_state_code", "st_ia_airport_code") \
    .select("st_ia_airport_code", "st_ia_airport_name", "st_ia_airport_state_code")

print(df_st_immigration_airports_regex_cleaned.count())
df_st_immigration_airports_regex_cleaned.show(10, False)

582
+------------------+------------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name      |st_ia_airport_state_code|
+------------------+------------------------+------------------------+
|5KE               |KETCHIKAN               |AK                      |
|ALC               |ALCAN                   |AK                      |
|ANC               |ANCHORAGE               |AK                      |
|BAR               |BAKER AAF - BAKER ISLAND|AK                      |
|DAC               |DALTONS CACHE           |AK                      |
|DTH               |DUTCH HARBOR            |AK                      |
|EGL               |EAGLE                   |AK                      |
|FRB               |FAIRBANKS               |AK                      |
|HOM               |HOMER                   |AK                      |
|HYD               |HYDER                   |AK                      |
+------------------+------------------------+------------------------+
on

In [3]:
# get regex_all values --> with errors like `Collapsed (BUF)` --> 660 Entries
regex = r"^\s+'([.\w{2,3} ]*)'\s+=\s+'([\w -.\/]*)\s*,*\s* ([\w\/]+)"

df_st_immigration_airports = df_txt_immigration_airports_raw\
    .select( F.regexp_extract('value',regex, 1).alias('st_ia_airport_code'),
             F.regexp_extract('value',regex, 2).alias('st_ia_airport_name'),
             F.regexp_extract('value',regex, 3).alias('st_ia_airport_state_code')) \
    .drop_duplicates() \
    .filter("st_ia_airport_code != ''")  \
    .sort("st_ia_airport_state_code", "st_ia_airport_code")

print(df_st_immigration_airports.count())
df_st_immigration_airports.show(1000, False)

660
+------------------+------------------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name            |st_ia_airport_state_code|
+------------------+------------------------------+------------------------+
|BUS               |Collapsed (BUF)               |06/15                   |
|FRG               |Collapsed (FOK)               |06/15                   |
|HRL               |Collapsed (HLG)               |06/15                   |
|IAG               |Collapsed (NIA)               |06/15                   |
|ISP               |Collapsed (FOK)               |06/15                   |
|JSJ               |Collapsed (SAJ)               |06/15                   |
|PHN               |Collapsed (PHU)               |06/15                   |
|STN               |Collapsed (STR)               |06/15                   |
|T01               |Collapsed (SEA)               |06/15                   |
|VMB               |Collapsed (VNB)               |06/15                

In [4]:
# Difference of the remaining entries ==> 660 - 582 = 78
df_st_immigration_airports \
    .join(df_st_immigration_airports_regex_cleaned,
          df_st_immigration_airports.st_ia_airport_code == df_st_immigration_airports_regex_cleaned.st_ia_airport_code,
          'left_anti')  \
    .show(10000, False)

+------------------+---------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name   |st_ia_airport_state_code|
+------------------+---------------------+------------------------+
|BUS               |Collapsed (BUF)      |06/15                   |
|FRG               |Collapsed (FOK)      |06/15                   |
|HRL               |Collapsed (HLG)      |06/15                   |
|IAG               |Collapsed (NIA)      |06/15                   |
|ISP               |Collapsed (FOK)      |06/15                   |
|JSJ               |Collapsed (SAJ)      |06/15                   |
|PHN               |Collapsed (PHU)      |06/15                   |
|STN               |Collapsed (STR)      |06/15                   |
|T01               |Collapsed (SEA)      |06/15                   |
|VMB               |Collapsed (VNB)      |06/15                   |
|MAP               |MARIPOSA             |AZ                      |
|.GA               |No PORT              |Code  

In [5]:
# correct all entries that are not error-free as expected
df_st_immigration_airports = df_st_immigration_airports \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r'Collapsed \(\w+\)|No PORT|UNKNOWN', 'Invalid Airport Entry').alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r'06/15|Code|POE', 'Invalid State Code').alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^DERBY LINE,.*", "DERBY LINE, VT (RT. 5)").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"5", "VT").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^LOUIS BOTHA, SOUTH", "LOUIS BOTHA").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"AFRICA", "SOUTH AFRICA").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r",", "").alias("st_ia_airport_name"),
            "st_ia_airport_state_code") \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^PASO DEL", "PASO DEL NORTE").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"NORTE", "TX").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^UNIDENTIFED AIR /?", "Invalid Airport Entry").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"^SEAPORT?", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"Abu", "Abu Dhabi").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"Dhabi", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"DOVER-AFB", "Invalid Airport Entry").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"DE", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"NOT REPORTED/UNKNOWNGALES", "NOGALES").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"AZ", "AZ").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^NOT", "Invalid Airport Entry").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"REPORTED/UNKNOWN", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"INVALID - IWAKUNI", "IWAKUNI").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"JAPAN", "JAPAN").alias("st_ia_airport_state_code")) \
    .sort("st_ia_airport_name", "st_ia_airport_code")

print(df_st_immigration_airports.count())
df_st_immigration_airports.show(1000, False)

660
+------------------+-----------------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name           |st_ia_airport_state_code|
+------------------+-----------------------------+------------------------+
|ABE               |ABERDEEN                     |WA                      |
|ADS               |ADDISON AIRPORT- ADDISON     |TX                      |
|AGA               |AGANA                        |GU                      |
|AGU               |AGUADILLA                    |PR                      |
|BOI               |AIR TERM. (GOWEN FLD) BOISE  |ID                      |
|AKR               |AKRON                        |OH                      |
|CAK               |AKRON                        |OH                      |
|ALA               |ALAMAGORDO                   |NM                      |
|ALB               |ALBANY                       |NY                      |
|CHO               |ALBEMARLE CHARLOTTESVILLE    |VA                      |
|ABQ    

In [6]:
# check if former invalid entries are cleaned correctly
# Difference of the remaining entries ==> 660 - 582 = 78
df_st_immigration_airports \
    .join(df_st_immigration_airports_regex_cleaned,
          df_st_immigration_airports.st_ia_airport_code == df_st_immigration_airports_regex_cleaned.st_ia_airport_code, 'left_anti')  \
    .show(10000, False)

+------------------+---------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name   |st_ia_airport_state_code|
+------------------+---------------------+------------------------+
|MAA               |Abu Dhabi            |Invalid State Code      |
|.GA               |Invalid Airport Entry|Invalid State Code      |
|060               |Invalid Airport Entry|Invalid State Code      |
|5T6               |Invalid Airport Entry|Invalid State Code      |
|74S               |Invalid Airport Entry|Invalid State Code      |
|888               |Invalid Airport Entry|Invalid State Code      |
|A2A               |Invalid Airport Entry|Invalid State Code      |
|ADU               |Invalid Airport Entry|Invalid State Code      |
|AG                |Invalid Airport Entry|Invalid State Code      |
|AKT               |Invalid Airport Entry|Invalid State Code      |
|AMT               |Invalid Airport Entry|Invalid State Code      |
|ASI               |Invalid Airport Entry|Invali

In [7]:
# Write data as new CSV file to disk
location_to_write = '../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/st_immigration_airports.csv'

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_immigration_airports \
    .coalesce(1)\
    .write\
    .mode("overwrite") \
    .csv(location_to_write, header = 'true')

In [8]:
# write df_st_immigration_airports back to stage area on file system
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_immigration_airports \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [9]:
# Read written data frame back into memory
# st_immigration_airports:
location_st_immigration_airports = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"
df_st_immigration_airports = spark.read.parquet(location_st_immigration_airports)

# current Schema of staging table st_immigration_airports
print(df_st_immigration_airports.count())
df_st_immigration_airports.printSchema()
df_st_immigration_airports.show(10, False)


660
root
 |-- st_ia_airport_code: string (nullable = true)
 |-- st_ia_airport_name: string (nullable = true)
 |-- st_ia_airport_state_code: string (nullable = true)

+------------------+---------------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name         |st_ia_airport_state_code|
+------------------+---------------------------+------------------------+
|ABE               |ABERDEEN                   |WA                      |
|ADS               |ADDISON AIRPORT- ADDISON   |TX                      |
|AGA               |AGANA                      |GU                      |
|AGU               |AGUADILLA                  |PR                      |
|BOI               |AIR TERM. (GOWEN FLD) BOISE|ID                      |
|AKR               |AKRON                      |OH                      |
|CAK               |AKRON                      |OH                      |
|ALA               |ALAMAGORDO                 |NM                      |
|ALB               |

2. Add the column `st_ia_airport_state_code --> st_i94_port_state_code` to staging table `st_i94_immigration` based on staging
   table `st_immigration_airports`. This information is needed to connect the `us-cities-demographics.json` file later on.

In [10]:
# read df_st_i94_immigrations staging table and add column `st_i94_port_state_code` to it. Write data frame back to disk.

# Read written data frame back into memory
# st_i94_immigrations:
location_st_i94_immigrations = "../P8_capstone_resource_files/parquet_stage/PQ1/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_st_i94_immigrations)

# st_immigration_airports:
location_st_immigration_airports = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"
df_st_immigration_airports = spark.read.parquet(location_st_immigration_airports)


print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(5, False)

print(df_st_immigration_airports.count())
df_st_immigration_airports.printSchema()
df_st_immigration_airports.show(5, False)

12228839
root
 |-- st_i94_cit: integer (nullable = true)
 |-- st_i94_port: string (nullable = true)
 |-- st_i94_addr: string (nullable = true)
 |-- st_i94_arrdate: integer (nullable = true)
 |-- st_i94_arrdate_iso: date (nullable = true)
 |-- st_i94_depdate: integer (nullable = true)
 |-- st_i94_depdate_iso: date (nullable = true)
 |-- st_i94_dtadfile: date (nullable = true)
 |-- st_i94_matflag: string (nullable = true)
 |-- st_i94_count: integer (nullable = true)
 |-- st_i94_id: integer (nullable = true)
 |-- st_i94_year: integer (nullable = true)
 |-- st_i94_month: integer (nullable = true)

+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+------------+
|st_i94_cit|st_i94_port|st_i94_addr|st_i94_arrdate|st_i94_arrdate_iso|st_i94_depdate|st_i94_depdate_iso|st_i94_dtadfile|st_i94_matflag|st_i94_count|st_i94_id|st_i94_year|st_i94_month|
+----------+-----------+-------

In [11]:
########################################################################################################################
# check if st_i94_dept_date_iso is 1900-01-01 (default value - No onward travel is planned)
df_st_i94_immigrations \
    .filter(df_st_i94_immigrations.st_i94_depdate == 0)\
    .show(5, False)

+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+------------+
|st_i94_cit|st_i94_port|st_i94_addr|st_i94_arrdate|st_i94_arrdate_iso|st_i94_depdate|st_i94_depdate_iso|st_i94_dtadfile|st_i94_matflag|st_i94_count|st_i94_id|st_i94_year|st_i94_month|
+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+------------+
|254       |NYC        |KY         |20661         |2016-07-26        |0             |1900-01-01        |2016-07-26     |NA            |1           |8614304  |2016       |7           |
|254       |BOS        |MA         |20661         |2016-07-26        |0             |1900-01-01        |2016-07-26     |NA            |1           |8682856  |2016       |7           |
|254       |DET        |MA         |20661         |2016-07-26        |0         

In [12]:
# add column `st_i94_port_state_code` to data frame st_i94_immigrations
df_st_i94_immigrations = df_st_i94_immigrations \
    .join(df_st_immigration_airports,
          [df_st_i94_immigrations.st_i94_port == df_st_immigration_airports.st_ia_airport_code], 'left_outer') \
    .drop("st_ia_airport_code", "st_ia_airport_name") \
    .withColumnRenamed("st_ia_airport_state_code", "st_i94_port_state_code")


In [13]:
# rename
# check if `st_i94_port_state_code` has null values
df_st_i94_immigrations\
    .fillna(value='NA', subset=['st_i94_port_state_code'])\
    .groupBy("st_i94_port_state_code")\
    .count() \
    .sort("st_i94_port_state_code")\
    .orderBy("count")\
    .show(500)

+----------------------+-------+
|st_i94_port_state_code|  count|
+----------------------+-------+
|                    NA|      1|
|              ANTILLES|      4|
|                    AR|      6|
|                 JAPAN|      7|
|                    IA|     14|
|                    OK|     15|
|                    MS|     20|
|                    NE|     35|
|            WASHINGTON|     40|
|                    KY|     61|
|                    KS|     82|
|                BRAZIL|    113|
|                    NH|    178|
|                    VA|    207|
|                    SC|    246|
|                    AL|    265|
|                Canada|    266|
|                    WV|    432|
|                    WI|    999|
|                    IN|   1720|
|                    NM|   1812|
|                    TN|   2388|
|                    CT|   2612|
|                    MO|   2761|
|               BERMUDA|   3084|
|                    RI|   3085|
|                    ID|   3086|
|         

In [14]:
# get entry with null value
df_st_i94_immigrations \
    .filter(col("st_i94_port_state_code").isNull()).show()

+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+------------+----------------------+
|st_i94_cit|st_i94_port|st_i94_addr|st_i94_arrdate|st_i94_arrdate_iso|st_i94_depdate|st_i94_depdate_iso|st_i94_dtadfile|st_i94_matflag|st_i94_count|st_i94_id|st_i94_year|st_i94_month|st_i94_port_state_code|
+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+------------+----------------------+
|       117|        OCA|         NY|         20682|        2016-08-16|             0|        1900-01-01|     2016-08-16|            NA|           1| 10124860|       2016|           8|                  null|
+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+--

In [15]:
# get status
print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(5, False)

12228839
root
 |-- st_i94_cit: integer (nullable = true)
 |-- st_i94_port: string (nullable = true)
 |-- st_i94_addr: string (nullable = true)
 |-- st_i94_arrdate: integer (nullable = true)
 |-- st_i94_arrdate_iso: date (nullable = true)
 |-- st_i94_depdate: integer (nullable = true)
 |-- st_i94_depdate_iso: date (nullable = true)
 |-- st_i94_dtadfile: date (nullable = true)
 |-- st_i94_matflag: string (nullable = true)
 |-- st_i94_count: integer (nullable = true)
 |-- st_i94_id: integer (nullable = true)
 |-- st_i94_year: integer (nullable = true)
 |-- st_i94_month: integer (nullable = true)
 |-- st_i94_port_state_code: string (nullable = true)

+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+-----------+------------+----------------------+
|st_i94_cit|st_i94_port|st_i94_addr|st_i94_arrdate|st_i94_arrdate_iso|st_i94_depdate|st_i94_depdate_iso|st_i94_dtadfile|st_i94_matflag|st_

In [16]:
# write st_i94_immigrations back to file system
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ2/st_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('st_i94_year', 'st_i94_month') \
    .parquet(location_to_write, compression="gzip")

3. Add new column `st_i94_port_state_code --> f_i94_port_state_code` to existing fact table `f_i94_immigrations`.

In [17]:
# Read data frames back into memory
# st_i94_immigrations with column `st_i94_port_state_code`:
location_st_i94_immigrations = "../P8_capstone_resource_files/parquet_stage/PQ2/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_st_i94_immigrations)

# f_i94_immigrations:
location_f_i94_immigrations = "../P8_capstone_resource_files/parquet_star/PQ1/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_f_i94_immigrations)

# show current schemas
print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

12228839
root
 |-- st_i94_cit: integer (nullable = true)
 |-- st_i94_port: string (nullable = true)
 |-- st_i94_addr: string (nullable = true)
 |-- st_i94_arrdate: integer (nullable = true)
 |-- st_i94_arrdate_iso: date (nullable = true)
 |-- st_i94_depdate: integer (nullable = true)
 |-- st_i94_depdate_iso: date (nullable = true)
 |-- st_i94_dtadfile: date (nullable = true)
 |-- st_i94_matflag: string (nullable = true)
 |-- st_i94_count: integer (nullable = true)
 |-- st_i94_id: integer (nullable = true)
 |-- st_i94_port_state_code: string (nullable = true)
 |-- st_i94_year: integer (nullable = true)
 |-- st_i94_month: integer (nullable = true)

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: inte

In [18]:
# get only the needed columns to join
df_st_i94_immigrations_2_join = df_st_i94_immigrations \
    .select("st_i94_id", "st_i94_port_state_code")


# add new columns to fact table `df_f_i94_immigrations`
df_f_i94_immigrations = df_f_i94_immigrations  \
    .join(df_st_i94_immigrations_2_join, df_f_i94_immigrations.f_i94_id == df_st_i94_immigrations_2_join.st_i94_id, 'inner') \
    .drop("st_i94_id") \
    .withColumnRenamed("st_i94_port_state_code", "f_i94_port_state_code") \
    .withColumn("d_sd_id", col("f_i94_addr"))

In [19]:
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5, False)

root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)

+---------+----------+----------+-----------------+-----------------+--------------+-------------+-----------+--------+-------+-------+----------+----------+----------+-----------+---------------------+-------+
|f_i94_cit|f_i94_port|f_i94_addr|f_i94_arrdate_iso|f_i94_

In [20]:
# write fact table f_i94_immigration (~ 109,7 MB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ2/f_i94_immigrations"

if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_f_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy("f_i94_year", "f_i94_month")\
    .parquet(location_to_write, compression="gzip")

4. Creation of a dimension named `d_immigration_airports` based on staging table `st_immigration_airports`.

In [21]:
# st_immigration_airports:
location_st_immigration_airports = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"
df_d_immigration_airports = spark.read.parquet(location_st_immigration_airports)

print(df_d_immigration_airports.count())
df_d_immigration_airports.printSchema()
df_d_immigration_airports.show(5, False)

660
root
 |-- st_ia_airport_code: string (nullable = true)
 |-- st_ia_airport_name: string (nullable = true)
 |-- st_ia_airport_state_code: string (nullable = true)

+------------------+---------------------------+------------------------+
|st_ia_airport_code|st_ia_airport_name         |st_ia_airport_state_code|
+------------------+---------------------------+------------------------+
|ABE               |ABERDEEN                   |WA                      |
|ADS               |ADDISON AIRPORT- ADDISON   |TX                      |
|AGA               |AGANA                      |GU                      |
|AGU               |AGUADILLA                  |PR                      |
|BOI               |AIR TERM. (GOWEN FLD) BOISE|ID                      |
+------------------+---------------------------+------------------------+
only showing top 5 rows



In [22]:
df_d_immigration_airports = df_d_immigration_airports  \
    .withColumn("d_ia_id", df_d_immigration_airports.st_ia_airport_code) \
    .withColumnRenamed("st_ia_airport_code", "d_ia_airport_code") \
    .withColumnRenamed("st_ia_airport_name", "d_ia_airport_name") \
    .withColumnRenamed("st_ia_airport_state_code", "d_ia_airport_state_code")

df_d_immigration_airports.printSchema()
df_d_immigration_airports.show(5, False)

root
 |-- d_ia_airport_code: string (nullable = true)
 |-- d_ia_airport_name: string (nullable = true)
 |-- d_ia_airport_state_code: string (nullable = true)
 |-- d_ia_id: string (nullable = true)

+-----------------+---------------------------+-----------------------+-------+
|d_ia_airport_code|d_ia_airport_name          |d_ia_airport_state_code|d_ia_id|
+-----------------+---------------------------+-----------------------+-------+
|ABE              |ABERDEEN                   |WA                     |ABE    |
|ADS              |ADDISON AIRPORT- ADDISON   |TX                     |ADS    |
|AGA              |AGANA                      |GU                     |AGA    |
|AGU              |AGUADILLA                  |PR                     |AGU    |
|BOI              |AIR TERM. (GOWEN FLD) BOISE|ID                     |BOI    |
+-----------------+---------------------------+-----------------------+-------+
only showing top 5 rows



In [23]:
# write dimension table d_immigration_airports to disk (~ 10 kB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_d_immigration_airports \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")


5. Mapping of dimension `d_immigration_airports` to  fact table `f_i94_immigration` based on columns
   (`st_immigration_airports.st_ia_airport_code` --> `d_immigration_airports.d_ia_id`) ==
   (`st_i94_immigration.st_i94_port` --> `f_i94_immigration.d_ia_id`).

6. Answer Project Question 2: At what airports do foreign persons arrive for immigration to the U.S.?


In [24]:
# Read written data frame back into memory
df_f_i94_immigrations = spark.read.parquet("../P8_capstone_resource_files/parquet_star/PQ2/f_i94_immigrations")
df_d_immigration_airports = spark.read.parquet("../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports")

# check read data frames
print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5, False)

print(df_d_immigration_airports.count())
df_d_immigration_airports.printSchema()
df_d_immigration_airports.show(5, False)

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)

+---------+----------+----------+-----------------+-----------------+--------------+-------------+-----------+--------+-------+-------+----------+----------+---------------------+-------+----------+-----------+
|f_i94_cit|f_i94_port|f_i94_addr|f_i94_arrdate_i

In [25]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_immigration_airports.createOrReplaceTempView("d_immigration_airports")

# SQL to answer project question 2 (From which country do immigrants come to the U.S. and how many?)
df_pq2 = spark.sql(" select   d_ia.d_ia_airport_code as airport_code"
                     "       ,d_ia.d_ia_airport_name as airport_name"
                     "       ,d_ia.d_ia_airport_state_code as airport_state_code"
                     "       ,sum(f_i94.f_i94_count) as immigrants"
                     "       ,RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigration_airport_rank"
                     " from f_i94_immigrations f_i94"
                     " join d_immigration_airports d_ia on f_i94.d_ia_id = d_ia.d_ia_id"
                     " group by airport_code"
                     "       , airport_name"
                     "       , airport_state_code"
                     " order by Immigration_airport_rank asc ")

df_pq2.show(5000, False)

+------------+----------------------------+------------------+----------+------------------------+
|airport_code|airport_name                |airport_state_code|immigrants|Immigration_airport_rank|
+------------+----------------------------+------------------+----------+------------------------+
|NYC         |NEW YORK                    |NY                |1669429   |1                       |
|MIA         |MIAMI                       |FL                |1139100   |2                       |
|LOS         |LOS ANGELES                 |CA                |1134611   |3                       |
|CHI         |CHICAGO                     |IL                |792628    |4                       |
|NEW         |NEWARK/TETERBORO            |NJ                |663630    |5                       |
|SFR         |SAN FRANCISCO               |CA                |628438    |6                       |
|HOU         |HOUSTON                     |TX                |609343    |7                       |
|ATL      