# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-05-05

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into different sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project file follows the following steps:
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data

##### 3.1.3. At what times do foreign persons arrive for immigration to the U.S.? [(Data pipeline)](#question3_data_pipeline) <a name="question3_description">
**Date dimensions**

`st_i94_arrdate` and `st_i94_depdate` from staging table `st_i94_immigration` describe dates in SAS specific Date format.
The SAS date calculation starts on 1960-01-01. These columns are converted to DateType format in the staging table
`st_i94_immigrations` as columns named `st_i94_arrdate_iso` and `st_i94_arrdate_iso`.

Get date values from columns `st_i94_immigration.st_i94_arrdate_iso` and `st_i94_immigration.st_i94_depdate_iso`.
Get a valid MIN(), MAX() and default (null value representation) date. Clean data and rewrite staging table 'st_i94_immigrations' if needed.
Finally, create two dimensions 'd_date_arrivals' and 'd_date_departures' out of it without gaps.

1. Read data and get min() and max() value out of `st_i94_arrdate_iso` and `st_i94_depdate_iso`
2. Clean date column "st_i94_depdate_iso": Valid entries are between 2016-01-01 and 2017-06-14. Pre- and descending values
   will be set to null / default value (1900-01-01)
3. Update fact table `f_i94_immigrations` based on cleaned column `st_i94_depdate_iso`  values inside
4. Generate new date staging tables (`st_date_arrivals`, `st_date_departures`) based on default, min and max values
5. Append date specific columns to staging tables, create a dimension out of it and store it
6. Map dimension `d_date_arrivals` to  fact table `f_i94_immigration` based on columns
   (`st_date_arrivals.st_da_date` --> `d_date_arrivals.d_da_id`) == (`st_i94_immigration.st_i94_arrdate_iso` --> `f_i94_immigration.d_da_id`).
7. Map dimension `d_date_departures` to  fact table `f_i94_immigration` based on columns
   (`st_date_departures.st_dd_date` --> `d_date_departures.d_dd_id`) == (`st_i94_immigration.st_i94_depdate_iso` --> `f_i94_immigration.d_dd_id`).
8. Answer Project Question 3.1: At what times do foreign persons arrive for immigration to the U.S.?
9. Answer Project Question 3.2: When a foreign person comes to the U.S. for immigration, do they travel on to another state?
10. Answer Project Question 3.3: If a foreign person travels to another state, after which period of time does this happen?


The creation of those two date dimensions is based on one physical table. This method is called
[Role-Playing Dimensions](https://dba.stackexchange.com/questions/137971/how-many-date-dimensions-for-one-fact)
![Role-Playing Dimension](../P8_capstone_documentation/11_P8_RolePlayingDimension.png).


### Step 4: Run ETL to Model the Data
##### 4.1.3. At what times do foreign persons arrive for immigration to the U.S.? [(Data description)](#question3_description) <a name="question3_data_pipeline">
**Date dimensions**

`st_i94_arrdate` and `st_i94_depdate` from staging table `st_i94_immigration` describe dates in SAS specific Date format.
The SAS date calculation starts on 1960-01-01. These columns are converted to DateType format in the staging table
`st_i94_immigrations` as columns named `st_i94_arrdate_iso` and `st_i94_arrdate_iso`.

Get date values from columns `st_i94_immigration.st_i94_arrdate_iso` and `st_i94_immigration.st_i94_depdate_iso`.
Get a valid MIN(), MAX() and default (null value representation) date. Clean data and rewrite staging table 'st_i94_immigrations' if needed.
Finally, create two dimensions 'd_date_arrivals' and 'd_date_departures' out of it without gaps.

1. Read data and get min() and max() value out of `st_i94_arrdate_iso` and `st_i94_depdate_iso`

In [1]:
###### Imports and Installs section
import shutil
import pandas as pd
import pyspark.sql.functions as F
# import spark as spark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, LongType, TimestampType, DateType
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, DataFrameNaFunctions
from pyspark.sql.functions import when, count, col, to_date, datediff, date_format, month
import re
import json
from os import path
MAX_MEMORY = "5g"

spark = SparkSession\
    .builder\
    .appName("etl pipeline for project 8 - I94 data") \
    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .appName("Foo") \
    .enableHiveSupport()\
    .getOrCreate()

# setting the current LOG-Level
spark.sparkContext.setLogLevel('ERROR')

In [2]:
# Read written data frame back into memory
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ2/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_to_read)
df_st_i94_immigrations.printSchema()

root
 |-- st_i94_cit: integer (nullable = true)
 |-- st_i94_port: string (nullable = true)
 |-- st_i94_addr: string (nullable = true)
 |-- st_i94_arrdate: integer (nullable = true)
 |-- st_i94_arrdate_iso: date (nullable = true)
 |-- st_i94_depdate: integer (nullable = true)
 |-- st_i94_depdate_iso: date (nullable = true)
 |-- st_i94_dtadfile: date (nullable = true)
 |-- st_i94_matflag: string (nullable = true)
 |-- st_i94_count: integer (nullable = true)
 |-- st_i94_id: integer (nullable = true)
 |-- st_i94_port_state_code: string (nullable = true)
 |-- st_i94_year: integer (nullable = true)
 |-- st_i94_month: integer (nullable = true)



In [3]:
# Get an overview about valid data - check some different perspectives
# get valid min and max date from date_fields `st_i94_arrdate_iso` and `st_i94_depdate_iso`

st_i94_arrdate_depdate_iso = df_st_i94_immigrations.select(
    F.min(col("st_i94_arrdate_iso")).alias("st_i94_arrdate_iso_min"),
    F.max(col("st_i94_arrdate_iso")).alias("st_i94_arrdate_iso_max"),
    F.min(col("st_i94_depdate_iso")).alias("st_i94_depdate_iso_min"),
    F.max(col("st_i94_depdate_iso")).alias("st_i94_depdate_iso_max"),
)

print(st_i94_arrdate_depdate_iso)


+----------------------+----------------------+----------------------+----------------------+
|st_i94_arrdate_iso_min|st_i94_arrdate_iso_max|st_i94_depdate_iso_min|st_i94_depdate_iso_max|
+----------------------+----------------------+----------------------+----------------------+
|            2016-01-01|            2016-12-31|            1900-01-01|            2092-05-09|
+----------------------+----------------------+----------------------+----------------------+



In [4]:
# get an overview about arrdate:
# all distinct date values
print(df_st_i94_immigrations.select("st_i94_arrdate_iso").distinct().count())
# Most entries on which date?
df_st_i94_immigrations.select("st_i94_arrdate_iso").groupBy("st_i94_arrdate_iso").count().sort("count", ascending=False).show(1000, False)
# get all date values. Is there a large gap or date values out of range?
df_st_i94_immigrations.select("st_i94_arrdate_iso").groupBy("st_i94_arrdate_iso").count().sort("st_i94_arrdate_iso", ascending=True).show(1000, False)

"""
Findings:

Everything seems to be valid. Date values start from 1st of January 2016 and ends by 31st of December 2016.
"""

366
+------------------+-----+
|st_i94_arrdate_iso|count|
+------------------+-----+
|2016-08-01        |49269|
|2016-07-23        |49171|
|2016-07-29        |49141|
|2016-07-30        |49066|
|2016-07-22        |48916|
|2016-07-28        |48358|
|2016-07-21        |48185|
|2016-07-24        |48028|
|2016-07-25        |47361|
|2016-08-04        |46963|
|2016-07-20        |46743|
|2016-07-27        |46409|
|2016-07-31        |45662|
|2016-08-05        |44889|
|2016-08-02        |44452|
|2016-07-01        |43395|
|2016-09-01        |43083|
|2016-07-09        |42250|
|2016-07-15        |42189|
|2016-07-16        |41506|
|2016-04-16        |41425|
|2016-06-30        |41376|
|2016-09-03        |41188|
|2016-07-10        |41179|
|2016-07-11        |41041|
|2016-07-07        |41009|
|2016-07-14        |40960|
|2016-07-08        |40951|
|2016-09-10        |40923|
|2016-07-02        |40850|
|2016-07-17        |40609|
|2016-09-02        |40509|
|2016-12-20        |40469|
|2016-09-11        |4044

'\nFindings:\n\nEverything seems to be valid. Date values start from 1st of January 2016 and ends by 31st of December 2016.\n'

In [5]:
# get an overview about depdate:
# all distinct date values
print(df_st_i94_immigrations.select("st_i94_depdate_iso").distinct().count())
# Most entries on which date?
df_st_i94_immigrations.select("st_i94_depdate_iso").groupBy("st_i94_depdate_iso").count().sort("count", ascending=False).show(10, False)
# get all date values. Is there a large gap or date values out of range?
df_st_i94_immigrations.select("st_i94_depdate_iso").groupBy("st_i94_depdate_iso").count().sort("st_i94_depdate_iso", ascending=True).show(1000, False)

715
+------------------+------+
|st_i94_depdate_iso|count |
+------------------+------+
|1900-01-01        |999506|
|2016-08-13        |47475 |
|2016-08-20        |47468 |
|2016-08-06        |47158 |
|2016-07-31        |46593 |
|2016-07-30        |46308 |
|2016-08-14        |46240 |
|2016-08-07        |45422 |
|2016-08-15        |44867 |
|2016-08-27        |44791 |
+------------------+------+
only showing top 10 rows

+------------------+------+
|st_i94_depdate_iso|count |
+------------------+------+
|1900-01-01        |999506|
|1961-08-21        |1     |
|1961-10-31        |1     |
|1962-08-20        |1     |
|1971-08-31        |1     |
|1971-09-04        |1     |
|1971-11-10        |1     |
|1981-09-11        |1     |
|2001-01-04        |1     |
|2001-01-16        |1     |
|2001-03-31        |1     |
|2001-07-20        |1     |
|2001-10-31        |1     |
|2006-01-08        |1     |
|2006-08-19        |1     |
|2006-10-02        |1     |
|2006-11-14        |1     |
|2006-11-17       

In [7]:
# compare st_i94_arrdate with st_i94_depdate. Is departure date earlier than arrival date --> there is a logical failure!

# Show only data to be corrected in the third column
df_st_i94_immigrations \
    .groupBy("st_i94_arrdate_iso", "st_i94_arrdate", "st_i94_depdate", "st_i94_depdate_iso") \
    .count() \
    .withColumn("st_i94_depdate_iso_wrong_dates",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1111-01-01")\
                .when(col("st_i94_depdate_iso") > "2017-06-14", "2222-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "3333-01-01")
                .otherwise(" ").cast(StringType())) \
    .orderBy("st_i94_arrdate_iso", "st_i94_depdate_iso") \
    .show(20)


+------------------+--------------+--------------+------------------+-----+------------------------------+
|st_i94_arrdate_iso|st_i94_arrdate|st_i94_depdate|st_i94_depdate_iso|count|st_i94_depdate_iso_wrong_dates|
+------------------+--------------+--------------+------------------+-----+------------------------------+
|        2016-01-01|         20454|             0|        1900-01-01| 3977|                    1111-01-01|
|        2016-01-01|         20454|         20428|        2015-12-06|    1|                    1111-01-01|
|        2016-01-01|         20454|         20455|        2016-01-02| 1089|                              |
|        2016-01-01|         20454|         20456|        2016-01-03|  830|                              |
|        2016-01-01|         20454|         20457|        2016-01-04|  749|                              |
|        2016-01-01|         20454|         20458|        2016-01-05|  724|                              |
|        2016-01-01|         20454|  

In [None]:
"""
Findings:
715 different date values ==> all distinct date values are greater than (>) 366 days of a year
==> That's possible. Many Immigrants already know their departure date.

1900-01-01 (start date):  This date is used as default value instead of a null value

arrdate starts on 2016-01-01 ==> The departure date cannot be earlier than the arrival date! --> each date before
2016-01-01 must be set to 1900-01-01 as null/default value

depdate greater than 2017-06-14 is not realistic, due to the very small amount of depdate entries within this range of dates
==> entries must be set to 1900-01-01 (null/default)

arrdate describes the 1st arrival into the U.S.. After that the immigrants decide to travel to different states in the U.S..
conclusion: arrdate must be earlier than depdate (arrdate < depdate ==> 2016-01-01 < 2016-01-02)

The following table shows some wrong dates where arrdate > depdate
+------------------+--------------+--------------+------------------+-----+------------------------------+
|st_i94_arrdate_iso|st_i94_arrdate|st_i94_depdate|st_i94_depdate_iso|count|st_i94_depdate_iso_wrong_dates|
+------------------+--------------+--------------+------------------+-----+------------------------------+
|        2016-01-02|         20455|         20454|        2016-01-01|    1|                    3333-01-01|
|        2016-01-08|         20461|         20454|        2016-01-01|    1|                    3333-01-01|
|        2016-01-08|         20461|         20459|        2016-01-06|    2|                    3333-01-01|
|        2016-01-08|         20461|         20460|        2016-01-07|    3|                    3333-01-01|
+------------------+--------------+--------------+------------------+-----+------------------------------+
"""


2. Clean date column "st_i94_depdate_iso" and "st_": Valid entries are between 2016-01-01 and 2017-06-14. Pre- and
   descending values will be set to null / default value (1900-01-01)

In [6]:
# show corrected column `st_i94_depdate_iso_corrected`
df_st_i94_immigrations \
    .groupBy("st_i94_arrdate_iso", "st_i94_arrdate", "st_i94_depdate", "st_i94_depdate_iso") \
    .count() \
    .withColumn("st_i94_depdate_iso_corrected",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1900-01-01")\
                .when(col("st_i94_depdate_iso") > "2017-06-14", "1900-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "1900-01-01")
                .otherwise(col("st_i94_depdate_iso")).cast(DateType())) \
    .orderBy("st_i94_arrdate_iso", "st_i94_depdate_iso") \
    .show(20)

+------------------+--------------+--------------+------------------+-----+----------------------------+
|st_i94_arrdate_iso|st_i94_arrdate|st_i94_depdate|st_i94_depdate_iso|count|st_i94_depdate_iso_corrected|
+------------------+--------------+--------------+------------------+-----+----------------------------+
|        2016-01-01|         20454|             0|        1900-01-01| 3977|                  1900-01-01|
|        2016-01-01|         20454|         20428|        2015-12-06|    1|                  1900-01-01|
|        2016-01-01|         20454|         20455|        2016-01-02| 1089|                  2016-01-02|
|        2016-01-01|         20454|         20456|        2016-01-03|  830|                  2016-01-03|
|        2016-01-01|         20454|         20457|        2016-01-04|  749|                  2016-01-04|
|        2016-01-01|         20454|         20458|        2016-01-05|  724|                  2016-01-05|
|        2016-01-01|         20454|         20459|     

In [7]:
# correct the date values in column `st_i94_depdate_iso`
df_st_i94_immigrations = df_st_i94_immigrations \
    .withColumn("st_i94_depdate_iso",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1900-01-01") \
                .when(col("st_i94_depdate_iso") > "2017-06-14", "1900-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "1900-01-01")
                .otherwise(col("st_i94_depdate_iso")).cast(DateType()))

In [8]:
df_st_i94_immigrations \
    .groupBy("st_i94_arrdate_iso", "st_i94_arrdate", "st_i94_depdate", "st_i94_depdate_iso") \
    .count() \
    .withColumn("st_i94_depdate_iso_wrong_dates",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1111-01-01")\
                .when(col("st_i94_depdate_iso") > "2017-06-14", "2222-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "3333-01-01")
                .otherwise(" ").cast(StringType())) \
    .orderBy("st_i94_depdate_iso_wrong_dates", ascending=False) \
    .show(20)

+------------------+--------------+--------------+------------------+-----+------------------------------+
|st_i94_arrdate_iso|st_i94_arrdate|st_i94_depdate|st_i94_depdate_iso|count|st_i94_depdate_iso_wrong_dates|
+------------------+--------------+--------------+------------------+-----+------------------------------+
|        2016-12-18|         20806|         20798|        1900-01-01|    1|                    1111-01-01|
|        2016-08-25|         20691|         20689|        1900-01-01|    4|                    1111-01-01|
|        2016-07-09|         20644|         20639|        1900-01-01|    1|                    1111-01-01|
|        2016-07-19|         20654|         20647|        1900-01-01|    2|                    1111-01-01|
|        2016-02-29|         20513|         20492|        1900-01-01|    1|                    1111-01-01|
|        2016-09-15|         20712|             0|        1900-01-01| 2106|                    1111-01-01|
|        2016-05-01|         20575|  

In [9]:
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(50)

root
 |-- st_i94_cit: integer (nullable = true)
 |-- st_i94_port: string (nullable = true)
 |-- st_i94_addr: string (nullable = true)
 |-- st_i94_arrdate: integer (nullable = true)
 |-- st_i94_arrdate_iso: date (nullable = true)
 |-- st_i94_depdate: integer (nullable = true)
 |-- st_i94_depdate_iso: date (nullable = true)
 |-- st_i94_dtadfile: date (nullable = true)
 |-- st_i94_matflag: string (nullable = true)
 |-- st_i94_count: integer (nullable = true)
 |-- st_i94_id: integer (nullable = true)
 |-- st_i94_port_state_code: string (nullable = true)
 |-- st_i94_year: integer (nullable = true)
 |-- st_i94_month: integer (nullable = true)

+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+----------------------+-----------+------------+
|st_i94_cit|st_i94_port|st_i94_addr|st_i94_arrdate|st_i94_arrdate_iso|st_i94_depdate|st_i94_depdate_iso|st_i94_dtadfile|st_i94_matflag|st_i94_count

In [10]:
# write st_i94_immigrations back to file system
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ3/st_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('st_i94_year', 'st_i94_month') \
    .parquet(location_to_write, compression="gzip")

3. Update fact table `f_i94_immigrations` based on cleaned column `st_i94_depdate_iso`  values inside

In [11]:
# Read data frames back into memory
# st_i94_immigrations with column `st_i94_port_state_code`:
location_st_i94_immigrations = "../P8_capstone_resource_files/parquet_stage/PQ3/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_st_i94_immigrations)

# f_i94_immigrations:
location_f_i94_immigrations = "../P8_capstone_resource_files/parquet_star/PQ2/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_f_i94_immigrations)

# show current schemas
print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(5,False)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5,False)

12228839
root
 |-- st_i94_cit: integer (nullable = true)
 |-- st_i94_port: string (nullable = true)
 |-- st_i94_addr: string (nullable = true)
 |-- st_i94_arrdate: integer (nullable = true)
 |-- st_i94_arrdate_iso: date (nullable = true)
 |-- st_i94_depdate: integer (nullable = true)
 |-- st_i94_depdate_iso: date (nullable = true)
 |-- st_i94_dtadfile: date (nullable = true)
 |-- st_i94_matflag: string (nullable = true)
 |-- st_i94_count: integer (nullable = true)
 |-- st_i94_id: integer (nullable = true)
 |-- st_i94_port_state_code: string (nullable = true)
 |-- st_i94_year: integer (nullable = true)
 |-- st_i94_month: integer (nullable = true)

+----------+-----------+-----------+--------------+------------------+--------------+------------------+---------------+--------------+------------+---------+----------------------+-----------+------------+
|st_i94_cit|st_i94_port|st_i94_addr|st_i94_arrdate|st_i94_arrdate_iso|st_i94_depdate|st_i94_depdate_iso|st_i94_dtadfile|st_i94_matflag|st_

In [12]:
# add column 'st_i94_depdate_iso' to fact table 'f_i94_immigrations'
df_st_i94_immigrations_2_join = df_st_i94_immigrations \
    .select("st_i94_id" , "st_i94_depdate_iso")


In [13]:
df_st_i94_immigrations_2_join.select("st_i94_id").printSchema()
df_f_i94_immigrations.select("f_i94_id").printSchema()
df_st_i94_immigrations_2_join.select("st_i94_id").filter("st_i94_id == 8987702").show()
df_f_i94_immigrations.select("f_i94_id").filter("f_i94_id == 8987702").show(1)

root
 |-- st_i94_id: integer (nullable = true)

root
 |-- f_i94_id: integer (nullable = true)

+---------+
|st_i94_id|
+---------+
|  8987702|
+---------+

+--------+
|f_i94_id|
+--------+
| 8987702|
+--------+



In [14]:
"""
Check that there are no null values in columns to join! Otherwise you can run out of memory!
A hint came from

https://medium.com/@yhoso/resolving-weird-spark-errors-f34324943e1c
-->An error occurred while calling o64.cacheTable. or An error occurred while calling o206.showString
"""
# check whether there are still zero values in the result data frame
df_st_i94_immigrations_2_join\
    .select([count( when(col(c).isNull(), c) )
            .alias(c) for c in df_st_i94_immigrations_2_join.columns])\
    .toPandas().T

Unnamed: 0,0
st_i94_id,0
st_i94_depdate_iso,0


In [15]:
df_f_i94_immigrations\
    .select([count( when(col(c).isNull(), c) )
            .alias(c) for c in df_f_i94_immigrations.columns])\
    .toPandas().T

Unnamed: 0,0
f_i94_cit,0
f_i94_port,0
f_i94_addr,0
f_i94_arrdate_iso,0
f_i94_depdate_iso,0
f_i94_dtadfile,0
f_i94_matflag,0
f_i94_count,0
f_i94_id,0
d_ic_id,0


In [16]:
# fill up null values with NA
df_f_i94_immigrations = df_f_i94_immigrations.fillna(value='NA', subset=['f_i94_port_state_code'])

In [17]:
df_f_i94_immigrations\
    .select([count( when(col(c).isNull(), c) )
            .alias(c) for c in df_f_i94_immigrations.columns])\
    .toPandas().T

Unnamed: 0,0
f_i94_cit,0
f_i94_port,0
f_i94_addr,0
f_i94_arrdate_iso,0
f_i94_depdate_iso,0
f_i94_dtadfile,0
f_i94_matflag,0
f_i94_count,0
f_i94_id,0
d_ic_id,0


In [18]:
# .coalesce(5) --> split df in partitions
df_f_i94_immigrations = df_f_i94_immigrations \
    .coalesce(5) \
    .join(df_st_i94_immigrations_2_join, F.col("st_i94_id") == F.col("f_i94_id")
          , 'inner') \
    .withColumn("f_i94_depdate_iso", col("st_i94_depdate_iso")) \
    .drop("st_i94_id", "st_i94_depdate_iso")

In [19]:
# update column 'd_dd_id' with newly updated values from column 'f_i94_depdate_iso'
df_f_i94_immigrations = df_f_i94_immigrations \
    .withColumn("d_dd_id", col("f_i94_depdate_iso"))

In [20]:
# check again if values of referencing column "d_dd_id" are equal to column "f_i94_depdate_iso"
df_f_i94_immigrations \
    .filter(col("d_dd_id") != col("f_i94_depdate_iso")) \
    .groupBy("d_dd_id", "f_i94_depdate_iso") \
    .count() \
    .orderBy("d_dd_id") \
    .show(5)

+-------+-----------------+-----+
|d_dd_id|f_i94_depdate_iso|count|
+-------+-----------------+-----+
+-------+-----------------+-----+



In [23]:
# write st_i94_immigrations back to file system
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)


In [24]:
df_f_i94_immigrations \
    .repartition(1) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite')\
    .partitionBy('f_i94_year', 'f_i94_month') \
    .parquet(location_to_write, compression="gzip")

4. Generate new date staging tables (`st_date_arrivals`, `st_date_departures`) based on default, min and max values

In [26]:
# Create new data frame with date series
def generate_dates(spark,range_list, dt_col="date_time_ref", interval=60*60*24): # TODO: attention to sparkSession
    """
    ...     Create a Spark DataFrame with a single column named dt_col and a range of date within a specified interval
            (start and stop included).
    ...     With hourly data, dates end at 23 of stop day
    ...     (https://stackoverflow.com/questions/57537760/pyspark-how-to-generate-a-dataframe-composed-of-datetime-range)
    ...
    ...     :param spark: SparkSession or sqlContext depending on environment (server vs local)
    ...     :param range_list: array of strings formatted as "2018-01-20" or "2018-01-20 00:00:00"
    ...     :param interval: number of seconds (frequency), output from get_freq()
    ...     :param dt_col: string with date column name. Date column must be TimestampType
    ...
    ...     :returns: df from range
    ...     """
    start,stop = range_list
    temp_df = spark.createDataFrame([(start, stop)], ("start", "stop"))
    temp_df = temp_df.select([F.col(c).cast("timestamp") for c in ("start", "stop")])
    temp_df = temp_df.withColumn("stop",F.date_add("stop",1).cast("timestamp"))
    temp_df = temp_df.select([F.col(c).cast("long") for c in ("start", "stop")])
    start, stop = temp_df.first()
    return spark.range(start,stop,interval).select(F.col("id").cast("timestamp").cast("date").alias(dt_col))

In [27]:
# Create new staging tables 'st_date_arrivals' and 'st_date_departure' with min and max date values
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)



In [28]:
# check if all date values from "f_i94_arrdate_iso" are valid
df_f_i94_immigrations\
    .groupBy("f_i94_arrdate_iso")\
    .count()\
    .orderBy("f_i94_arrdate_iso")\
    .show(5000)

+-----------------+-----+
|f_i94_arrdate_iso|count|
+-----------------+-----+
|       2016-01-01|23410|
|       2016-01-02|23665|
|       2016-01-03|27009|
|       2016-01-04|29974|
|       2016-01-05|28897|
|       2016-01-06|28712|
|       2016-01-07|30191|
|       2016-01-08|31174|
|       2016-01-09|33442|
|       2016-01-10|36234|
|       2016-01-11|33429|
|       2016-01-12|28912|
|       2016-01-13|27789|
|       2016-01-14|27549|
|       2016-01-15|30710|
|       2016-01-16|33242|
|       2016-01-17|33987|
|       2016-01-18|32927|
|       2016-01-19|27734|
|       2016-01-20|26643|
|       2016-01-21|25519|
|       2016-01-22|25669|
|       2016-01-23|22433|
|       2016-01-24|27801|
|       2016-01-25|28365|
|       2016-01-26|21440|
|       2016-01-27|20802|
|       2016-01-28|20396|
|       2016-01-29|23002|
|       2016-01-30|27172|
|       2016-01-31|27740|
|       2016-02-01|28419|
|       2016-02-02|23660|
|       2016-02-03|23306|
|       2016-02-04|23656|
|       2016

In [29]:
# Get min and max values for "f_i94_arrdate"
f_i94_arrdate_iso_min, f_i94_arrdate_iso_max =  df_f_i94_immigrations \
    .select(F.min("f_i94_arrdate_iso").alias("f_i94_arrdate_iso_min"), \
            F.max("f_i94_arrdate_iso").alias("f_i94_arrdate_iso_max")) \
    .first()


print(f"f_i94_arrdate_iso_min: {f_i94_arrdate_iso_min}")
print(f"f_i94_arrdate_iso_max: {f_i94_arrdate_iso_max}")


f_i94_arrdate_iso_min: 2016-01-01
f_i94_arrdate_iso_max: 2016-12-31


In [30]:
# create new staging table "st_date_arrivals"
date_range = [f_i94_arrdate_iso_min, f_i94_arrdate_iso_max]
dt_col="st_da_date"
df_st_date_arrivals = generate_dates(spark, date_range, dt_col)

df_st_date_arrivals.printSchema()
df_st_date_arrivals.head(5)

root
 |-- st_da_date: date (nullable = false)



[Row(st_da_date=datetime.date(2016, 1, 1)),
 Row(st_da_date=datetime.date(2016, 1, 2)),
 Row(st_da_date=datetime.date(2016, 1, 3)),
 Row(st_da_date=datetime.date(2016, 1, 4)),
 Row(st_da_date=datetime.date(2016, 1, 5))]

In [31]:
df_st_date_arrivals.tail(5)

[Row(st_da_date=datetime.date(2016, 12, 27)),
 Row(st_da_date=datetime.date(2016, 12, 28)),
 Row(st_da_date=datetime.date(2016, 12, 29)),
 Row(st_da_date=datetime.date(2016, 12, 30)),
 Row(st_da_date=datetime.date(2016, 12, 31))]

5. Append date specific columns to staging tables, create a dimension from it and save it to the file system.

In [32]:
# create new columns of st_date_arrivals table
# https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

df_st_date_arrivals = df_st_date_arrivals \
    .withColumn("st_da_id", col("st_da_date")) \
    .withColumn("st_da_year", F.year(col("st_da_date"))) \
    .withColumn("st_da_year_quarter", F.concat_ws('/', F.year(col("st_da_date")), F.quarter(col("st_da_date")))) \
    .withColumn("st_da_year_month", F.concat_ws('/', F.year(col("st_da_date")), F.month(col("st_da_date")))) \
    .withColumn("st_da_year_month", F.concat_ws('/', F.year(col("st_da_date")), date_format(col("st_da_date"), 'MM'))) \
    .withColumn("st_da_quarter", F.quarter(col("st_da_date"))) \
    .withColumn("st_da_month", F.month(col("st_da_date"))) \
    .withColumn("st_da_week", F.weekofyear(col("st_da_date"))) \
    .withColumn("st_da_weekday", F.date_format(col("st_da_date"),'EEEE')) \
    .withColumn("st_da_weekday_short", F.date_format(col("st_da_date"),'EEE')) \
    .withColumn("st_da_dayofweek", F.dayofweek(col("st_da_date"))) \
    .withColumn("st_da_day", F.dayofmonth(col("st_da_date")) )

df_st_date_arrivals.printSchema()
df_st_date_arrivals.show(5)

root
 |-- st_da_date: date (nullable = false)
 |-- st_da_id: date (nullable = false)
 |-- st_da_year: integer (nullable = false)
 |-- st_da_year_quarter: string (nullable = false)
 |-- st_da_year_month: string (nullable = false)
 |-- st_da_quarter: integer (nullable = false)
 |-- st_da_month: integer (nullable = false)
 |-- st_da_week: integer (nullable = false)
 |-- st_da_weekday: string (nullable = false)
 |-- st_da_weekday_short: string (nullable = false)
 |-- st_da_dayofweek: integer (nullable = false)
 |-- st_da_day: integer (nullable = false)

+----------+----------+----------+------------------+----------------+-------------+-----------+----------+-------------+-------------------+---------------+---------+
|st_da_date|  st_da_id|st_da_year|st_da_year_quarter|st_da_year_month|st_da_quarter|st_da_month|st_da_week|st_da_weekday|st_da_weekday_short|st_da_dayofweek|st_da_day|
+----------+----------+----------+------------------+----------------+-------------+-----------+----------+-

In [33]:
# persist staging time table 'st_date_arrivals'
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_arrivals"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_arrivals \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [34]:
# create dimension 'd_date_arrivals' from staging table 'st_date_arrivals'
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_arrivals"
df_st_date_arrivals = spark.read.parquet(location_to_read)

print(df_st_date_arrivals.count())
df_st_date_arrivals.printSchema()

df_st_date_arrivals = df_st_date_arrivals \
    .withColumnRenamed("st_da_date", "d_da_date") \
    .withColumnRenamed("st_da_id", "d_da_id") \
    .withColumnRenamed("st_da_year", "d_da_year") \
    .withColumnRenamed("st_da_year_quarter", "d_da_year_quarter") \
    .withColumnRenamed("st_da_year_month", "d_da_year_month") \
    .withColumnRenamed("st_da_quarter", "d_da_quarter") \
    .withColumnRenamed("st_da_month", "d_da_month") \
    .withColumnRenamed("st_da_week", "d_da_week") \
    .withColumnRenamed("st_da_weekday", "d_da_weekday") \
    .withColumnRenamed("st_da_weekday_short", "d_da_weekday_short") \
    .withColumnRenamed("st_da_dayofweek", "d_da_dayofweek") \
    .withColumnRenamed("st_da_day", "d_da_day") \

df_st_date_arrivals.printSchema()
df_st_date_arrivals.show(5)


location_to_write = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_arrivals \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

366
root
 |-- st_da_date: date (nullable = true)
 |-- st_da_id: date (nullable = true)
 |-- st_da_year: integer (nullable = true)
 |-- st_da_year_quarter: string (nullable = true)
 |-- st_da_year_month: string (nullable = true)
 |-- st_da_quarter: integer (nullable = true)
 |-- st_da_month: integer (nullable = true)
 |-- st_da_week: integer (nullable = true)
 |-- st_da_weekday: string (nullable = true)
 |-- st_da_weekday_short: string (nullable = true)
 |-- st_da_dayofweek: integer (nullable = true)
 |-- st_da_day: integer (nullable = true)

root
 |-- d_da_date: date (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_da_year: integer (nullable = true)
 |-- d_da_year_quarter: string (nullable = true)
 |-- d_da_year_month: string (nullable = true)
 |-- d_da_quarter: integer (nullable = true)
 |-- d_da_month: integer (nullable = true)
 |-- d_da_week: integer (nullable = true)
 |-- d_da_weekday: string (nullable = true)
 |-- d_da_weekday_short: string (nullable = true)
 |-- d_da

In [35]:
# Creation of the second dimension named `d_date_departures` based on fact column `f_i94_depdate_iso`.
# Create new staging table 'st_date_departure' with min, max and default date values
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)



In [36]:
# check if all date values from "f_i94_depdate_iso" are valid
df_f_i94_immigrations\
    .groupBy("f_i94_depdate_iso")\
    .count()\
    .orderBy("f_i94_depdate_iso")\
    .show(5000)

+-----------------+-------+
|f_i94_depdate_iso|  count|
+-----------------+-------+
|       1900-01-01|1004028|
|       2016-01-02|   1089|
|       2016-01-03|   2073|
|       2016-01-04|   2552|
|       2016-01-05|   3339|
|       2016-01-06|   4415|
|       2016-01-07|   6061|
|       2016-01-08|   8682|
|       2016-01-09|  10346|
|       2016-01-10|  11739|
|       2016-01-11|  10111|
|       2016-01-12|   9934|
|       2016-01-13|  11472|
|       2016-01-14|  13767|
|       2016-01-15|  16892|
|       2016-01-16|  16578|
|       2016-01-17|  15290|
|       2016-01-18|  12871|
|       2016-01-19|  11538|
|       2016-01-20|  13869|
|       2016-01-21|  16529|
|       2016-01-22|  20039|
|       2016-01-23|  17967|
|       2016-01-24|  17568|
|       2016-01-25|  16841|
|       2016-01-26|  15086|
|       2016-01-27|  16466|
|       2016-01-28|  19092|
|       2016-01-29|  22176|
|       2016-01-30|  22846|
|       2016-01-31|  20686|
|       2016-02-01|  17054|
|       2016-02-02| 

In [37]:
# extract default, min and max date from column 'f_i94_depdate_iso'
# get default and min value
f_i94_depdate_iso_default, f_i94_depdate_iso_min = df_f_i94_immigrations\
    .select("f_i94_depdate_iso") \
    .distinct() \
    .orderBy("f_i94_depdate_iso", ascending=True) \
    .limit(2) \
    .select(F.min("f_i94_depdate_iso").alias("f_i94_depdate_iso_default"),
            F.max("f_i94_depdate_iso").alias("f_i94_depdate_iso_min")) \
    .first()

# get max value
f_i94_depdate_iso_max, f_i94_depdate_iso_max =  df_f_i94_immigrations \
    .select(F.max("f_i94_depdate_iso").alias("f_i94_depdate_iso_max"), \
            F.max("f_i94_depdate_iso").alias("f_i94_depdate_iso_max")) \
    .first()

# check selected data
print(f"f_i94_depdate_iso_default: {f_i94_depdate_iso_default}")
print(f"f_i94_depdate_iso_min: {f_i94_depdate_iso_min}")
print(f"f_i94_depdate_iso_max: {f_i94_depdate_iso_max}")

f_i94_depdate_iso_default: 1900-01-01
f_i94_depdate_iso_min: 2016-01-02
f_i94_depdate_iso_max: 2017-06-14


In [38]:
# create new staging table "st_date_departures"
date_range_default = [f_i94_depdate_iso_default, f_i94_depdate_iso_default]
date_range_min_max = [f_i94_depdate_iso_min, f_i94_depdate_iso_max]

# check valid date ranges
print(date_range_default)
print(date_range_min_max)

# create new data frames for
dt_col="st_dd_date"
df_st_date_departures_default = generate_dates(spark, date_range_default, dt_col)
df_st_date_departures_min_max = generate_dates(spark, date_range_min_max, dt_col)

[datetime.date(1900, 1, 1), datetime.date(1900, 1, 1)]
[datetime.date(2016, 1, 2), datetime.date(2017, 6, 14)]


In [39]:
# combine both data frames to append `1900-01-01` to all other dates
df_st_date_departures = df_st_date_departures_default.union(df_st_date_departures_min_max)

In [40]:
df_st_date_departures.printSchema()
df_st_date_departures.head(5)

root
 |-- st_dd_date: date (nullable = false)



[Row(st_dd_date=datetime.date(1900, 1, 1)),
 Row(st_dd_date=datetime.date(2016, 1, 2)),
 Row(st_dd_date=datetime.date(2016, 1, 3)),
 Row(st_dd_date=datetime.date(2016, 1, 4)),
 Row(st_dd_date=datetime.date(2016, 1, 5))]

In [41]:
df_st_date_departures.tail(5)



[Row(st_dd_date=datetime.date(2017, 6, 10)),
 Row(st_dd_date=datetime.date(2017, 6, 11)),
 Row(st_dd_date=datetime.date(2017, 6, 12)),
 Row(st_dd_date=datetime.date(2017, 6, 13)),
 Row(st_dd_date=datetime.date(2017, 6, 14))]

In [42]:
# Append date specific columns to staging table `st_date_departures`.
# create new columns of st_date_departures table
# https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

df_st_date_departures = df_st_date_departures \
    .withColumn("st_dd_id", col("st_dd_date")) \
    .withColumn("st_dd_year", F.year(col("st_dd_date"))) \
    .withColumn("st_dd_year_quarter", F.concat_ws('/', F.year(col("st_dd_date")), F.quarter(col("st_dd_date")))) \
    .withColumn("st_dd_year_month", F.concat_ws('/', F.year(col("st_dd_date")), date_format(col("st_dd_date"), "MM")) )\
    .withColumn("st_dd_quarter", F.quarter(col("st_dd_date"))) \
    .withColumn("st_dd_month", F.month("st_dd_date")) \
    .withColumn("st_dd_week", F.weekofyear(col("st_dd_date"))) \
    .withColumn("st_dd_weekday", F.date_format(col("st_dd_date"),'EEEE')) \
    .withColumn("st_dd_weekday_short", F.date_format(col("st_dd_date"),'EEE')) \
    .withColumn("st_dd_dayofweek", F.dayofweek(col("st_dd_date"))) \
    .withColumn("st_dd_day", F.dayofmonth(col("st_dd_date")) )

In [43]:
# get prepared staging table
df_st_date_departures.printSchema()
df_st_date_departures.show(5)

root
 |-- st_dd_date: date (nullable = false)
 |-- st_dd_id: date (nullable = false)
 |-- st_dd_year: integer (nullable = false)
 |-- st_dd_year_quarter: string (nullable = false)
 |-- st_dd_year_month: string (nullable = false)
 |-- st_dd_quarter: integer (nullable = false)
 |-- st_dd_month: integer (nullable = false)
 |-- st_dd_week: integer (nullable = false)
 |-- st_dd_weekday: string (nullable = false)
 |-- st_dd_weekday_short: string (nullable = false)
 |-- st_dd_dayofweek: integer (nullable = false)
 |-- st_dd_day: integer (nullable = false)

+----------+----------+----------+------------------+----------------+-------------+-----------+----------+-------------+-------------------+---------------+---------+
|st_dd_date|  st_dd_id|st_dd_year|st_dd_year_quarter|st_dd_year_month|st_dd_quarter|st_dd_month|st_dd_week|st_dd_weekday|st_dd_weekday_short|st_dd_dayofweek|st_dd_day|
+----------+----------+----------+------------------+----------------+-------------+-----------+----------+-

In [44]:
# persist staging time table 'st_date_departures'
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_deaprtures"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_departures \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [45]:
# create dimension 'd_date_arrivals' from staging table 'st_date_arrivals'
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_deaprtures"
df_st_date_departures = spark.read.parquet(location_to_read)

print(df_st_date_departures.count())
df_st_date_departures.printSchema()


df_st_date_departures = df_st_date_departures \
    .withColumnRenamed("st_dd_date", "d_dd_date") \
    .withColumnRenamed("st_dd_id", "d_dd_id") \
    .withColumnRenamed("st_dd_year", "d_dd_year") \
    .withColumnRenamed("st_dd_year_quarter", "d_dd_year_quarter") \
    .withColumnRenamed("st_dd_year_month", "d_dd_year_month") \
    .withColumnRenamed("st_dd_quarter", "d_dd_quarter") \
    .withColumnRenamed("st_dd_month", "d_dd_month") \
    .withColumnRenamed("st_dd_week", "d_dd_week") \
    .withColumnRenamed("st_dd_weekday", "d_dd_weekday") \
    .withColumnRenamed("st_dd_weekday_short", "d_dd_weekday_short") \
    .withColumnRenamed("st_dd_dayofweek", "d_dd_dayofweek") \
    .withColumnRenamed("st_dd_day", "d_dd_day") \

df_st_date_departures.printSchema()
df_st_date_departures.show(5)


location_to_write = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_departures \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

531
root
 |-- st_dd_date: date (nullable = true)
 |-- st_dd_id: date (nullable = true)
 |-- st_dd_year: integer (nullable = true)
 |-- st_dd_year_quarter: string (nullable = true)
 |-- st_dd_year_month: string (nullable = true)
 |-- st_dd_quarter: integer (nullable = true)
 |-- st_dd_month: integer (nullable = true)
 |-- st_dd_week: integer (nullable = true)
 |-- st_dd_weekday: string (nullable = true)
 |-- st_dd_weekday_short: string (nullable = true)
 |-- st_dd_dayofweek: integer (nullable = true)
 |-- st_dd_day: integer (nullable = true)

root
 |-- d_dd_date: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- d_dd_year: integer (nullable = true)
 |-- d_dd_year_quarter: string (nullable = true)
 |-- d_dd_year_month: string (nullable = true)
 |-- d_dd_quarter: integer (nullable = true)
 |-- d_dd_month: integer (nullable = true)
 |-- d_dd_week: integer (nullable = true)
 |-- d_dd_weekday: string (nullable = true)
 |-- d_dd_weekday_short: string (nullable = true)
 |-- d_dd

6. Map dimension `d_date_arrivals` to  fact table `f_i94_immigration` based on columns
   (`st_date_arrivals.st_da_date` --> `d_date_arrivals.d_da_id`) == (`st_i94_immigration.st_i94_arrdate_iso` -->
   `f_i94_immigration.d_da_id`).

7. Map dimension `d_date_departures` to  fact table `f_i94_immigration` based on columns
   (`st_date_departures.st_dd_date` --> `d_date_departures.d_dd_id`) == (`st_i94_immigration.st_i94_depdate_iso` -->
   `f_i94_immigration.d_dd_id`).

8. Answer Project Question 3: At what times do foreign persons arrive for immigration to the U.S.?

In [46]:
# reload fact table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)
print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5, False)

location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals"
df_d_date_arrivals = spark.read.parquet(location_to_read)
print(df_d_date_arrivals.count())
df_d_date_arrivals.printSchema()
df_d_date_arrivals.show(5, False)

location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures"
df_d_date_departures = spark.read.parquet(location_to_read)
print(df_d_date_departures.count())
df_d_date_departures.printSchema()
df_d_date_departures.show(5, False)

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)

+---------+----------+----------+-----------------+-----------------+--------------+-------------+-----------+--------+-------+-------+----------+----------+---------------------+-------+----------+-----------+
|f_i94_cit|f_i94_port|f_i94_addr|f_i94_arrdate_i

In [47]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_date_arrivals.createOrReplaceTempView("d_date_arrivals")
df_d_date_departures.createOrReplaceTempView("d_date_departures")

8. Answer Project Question 3.1: At what times do foreign persons arrive for immigration to the U.S.?

In [48]:
# SQL to answer Project Question 3.1: At what times do foreign persons arrive for immigration to the U.S.?
df_pq3_1 = spark.sql("select da.d_da_year_month as Year_Month"
                     "      ,count(f_i94.f_i94_count) as  Immigrants"
                     "      ,RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigrants_rank"
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " group by Year_Month "
                     " order by Year_Month  "
                     )

df_pq3_1.show(20, False)

df_pq3_11 = spark.sql("select da.d_da_year_month as Year_Month"
                     "      ,count(f_i94.f_i94_count) as  Immigrants"
                     "      ,RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigrants_rank"
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " group by Year_Month "
                     " order by Immigrants_rank  "
                     )

df_pq3_11.show(20, False)



+----------+----------+---------------+
|Year_Month|Immigrants|Immigrants_rank|
+----------+----------+---------------+
|2016/01   |865969    |11             |
|2016/02   |696579    |12             |
|2016/03   |939031    |9              |
|2016/04   |1042752   |6              |
|2016/05   |1042120   |7              |
|2016/06   |1080648   |4              |
|2016/07   |1328974   |1              |
|2016/08   |1160455   |2              |
|2016/09   |1125245   |3              |
|2016/10   |1050757   |5              |
|2016/11   |904804    |10             |
|2016/12   |991505    |8              |
+----------+----------+---------------+

+----------+----------+---------------+
|Year_Month|Immigrants|Immigrants_rank|
+----------+----------+---------------+
|2016/07   |1328974   |1              |
|2016/08   |1160455   |2              |
|2016/09   |1125245   |3              |
|2016/06   |1080648   |4              |
|2016/10   |1050757   |5              |
|2016/04   |1042752   |6              |

9. Answer Project Question 3.2: When a foreign person comes to the U.S. for immigration, do they travel on to
   another state?

In [49]:
# SQL to answer Project Question 3.2: When a foreign person comes to the U.S. for immigration, do they travel on to
# another state?
df_pq3_2 = spark.sql("select da.d_da_year_month as Year_Month_arrival"
                     "      ,dd.d_dd_year_month as Year_Month_dearture"
                     "      ,count(f_i94.f_i94_count) as Immigrants "
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " left join d_date_departures dd on dd.d_dd_id = f_i94.d_dd_id  "
                     " group by Year_Month_arrival, Year_Month_dearture"
                     " order by Year_Month_arrival, Year_Month_dearture, Immigrants"
                     )

df_pq3_2.show(5000, False)

+------------------+-------------------+----------+
|Year_Month_arrival|Year_Month_dearture|Immigrants|
+------------------+-------------------+----------+
|2016/01           |1900/01            |153706    |
|2016/01           |2016/01            |387914    |
|2016/01           |2016/02            |226693    |
|2016/01           |2016/03            |80526     |
|2016/01           |2016/04            |17130     |
|2016/02           |1900/01            |66130     |
|2016/02           |2016/02            |360331    |
|2016/02           |2016/03            |205680    |
|2016/02           |2016/04            |40169     |
|2016/02           |2016/05            |24269     |
|2016/03           |1900/01            |48596     |
|2016/03           |2016/03            |442997    |
|2016/03           |2016/04            |260733    |
|2016/03           |2016/05            |76461     |
|2016/03           |2016/06            |48895     |
|2016/03           |2016/07            |24286     |
|2016/03    

In [50]:
df_pq3_2 = spark.sql("select da.d_da_year_month as Year_Month_arrival"
                     "      ,dd.d_dd_year_month as Year_Month_dearture"
                     "      ,count(f_i94.f_i94_count) as Immigrants "
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " left join d_date_departures dd on dd.d_dd_id = f_i94.d_dd_id  "
                     " group by Year_Month_arrival, Year_Month_dearture"
                     " order by Immigrants desc "
                     )

df_pq3_2.show(5000, False)


+------------------+-------------------+----------+
|Year_Month_arrival|Year_Month_dearture|Immigrants|
+------------------+-------------------+----------+
|2016/07           |2016/07            |565670    |
|2016/08           |2016/08            |559430    |
|2016/07           |2016/08            |532022    |
|2016/10           |2016/10            |531023    |
|2016/05           |2016/05            |524489    |
|2016/04           |2016/04            |521618    |
|2016/09           |2016/09            |517067    |
|2016/06           |2016/06            |471452    |
|2016/03           |2016/03            |442997    |
|2016/11           |2016/11            |432226    |
|2016/01           |2016/01            |387914    |
|2016/12           |2017/01            |385122    |
|2016/12           |2016/12            |364770    |
|2016/02           |2016/02            |360331    |
|2016/06           |2016/07            |359382    |
|2016/08           |2016/09            |327683    |
|2016/04    

10. Answer Project Question 3.3: If a foreign person travels to another state after immigration. After which period of
    time does this happen?

In [51]:
# DF to answer Project Question 3.3: If a foreign person travels to another state, after which period of time does this
# happen?
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
windowSpec  = Window.orderBy(col("immigrants").desc())

df_f_i94_immigrations \
    .join(df_d_date_arrivals , df_f_i94_immigrations.d_da_id == df_d_date_arrivals.d_da_id) \
    .join(df_d_date_departures, df_f_i94_immigrations.d_dd_id == df_d_date_departures.d_dd_id) \
    .filter("f_i94_depdate_iso != '1900-01-01'") \
    .withColumn("departure_days_after_arrival", F.datediff(col("f_i94_depdate_iso"), col("f_i94_arrdate_iso"))) \
    .select( "d_da_date"
            ,"d_dd_date"
            ,"departure_days_after_arrival") \
    .groupBy("departure_days_after_arrival").count() \
    .withColumnRenamed("count", "immigrants") \
    .withColumn("dense_rank",dense_rank().over(windowSpec)) \
    .show(500)


+----------------------------+----------+----------+
|departure_days_after_arrival|immigrants|dense_rank|
+----------------------------+----------+----------+
|                           4|    608273|         1|
|                           5|    574832|         2|
|                           3|    562640|         3|
|                           7|    554430|         4|
|                           6|    533309|         5|
|                           1|    512566|         6|
|                           8|    443648|         7|
|                           2|    417354|         8|
|                           9|    402639|         9|
|                          10|    386824|        10|
|                          14|    372211|        11|
|                          11|    334037|        12|
|                          13|    306093|        13|
|                          12|    305318|        14|
|                          15|    275895|        15|
|                          16|    231645|     