# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-04-03

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into five sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project file follows the following steps:
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data

##### 3.1.4. To which states in the U.S. do immigrants want to continue their travel after their initial arrival and what demographics can immigrants expect when they arrive in the destination state, such as average temperature, population numbers or population density? [(Data pipeline)](#question4_data_pipeline) <a name="question4_description">
1. Clean data and create staging table `st_state_destinations` from file
   [I94_SAS_Labels_I94ADDR.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt)
   based on columns `st_sd_state_code` and `st_sd_state_name`.
2. Extract some demographic data from file [us-cities-demographics.json](../P8_capstone_resource_files/us-cities-demographics.json)
   like `age_median`, `population_male`, `population_female`, `population_total` or `foreign_born` and add them to staging
   table `st_state_destinations`.
3. Creation of a dimension named `d_state_destinations` based on staging table `st_state_destinations`.
4. Mapping of dimension `d_state_destinations` to  fact table `f_i94_immigration` based on columns
   (`st_state_destinations.st_sd_state_code` --> `d_state_destinations.d_sd_id`) ==
   (`st_i94_immigration.st_i94_addr` --> `f_i94_immigration.d_sd_id`).
5. Clean fact table `f_i94_immigration` based on the dimension `d_state_destinations`. All unrecognizable columns will
   be set to 99 (all other countries).
6. Answer Project Question 4: To which states in the U.S. do immigrants want to continue their travel after their initial
   arrival and what demographics can immigrants expect when they arrive in the destination state, such as average
   temperature, population numbers or population density?


##### 4.1.4. To which states in the U.S. do immigrants want to continue their travel after their initial arrival and what demographics can immigrants expect when they arrive in the destination state, such as average temperature, population numbers or population density? [(Data description)](#question4_description) <a name="question4_data_pipeline">
1. Clean data and create staging table `st_state_destinations` from file
   [I94_SAS_Labels_I94ADDR.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt)
   based on columns `st_sd_state_code` and `st_sd_state_name`.

In [1]:
###### Imports and Installs section
import shutil
import pandas as pd
import pyspark.sql.functions as F
# import spark as spark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, LongType, TimestampType, DateType
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, DataFrameNaFunctions
from pyspark.sql.functions import when, count, col, to_date, datediff, date_format, month
import re
import json
from os import path

MAX_MEMORY = "5g"

spark = SparkSession\
    .builder\
    .appName("etl pipeline for project 8 - I94 data") \
    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .appName("Foo") \
    .enableHiveSupport()\
    .getOrCreate()

# setting the current LOG-Level
spark.sparkContext.setLogLevel('ERROR')

In [2]:
# get data
location_to_read = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt"
df_st_I94_SAS_Labels_I94ADDR = spark.read.text(location_to_read)

df_st_I94_SAS_Labels_I94ADDR.printSchema()
df_st_I94_SAS_Labels_I94ADDR.show(5, False)

# get regex_cleaned values -->
regex_cleaned = r"^\s+'([9+A-Z]+)'='([A-Z\s.]+)'"

df_st_I94_SAS_Labels_I94ADDR_regex_cleaned = df_st_I94_SAS_Labels_I94ADDR \
    .select( F.regexp_extract('value',regex_cleaned, 1).alias('st_sd_state_code'),
             F.regexp_extract('value',regex_cleaned, 2).alias('st_sd_state_name')) \
    .drop_duplicates() \
    .orderBy("st_sd_state_code")

print(df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.count())
df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.show(100)

root
 |-- value: string (nullable = true)

+------------------+
|value             |
+------------------+
|	'AL'='ALABAMA'   |
|	'AK'='ALASKA'    |
|	'AZ'='ARIZONA'   |
|	'AR'='ARKANSAS'  |
|	'CA'='CALIFORNIA'|
+------------------+
only showing top 5 rows

55
+----------------+-----------------+
|st_sd_state_code| st_sd_state_name|
+----------------+-----------------+
|              99|  ALL OTHER CODES|
|              AK|           ALASKA|
|              AL|          ALABAMA|
|              AR|         ARKANSAS|
|              AZ|          ARIZONA|
|              CA|       CALIFORNIA|
|              CO|         COLORADO|
|              CT|      CONNECTICUT|
|              DC|DIST. OF COLUMBIA|
|              DE|         DELAWARE|
|              FL|          FLORIDA|
|              GA|          GEORGIA|
|              GU|             GUAM|
|              HI|           HAWAII|
|              IA|             IOWA|
|              ID|            IDAHO|
|              IL|         ILLINOIS|


In [3]:
# This step is optional
location_to_write = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.csv"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_I94_SAS_Labels_I94ADDR_regex_cleaned\
    .coalesce(1)\
    .write\
    .option("header", "true")\
    .csv(location_to_write, mode='overwrite')

2. Extract some demographic data from file [us-cities-demographics.json](../P8_capstone_resource_files/us-cities-demographics.json)
   like `age_median`, `population_male`, `population_female`, `population_total` or `foreign_born` and add them to staging
   table `st_state_destinations`.


In [4]:
# get data from JSON-file
location_to_read = "../P8_capstone_resource_files/us-cities-demographics.json"
df_us_cities_demographics = spark.read.json(location_to_read)

print(df_us_cities_demographics.count())
df_us_cities_demographics.printSchema()

2891
root
 |-- datasetid: string (nullable = true)
 |-- fields: struct (nullable = true)
 |    |-- average_household_size: double (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- count: long (nullable = true)
 |    |-- female_population: long (nullable = true)
 |    |-- foreign_born: long (nullable = true)
 |    |-- male_population: long (nullable = true)
 |    |-- median_age: double (nullable = true)
 |    |-- number_of_veterans: long (nullable = true)
 |    |-- race: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- state_code: string (nullable = true)
 |    |-- total_population: long (nullable = true)
 |-- record_timestamp: string (nullable = true)
 |-- recordid: string (nullable = true)



In [5]:
#Check data for further processing
df_us_cities_demographics \
    .filter("fields.state == 'Alabama'") \
    .select("fields.state_code"
            , "fields.state"
            , "fields.city"
            , "fields.median_age"
            , "fields.male_population"
            , "fields.female_population"
            , "fields.total_population"
            , "fields.foreign_born") \
    .distinct()\
    .orderBy("fields.state_code")\
    .show(50)


+----------+-------+----------+----------+---------------+-----------------+----------------+------------+
|state_code|  state|      city|median_age|male_population|female_population|total_population|foreign_born|
+----------+-------+----------+----------+---------------+-----------------+----------------+------------+
|        AL|Alabama|Tuscaloosa|      29.1|          47293|            51045|           98338|        4706|
|        AL|Alabama|    Dothan|      38.9|          32172|            35364|           67536|        1699|
|        AL|Alabama|Huntsville|      38.1|          91764|            97350|          189114|       12691|
|        AL|Alabama|    Mobile|      38.0|          91275|           103030|          194305|        7234|
|        AL|Alabama|Birmingham|      35.6|         102122|           112789|          214911|        8258|
|        AL|Alabama|Montgomery|      35.4|          94582|           106004|          200586|        9337|
|        AL|Alabama|    Hoover|      

In [6]:
# Get only values aggregated by state and not the city values.
df_us_cities_demographics_agg = df_us_cities_demographics \
    .groupBy("fields.state_code", "fields.state") \
    .agg(  F.round(F.avg('fields.median_age'),1).alias('st_sd_age_median')
          ,F.round(F.avg('fields.male_population').cast(IntegerType()),2).alias('st_sd_population_male')
          ,F.round(F.avg('fields.female_population').cast(IntegerType()),2).alias('st_sd_population_female')
          ,F.round(F.avg('fields.total_population').cast(IntegerType()),2).alias('st_sd_population_total')
          ,F.round(F.avg('fields.foreign_born').cast(IntegerType()),2).alias('st_sd_foreign_born')
           ) \
    .orderBy("fields.state_code")\
    .withColumnRenamed("fields.state_code", "state_code")\
    .withColumnRenamed("fields.state", "state")

In [7]:
print(df_us_cities_demographics_agg.count())
df_us_cities_demographics_agg.printSchema()
df_us_cities_demographics_agg.show(500)

49
root
 |-- state_code: string (nullable = true)
 |-- state: string (nullable = true)
 |-- st_sd_age_median: double (nullable = true)
 |-- st_sd_population_male: integer (nullable = true)
 |-- st_sd_population_female: integer (nullable = true)
 |-- st_sd_population_total: integer (nullable = true)
 |-- st_sd_foreign_born: integer (nullable = true)

+----------+--------------------+----------------+---------------------+-----------------------+----------------------+------------------+
|state_code|               state|st_sd_age_median|st_sd_population_male|st_sd_population_female|st_sd_population_total|st_sd_foreign_born|
+----------+--------------------+----------------+---------------------+-----------------------+----------------------+------------------+
|        AK|              Alaska|            32.2|               152945|                 145750|                298695|             33258|
|        AL|             Alabama|            36.2|                72005|                  79

In [8]:
# Join "df_st_I94_SAS_Labels_I94ADDR_regex_cleaned" and "df_us_cities_demographics" to get new data frame "df_st_state_destinations"
# fill up null values with 0
df_st_state_destinations = df_st_I94_SAS_Labels_I94ADDR_regex_cleaned \
    .join(df_us_cities_demographics_agg, df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.st_sd_state_code ==
          df_us_cities_demographics_agg.state_code, 'left'  ) \
    .drop("state_code", "state") \
    .withColumn("st_sd_state_name", F.initcap(col("st_sd_state_name"))) \
    .fillna(value=0.0 ,subset=['st_sd_age_median'])\
    .fillna(value=0 ,subset=['st_sd_population_male'])\
    .fillna(value=0 ,subset=['st_sd_population_female'])\
    .fillna(value=0 ,subset=['st_sd_population_total'])\
    .fillna(value=0 ,subset=['st_sd_foreign_born'])

In [9]:
df_st_state_destinations.show()

+----------------+-----------------+----------------+---------------------+-----------------------+----------------------+------------------+
|st_sd_state_code| st_sd_state_name|st_sd_age_median|st_sd_population_male|st_sd_population_female|st_sd_population_total|st_sd_foreign_born|
+----------------+-----------------+----------------+---------------------+-----------------------+----------------------+------------------+
|              WY|          Wyoming|             0.0|                    0|                      0|                     0|                 0|
|              TN|        Tennessee|            34.3|               116458|                 126499|                242958|             20457|
|              KS|           Kansas|            34.8|                80592|                  83447|                164039|             16949|
|              IA|             Iowa|            32.5|                52119|                  53880|                106000|              9112|
|     

In [10]:
# check results
print(df_st_state_destinations.count())
df_st_state_destinations.printSchema()
df_st_state_destinations\
    .orderBy("st_sd_state_code") \
    .show(100)

55
root
 |-- st_sd_state_code: string (nullable = true)
 |-- st_sd_state_name: string (nullable = true)
 |-- st_sd_age_median: double (nullable = false)
 |-- st_sd_population_male: integer (nullable = true)
 |-- st_sd_population_female: integer (nullable = true)
 |-- st_sd_population_total: integer (nullable = true)
 |-- st_sd_foreign_born: integer (nullable = true)

+----------------+-----------------+----------------+---------------------+-----------------------+----------------------+------------------+
|st_sd_state_code| st_sd_state_name|st_sd_age_median|st_sd_population_male|st_sd_population_female|st_sd_population_total|st_sd_foreign_born|
+----------------+-----------------+----------------+---------------------+-----------------------+----------------------+------------------+
|              99|  All Other Codes|             0.0|                    0|                      0|                     0|                 0|
|              AK|           Alaska|            32.2|         

In [11]:
df_st_state_destinations

st_sd_state_code,st_sd_state_name,st_sd_age_median,st_sd_population_male,st_sd_population_female,st_sd_population_total,st_sd_foreign_born
WY,Wyoming,0.0,0,0,0,0
TN,Tennessee,34.3,116458,126499,242958,20457
KS,Kansas,34.8,80592,83447,164039,16949
IA,Iowa,32.5,52119,53880,106000,9112
NV,Nevada,36.1,124293,124677,248971,53481
IN,Indiana,33.8,86272,92115,178388,13992
PR,Puerto Rico,40.7,71542,84676,156219,0
WI,Wisconsin,33.5,76533,80482,157016,13854
NY,New York,35.6,433755,473689,907445,318275
RI,Rhode Island,37.7,51304,53227,104532,22710


In [12]:
# check results
print(df_st_state_destinations.count())
df_st_state_destinations.printSchema()
df_st_state_destinations\
    .orderBy("st_sd_state_code") \
    .show(100)

55
root
 |-- st_sd_state_code: string (nullable = true)
 |-- st_sd_state_name: string (nullable = true)
 |-- st_sd_age_median: double (nullable = false)
 |-- st_sd_population_male: integer (nullable = true)
 |-- st_sd_population_female: integer (nullable = true)
 |-- st_sd_population_total: integer (nullable = true)
 |-- st_sd_foreign_born: integer (nullable = true)

+----------------+-----------------+----------------+---------------------+-----------------------+----------------------+------------------+
|st_sd_state_code| st_sd_state_name|st_sd_age_median|st_sd_population_male|st_sd_population_female|st_sd_population_total|st_sd_foreign_born|
+----------------+-----------------+----------------+---------------------+-----------------------+----------------------+------------------+
|              99|  All Other Codes|             0.0|                    0|                      0|                     0|                 0|
|              AK|           Alaska|            32.2|         

In [13]:
# store staging table
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ4/st_state_destinations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_state_destinations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")


3. Creation of a dimension named `d_state_destinations` based on staging table `st_state_destinations`.

In [14]:
# get data to process and store dimension "d_state_destinations"
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ4/st_state_destinations"
df_st_state_destinations = spark.read.parquet(location_to_read)

print(df_st_state_destinations.count())
df_st_state_destinations.printSchema()

df_st_state_destinations = df_st_state_destinations \
    .withColumn("d_sd_id", col("st_sd_state_code")) \
    .withColumnRenamed("st_sd_state_code", "d_sd_state_code") \
    .withColumnRenamed("st_sd_state_name", "d_sd_state_name") \
    .withColumnRenamed("st_sd_age_median", "d_sd_age_median") \
    .withColumnRenamed("st_sd_population_male", "d_sd_population_male") \
    .withColumnRenamed("st_sd_population_female", "d_sd_population_female") \
    .withColumnRenamed("st_sd_population_total", "d_sd_population_total") \
    .withColumnRenamed("st_sd_foreign_born", "d_sd_foreign_born") \

df_st_state_destinations.printSchema()

# store dimension table
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_state_destinations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

df_st_state_destinations.orderBy("d_sd_state_code").show(1000)

55
root
 |-- st_sd_state_code: string (nullable = true)
 |-- st_sd_state_name: string (nullable = true)
 |-- st_sd_age_median: double (nullable = true)
 |-- st_sd_population_male: integer (nullable = true)
 |-- st_sd_population_female: integer (nullable = true)
 |-- st_sd_population_total: integer (nullable = true)
 |-- st_sd_foreign_born: integer (nullable = true)

root
 |-- d_sd_state_code: string (nullable = true)
 |-- d_sd_state_name: string (nullable = true)
 |-- d_sd_age_median: double (nullable = true)
 |-- d_sd_population_male: integer (nullable = true)
 |-- d_sd_population_female: integer (nullable = true)
 |-- d_sd_population_total: integer (nullable = true)
 |-- d_sd_foreign_born: integer (nullable = true)
 |-- d_sd_id: string (nullable = true)

+---------------+-----------------+---------------+--------------------+----------------------+---------------------+-----------------+-------+
|d_sd_state_code|  d_sd_state_name|d_sd_age_median|d_sd_population_male|d_sd_population_f

4. Mapping of dimension `d_state_destinations` to  fact table `f_i94_immigration` based on columns
   (`st_state_destinations.st_sd_state_code` --> `d_state_destinations.d_sd_id`) ==
   (`st_i94_immigration.st_i94_addr` --> `f_i94_immigration.d_sd_id`).

5. Clean fact table `f_i94_immigration` based on the dimension `d_state_destinations`. All unrecognizable columns will
be set to 99 (all other countries).

In [15]:
#Get data for further processing
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"
df_d_state_destinations = spark.read.parquet(location_to_read)

print(df_d_state_destinations.count())
df_d_state_destinations.printSchema()

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)

55
root
 |-- d_sd_state_code: string (nullable = true)
 |-- d_sd_state_name: string (nullable = true)
 |-- d_sd_age_median: double (nullable = true)
 |-- d_sd_population_male: integer (nullable = true)
 |-- d_sd_population_female: integer (nullable = true)
 |

In [16]:
# prepare data frame `df_f_i94_immigrations_2_join` to get only the allowed state codes
df_f_i94_immigrations_2_join = df_d_state_destinations \
    .select("d_sd_id")\
    .withColumnRenamed("d_sd_id", "d_sd_id_reference") \
    .orderBy("d_sd_id_reference")

print(df_f_i94_immigrations_2_join.count())
df_f_i94_immigrations_2_join.printSchema()
df_f_i94_immigrations_2_join.show(60)

55
root
 |-- d_sd_id_reference: string (nullable = true)

+-----------------+
|d_sd_id_reference|
+-----------------+
|               99|
|               AK|
|               AL|
|               AR|
|               AZ|
|               CA|
|               CO|
|               CT|
|               DC|
|               DE|
|               FL|
|               GA|
|               GU|
|               HI|
|               IA|
|               ID|
|               IL|
|               IN|
|               KS|
|               KY|
|               LA|
|               MA|
|               MD|
|               ME|
|               MI|
|               MN|
|               MO|
|               MS|
|               MT|
|               NC|
|               ND|
|               NE|
|               NH|
|               NJ|
|               NM|
|               NV|
|               NY|
|               OH|
|               OK|
|               OR|
|               PA|
|               PR|
|               RI|
|               SC|
| 

In [17]:
# prepare and create a cleaned column "d_sd_id_cleaned"
df_f_i94_immigrations \
    .select("d_sd_id", "f_i94_addr") \
    .join(df_f_i94_immigrations_2_join, df_f_i94_immigrations_2_join.d_sd_id_reference == df_f_i94_immigrations.d_sd_id, 'left') \
    .withColumn("d_sd_id_cleaned", when(col("d_sd_id_reference").isNull(), "99")\
                .otherwise(col("d_sd_id_reference"))) \
    .filter(col("d_sd_id_reference").isNull())\
    .distinct() \
    .orderBy("d_sd_id") \
    .show(5000)

+-------+----------+-----------------+---------------+
|d_sd_id|f_i94_addr|d_sd_id_reference|d_sd_id_cleaned|
+-------+----------+-----------------+---------------+
|     **|        **|             null|             99|
|     ..|        ..|             null|             99|
|     .7|        .7|             null|             99|
|     .9|        .9|             null|             99|
|     .A|        .A|             null|             99|
|     .C|        .C|             null|             99|
|     .D|        .D|             null|             99|
|     .F|        .F|             null|             99|
|     .H|        .H|             null|             99|
|     .I|        .I|             null|             99|
|     .K|        .K|             null|             99|
|     .L|        .L|             null|             99|
|     .M|        .M|             null|             99|
|     .N|        .N|             null|             99|
|     .O|        .O|             null|             99|
|     .S| 

In [18]:
# clean column "f_i94_immigrations.d_sd_id" by column "d_sd_id_cleaned (d_sd_id_reference)"
df_f_i94_immigrations = df_f_i94_immigrations \
    .join(df_f_i94_immigrations_2_join, df_f_i94_immigrations_2_join.d_sd_id_reference == df_f_i94_immigrations.d_sd_id, 'left') \
    .withColumn("d_sd_id", when(col("d_sd_id_reference").isNull(), "99")\
                .otherwise(col("d_sd_id_reference"))) \
    .drop("d_sd_id_reference") \

In [19]:
# check corrected column "d_sd_id"
df_f_i94_immigrations \
    .select("d_sd_id", "f_i94_addr") \
    .distinct() \
    .orderBy("d_sd_id", "f_i94_addr") \
    .show(5000)

+-------+----------+
|d_sd_id|f_i94_addr|
+-------+----------+
|     99|        **|
|     99|        ..|
|     99|        .7|
|     99|        .9|
|     99|        .A|
|     99|        .C|
|     99|        .D|
|     99|        .F|
|     99|        .H|
|     99|        .I|
|     99|        .K|
|     99|        .L|
|     99|        .M|
|     99|        .N|
|     99|        .O|
|     99|        .S|
|     99|        .T|
|     99|        .V|
|     99|        .W|
|     99|         /|
|     99|         0|
|     99|        00|
|     99|        01|
|     99|        02|
|     99|        03|
|     99|        04|
|     99|        06|
|     99|        07|
|     99|        08|
|     99|         1|
|     99|        10|
|     99|        11|
|     99|        12|
|     99|        13|
|     99|        14|
|     99|        15|
|     99|        16|
|     99|        17|
|     99|        18|
|     99|        19|
|     99|         2|
|     99|        20|
|     99|        21|
|     99|        22|
|     99|    

In [20]:
print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)



In [21]:
# write fact table f_i94_immigration (~ 137,7 MB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_f_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('f_i94_year', 'f_i94_month') \
    .parquet(location_to_write, compression="gzip")


6. Answer Project Question 4: To which states in the U.S. do immigrants want to continue their travel after their initial
   arrival and what demographics can immigrants expect when they arrive in the destination state, such as average
   temperature, population numbers or population density?

In [22]:
#Get data for further processing
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()


location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"
df_d_state_destinations = spark.read.parquet(location_to_read)

print(df_d_state_destinations.count())
df_d_state_destinations.printSchema()

12228839
root
 |-- f_i94_cit: integer (nullable = true)
 |-- f_i94_port: string (nullable = true)
 |-- f_i94_addr: string (nullable = true)
 |-- f_i94_arrdate_iso: date (nullable = true)
 |-- f_i94_depdate_iso: date (nullable = true)
 |-- f_i94_dtadfile: date (nullable = true)
 |-- f_i94_matflag: string (nullable = true)
 |-- f_i94_count: integer (nullable = true)
 |-- f_i94_id: integer (nullable = true)
 |-- d_ic_id: integer (nullable = true)
 |-- d_ia_id: string (nullable = true)
 |-- d_da_id: date (nullable = true)
 |-- d_dd_id: date (nullable = true)
 |-- f_i94_port_state_code: string (nullable = true)
 |-- d_sd_id: string (nullable = true)
 |-- f_i94_year: integer (nullable = true)
 |-- f_i94_month: integer (nullable = true)

55
root
 |-- d_sd_state_code: string (nullable = true)
 |-- d_sd_state_name: string (nullable = true)
 |-- d_sd_age_median: double (nullable = true)
 |-- d_sd_population_male: integer (nullable = true)
 |-- d_sd_population_female: integer (nullable = true)
 |

In [24]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_state_destinations.createOrReplaceTempView("d_state_destinations")


# Answer Project question #6: The Answer is "California"
df_pq4 = spark.sql(" select "
                   "        RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) immigrants_continue_travel_rank"
                   "       ,d_sd.d_sd_state_code as state_code"
                   "       ,d_sd.d_sd_state_name as state_name"
                   "       ,count(f_i94.f_i94_count) as immigrants_continue_travel "
                   "       ,d_sd.d_sd_age_median as age_median"
                   "       ,d_sd.d_sd_population_male as population_male"
                   "       ,d_sd.d_sd_population_female as population_female"
                   "       ,d_sd.d_sd_population_total as population_total"
                   "       ,d_sd.d_sd_foreign_born as foreign_born"
                   " from f_i94_immigrations f_i94"
                   " join d_state_destinations d_sd on d_sd.d_sd_id == f_i94.d_sd_id"
                   " group by state_code"
                   "         ,state_name"
                   "         ,age_median"
                   "         ,population_male"
                   "         ,population_female"
                   "         ,population_total"
                   "         ,foreign_born"
                   " order by immigrants_continue_travel desc ")

df_pq4.show(500)


+-------------------------------+----------+-----------------+--------------------------+----------+---------------+-----------------+----------------+------------+
|immigrants_continue_travel_rank|state_code|       state_name|immigrants_continue_travel|age_median|population_male|population_female|population_total|foreign_born|
+-------------------------------+----------+-----------------+--------------------------+----------+---------------+-----------------+----------------+------------+
|                              1|        CA|       California|                   1643595|      36.2|          90319|            92290|          182609|       54821|
|                              2|        FL|          Florida|                   1574311|      39.5|          70602|            75919|          145523|       35340|
|                              3|        NY|         New York|                   1387808|      35.6|         433755|           473689|          907445|      318275|
|         