# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-05-05

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into five sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project follows the following steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up


### Step 1: Scope the Project and Gather Data

##### Scope of the Project
Based on the given data set, the following four project questions (PQ) are posed for business analysis, which need to be
 answered in this project. The data pipeline and star data model are completely aligned with the questions.

1. From which country do immigrants come to the U.S. and how many?
2. At what airports do foreign persons arrive for immigration to the U.S.?
3. At what times do foreign persons arrive for immigration to the U.S.?
4. To which states in the U.S. do immigrants want to continue their travel after their initial arrival and what
   demographics can immigrants expect when they arrive in the destination state, such as average temperature, population
   numbers or population density?


##### Gather Data
The project works primarily with a dataset based on immigration data (I94) to the United States.

- Gathering Data (given data sets):
    1. [Immigration data '18-83510-I94-Data-2016' to the U.S.](https://travel.trade.gov/research/programs/i94/description.asp)
    2. [airport-codes_csv.csv: Airports around the world](https://datahub.io/core/airport-codes#data)
    3. [us-cities-demographics.csv: US cities and it's information about citizens](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)
    4. [GlobalLandTemperaturesByCity.csv: Temperature grouped by City and Country](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)

### Step 2: Explore and Assess the Data
The next step is used to find insights within given data.

#### Summary for Immigration data `18-83510-I94-Data-2016` to the U.S.:
* **Source**: [Visitor Arrivals Program (I-94 Form)](https://travel.trade.gov/research/programs/i94/description.asp)
* **Description**: [I94_SAS_Labels_Descriptions.SAS](../P8_capstone_resource_files/I94_SAS_Labels_Descriptions.SAS) file
contains descriptions for the I94 data
* **Data**: Month based dataset for year 2016
* **Format**: SAS (SAS7BDAT - e.g. `i94_apr16_sub.sas7bdat`)
* **Rows**: Over 3 million lines for each file. In total, about 40 million lines.
* **Data description**: Data has 29 columns containing information about event date, arriving person, airport, airline, etc.
![I94-immigration-data example](../P8_capstone_documentation/10_P8_immigration_data_sample.png)
NOTE: The Data has to be paid. Year 2016 is included and available for Udacity DEND course.

##### Immigration data '18-83510-I94-Data-2016' to the U.S.
   The descriptions for the listed columns were taken from file [I94_SAS_Labels_Descriptions.SAS](../P8_capstone_resource_files/I94_SAS_Labels_Descriptions.SAS).

    - **i94yr:** 4 digit year
    - **i94mon:** numeric month
    - **i94cit + i94res:** Country where the immigrants come from - `Country code, country name`
    Look at file [I94_SAS_Labels_I94CIT_I94RES.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94CIT_I94RES.txt) for more details.

            438 =  'AUSTRALIA'
            112 =  'GERMANY'
    ! Note that the I94 country codes are different from the ISO country numbers.

   - **i94port:** arrival airport - `Airport code, Airport city, State of Airport`. Note that the airport code is **not** the same as the [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association) code.
     [IATA-Code Search Engine](https://www.iatacodes.de/)

   The data of the I-94 table do not correspond to the current ISO standards. Therefore, `SFR` is used for San
   Francisco Airport rather than the more common `SFO` designation.

            'SFR'	=	'SAN FRANCISCO, CA     '
            'LOS'	=	'LOS ANGELES, CA       '
            'NYC'	=	'NEW YORK, NY          '

    Look at file [I94_SAS_Labels_I94PORT.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt) for more details.

   - **arrdate:** Arrival date in the U.S. (SAS Date format)

            SAS: Start Date is 01.01.1960 (SAS - Days since 1/1/1960: 0)
            Example:
            01.01.1960: (SAS: Days since 1/1/1960: 0)
            01.01.1970: (SAS: Days since 1/1/1960: 3653)

        Take a look at [Free SAS Date Calculator](https://www.sastipsbyhal.com)


    - **i94mode:** Type of immigration to U.S.
    Look at file [I94_SAS_Labels_I94MODE.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94MODE.txt) for more details.

            1 = 'Air'
            2 = 'Sea'
            3 = 'Land'
            9 = 'Not reported'

    - **i94addr:** Location State where the immigrants want travel to.
      Look at file [I94_SAS_Labels_I94ADDR.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt) for more details.

            'AL'='ALABAMA'
            'IN'='INDIANA'

    - **depdate:** Departure date from USA (SAS Date format) -> look at `arrdate` for calculation

    - **i94bir:** Age of respondent in years
    - **i94ivsa:** Visa codes collapsed into three categories:
      Look at file [I94_SAS_Labels_I94VISA.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94VISA.txt) for more details.

            1 = Business
            2 = Pleasure
            3 = Student

    - **count:** value is for summary statistics
    - **dtadfile:** Date added to I-94 Files - Character date field as YYYYMMDD (represents `arrdate`)
    - **visapost:** Department of state where Visa was issued
    - **occup:** Occupation that will be performed in U.S.
    - **entdepa:** Arrival Flag - admitted or paroled into the U.S.
    - **entdepd:** Departure Flag - Departed, lost I-94 or is deceased
    - **entdepu:** Update Flag - Either apprehended, overstayed, adjusted to perm residence
    - **matflag:** Match flag - Match of arrival and departure records
    - **biryear:** 4 digit year of birth
    - **dtaddto:** Date to which admitted to U.S. (allowed to stay until) - Character date field as MMDDYYYY (represents `depdate`)
    - **gender:** Gender - Non-immigrant sex
    - **insnum:** Insurance (INS) number
    - **airline:** Airline used to arrive in U.S.
    - **admnum:** Admission Number
    - **fltno:** Flight number of Airline used to arrive in U.S.
    - **viatype:** Class of admission legally admitting the non-immigrant to temporarily stay in U.S.


##### Imports and Installs section

In [None]:
import shutil
import pandas as pd
import pyspark.sql.functions as F
# import spark as spark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, LongType, TimestampType, DateType
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, DataFrameNaFunctions
from pyspark.sql.functions import when, count, col, to_date, datediff, date_format, month
import re
import json
from os import path

##### Create Pandas and SparkSession to create data frames from source data

In [None]:
# If code will be executed in Udacity workbench --> use the following config(...)
#spark = SparkSession.builder.config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11").enableHiveSupport().getOrCreate()

# The version number for "saurfang:spark-sas7bdat" had to be updated for the local installation
MAX_MEMORY = "5g"

spark = SparkSession\
    .builder\
    .appName("etl pipeline for project 8 - I94 data") \
    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .appName("Foo") \
    .enableHiveSupport()\
    .getOrCreate()

# setting the current LOG-Level
spark.sparkContext.setLogLevel('ERROR')


In [None]:
# Read data from Immigration data '18-83510-I94-Data-2016' to the U.S.
filepath = '../P8_capstone_resource_files/immigration_data/18-83510-I94-Data-2016/i94_feb16_sub.sas7bdat'
df_pd_i94 = pd.read_sas(filepath, format=None, index=None, encoding=None, chunksize=None, iterator=False)

In [None]:
# Show data (1st 5 rows)
df_pd_i94.head()

In [None]:
# Show data (last 5 rows)
df_pd_i94.tail()

In [None]:
# Get an overview about filled fields (not null)
df_pd_i94.count()

#### Summary for Airport Codes [`airport-codes_csv.csv`](../P8_capstone_resource_files/airport-codes_csv.csv):
* **Source**: [datahub.io - Airport codes](https://datahub.io/core/airport-codes#data)
* **Description**: Airport codes from around the world contain codes that may refer to either IATA airport code, a
  three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport
  code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code.
* **Data**: Large file, containing information about all airports from [this site](https://ourairports.com/data/)
* **Format**: CSV File - Comma separated text file format
* **Rows**: over 55k
* **Data description**: Detailed information about each listed airport is displayed in 12 columns.
  ![08_P8_airport-codes_csv.png](../P8_capstone_documentation/08_P8_airport-codes_csv.png)


##### Read data from file Airport Codes: `airport-codes_csv.csv`

In [None]:
filepath = '../P8_capstone_resource_files/airport-codes_csv.csv'
df_pd_airport = pd.read_csv(filepath)

In [None]:
# Show data (1st 5 rows)
df_pd_airport.head()

In [None]:
# Show data (last 5 rows)
df_pd_airport.tail()

In [None]:
# Get an overview about filled fields
df_pd_airport.count()

#### Summary for US Cities: Demographics [`us-cities-demographics.json`](../P8_capstone_resource_files/us-cities-demographics.json):
* **Source:** [US Cities: Demographics ](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/information/)
* **Description:** This dataset contains information about the demographics of all US cities and census-designated places
  with a population greater or equal to 65,000. This data comes from the [US Census Bureau's 2015 American Community Survey](https://www.census.gov/en.html).
* **Data:** Structured data about City, State, Age, Population, etc.
* **Format:** JSON File - Structured data
* **Rows:** 2,8k
* **Data description:** 12 columns describing facts from cities across the U.S. about demographics.
  ![15_P8_us-cities-demographics.png](../P8_capstone_documentation/12_P8_us-cities-demographics.png)


##### Read data from file US Cities and it's information about citizens: `us-cities-demographics.csv:`

In [None]:
filepath = '../P8_capstone_resource_files/us-cities-demographics.json'
df_pd_us_cities = pd.read_json(filepath, orient='columns')

In [None]:
# Show data (1st 5 rows)
df_pd_us_cities.head()

In [None]:
# Show data (last 5 rows)
df_pd_us_cities.tail()

In [None]:
# Get an overview about filled fields
df_pd_us_cities.count()

#### Summary for World Temperature Data [`GlobalLandTemperaturesByCity.csv`](../P8_capstone_resource_files/GlobalLandTemperaturesByCity.csv):
* **Source:** [World Temperature Data: Temperature grouped by City and Country](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)
* **Description:** Climate Change: Earth Surface Temperature Data. Global temperatures since 1750.
* **Data:**  Structured data about Average Temperature, City, Country, Location (Latitude and Longitude)
* **Format:** CSV File - Comma separated text file format
* **Rows:** 8,5 million entries
* **Data description:** Temperature record as time series information since 1750.
  ![09_P8_GlobalLandTemperaturesByCity.png](../P8_capstone_documentation/09_P8_GlobalLandTemperaturesByCity.png)
* **Note:** Temperature data must be formatted correctly

##### Read data from World Temperature Data where Temperature is grouped by City and Country: `GlobalLandTemperaturesByCity.csv`

In [None]:
filepath = '../P8_capstone_resource_files/GlobalLandTemperaturesByCity.csv'
df_pd_temperature = pd.read_csv(filepath)

In [None]:
# Show data (1st 5 rows)
df_pd_temperature.head()

In [None]:
# Show data (last 5 rows)
df_pd_temperature.tail()

In [None]:
# Get an overview about filled fields
df_pd_temperature.count()


#### Findings from Immigration data `18-83510-I94-Data-2016` to the U.S.:

###### 1. `df_spark_i94.i94cit`:
- County Code does not match to `iso-3166`-Country-Code for further analysis
- Null values in column `i94cit`

###### 2. `df_spark_i94.i94port`:
- Airport Code `i94port` does not correspondent to [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association)
3 letter airport codes from file [I94_SAS_Labels_I94PORT.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt).
**Project decision**: Only usage of given i94 airport codes.

###### 3. `df_spark_i94.arrdate` / `df_spark_i94.depdate`:
- `arrdate` and `depdate` are in SAS date format (String), whose epoch starts on 1960-01-01. This date values will be converted into DateFormat.

###### 4. `df_spark_i94.i94addr`:
- Null values in column `i94addr`

###### 5. [I94_SAS_Labels_I94ADDR.txt.I94ADDR](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt):
- `I94ADDR` State description has errors like 'WI'='WISCONS**O**N' instead of 'WI'='WISCONS**I**N'. **Project decision:**
The only incorrect US state will be corrected manually.


### Step 3.0: Define the Data Model
#### Conceptual Data Model
The following data model is based on the four main questions to be answered. For this reason, I decided to select only
the fields from the source data that provide the correct answers. After the data has been read in and written to the
staging tables for transformation, a star data model will be created for data analysis. Note: This project is not large enough
to store the data in a core data warehouse (3NR) as a preliminary stage.

![I94-Data ER-Model](../P8_capstone_documentation/05_P8_I94-Data_Staging_Tables.png)
![I94-Data Staging Tables](../P8_capstone_documentation/06_P8_I94-Data_Model.png)

#### Mapping Out Data Pipelines
Here is the list of steps to pipeline the data into the chosen data model to answer the project questions.

**Project decision:** As already mentioned the data model is built up step by step always the project questions in mind.
Another common way to build the star data model is to create all staging tables and then the dimension and fact tables.
This is not part of the following description.


##### 3.1.1. From which country do immigrants come to the U.S. and how many? [(Data pipeline)](#question1_data_pipeline) <a name="question1_description">
1. Clean data and create staging table `st_i94_immigration` from files `i94_<month>16_sub.sas7bdat`
2. Clean data and create staging table `st_immigration_countries` from file
   [`I94_SAS_Labels_I94CIT_I94RES.txt`](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94CIT_I94RES.txt)
3. Creation of a fact table named `f_i94_immigration` based on staging table `st_i94_immigration`.
4. Creation of a dimension named `d_immigration_countries` based on staging table `st_immigration_countries`.
5. Mapping of dimension `d_immigration_countries` to  fact table `f_i94_immigration` based on columns
   (`st_i94_immigration.st_i94_cit` --> `f_i94_immigration.d_ic_id`) == (`st_immigration_countries.st_ic_country_code`
   --> `d_immigration_countries.d_ic_id` )
6. Answer Project Question 1: From which country do immigrants come to the U.S. and how many?

**NOTE:** The three columns `st_i94_port_iso`, `st_i94_port_state_code` and `st_i94_port_city` will be inserted after
creation of the staging table `st_immigration_airports` within the next step.

##### 3.1.2. At what airports do foreign persons arrive for immigration to the U.S.? [(Data pipeline)](#question2_data_pipeline) <a name="question2_description">
**Airport dimension**
1. Clean data and create staging table `st_immigration_airports` from file
   [`I94_SAS_Labels_I94PORT.txt`](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt)
   with the columns `st_ia_airport_code` as referencing column, `st_ia_airport_name` and `st_ia_airport_state_code`.

    Note that the I-94 airport code is **not** the same as the [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association) code and
    does not correspond to it. Therefore, `SFR` (I94: 'SFR' = 'SAN FRANCISCO, CA') is used for San
    Francisco Airport in this scenario instead of `SFO`. `SFR` means normally San Fernando, CA, USA.

    **Project decision:** Data from file [airport-codes.csv](../P8_capstone_resource_files/airport-codes_csv.csv) will **not** be linked to the
    I-94 airport codes because incorrect assignments should not be made.
2. Add the column `st_i94_port_state_code` to staging table `st_i94_immigration` based on staging table `st_immigration_airports`. This
   information is needed to connect the `us-cities-demographics.json` file later on.
   `st_ia_airport_state_code --> st_i94_port_state_code`
3. Add column `st_i94_port_state_code --> f_i94_port_state_code` to fact table `f_i94_immigrations`
4. Creation of a dimension named `d_immigration_airports` based on staging table `st_immigration_airports`.
5. Mapping of dimension `d_immigration_airports` to  fact table `f_i94_immigration` based on columns
   (`st_immigration_airports.st_ia_airport_code` --> `d_immigration_airports.d_ia_id`) ==
   (`st_i94_immigration.st_i94_port` --> `f_i94_immigration.d_ia_id`).
6. Answer Project Question 2: At what airports do foreign persons arrive for immigration to the U.S.?

##### 3.1.3. At what times do foreign persons arrive for immigration to the U.S.? [(Data pipeline)](#question3_data_pipeline) <a name="question3_description">
**Date dimensions**

`st_i94_arrdate` and `st_i94_depdate` from staging table `st_i94_immigration` describe dates in SAS specific Date format.
The SAS date calculation starts on 1960-01-01. These columns are converted to DateType format in the staging table
`st_i94_immigrations` as columns named `st_i94_arrdate_iso` and `st_i94_arrdate_iso`.

Get date values from columns `st_i94_immigration.st_i94_arrdate_iso` and `st_i94_immigration.st_i94_depdate_iso`.
Get a valid MIN(), MAX() and default (null value representation) date. Clean data and rewrite staging table 'st_i94_immigrations' if needed.
Finally, create two dimensions 'd_date_arrivals' and 'd_date_departures' out of it without gaps.

1. Read data and get min() and max() value out of `st_i94_arrdate_iso` and `st_i94_depdate_iso`
2. Clean date column "st_i94_depdate_iso": Valid entries are between 2016-01-01 and 2017-06-14. Pre- and descending values
   will be set to null / default value (1900-01-01)
3. Update fact table `f_i94_immigrations` based on cleaned column `st_i94_depdate_iso`  values inside
4. Generate new date staging tables (`st_date_arrivals`, `st_date_departures`) based on default, min and max values
5. Append date specific columns to staging tables, create a dimension out of it and store it
6. Map dimension `d_date_arrivals` to  fact table `f_i94_immigration` based on columns
   (`st_date_arrivals.st_da_date` --> `d_date_arrivals.d_da_id`) == (`st_i94_immigration.st_i94_arrdate_iso` --> `f_i94_immigration.d_da_id`).
7. Map dimension `d_date_departures` to  fact table `f_i94_immigration` based on columns
   (`st_date_departures.st_dd_date` --> `d_date_departures.d_dd_id`) == (`st_i94_immigration.st_i94_depdate_iso` --> `f_i94_immigration.d_dd_id`).
8. Answer Project Question 3.1: At what times do foreign persons arrive for immigration to the U.S.?
9. Answer Project Question 3.2: When a foreign person comes to the U.S. for immigration, do they travel on to another state?
10. Answer Project Question 3.3: If a foreign person travels to another state, after which period of time does this happen?


The creation of those two date dimensions is based on one physical table. This method is called
[Role-Playing Dimensions](https://dba.stackexchange.com/questions/137971/how-many-date-dimensions-for-one-fact)
![Role-Playing Dimension](../P8_capstone_documentation/11_P8_RolePlayingDimension.png).


##### 3.1.4. To which states in the U.S. do immigrants want to continue their travel after their initial arrival and what demographics can immigrants expect when they arrive in the destination state, such as average temperature, population numbers or population density? [(Data pipeline)](#question4_data_pipeline) <a name="question4_description">
1. Clean data and create staging table `st_state_destinations` from file
   [I94_SAS_Labels_I94ADDR.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt)
   based on columns `st_sd_state_code` and `st_sd_state_name`.
2. Extract some demographic data from file [us-cities-demographics.json](../P8_capstone_resource_files/us-cities-demographics.json)
   like `age_median`, `population_male`, `population_female`, `population_total` or `foreign_born` and add them to staging
   table `st_state_destinations`.
3. Creation of a dimension named `d_state_destinations` based on staging table `st_state_destinations`.
4. Mapping of dimension `d_state_destinations` to  fact table `f_i94_immigration` based on columns
   (`st_state_destinations.st_sd_state_code` --> `d_state_destinations.d_sd_id`) ==
   (`st_i94_immigration.st_i94_addr` --> `f_i94_immigration.d_sd_id`).
5. Clean fact table `f_i94_immigration` based on the dimension `d_state_destinations`. All unrecognizable columns will
   be set to 99 (all other countries).
6. Answer Project Question 4: To which states in the U.S. do immigrants want to continue their travel after their initial
   arrival and what demographics can immigrants expect when they arrive in the destination state, such as average
   temperature, population numbers or population density?

### Step 4.0: Run ETL to Model the Data
The following steps describe the ETL data pipeline.

#### 4.1 Create the data model
The data pipelines are built to create the data model.

##### 4.1.1. From which country do immigrants come to the U.S. and how many? [(Description)](#question1_description) <a name="question1_data_pipeline">

1. Clean data and create staging table `st_i94_immigration` from files `i94_<month>16_sub.sas7bdat`

##### Convert SAS data into spark parquet files as 1st staging step #####

In [None]:
# original path in Udacity workspace
#df_spark =spark.read.format('com.github.saurfang.sas.spark')
# .load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

# The SAS files (e.g. i94_apr16_sub.sas7bdat) are partitioned by month. The for loop extracts each file and stores it
# partitioned by month in parquet format.

months_abbreviation = ["jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]

for current_month in months_abbreviation:
    month_abbreviation = current_month

    filepath_i94 = f"../P8_capstone_resource_files/immigration_data/18-83510-I94-Data-2016/" \
                   f"i94_{month_abbreviation}16_sub.sas7bdat"
    print(filepath_i94)

    # load current month
    df_spark_i94 = spark\
        .read\
        .format('com.github.saurfang.sas.spark')\
        .load(filepath_i94)

    """
    Note: optionally load conditions:
            .load(filepath_i94,
                  forceLowercaseNames=True,
                  inferLong=True)
    """

    # write data and append all month to the same parquet result set
    location_to_write = "../P8_capstone_resource_files/parquet_raw/i94_sas_data"

    # delete folder if already exists
    if path.exists(location_to_write):
        shutil.rmtree(location_to_write)

    # write data frame as parquet file (ca. 815 MB)
    df_spark_i94 \
        .repartition(int(1)) \
        .write \
        .mode(saveMode='append') \
        .partitionBy('i94mon') \
        .parquet(location_to_write, compression="gzip")


##### Optional write methods (.csv & .csv.gz)

In [None]:
    location_to_write = "../P8_capstone_resource_files/parquet_raw/i94_data.csv"

    # delete folder if already exists
    if path.exists(location_to_write):
        shutil.rmtree(location_to_write)

    # write data frame as uncompressed CSV file (approx. 5,9 GB)
    df_spark_i94\
        .coalesce(1)\
        .write\
        .option("header", "true")\
        .csv(location_to_write)

In [None]:
    # delete folder if already exists
    if path.exists(location_to_write):
        shutil.rmtree(location_to_write)

    # write data frame as compressed CSV file (approx. 885 MB)
    df_spark_i94\
        .coalesce(1)\
        .write\
        .option("header", "true")\
        .option("codec", "org.apache.hadoop.io.compress.GzipCodec")\
        .csv("../P8_capstone_resource_files/parquet_raw/i94_data.csv.gz")

##### Check written data frame

In [None]:
# Read written data frame back into memory
df_spark_i94 = spark.read.parquet("../P8_capstone_resource_files/parquet_raw/i94_sas_data")

# read only three month of data
#df_spark_i94 = spark.read.parquet("../P8_capstone_resource_files/parquet/i94_sas_data/i94mon=12.0")
"""
df_spark_i94 = spark.read\
    .parquet("../P8_capstone_resource_files/parquet_raw/i94_sas_data/i94mon=4.0",
             "../P8_capstone_resource_files/parquet_raw/i94_sas_data/i94mon=5.0",
             "../P8_capstone_resource_files/parquet_raw/i94_sas_data/i94mon=6.0")


df_spark_i94 = spark.read\
    .parquet("../P8_capstone_resource_files/parquet_raw/i94_sas_data/i94mon=1.0",
             "../P8_capstone_resource_files/parquet_raw/i94_sas_data/i94mon=2.0",
             "../P8_capstone_resource_files/parquet_raw/i94_sas_data/i94mon=3.0")
"""

In [None]:
# Get lines of data from data frame
df_spark_i94.count()

In [None]:
# check current df Schema
df_spark_i94.printSchema()


In [None]:
# Show Summary statistics. Attention: This could take very long to compute!
df_spark_i94.describe()

In [None]:
# Check if the Conversion step worked as expected
### !!!!!!! Keep in mind: Setup virtual environment path: ..../Project_8_Data_Engineering_Capstone_Project/venv/bin/python3  because we use UDF --> python code !!!!!!!!!!!!!!!!!!

In [None]:
# Preparation to get an enumeration of all elements within the data frame and a
# UDF to convert a SAS date (Integer Format) into a DateType() format.

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window
w = Window().orderBy(lit('A'))

# register UDF function to calculate a DateType from given SAS date format
getDateFomSASDate = F.udf(lambda y: get_date_from_sas_date(y), DateType())
spark.udf.register("getDateFomSASDate", getDateFomSASDate)

# Function to convert a SAS date into a DateType
"""
Convert SAS date into a DateType value. If sas_date == 0 then choose the default value 1960-01-01.
"""
def get_date_from_sas_date(sas_date):
    sas_date_int = int(sas_date)
    if sas_date_int > 0:
        return datetime(1960, 1, 1) + timedelta(days=sas_date_int)
    else:
        return datetime(1900, 1, 1)

In [None]:
## Transformation of the originally stored data from files `i94_<month>16_sub.sas7bdat`
# read parquet file
# fill up null values
# convert data into new columns
# select only needed columns

df_st_i94_immigrations = spark\
    .read\
    .parquet("../P8_capstone_resource_files/parquet_raw/i94_sas_data")\
    .fillna(value=0.0 ,subset=['i94cit'])\
    .fillna(value='99', subset=['i94addr'])\
    .fillna(value=0.0, subset=['depdate'])\
    .fillna(value='99991231', subset=['dtadfile'])\
    .fillna(value='NA', subset=['matflag'])\
    .withColumn("st_i94_cit", F.round("i94cit", 0).cast(IntegerType()))\
    .withColumn("st_i94_port", col("i94port"))\
    .withColumn("st_i94_addr", col("i94addr"))\
    .withColumn("st_i94_arrdate", F.round("arrdate").cast(IntegerType()))\
    .withColumn("st_i94_arrdate_iso", getDateFomSASDate("arrdate"))\
    .withColumn("st_i94_depdate", F.round("depdate").cast(IntegerType()))\
    .withColumn("st_i94_depdate_iso", getDateFomSASDate("depdate"))\
    .withColumn('st_i94_dtadfile', to_date('dtadfile','yyyyMMdd')) \
    .withColumn("st_i94_matflag", col("matflag"))\
    .withColumn("st_i94_count", F.round("count", 0).cast(IntegerType()))\
    .withColumn("st_i94_year", col("i94yr").cast(IntegerType()))\
    .withColumn("st_i94_month", col("i94mon").cast(IntegerType())) \
    .select(
              "st_i94_cit",
              "st_i94_port",
              "st_i94_addr",
              "st_i94_arrdate",
              "st_i94_arrdate_iso",
              "st_i94_depdate",
              "st_i94_depdate_iso",
              "st_i94_dtadfile",
              "st_i94_matflag",
              "st_i94_count",
              "st_i94_year",
              "st_i94_month" )

In [None]:
# compare the counts of the full dataset
# Count of rows         : 40.790.529
# Count of distinct rows: 12.228.839
print('Count of rows: {0}'.format(df_st_i94_immigrations.count()))
print('Count of distinct rows: {0}'.format(df_st_i94_immigrations.distinct().count()))

In [None]:
# clean up complete identical rows. Only do this if the results from the step above are not identical!
df_st_i94_immigrations = df_st_i94_immigrations.drop_duplicates()

In [None]:
# compare the counts of the full dataset again
# Count of rows         : 12.228.839
# Count of distinct rows: 12.228.839
print('Count of rows: {0}'.format(df_st_i94_immigrations.count()))
print('Count of distinct rows: {0}'.format(df_st_i94_immigrations.distinct().count()))

In [None]:
# After dropping duplicates we create for each row a unique ID.
# The F.row_number().over(w)) method gives each record a unique and increasing ID and starts with 1.
# The F.monotonicallymonotonically_increasing_id() method gives each record a unique and increasing ID and starts with 0.
df_st_i94_immigrations = df_st_i94_immigrations\
    .sort("st_i94_year", "st_i94_month", "st_i94_cit") \
    .withColumn("st_i94_id",  F.row_number().over(w))
#    .withColumn('st_i94_id_new', F.monotonically_increasing_id())

In [None]:
df_st_i94_immigrations.show()

In [None]:
# compare the counts of the full dataset again
# Count of rows         : 12.228.839
# Count of distinct rows: 12.228.839
print('Count of rows: {0}'.format(df_st_i94_immigrations.count()))
print('Count of distinct rows: {0}'.format(df_st_i94_immigrations.distinct().count()))

In [None]:
# *** OPTIONAL 1 ***
# Let's check whether there are any duplicates in the data irrespective of `st_i94_id`.
# Only columns other than the `st_i94_id` column:

print('Count of ids: {0}'.format(df_st_i94_immigrations.count()))
print('Count of distinct ids: {0}'.format(
    df_st_i94_immigrations.select( [
        c for c in df_st_i94_immigrations.columns if c != 'st_i94_id'
    ])
        .distinct()
        .count())
    )

In [None]:
# *** OPTIONAL 2 ***
# clean up if found duplicate rows irrespective of 'st_i94_id'
df_st_i94_immigrations = df_st_i94_immigrations.dropDuplicates(subset=[
c for c in df_st_i94_immigrations.columns if c != 'st_i94_id'
])

In [None]:
# Avoid duplicates in ID column `st_i94_id`
df_st_i94_immigrations.agg(
    F.count('st_i94_id').alias('count'),
    F.countDistinct('st_i94_id').alias('distinct')
).show()

In [None]:
# Check percentage of missing observations are there in each column:
df_st_i94_immigrations.agg(*[
(1 - (F.count(c) / F.count('*'))).alias(c + '_missing')
for c in df_st_i94_immigrations.columns
]).show()

In [None]:
# check whether there are still zero values in the result data frame
df_st_i94_immigrations\
.select([count( when(col(c).isNull(), c) )
        .alias(c) for c in df_st_i94_immigrations.columns])\
.toPandas().T

In [None]:
# get current Schema of staging table st_i94_immigration
df_st_i94_immigrations.printSchema()

In [None]:
# check content of current staging table st_i94_immigration
df_st_i94_immigrations.limit(2).toPandas().T

**NOTE:** The column `st_i94_port_state_code` will be inserted after creation of the staging
table `st_immigration_airports` within the step [**2 - Airport dimension**](#question2_data_pipeline)

In [None]:
# write data and append all month to the same parquet result set
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ1/st_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

# write data frame as parquet file (40 Mio. Rows: ~601MB (GZIP) or 855 MB (uncompressed); 12 Mio. Rows: 101 MB (GZIP))
# NOTE: One column is still missing: `st_i94_port_state_code`.
df_st_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('st_i94_year', 'st_i94_month') \
    .parquet(location_to_write, compression="gzip")

"""
df_st_i94_immigrations\
    .write\
    .format('parquet') \
    .mode("overwrite") \
    .partitionBy('st_i94_year', 'st_i94_month', 'st_i94_port') \
    .saveAsTable('st_i94_immigrations',
                 format='parquet',
                 mode='overwrite',
                 compression="gzip",
                 path=filepath_st_i94_immigrations
                 )
"""

2. Clean data and create staging table `st_immigration_countries` from file [`I94_SAS_Labels_I94CIT_I94RES.txt`](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94CIT_I94RES.txt)

In [None]:
# path of txt file
filepath_immigration_countries = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94CIT_I94RES.txt"

# read txt file into data frame
df_txt_immigration_countries=spark.read.text(filepath_immigration_countries)

# create a new df with two columns (st_id_country_code, st_ic_country_name) as staging table st_immigration_countries
df_st_immigration_countries = df_txt_immigration_countries\
    .select(F.regexp_extract('value', r'^\s*(\d*)\s*=  \'(\w*.*)\'', 1).alias('st_ic_country_code').cast(IntegerType()),
            F.regexp_extract('value', r'^\s*(\d*)\s*=  \'(\w*.*)\'', 2).alias('st_ic_country_name'))\
    .drop_duplicates()\
    .sort("st_ic_country_code")

In [None]:
# show prepared staging table st_immigration_countries
df_st_immigration_countries.sort("st_ic_country_code").show(500, False)

In [None]:
# compare the counts of the full dataset
# Count of rows         : 289
# Count of distinct rows: 289
print('Count of rows: {0}'.format(df_st_immigration_countries.count()))
print('Count of distinct rows: {0}'.format(df_st_immigration_countries.distinct().count()))

In [None]:
# clean up complete identical rows. Only do this if the results from the step above are not identical!
df_st_immigration_countries = df_st_immigration_countries.drop_duplicates()

In [None]:
# show prepared staging table st_immigration_countries
df_st_immigration_countries.sort("st_ic_country_code").show(500, False)

In [None]:
# write staging table st_immigration_countries as parquet file
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ1/st_immigration_countries"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_immigration_countries \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")


3. Creation of a fact table named `f_i94_immigrations` based on staging table `st_i94_immigrations`.

In [None]:
# Read written data frame back into memory
df_st_i94_immigrations = spark.read.parquet("../P8_capstone_resource_files/parquet_stage/PQ1/st_i94_immigrations")

# show current Schema
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.count()

In [None]:
# create fact table f_i94_immigration based of staging table st_i94_immigration
df_f_i94_immigrations = df_st_i94_immigrations\
    .withColumnRenamed("st_i94_id", "f_i94_id")\
    .withColumnRenamed("st_i94_cit", "f_i94_cit")\
    .withColumnRenamed("st_i94_addr", "f_i94_addr")\
    .withColumnRenamed("st_i94_arrdate", "f_i94_arrdate")\
    .withColumnRenamed("st_i94_arrdate_iso", "f_i94_arrdate_iso")\
    .withColumnRenamed("st_i94_depdate", "f_i94_depdate")\
    .withColumnRenamed("st_i94_depdate_iso", "f_i94_depdate_iso")\
    .withColumnRenamed("st_i94_dtadfile", "f_i94_dtadfile")\
    .withColumnRenamed("st_i94_matflag", "f_i94_matflag")\
    .withColumnRenamed("st_i94_count", "f_i94_count")\
    .withColumnRenamed("st_i94_year", "f_i94_year")\
    .withColumnRenamed("st_i94_month", "f_i94_month")\
    .withColumnRenamed("st_i94_port", "f_i94_port")\
    .withColumn("d_ic_id", col("f_i94_cit"))\
    .withColumn("d_ia_id", col("f_i94_port")) \
    .withColumn("d_da_id", col("f_i94_arrdate_iso")) \
    .withColumn("d_dd_id", col("f_i94_depdate_iso")) \
    .drop("f_i94_arrdate")\
    .drop("f_i94_depdate")

# show current fact table Schema
df_f_i94_immigrations.printSchema()

In [None]:
# take a look inside the fact table f_i94_immigration
df_f_i94_immigrations.show(5,False)

In [None]:
# write fact table f_i94_immigration based on staging table st_i94_immigration (~ 69 MB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ1/f_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_f_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy("f_i94_year", "f_i94_month")\
    .parquet(location_to_write, compression="gzip")


4. Creation of a dimension named `d_immigration_countries` based on staging table `st_immigration_countries`.

In [None]:
# Read written data frame back into memory
df_st_i94_immigration_countries = spark.read.parquet("../P8_capstone_resource_files/parquet_stage/PQ1/st_immigration_countries")

# show current Schema
df_st_i94_immigration_countries.printSchema()


# create dimension table d_i94_immigration_countries based of staging table st_i94_immigration_countries
df_d_i94_immigration_countries = df_st_i94_immigration_countries\
    .withColumn("d_ic_id", col("st_ic_country_code"))\
    .withColumnRenamed("st_ic_country_code", "d_ic_country_code")\
    .withColumnRenamed("st_ic_country_name", "d_ic_country_name")

# get current content of dimension table
df_d_i94_immigration_countries.printSchema()
df_d_i94_immigration_countries.sort("d_ic_id").show(5, False)


# write fact table f_i94_immigration based on staging table st_i94_immigration
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ1/d_immigration_countries"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_d_i94_immigration_countries \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

5. Mapping of dimension `d_immigration_countries` to  fact table `f_i94_immigration` based on columns
   (`st_i94_immigration.st_i94_cit` --> `f_i94_immigration.d_ic_id`) == (`st_immigration_countries.st_ic_country_code`
   --> `d_immigration_countries.d_ic_id` )

6. Answer Project Question 1: From which country do immigrants come to the U.S. and how many?

In [None]:
# Read written data frame back into memory
df_f_i94_immigrations = spark.read.parquet("../P8_capstone_resource_files/parquet_star/PQ1/f_i94_immigrations")
df_d_immigration_countries = spark.read.parquet("../P8_capstone_resource_files/parquet_star/PQ1/d_immigration_countries")

# check read data frames
df_f_i94_immigrations.printSchema()
df_d_immigration_countries.printSchema()

In [None]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_immigration_countries.createOrReplaceTempView("d_immigration_countries")


# SQL to answer project question 1 (From which country do immigrants come to the U.S. and how many?)
df_pq1 = spark.sql("select f_i94.f_i94_cit as county_id"
                   "     , d_ic.d_ic_country_name as country"
                   "     , count(f_i94.f_i94_count) as immigrants"
                   "     , RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigrants_rank"
                   "  from f_i94_immigrations f_i94"
                   "  join d_immigration_countries d_ic on d_ic.d_ic_id = f_i94.d_ic_id"
                   " group by f_i94.f_i94_cit"
                   "         ,d_ic.d_ic_country_name"
                   "  order by Immigrants_rank"
                   "")

# Show top 10 countries where Immigrants come from and how many
df_pq1.filter(df_pq1.Immigrants_rank < 11).show()


##### 4.1.2. At what airports do foreign persons arrive for immigration to the U.S.? [(Description)](#question2_description) <a name="question2_data_pipeline">
**Airport dimension**
1. Clean data and create staging table `st_immigration_airports` from file
   [`I94_SAS_Labels_I94PORT.txt`](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt)
   with the columns `st_ia_airport_code` as referencing column, `st_ia_airport_name` and `st_ia_airport_state_code`.

    Note that the I-94 airport code is **not** the same as the [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association) code and
    does not correspond to it. Therefore, `SFR` (I94: 'SFR' = 'SAN FRANCISCO, CA') is used for San
    Francisco Airport in this scenario instead of `SFO`. `SFR` means normally San Fernando, CA, USA.

    **Project decision:** Data from file [airport-codes.csv](../P8_capstone_resource_files/airport-codes_csv.csv) will **not** be linked to the
    I-94 airport codes because incorrect assignments should not be made.

In [None]:
"""
Next Steps: Carefully clean list of airports
1. read all available information from file
2. filter all elements on different regex conditions and store them into a new data frame called `df_st_immigration_airports`
3. store cleaned data frame `df_st_immigration_airports` to disk
"""

# path of txt file
filepath_immigration_airports = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt"

# read txt file into data frame
df_txt_immigration_airports_raw = spark.read.text(filepath_immigration_airports)

# get regex_cleaned values --> less error prone --> 582 Entries
regex_cleaned = r"^\s+'([.\w{2,3} ]*)'\s+=\s+'([\w -.\/]*),\s* ([\w\/]+)"

df_st_immigration_airports_regex_cleaned = df_txt_immigration_airports_raw\
    .select( F.regexp_extract('value',regex_cleaned, 1).alias('st_ia_airport_code'),
             F.regexp_extract('value',regex_cleaned, 2).alias('st_ia_airport_name'),
             F.regexp_extract('value',regex_cleaned, 3).alias('st_ia_airport_state_code')) \
    .drop_duplicates() \
    .filter("st_ia_airport_code != ''")  \
    .sort("st_ia_airport_state_code", "st_ia_airport_code") \
    .select("st_ia_airport_code", "st_ia_airport_name", "st_ia_airport_state_code")

print(df_st_immigration_airports_regex_cleaned.count())
df_st_immigration_airports_regex_cleaned.show(10, False)

In [None]:
# get regex_all values --> with errors like `Collapsed (BUF)` --> 660 Entries
regex = r"^\s+'([.\w{2,3} ]*)'\s+=\s+'([\w -.\/]*)\s*,*\s* ([\w\/]+)"

df_st_immigration_airports = df_txt_immigration_airports_raw\
    .select( F.regexp_extract('value',regex, 1).alias('st_ia_airport_code'),
             F.regexp_extract('value',regex, 2).alias('st_ia_airport_name'),
             F.regexp_extract('value',regex, 3).alias('st_ia_airport_state_code')) \
    .drop_duplicates() \
    .filter("st_ia_airport_code != ''")  \
    .sort("st_ia_airport_state_code", "st_ia_airport_code")

print(df_st_immigration_airports.count())
df_st_immigration_airports.show(1000, False)

In [None]:
# Difference of the remaining entries ==> 660 - 582 = 78
df_st_immigration_airports \
    .join(df_st_immigration_airports_regex_cleaned,
          df_st_immigration_airports.st_ia_airport_code == df_st_immigration_airports_regex_cleaned.st_ia_airport_code,
          'left_anti')  \
    .show(10000, False)

In [None]:
# correct all entries that are not error-free as expected
df_st_immigration_airports = df_st_immigration_airports \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r'Collapsed \(\w+\)|No PORT|UNKNOWN', 'Invalid Airport Entry').alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r'06/15|Code|POE', 'Invalid State Code').alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^DERBY LINE,.*", "DERBY LINE, VT (RT. 5)").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"5", "VT").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^LOUIS BOTHA, SOUTH", "LOUIS BOTHA").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"AFRICA", "SOUTH AFRICA").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r",", "").alias("st_ia_airport_name"),
            "st_ia_airport_state_code") \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^PASO DEL", "PASO DEL NORTE").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"NORTE", "TX").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^UNIDENTIFED AIR /?", "Invalid Airport Entry").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"^SEAPORT?", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"Abu", "Abu Dhabi").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"Dhabi", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"DOVER-AFB", "Invalid Airport Entry").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"DE", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"NOT REPORTED/UNKNOWNGALES", "NOGALES").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"AZ", "AZ").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"^NOT", "Invalid Airport Entry").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"REPORTED/UNKNOWN", "Invalid State Code").alias("st_ia_airport_state_code")) \
    .select("st_ia_airport_code",
            F.regexp_replace('st_ia_airport_name', r"INVALID - IWAKUNI", "IWAKUNI").alias("st_ia_airport_name"),
            F.regexp_replace("st_ia_airport_state_code", r"JAPAN", "JAPAN").alias("st_ia_airport_state_code")) \
    .sort("st_ia_airport_name", "st_ia_airport_code")

print(df_st_immigration_airports.count())
df_st_immigration_airports.show(1000, False)

In [None]:
# check if former invalid entries are cleaned correctly
# Difference of the remaining entries ==> 660 - 582 = 78
df_st_immigration_airports \
    .join(df_st_immigration_airports_regex_cleaned,
          df_st_immigration_airports.st_ia_airport_code == df_st_immigration_airports_regex_cleaned.st_ia_airport_code, 'left_anti')  \
    .show(10000, False)

In [None]:
# Write data as new CSV file to disk
location_to_write = '../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/st_immigration_airports.csv'

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_immigration_airports \
    .coalesce(1)\
    .write\
    .mode("overwrite") \
    .csv(location_to_write, header = 'true')

In [None]:
# write df_st_immigration_airports back to stage area on file system
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_immigration_airports \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [None]:
# Read written data frame back into memory
# st_immigration_airports:
location_st_immigration_airports = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"
df_st_immigration_airports = spark.read.parquet(location_st_immigration_airports)

# current Schema of staging table st_immigration_airports
print(df_st_immigration_airports.count())
df_st_immigration_airports.printSchema()
df_st_immigration_airports.show(10, False)


2. Add the column `st_ia_airport_state_code --> st_i94_port_state_code` to staging table `st_i94_immigration` based on staging
   table `st_immigration_airports`. This information is needed to connect the `us-cities-demographics.json` file later on.

In [None]:
# read df_st_i94_immigrations staging table and add column `st_i94_port_state_code` to it. Write data frame back to disk.

# Read written data frame back into memory
# st_i94_immigrations:
location_st_i94_immigrations = "../P8_capstone_resource_files/parquet_stage/PQ1/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_st_i94_immigrations)

# st_immigration_airports:
location_st_immigration_airports = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"
df_st_immigration_airports = spark.read.parquet(location_st_immigration_airports)


print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(5, False)

print(df_st_immigration_airports.count())
df_st_immigration_airports.printSchema()
df_st_immigration_airports.show(5, False)

In [None]:
########################################################################################################################
# check if st_i94_dept_date_iso is 1900-01-01 (default value - No onward travel is planned)
df_st_i94_immigrations \
    .filter(df_st_i94_immigrations.st_i94_depdate == 0)\
    .show(5, False)

In [None]:
# add column `st_i94_port_state_code` to data frame st_i94_immigrations
df_st_i94_immigrations = df_st_i94_immigrations \
    .join(df_st_immigration_airports,
          [df_st_i94_immigrations.st_i94_port == df_st_immigration_airports.st_ia_airport_code], 'left_outer') \
    .drop("st_ia_airport_code", "st_ia_airport_name") \
    .withColumnRenamed("st_ia_airport_state_code", "st_i94_port_state_code")


In [None]:
# check if `st_i94_port_state_code` has null values
df_st_i94_immigrations\
    .fillna(value='NA', subset=['st_i94_port_state_code'])\
    .groupBy("st_i94_port_state_code")\
    .count() \
    .sort("st_i94_port_state_code")\
    .orderBy("count")\
    .show(500)

In [None]:
# get entry with null value
df_st_i94_immigrations \
    .filter(col("st_i94_port_state_code").isNull()).show()

In [None]:
# rename

In [None]:
# get status
print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(5, False)

In [None]:
# write st_i94_immigrations back to file system
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ2/st_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('st_i94_year', 'st_i94_month') \
    .parquet(location_to_write, compression="gzip")

3. Add new column `st_i94_port_state_code --> f_i94_port_state_code` to existing fact table `f_i94_immigrations`.

In [None]:
# Read data frames back into memory
# st_i94_immigrations with column `st_i94_port_state_code`:
location_st_i94_immigrations = "../P8_capstone_resource_files/parquet_stage/PQ2/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_st_i94_immigrations)

# f_i94_immigrations:
location_f_i94_immigrations = "../P8_capstone_resource_files/parquet_star/PQ1/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_f_i94_immigrations)

# show current schemas
print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

In [None]:
# get only the needed columns to join
df_st_i94_immigrations_2_join = df_st_i94_immigrations \
    .select("st_i94_id", "st_i94_port_state_code")


# add new columns to fact table `df_f_i94_immigrations`
df_f_i94_immigrations = df_f_i94_immigrations  \
    .join(df_st_i94_immigrations_2_join, df_f_i94_immigrations.f_i94_id == df_st_i94_immigrations_2_join.st_i94_id, 'inner') \
    .drop("st_i94_id") \
    .withColumnRenamed("st_i94_port_state_code", "f_i94_port_state_code") \
    .withColumn("d_sd_id", col("f_i94_addr"))

In [None]:
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5, False)

In [None]:
# write fact table f_i94_immigration (~ 109,7 MB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ2/f_i94_immigrations"

if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_f_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy("f_i94_year", "f_i94_month")\
    .parquet(location_to_write, compression="gzip")

4. Creation of a dimension named `d_immigration_airports` based on staging table `st_immigration_airports`.

In [None]:
# st_immigration_airports:
location_st_immigration_airports = "../P8_capstone_resource_files/parquet_stage/PQ2/st_immigration_airports"
df_d_immigration_airports = spark.read.parquet(location_st_immigration_airports)

print(df_d_immigration_airports.count())
df_d_immigration_airports.printSchema()
df_d_immigration_airports.show(5, False)

In [None]:
df_d_immigration_airports = df_d_immigration_airports  \
    .withColumn("d_ia_id", df_d_immigration_airports.st_ia_airport_code) \
    .withColumnRenamed("st_ia_airport_code", "d_ia_airport_code") \
    .withColumnRenamed("st_ia_airport_name", "d_ia_airport_name") \
    .withColumnRenamed("st_ia_airport_state_code", "d_ia_airport_state_code")

df_d_immigration_airports.printSchema()
df_d_immigration_airports.show(5, False)

In [None]:
# write dimension table d_immigration_airports to disk (~ 10 kB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_d_immigration_airports \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")


5. Mapping of dimension `d_immigration_airports` to  fact table `f_i94_immigration` based on columns
   (`st_immigration_airports.st_ia_airport_code` --> `d_immigration_airports.d_ia_id`) ==
   (`st_i94_immigration.st_i94_port` --> `f_i94_immigration.d_ia_id`).

6. Answer Project Question 2: At what airports do foreign persons arrive for immigration to the U.S.?


In [None]:
# Read written data frame back into memory
df_f_i94_immigrations = spark.read.parquet("../P8_capstone_resource_files/parquet_star/PQ2/f_i94_immigrations")
df_d_immigration_airports = spark.read.parquet("../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports")

# check read data frames
print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5, False)

print(df_d_immigration_airports.count())
df_d_immigration_airports.printSchema()
df_d_immigration_airports.show(5, False)

In [None]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_immigration_airports.createOrReplaceTempView("d_immigration_airports")

# SQL to answer project question 2 (From which country do immigrants come to the U.S. and how many?)
df_pq2 = spark.sql(" select   d_ia.d_ia_airport_code as airport_code"
                     "       ,d_ia.d_ia_airport_name as airport_name"
                     "       ,d_ia.d_ia_airport_state_code as airport_state_code"
                     "       ,sum(f_i94.f_i94_count) as immigrants"
                     "       ,RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigration_airport_rank"
                     " from f_i94_immigrations f_i94"
                     " join d_immigration_airports d_ia on f_i94.d_ia_id = d_ia.d_ia_id"
                     " group by airport_code"
                     "       , airport_name"
                     "       , airport_state_code"
                     " order by Immigration_airport_rank asc ")

df_pq2.show(5000, False)

##### 4.1.3. At what times do foreign persons arrive for immigration to the U.S.? [(Data pipeline)](#question3_description) <a name="question3_data_pipeline">
**Date dimensions**

`st_i94_arrdate` and `st_i94_depdate` from staging table `st_i94_immigration` describe dates in SAS specific Date format.
The SAS date calculation starts on 1960-01-01. These columns are converted to DateType format in the staging table
`st_i94_immigrations` as columns named `st_i94_arrdate_iso` and `st_i94_arrdate_iso`.

Get date values from columns `st_i94_immigration.st_i94_arrdate_iso` and `st_i94_immigration.st_i94_depdate_iso`.
Get a valid MIN(), MAX() and default (null value representation) date. Clean data and rewrite staging table 'st_i94_immigrations' if needed.
Finally, create two dimensions 'd_date_arrivals' and 'd_date_departures' out of it without gaps.

1. Read data and get min() and max() value out of `st_i94_arrdate_iso` and `st_i94_depdate_iso`

In [None]:
# Read written data frame back into memory
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ2/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_to_read)
df_st_i94_immigrations.printSchema()

In [None]:
# Get an overview about valid data - check some different perspectives
# get valid min and max date from date_fields `st_i94_arrdate_iso` and `st_i94_depdate_iso`

st_i94_arrdate_depdate_iso = df_st_i94_immigrations.select(
    F.min(col("st_i94_arrdate_iso")).alias("st_i94_arrdate_iso_min"),
    F.max(col("st_i94_arrdate_iso")).alias("st_i94_arrdate_iso_max"),
    F.min(col("st_i94_depdate_iso")).alias("st_i94_depdate_iso_min"),
    F.max(col("st_i94_depdate_iso")).alias("st_i94_depdate_iso_max"),
)

print(st_i94_arrdate_depdate_iso)


In [None]:
# get an overview about arrdate:
# all distinct date values
print(df_st_i94_immigrations.select("st_i94_arrdate_iso").distinct().count())
# Most entries on which date?
df_st_i94_immigrations.select("st_i94_arrdate_iso").groupBy("st_i94_arrdate_iso").count().sort("count", ascending=False).show(1000, False)
# get all date values. Is there a large gap or date values out of range?
df_st_i94_immigrations.select("st_i94_arrdate_iso").groupBy("st_i94_arrdate_iso").count().sort("st_i94_arrdate_iso", ascending=True).show(1000, False)

"""
Findings:

Everything seems to be valid. Date values start from 1st of January 2016 and ends by 31st of December 2016.
"""

In [None]:
# get an overview about depdate:
# all distinct date values
print(df_st_i94_immigrations.select("st_i94_depdate_iso").distinct().count())
# Most entries on which date?
df_st_i94_immigrations.select("st_i94_depdate_iso").groupBy("st_i94_depdate_iso").count().sort("count", ascending=False).show(10, False)
# get all date values. Is there a large gap or date values out of range?
df_st_i94_immigrations.select("st_i94_depdate_iso").groupBy("st_i94_depdate_iso").count().sort("st_i94_depdate_iso", ascending=True).show(1000, False)

In [None]:
# compare st_i94_arrdate with st_i94_depdate. Is departure date earlier than arrival date --> there is a logical failure!

# Show only data to be corrected in the third column
df_st_i94_immigrations \
    .groupBy("st_i94_arrdate_iso", "st_i94_arrdate", "st_i94_depdate", "st_i94_depdate_iso") \
    .count() \
    .withColumn("st_i94_depdate_iso_wrong_dates",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1111-01-01")\
                .when(col("st_i94_depdate_iso") > "2017-06-14", "2222-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "3333-01-01")
                .otherwise(" ").cast(StringType())) \
    .orderBy("st_i94_arrdate_iso", "st_i94_depdate_iso") \
    .show(5000) \


In [None]:
"""
Findings:
715 different date values ==> all distinct date values are greater than (>) 366 days of a year
==> That's possible. Many Immigrants already know their departure date.

1900-01-01 (start date):  This date is used as default value instead of a null value

arrdate starts on 2016-01-01 ==> The departure date cannot be earlier than the arrival date! --> each date before
2016-01-01 must be set to 1900-01-01 as null/default value

depdate greater than 2017-06-14 is not realistic, due to the very small amount of depdate entries within this range of dates
==> entries must be set to 1900-01-01 (null/default)

arrdate describes the 1st arrival into the U.S.. After that the immigrants decide to travel to different states in the U.S..
conclusion: arrdate must be earlier than depdate (arrdate < depdate ==> 2016-01-01 < 2016-01-02)

The following table shows some wrong dates where arrdate > depdate
+------------------+--------------+--------------+------------------+-----+------------------------------+
|st_i94_arrdate_iso|st_i94_arrdate|st_i94_depdate|st_i94_depdate_iso|count|st_i94_depdate_iso_wrong_dates|
+------------------+--------------+--------------+------------------+-----+------------------------------+
|        2016-01-02|         20455|         20454|        2016-01-01|    1|                    3333-01-01|
|        2016-01-08|         20461|         20454|        2016-01-01|    1|                    3333-01-01|
|        2016-01-08|         20461|         20459|        2016-01-06|    2|                    3333-01-01|
|        2016-01-08|         20461|         20460|        2016-01-07|    3|                    3333-01-01|
+------------------+--------------+--------------+------------------+-----+------------------------------+
"""


2. Clean date column "st_i94_depdate_iso" and "st_": Valid entries are between 2016-01-01 and 2017-06-14. Pre- and descending values
   will be set to null / default value (1900-01-01)

In [None]:
# show corrected column `st_i94_depdate_iso_corrected`
df_st_i94_immigrations \
    .groupBy("st_i94_arrdate_iso", "st_i94_arrdate", "st_i94_depdate", "st_i94_depdate_iso") \
    .count() \
    .withColumn("st_i94_depdate_iso_corrected",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1900-01-01")\
                .when(col("st_i94_depdate_iso") > "2017-06-14", "1900-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "1900-01-01")
                .otherwise(col("st_i94_depdate_iso")).cast(DateType())) \
    .orderBy("st_i94_arrdate_iso", "st_i94_depdate_iso") \
    .show(5000) \

In [None]:
# correct the date values in column `st_i94_depdate_iso`
df_st_i94_immigrations = df_st_i94_immigrations \
    .withColumn("st_i94_depdate_iso",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1900-01-01") \
                .when(col("st_i94_depdate_iso") > "2017-06-14", "1900-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "1900-01-01")
                .otherwise(col("st_i94_depdate_iso")).cast(DateType()))

In [None]:
df_st_i94_immigrations \
    .groupBy("st_i94_arrdate_iso", "st_i94_arrdate", "st_i94_depdate", "st_i94_depdate_iso") \
    .count() \
    .withColumn("st_i94_depdate_iso_wrong_dates",
                 when(col("st_i94_depdate_iso") < "2016-01-01", "1111-01-01")\
                .when(col("st_i94_depdate_iso") > "2017-06-14", "2222-01-01") \
                .when(col("st_i94_arrdate_iso") > col("st_i94_depdate_iso"), "3333-01-01")
                .otherwise(" ").cast(StringType())) \
    .orderBy("st_i94_depdate_iso_wrong_dates", ascending=False) \
    .show(5000)

In [None]:
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(500)

In [None]:
# write st_i94_immigrations back to file system
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ3/st_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('st_i94_year', 'st_i94_month') \
    .parquet(location_to_write, compression="gzip")

3. Update fact table `f_i94_immigrations` based on cleaned column `st_i94_depdate_iso`  values inside

In [None]:
# Read data frames back into memory
# st_i94_immigrations with column `st_i94_port_state_code`:
location_st_i94_immigrations = "../P8_capstone_resource_files/parquet_stage/PQ3/st_i94_immigrations"
df_st_i94_immigrations = spark.read.parquet(location_st_i94_immigrations)

# f_i94_immigrations:
location_f_i94_immigrations = "../P8_capstone_resource_files/parquet_star/PQ2/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_f_i94_immigrations)

# show current schemas
print(df_st_i94_immigrations.count())
df_st_i94_immigrations.printSchema()
df_st_i94_immigrations.show(5,False)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5,False)

In [None]:
# add column 'st_i94_depdate_iso' to fact table 'f_i94_immigrations'
df_st_i94_immigrations_2_join = df_st_i94_immigrations \
    .select("st_i94_id" , "st_i94_depdate_iso")

In [None]:
df_st_i94_immigrations_2_join.printSchema()
df_st_i94_immigrations_2_join.show(5,False)

In [None]:
df_f_i94_immigrations = df_f_i94_immigrations \
    .join(df_st_i94_immigrations_2_join, df_f_i94_immigrations.f_i94_id == df_st_i94_immigrations_2_join.st_i94_id, 'inner') \
    .withColumn("f_i94_depdate_iso", col("st_i94_depdate_iso")) \
    .drop("st_i94_id", "st_i94_depdate_iso")

In [None]:
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5,False)

In [None]:

# check if values of referencing column "d_dd_id" are equal to column "f_i94_depdate_iso"
df_f_i94_immigrations \
    .filter(col("d_dd_id") != col("f_i94_depdate_iso")) \
    .groupBy("d_dd_id", "f_i94_depdate_iso") \
    .count() \
    .orderBy("d_dd_id") \
    .show(5)

In [None]:
# update column 'd_dd_id' with newly updated values from column 'f_i94_depdate_iso'
df_f_i94_immigrations = df_f_i94_immigrations \
    .withColumn("d_dd_id", col("f_i94_depdate_iso"))

In [None]:
# check again if values of referencing column "d_dd_id" are equal to column "f_i94_depdate_iso"
df_f_i94_immigrations \
    .filter(col("d_dd_id") != col("f_i94_depdate_iso")) \
    .groupBy("d_dd_id", "f_i94_depdate_iso") \
    .count() \
    .orderBy("d_dd_id") \
    .show(5)

In [None]:
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5)

In [None]:
# write st_i94_immigrations back to file system
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_f_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite')\
    .partitionBy('f_i94_year', 'f_i94_month') \
    .parquet(location_to_write, compression="gzip")


4. Generate new date staging tables (`st_date_arrivals`, `st_date_departures`) based on default, min and max values

In [None]:
# Create new data frame with date series
def generate_dates(spark,range_list, dt_col="date_time_ref", interval=60*60*24): # TODO: attention to sparkSession
    """
    ...     Create a Spark DataFrame with a single column named dt_col and a range of date within a specified interval (start and stop included).
    ...     With hourly data, dates end at 23 of stop day
    ...     (https://stackoverflow.com/questions/57537760/pyspark-how-to-generate-a-dataframe-composed-of-datetime-range)
    ...
    ...     :param spark: SparkSession or sqlContext depending on environment (server vs local)
    ...     :param range_list: array of strings formatted as "2018-01-20" or "2018-01-20 00:00:00"
    ...     :param interval: number of seconds (frequency), output from get_freq()
    ...     :param dt_col: string with date column name. Date column must be TimestampType
    ...
    ...     :returns: df from range
    ...     """
    start,stop = range_list
    temp_df = spark.createDataFrame([(start, stop)], ("start", "stop"))
    temp_df = temp_df.select([F.col(c).cast("timestamp") for c in ("start", "stop")])
    temp_df = temp_df.withColumn("stop",F.date_add("stop",1).cast("timestamp"))
    temp_df = temp_df.select([F.col(c).cast("long") for c in ("start", "stop")])
    start, stop = temp_df.first()
    return spark.range(start,stop,interval).select(F.col("id").cast("timestamp").cast("date").alias(dt_col))

In [None]:
# Create new staging tables 'st_date_arrivals' and 'st_date_departure' with min and max date values
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

In [None]:
# check if all date values from "f_i94_arrdate_iso" are valid
df_f_i94_immigrations\
    .groupBy("f_i94_arrdate_iso")\
    .count()\
    .orderBy("f_i94_arrdate_iso")\
    .show(5000)

In [None]:
# Get min and max values for "f_i94_arrdate"
f_i94_arrdate_iso_min, f_i94_arrdate_iso_max =  df_f_i94_immigrations \
    .select(F.min("f_i94_arrdate_iso").alias("f_i94_arrdate_iso_min"), \
            F.max("f_i94_arrdate_iso").alias("f_i94_arrdate_iso_max")) \
    .first()


print(f"f_i94_arrdate_iso_min: {f_i94_arrdate_iso_min}")
print(f"f_i94_arrdate_iso_max: {f_i94_arrdate_iso_max}")


In [None]:
# create new staging table "st_date_arrivals"
date_range = [f_i94_arrdate_iso_min, f_i94_arrdate_iso_max]
dt_col="st_da_date"
df_st_date_arrivals = generate_dates(spark, date_range, dt_col)

df_st_date_arrivals.printSchema()
df_st_date_arrivals.head(5)

In [None]:
df_st_date_arrivals.tail(5)

5. Append date specific columns to staging tables, create a dimension from it and save it to the file system.

In [None]:
# create new columns of st_date_arrivals table
# https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

df_st_date_arrivals = df_st_date_arrivals \
    .withColumn("st_da_id", col("st_da_date")) \
    .withColumn("st_da_year", F.year(col("st_da_date"))) \
    .withColumn("st_da_year_quarter", F.concat_ws('/', F.year(col("st_da_date")), F.quarter(col("st_da_date")))) \
    .withColumn("st_da_year_month", F.concat_ws('/', F.year(col("st_da_date")), F.month(col("st_da_date")))) \
    .withColumn("st_da_year_month", F.concat_ws('/', F.year(col("st_da_date")), date_format(col("st_da_date"), 'MM'))) \
    .withColumn("st_da_quarter", F.quarter(col("st_da_date"))) \
    .withColumn("st_da_month", F.month(col("st_da_date"))) \
    .withColumn("st_da_week", F.weekofyear(col("st_da_date"))) \
    .withColumn("st_da_weekday", F.date_format(col("st_da_date"),'EEEE')) \
    .withColumn("st_da_weekday_short", F.date_format(col("st_da_date"),'EEE')) \
    .withColumn("st_da_dayofweek", F.dayofweek(col("st_da_date"))) \
    .withColumn("st_da_day", F.dayofmonth(col("st_da_date")) )

df_st_date_arrivals.printSchema()
df_st_date_arrivals.show(5)

In [None]:
# persist staging time table 'st_date_arrivals'
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_arrivals"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_arrivals \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [None]:
# create dimension 'd_date_arrivals' from staging table 'st_date_arrivals'
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_arrivals"
df_st_date_arrivals = spark.read.parquet(location_to_read)

print(df_st_date_arrivals.count())
df_st_date_arrivals.printSchema()

df_st_date_arrivals = df_st_date_arrivals \
    .withColumnRenamed("st_da_date", "d_da_date") \
    .withColumnRenamed("st_da_id", "d_da_id") \
    .withColumnRenamed("st_da_year", "d_da_year") \
    .withColumnRenamed("st_da_year_quarter", "d_da_year_quarter") \
    .withColumnRenamed("st_da_year_month", "d_da_year_month") \
    .withColumnRenamed("st_da_quarter", "d_da_quarter") \
    .withColumnRenamed("st_da_month", "d_da_month") \
    .withColumnRenamed("st_da_week", "d_da_week") \
    .withColumnRenamed("st_da_weekday", "d_da_weekday") \
    .withColumnRenamed("st_da_weekday_short", "d_da_weekday_short") \
    .withColumnRenamed("st_da_dayofweek", "d_da_dayofweek") \
    .withColumnRenamed("st_da_day", "d_da_day") \

df_st_date_arrivals.printSchema()
df_st_date_arrivals.show(5)


location_to_write = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_arrivals \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [None]:
# Creation of the second dimension named `d_date_departures` based on fact column `f_i94_depdate_iso`.
# Create new staging table 'st_date_departure' with min, max and default date values
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

In [None]:
# check if all date values from "f_i94_depdate_iso" are valid
df_f_i94_immigrations\
    .groupBy("f_i94_depdate_iso")\
    .count()\
    .orderBy("f_i94_depdate_iso")\
    .show(5000)

In [None]:
# extract default, min and max date from column 'f_i94_depdate_iso'
# get default and min value
f_i94_depdate_iso_default, f_i94_depdate_iso_min = df_f_i94_immigrations\
    .select("f_i94_depdate_iso") \
    .distinct() \
    .orderBy("f_i94_depdate_iso", ascending=True) \
    .limit(2) \
    .select(F.min("f_i94_depdate_iso").alias("f_i94_depdate_iso_default"),
            F.max("f_i94_depdate_iso").alias("f_i94_depdate_iso_min")) \
    .first()

# get max value
f_i94_depdate_iso_max, f_i94_depdate_iso_max =  df_f_i94_immigrations \
    .select(F.max("f_i94_depdate_iso").alias("f_i94_depdate_iso_max"), \
            F.max("f_i94_depdate_iso").alias("f_i94_depdate_iso_max")) \
    .first()

# check selected data
print(f"f_i94_depdate_iso_default: {f_i94_depdate_iso_default}")
print(f"f_i94_depdate_iso_min: {f_i94_depdate_iso_min}")
print(f"f_i94_depdate_iso_max: {f_i94_depdate_iso_max}")

In [None]:
# create new staging table "st_date_departures"
date_range_default = [f_i94_depdate_iso_default, f_i94_depdate_iso_default]
date_range_min_max = [f_i94_depdate_iso_min, f_i94_depdate_iso_max]

# check valid date ranges
print(date_range_default)
print(date_range_min_max)

# create new data frames for
dt_col="st_dd_date"
df_st_date_departures_default = generate_dates(spark, date_range_default, dt_col)
df_st_date_departures_min_max = generate_dates(spark, date_range_min_max, dt_col)

In [None]:
# combine both data frames to append `1900-01-01` to all other dates
df_st_date_departures = df_st_date_departures_default.union(df_st_date_departures_min_max)

In [None]:
df_st_date_departures.printSchema()
df_st_date_departures.head(5)

In [None]:
df_st_date_departures.tail(5)



In [None]:
# Append date specific columns to staging table `st_date_departures`.
# create new columns of st_date_departures table
# https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

df_st_date_departures = df_st_date_departures \
    .withColumn("st_dd_id", col("st_dd_date")) \
    .withColumn("st_dd_year", F.year(col("st_dd_date"))) \
    .withColumn("st_dd_year_quarter", F.concat_ws('/', F.year(col("st_dd_date")), F.quarter(col("st_dd_date")))) \
    .withColumn("st_dd_year_month", F.concat_ws('/', F.year(col("st_dd_date")), date_format(col("st_dd_date"), "MM")) )\
    .withColumn("st_dd_quarter", F.quarter(col("st_dd_date"))) \
    .withColumn("st_dd_month", F.month("st_dd_date")) \
    .withColumn("st_dd_week", F.weekofyear(col("st_dd_date"))) \
    .withColumn("st_dd_weekday", F.date_format(col("st_dd_date"),'EEEE')) \
    .withColumn("st_dd_weekday_short", F.date_format(col("st_dd_date"),'EEE')) \
    .withColumn("st_dd_dayofweek", F.dayofweek(col("st_dd_date"))) \
    .withColumn("st_dd_day", F.dayofmonth(col("st_dd_date")) )

In [None]:
# get prepared staging table
df_st_date_departures.printSchema()
df_st_date_departures.show(5)

In [None]:
# persist staging time table 'st_date_departures'
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_deaprtures"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_departures \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

In [None]:
# create dimension 'd_date_arrivals' from staging table 'st_date_arrivals'
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ3/st_date_deaprtures"
df_st_date_departures = spark.read.parquet(location_to_read)

print(df_st_date_departures.count())
df_st_date_departures.printSchema()


df_st_date_departures = df_st_date_departures \
    .withColumnRenamed("st_dd_date", "d_dd_date") \
    .withColumnRenamed("st_dd_id", "d_dd_id") \
    .withColumnRenamed("st_dd_year", "d_dd_year") \
    .withColumnRenamed("st_dd_year_quarter", "d_dd_year_quarter") \
    .withColumnRenamed("st_dd_year_month", "d_dd_year_month") \
    .withColumnRenamed("st_dd_quarter", "d_dd_quarter") \
    .withColumnRenamed("st_dd_month", "d_dd_month") \
    .withColumnRenamed("st_dd_week", "d_dd_week") \
    .withColumnRenamed("st_dd_weekday", "d_dd_weekday") \
    .withColumnRenamed("st_dd_weekday_short", "d_dd_weekday_short") \
    .withColumnRenamed("st_dd_dayofweek", "d_dd_dayofweek") \
    .withColumnRenamed("st_dd_day", "d_dd_day") \

df_st_date_departures.printSchema()
df_st_date_departures.show(5)


location_to_write = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_date_departures \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

6. Map dimension `d_date_arrivals` to  fact table `f_i94_immigration` based on columns
   (`st_date_arrivals.st_da_date` --> `d_date_arrivals.d_da_id`) == (`st_i94_immigration.st_i94_arrdate_iso` --> `f_i94_immigration.d_da_id`).

7. Map dimension `d_date_departures` to  fact table `f_i94_immigration` based on columns
   (`st_date_departures.st_dd_date` --> `d_date_departures.d_dd_id`) == (`st_i94_immigration.st_i94_depdate_iso` --> `f_i94_immigration.d_dd_id`).

8. Answer Project Question 3: At what times do foreign persons arrive for immigration to the U.S.?

In [None]:
# reload fact table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)
print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()
df_f_i94_immigrations.show(5, False)

location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals"
df_d_date_arrivals = spark.read.parquet(location_to_read)
print(df_d_date_arrivals.count())
df_d_date_arrivals.printSchema()
df_d_date_arrivals.show(5, False)

location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures"
df_d_date_departures = spark.read.parquet(location_to_read)
print(df_d_date_departures.count())
df_d_date_departures.printSchema()
df_d_date_departures.show(5, False)

In [None]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_date_arrivals.createOrReplaceTempView("d_date_arrivals")
df_d_date_departures.createOrReplaceTempView("d_date_departures")

8. Answer Project Question 3.1: At what times do foreign persons arrive for immigration to the U.S.?

In [None]:
# SQL to answer Project Question 3.1: At what times do foreign persons arrive for immigration to the U.S.?
df_pq3_1 = spark.sql("select da.d_da_year_month as Year_Month"
                     "      ,count(f_i94.f_i94_count) as  Immigrants"
                     "      ,RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigrants_rank"
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " group by Year_Month "
                     " order by Year_Month  "
                     )

df_pq3_1.show(5000, False)

df_pq3_11 = spark.sql("select da.d_da_year_month as Year_Month"
                     "      ,count(f_i94.f_i94_count) as  Immigrants"
                     "      ,RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) Immigrants_rank"
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " group by Year_Month "
                     " order by Immigrants_rank  "
                     )

df_pq3_11.show(5000, False)



9. Answer Project Question 3.2: When a foreign person comes to the U.S. for immigration, do they travel on to
   another state?

In [None]:
# SQL to answer Project Question 3.2: When a foreign person comes to the U.S. for immigration, do they travel on to
# another state?
df_pq3_2 = spark.sql("select da.d_da_year_month as Year_Month_arrival"
                     "      ,dd.d_dd_year_month as Year_Month_dearture"
                     "      ,count(f_i94.f_i94_count) as Immigrants "
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " left join d_date_departures dd on dd.d_dd_id = f_i94.d_dd_id  "
                     " group by Year_Month_arrival, Year_Month_dearture"
                     " order by Year_Month_arrival, Year_Month_dearture, Immigrants"
                     )

df_pq3_2.show(5000, False)

In [None]:
df_pq3_2 = spark.sql("select da.d_da_year_month as Year_Month_arrival"
                     "      ,dd.d_dd_year_month as Year_Month_dearture"
                     "      ,count(f_i94.f_i94_count) as Immigrants "
                     "  from f_i94_immigrations f_i94"
                     "  join d_date_arrivals da on da.d_da_id = f_i94.d_da_id  "
                     " left join d_date_departures dd on dd.d_dd_id = f_i94.d_dd_id  "
                     " group by Year_Month_arrival, Year_Month_dearture"
                     " order by Immigrants desc "
                     )

df_pq3_2.show(5000, False)


10. Answer Project Question 3.3: If a foreign person travels to another state after immigration. After which period of
    time does this happen?

In [None]:
# DF to answer Project Question 3.3: If a foreign person travels to another state, after which period of time does this happen?
from pyspark.sql.window import Window
from pyspark.sql.functions import dense_rank
windowSpec  = Window.orderBy(col("immigrants").desc())

df_f_i94_immigrations \
    .join(df_d_date_arrivals , df_f_i94_immigrations.d_da_id == df_d_date_arrivals.d_da_id) \
    .join(df_d_date_departures, df_f_i94_immigrations.d_dd_id == df_d_date_departures.d_dd_id) \
    .filter("f_i94_depdate_iso != '1900-01-01'") \
    .withColumn("departure_days_after_arrival", F.datediff(col("f_i94_depdate_iso"), col("f_i94_arrdate_iso"))) \
    .select( "d_da_date"
            ,"d_dd_date"
            ,"departure_days_after_arrival") \
    .groupBy("departure_days_after_arrival").count() \
    .withColumnRenamed("count", "immigrants") \
    .withColumn("dense_rank",dense_rank().over(windowSpec)) \
    .show(500)


##### 4.1.4. To which states in the U.S. do immigrants want to continue their travel after their initial arrival and what demographics can immigrants expect when they arrive in the destination state, such as average temperature, population numbers or population density? [(Data description)](#question4_description) <a name="question4_data_pipeline">
1. Clean data and create staging table `st_state_destinations` from file
   [I94_SAS_Labels_I94ADDR.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt)
   based on columns `st_sd_state_code` and `st_sd_state_name`.

In [None]:
# get data
location_to_read = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt"
df_st_I94_SAS_Labels_I94ADDR = spark.read.text(location_to_read)

df_st_I94_SAS_Labels_I94ADDR.printSchema()
df_st_I94_SAS_Labels_I94ADDR.show(5, False)

# get regex_cleaned values -->
regex_cleaned = r"^\s+'([9+A-Z]+)'='([A-Z\s.]+)'"

df_st_I94_SAS_Labels_I94ADDR_regex_cleaned = df_st_I94_SAS_Labels_I94ADDR \
    .select( F.regexp_extract('value',regex_cleaned, 1).alias('st_sd_state_code'),
             F.regexp_extract('value',regex_cleaned, 2).alias('st_sd_state_name')) \
    .drop_duplicates() \
    .orderBy("st_sd_state_code")

print(df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.count())
df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.show(100)

In [None]:
# This step is optional
location_to_write = "../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.csv"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_I94_SAS_Labels_I94ADDR_regex_cleaned\
    .coalesce(1)\
    .write\
    .option("header", "true")\
    .csv(location_to_write, mode='overwrite')

2. Extract some demographic data from file [us-cities-demographics.json](../P8_capstone_resource_files/us-cities-demographics.json)
   like `age_median`, `population_male`, `population_female`, `population_total` or `foreign_born` and add them to staging
   table `st_state_destinations`.


In [None]:
# get data from JSON-file
location_to_read = "../P8_capstone_resource_files/us-cities-demographics.json"
df_us_cities_demographics = spark.read.json(location_to_read)

print(df_us_cities_demographics.count())
df_us_cities_demographics.printSchema()

In [None]:
#Check data for further processing
df_us_cities_demographics \
    .filter("fields.state == 'Alabama'") \
    .select("fields.state_code"
            , "fields.state"
            , "fields.city"
            , "fields.median_age"
            , "fields.male_population"
            , "fields.female_population"
            , "fields.total_population"
            , "fields.foreign_born") \
    .distinct()\
    .orderBy("fields.state_code")\
    .show(50)


In [None]:
# Get only values aggregated by state and not the city values.
df_us_cities_demographics_agg = df_us_cities_demographics \
    .groupBy("fields.state_code", "fields.state") \
    .agg(  F.round(F.avg('fields.median_age'),1).alias('st_sd_age_median')
          ,F.round(F.avg('fields.male_population').cast(IntegerType()),2).alias('st_sd_population_male')
          ,F.round(F.avg('fields.female_population').cast(IntegerType()),2).alias('st_sd_population_female')
          ,F.round(F.avg('fields.total_population').cast(IntegerType()),2).alias('st_sd_population_total')
          ,F.round(F.avg('fields.foreign_born').cast(IntegerType()),2).alias('st_sd_foreign_born')
           ) \
    .orderBy("fields.state_code")

In [None]:
print(df_us_cities_demographics_agg.count())
df_us_cities_demographics_agg.printSchema()
df_us_cities_demographics_agg.show(500)

# Join "df_st_I94_SAS_Labels_I94ADDR_regex_cleaned" and "df_us_cities_demographics" to get new data frame "df_st_state_destinations"
# fill up null values with 0
df_st_state_destinations = df_st_I94_SAS_Labels_I94ADDR_regex_cleaned \
    .join(df_us_cities_demographics_agg, df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.st_sd_state_code ==
          df_us_cities_demographics_agg.fields.state_code, 'left'  )\
    .drop("state_code", "state") \
    .withColumn("st_sd_state_name", F.initcap(col("st_sd_state_name"))) \
    .fillna(value=0.0 ,subset=['st_sd_age_median'])\
    .fillna(value=0 ,subset=['st_sd_population_male'])\
    .fillna(value=0 ,subset=['st_sd_population_female'])\
    .fillna(value=0 ,subset=['st_sd_population_total'])\
    .fillna(value=0 ,subset=['st_sd_foreign_born'])

In [None]:
df_st_state_destinations = df_st_I94_SAS_Labels_I94ADDR_regex_cleaned \
    .join(df_us_cities_demographics, df_st_I94_SAS_Labels_I94ADDR_regex_cleaned.st_sd_state_code ==
          df_us_cities_demographics.fields.state_code, 'left'  )



In [None]:
df_st_state_destinations

In [None]:
# check results
print(df_st_state_destinations.count())
df_st_state_destinations.printSchema()
df_st_state_destinations\
    .orderBy("st_sd_state_code") \
    .show(100)

In [None]:
df_st_state_destinations

In [None]:
# check results
print(df_st_state_destinations.count())
df_st_state_destinations.printSchema()
df_st_state_destinations\
    .orderBy("st_sd_state_code") \
    .show(100)

In [None]:
# store staging table
location_to_write = "../P8_capstone_resource_files/parquet_stage/PQ4/st_state_destinations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_state_destinations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")


3. Creation of a dimension named `d_state_destinations` based on staging table `st_state_destinations`.

In [None]:
# get data to process and store dimension "d_state_destinations"
location_to_read = "../P8_capstone_resource_files/parquet_stage/PQ4/st_state_destinations"
df_st_state_destinations = spark.read.parquet(location_to_read)

print(df_st_state_destinations.count())
df_st_state_destinations.printSchema()

df_st_state_destinations = df_st_state_destinations \
    .withColumn("d_sd_id", col("st_sd_state_code")) \
    .withColumnRenamed("st_sd_state_code", "d_sd_state_code") \
    .withColumnRenamed("st_sd_state_name", "d_sd_state_name") \
    .withColumnRenamed("st_sd_age_median", "d_sd_age_median") \
    .withColumnRenamed("st_sd_population_male", "d_sd_population_male") \
    .withColumnRenamed("st_sd_population_female", "d_sd_population_female") \
    .withColumnRenamed("st_sd_population_total", "d_sd_population_total") \
    .withColumnRenamed("st_sd_foreign_born", "d_sd_foreign_born") \

df_st_state_destinations.printSchema()

# store dimension table
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_st_state_destinations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .parquet(location_to_write, compression="gzip")

df_st_state_destinations.orderBy("d_sd_state_code").show(1000)

4. Mapping of dimension `d_state_destinations` to  fact table `f_i94_immigration` based on columns
   (`st_state_destinations.st_sd_state_code` --> `d_state_destinations.d_sd_id`) ==
   (`st_i94_immigration.st_i94_addr` --> `f_i94_immigration.d_sd_id`).

5. Clean fact table `f_i94_immigration` based on the dimension `d_state_destinations`. All unrecognizable columns will
be set to 99 (all other countries).

In [None]:
#Get data for further processing
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"
df_d_state_destinations = spark.read.parquet(location_to_read)

print(df_d_state_destinations.count())
df_d_state_destinations.printSchema()

In [None]:
# prepare data frame `df_f_i94_immigrations_2_join` to get only the allowed state codes
df_f_i94_immigrations_2_join = df_d_state_destinations \
    .select("d_sd_id")\
    .withColumnRenamed("d_sd_id", "d_sd_id_reference") \
    .orderBy("d_sd_id_reference")

print(df_f_i94_immigrations_2_join.count())
df_f_i94_immigrations_2_join.printSchema()
df_f_i94_immigrations_2_join.show(60)

In [None]:
# prepare and create a cleaned column "d_sd_id_cleaned"
df_f_i94_immigrations \
    .select("d_sd_id", "f_i94_addr") \
    .join(df_f_i94_immigrations_2_join, df_f_i94_immigrations_2_join.d_sd_id_reference == df_f_i94_immigrations.d_sd_id, 'left') \
    .withColumn("d_sd_id_cleaned", when(col("d_sd_id_reference").isNull(), "99")\
                .otherwise(col("d_sd_id_reference"))) \
    .filter(col("d_sd_id_reference").isNull())\
    .distinct() \
    .orderBy("d_sd_id") \
    .show(5000)

In [None]:
# clean column "f_i94_immigrations.d_sd_id" by column "d_sd_id_cleaned (d_sd_id_reference)"
df_f_i94_immigrations = df_f_i94_immigrations \
    .join(df_f_i94_immigrations_2_join, df_f_i94_immigrations_2_join.d_sd_id_reference == df_f_i94_immigrations.d_sd_id, 'left') \
    .withColumn("d_sd_id", when(col("d_sd_id_reference").isNull(), "99")\
                .otherwise(col("d_sd_id_reference"))) \
    .drop("d_sd_id_reference") \

In [None]:
# check corrected column "d_sd_id"
df_f_i94_immigrations \
    .select("d_sd_id", "f_i94_addr") \
    .distinct() \
    .orderBy("d_sd_id", "f_i94_addr") \
    .show(5000)

In [None]:
print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()

In [None]:
# write fact table f_i94_immigration (~ 137,7 MB)
location_to_write = "../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations"

# delete folder if already exists
if path.exists(location_to_write):
    shutil.rmtree(location_to_write)

df_f_i94_immigrations \
    .repartition(int(1)) \
    .write \
    .format("parquet")\
    .mode(saveMode='overwrite') \
    .partitionBy('f_i94_year', 'f_i94_month') \
    .parquet(location_to_write, compression="gzip")


6. Answer Project Question 4: To which states in the U.S. do immigrants want to continue their travel after their initial
   arrival and what demographics can immigrants expect when they arrive in the destination state, such as average
   temperature, population numbers or population density?

In [None]:
#Get data for further processing
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations"
df_f_i94_immigrations = spark.read.parquet(location_to_read)

print(df_f_i94_immigrations.count())
df_f_i94_immigrations.printSchema()


location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"
df_d_state_destinations = spark.read.parquet(location_to_read)

print(df_d_state_destinations.count())
df_d_state_destinations.printSchema()

In [None]:
# Register data frames as Views
df_f_i94_immigrations.createOrReplaceTempView("f_i94_immigrations")
df_d_state_destinations.createOrReplaceTempView("d_state_destinations")


# Answer Project question #6: The Answer is "California"
df_pq4 = spark.sql(" select "
                   "        RANK() OVER (ORDER BY count(f_i94.f_i94_count) desc) immigrants_continue_travel_rank"
                   "       ,d_sd.d_sd_state_code as state_code"
                   "       ,d_sd.d_sd_state_name as state_name"
                   "       ,count(f_i94.f_i94_count) as immigrants_continue_travel "
                   "       ,d_sd.d_sd_age_median as age_median"
                   "       ,d_sd.d_sd_population_male as population_male"
                   "       ,d_sd.d_sd_population_female as population_female"
                   "       ,d_sd.d_sd_population_total as population_total"
                   "       ,d_sd.d_sd_foreign_born as foreign_born"
                   " from f_i94_immigrations f_i94"
                   " join d_state_destinations d_sd on d_sd.d_sd_id == f_i94.d_sd_id"
                   " group by state_code"
                   "         ,state_name"
                   "         ,age_median"
                   "         ,population_male"
                   "         ,population_female"
                   "         ,population_total"
                   "         ,foreign_born"
                   " order by immigrants_continue_travel desc ")

df_pq4.show(500)




#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness

Run Quality Checks

##### 4.2.1 Define StructType and create result data frame

In [None]:
# Define format to store data quality result data frame
result_struct_type = StructType(
    [
         StructField("dq_result_table_name", StringType(), True)
        ,StructField("dq_result_null_entries", IntegerType(), True)
        ,StructField("dq_result_entries", IntegerType(), True)
        ,StructField("dq_result_status", StringType(), True)
    ]
)

In [None]:
# execute check commands

#####  4.2.2 Data Quality (dq) checks for table d_immigration_countries

In [None]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ1/d_immigration_countries"
df_dq_table_d_immigration_countries = spark.read.parquet(location_to_read)

# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_immigration_countries \
    .select("d_ic_id") \
    .where("d_ic_id is null or d_ic_id == ''") \
    .count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_immigration_countries.count()
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
# insert result into result_df
table_name = "d_immigration_countries"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(dq_check_result)

In [None]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

# create results data frame
df_dq_results = spark.createDataFrame(dq_results, result_struct_type)

# check df schema and content
df_dq_results.printSchema()
df_dq_results.show(100, False)

#####  4.2.3 Data Quality (dq) checks for table d_immigration_airports

In [None]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports"
df_dq_table_d_immigration_airports = spark.read.parquet(location_to_read)

# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_immigration_airports \
    .select("d_ia_id") \
    .where("d_ia_id is null or d_ia_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_immigration_airports.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
# insert result into result_df
table_name = "d_immigration_airports"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

In [None]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [None]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

#####  4.2.4 Data Quality (dq) checks for table d_date_arrivals

In [None]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals"
df_dq_table_d_date_arrivals = spark.read.parquet(location_to_read)

In [None]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_date_arrivals \
    .select("d_da_id") \
    .where("d_da_id is null or d_da_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_date_arrivals.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
# insert result into result_df
table_name = "d_date_arrivals"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

In [None]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [None]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

#####  4.2.5 Data Quality (dq) checks for table d_date_departures

In [None]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures"
df_dq_table_d_date_departures = spark.read.parquet(location_to_read)

In [None]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_date_departures \
    .select("d_dd_id") \
    .where("d_dd_id is null or d_dd_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_date_departures.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
# insert result into result_df
table_name = "d_date_departures"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

In [None]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [None]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

#####  4.2.6 Data Quality (dq) checks for table d_state_destinations

In [None]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"
df_dq_table_d_state_destinations = spark.read.parquet(location_to_read)

In [None]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_state_destinations \
    .select("d_sd_id") \
    .where("d_sd_id is null or d_sd_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_state_destinations.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
# insert result into result_df
table_name = "d_state_destinations"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

In [None]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [None]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

#####  4.2.7 Data Quality (dq) checks for table f_i94_immigrations

In [None]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations"
df_dq_table_f_i94_immigrations = spark.read.parquet(location_to_read)

In [None]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_f_i94_immigrations \
    .select(  "f_i94_id"
            , "d_ia_id"
            , "d_sd_id"
            , "d_da_id"
            , "d_dd_id"
            , "d_ic_id"
            ) \
    .where(  "    f_i94_id is null or f_i94_id == ''"
             " or d_ia_id is null or d_ia_id == ''"
             " or d_sd_id is null or d_sd_id == ''"
             " or d_da_id is null or d_da_id == ''"
             " or d_dd_id is null or d_dd_id == ''"
             " or d_ic_id is null or d_ic_id == ''") \
    .count()

In [None]:
# Check that table has > 0 rows
df_dq_check_content = df_dq_table_f_i94_immigrations.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
# insert result into result_df
table_name = "f_i94_immigrations"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

In [None]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

In [None]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [None]:
df_dq_results.show(10, False)
##------------------------------------------------------------------------#


#### 4.3 Data dictionary
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where
it came from. You can include the data dictionary in the notebook or in a separate file.

Generate a dictionary from table columns to fill out manually the meaning of the current column like the following example

In [None]:
# create data dictionary of used star schema
def create_data_dictionary_from_df(location_to_read):
    json_table_data = {}
    tables = {}

    # loop thru list of data frames to read (star schema)
    for current_table_name_from_df in locations_to_read:

        location_to_read = current_table_name_from_df
        regex = r"^.*\/(\w+)$"
        matches = re.finditer(regex, location_to_read, re.MULTILINE)

        # get table name from location string of source data
        for matchNum, match in enumerate(matches, start=1):
            current_table = match.group(matchNum)

        # Set current table name. Table description will be filled in later. Columns will be appended later also.
        dict_current_table = {"table_name": current_table,
                              "table_description": "not set"}

        # read all table columns for current table from df
        current_table_columns_df = [spark.read.parquet(location_to_read).columns]

        # create dictionary from table columns
        current_table_columns_dict = {}

        # loop thru list "current_table_columns_df" and add columns to dict "current_table_columns_dict"
        for counter, current_table_columns_df_column in enumerate(current_table_columns_df, start=1):
            for current_column in enumerate(current_table_columns_df_column, start=1):
                current_table_column_name_dict = {"column_name": current_column[counter],
                                                  "column_description": "not set"}
                current_table_columns_dict[current_column[counter]] = current_table_column_name_dict

        dict_current_table["columns"] = current_table_columns_dict

        tables[current_table] = dict_current_table

        # add tables content to the dict json_data
        json_table_data["tables"] = tables

    return json_table_data


# add table and column descriptions manually
def update_descriptions(json_data_dictionary):

    # The following part is specific to this Project. Every description has to be configured separately

    # Table d_immigration_countries
    json_data_dictionary['tables']['d_immigration_countries']['table_description'] = \
        "Country where immigrants come from to the U.S."
    json_data_dictionary['tables']['d_immigration_countries']['columns']['d_ic_id']['column_description'] \
        = "PK of table d_immigration_countries"
    json_data_dictionary['tables']['d_immigration_countries']['columns']['d_ic_country_code']['column_description'] \
        = "Abbreviation of country code"
    json_data_dictionary['tables']['d_immigration_countries']['columns']['d_ic_country_name']['column_description'] \
        = "Name of country"

    # Table d_immigration_airports
    json_data_dictionary['tables']['d_immigration_airports']['table_description'] \
        = "Airport name where foreign people arrive to the U.S. "
    json_data_dictionary['tables']['d_immigration_airports']['columns']['d_ia_id']['column_description'] \
        = "PK of table d_immigration_airports"
    json_data_dictionary['tables']['d_immigration_airports']['columns']['d_ia_airport_code']['column_description'] \
        = "Abbreviation code of Airport"
    json_data_dictionary['tables']['d_immigration_airports']['columns']['d_ia_airport_name']['column_description'] \
        = "Name of Airport"
    json_data_dictionary['tables']['d_immigration_airports']['columns']['d_ia_airport_state_code']['column_description'] \
        = "Abbreviation of state where Airport is located"

    # Table d_date_arrivals
    json_data_dictionary['tables']['d_date_arrivals']['table_description'] \
        = "Arrival date for foreign persons to immigrate to the U.S.? "
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_id']['column_description'] \
        = "PK of table d_date_arrivals"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_date']['column_description'] \
        = "Date when foreign persons arrive for immigration to the U.S. "
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_year']['column_description'] \
        = "Year of arrival like '2020'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_year_quarter']['column_description'] \
        = "Year and quarter of arrival like '2016/1'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_year_month']['column_description'] \
        = "Year and month of arrival like '2016/01'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_quarter']['column_description'] \
        = "Quarter of arrival like '1'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_month']['column_description'] \
        = "Month of arrival like '1'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_week']['column_description'] \
        = "Week of arrival like '53'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_weekday']['column_description'] \
        = "Day of week like 'Friday'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_weekday_short']['column_description'] \
        = "Day of week in short form like 'Fri'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_dayofweek']['column_description'] \
        = "Day of week as number like '6'"
    json_data_dictionary['tables']['d_date_arrivals']['columns']['d_da_day']['column_description'] \
        = "Day number of current date like 2016-01-01 --> 1"

    # Table d_date_departures
    json_data_dictionary['tables']['d_date_departures']['table_description'] = "Departure date from USA"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_id']['column_description'] \
        = "PK of table d_date_departures"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_date']['column_description'] \
        = "Date when foreign persons departure for immigration to the U.S. "
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_year']['column_description'] \
        = "Year of departure like '2020'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_year_quarter']['column_description'] \
        = "Year and quarter of departure like '2016/1'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_year_month']['column_description'] \
        = "Year and quarter of departure like '2016/1'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_quarter']['column_description'] \
        = "Quarter of departure like '1'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_month']['column_description'] \
        = "Month of departure like '1'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_week']['column_description'] \
        = "Week of departure like '53'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_weekday']['column_description'] \
        = "Day of week like 'Friday'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_weekday_short']['column_description'] \
        = "Day of week in short form like 'Fri'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_dayofweek']['column_description'] \
        = "Day of week as number like '6'"
    json_data_dictionary['tables']['d_date_departures']['columns']['d_dd_day']['column_description'] \
        = "Day number of current date like 2016-01-01 --> 1"


    # Table d_state_destinations --> To which states in the U.S. do immigrants want to continue their travel after
    # their initial arrival and what demographics can immigrants expect when they arrive in the destination state, such
    # as average temperature, population numbers or population density?
    json_data_dictionary['tables']['d_state_destinations']['table_description'] \
        = "To which State immigrants want to continue their travel after initial arrival in the U.S."
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_id']['column_description'] \
        = "PK of table d_state_destinations"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_state_code']['column_description'] \
        = "Abbreviation of State code"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_state_name']['column_description'] \
        = "Full name of State"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_age_median']['column_description'] \
        = "Median age of the population"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_population_male']['column_description'] \
        = "Average of male population"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_population_female']['column_description'] \
        = "Average of female population"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_population_total']['column_description'] \
        = "Average of population"
    json_data_dictionary['tables']['d_state_destinations']['columns']['d_sd_foreign_born']['column_description'] \
        = "Average of the population born abroad"

    # Table f_i94_immigrations
    json_data_dictionary['tables']['f_i94_immigrations']['table_description'] = "I-94 Immigration data to the U.S."
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_id']['column_description'] \
        = "PK of table f_i94_immigrations"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['d_ia_id']['column_description'] \
        = "FK of table d_immigration_airports"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['d_sd_id']['column_description'] \
        = "FK of table d_state_destinations"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['d_da_id']['column_description'] \
        = "FK of table d_date_arrivals"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['d_dd_id']['column_description'] \
        = "FK of table d_date_departures"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['d_ic_id']['column_description'] \
        = "FK of table d_immigration_countries"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_cit']['column_description'] \
        = "Country where the immigrants come from"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_port']['column_description'] \
        = "Arrival airport from immigrants to the U.S."
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_addr']['column_description'] \
        = "Location State where the immigrants want travel to"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_arrdate_iso']['column_description'] \
        = "Arrival date in the U.S."
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_depdate_iso']['column_description'] \
        = "Departure date from U.S."
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_dtadfile']['column_description'] \
        = "Date added to I-94 Files"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_matflag']['column_description'] \
        = "Match flag - Match of arrival and departure records"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_count']['column_description'] \
        = "Counter (1). This value is used for calculation purposes"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_year']['column_description'] \
        = "4 digit year when record added to I-94 Files"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_month']['column_description'] \
        = "Month when record added to I-94 Files"
    json_data_dictionary['tables']['f_i94_immigrations']['columns']['f_i94_port_state_code']['column_description'] \
        = "State code of state where immigration airport (I94PORT) is located"

    return json_data_dictionary


def persist_json_data(json_data, location_to_write):
    # write data to file in json format
    with open(location_to_write, "w") as outfile:
        json.dump(json_data, outfile, sort_keys=True, indent=4, ensure_ascii=False)

# File locations to get all columns of the source data to be described.
locations_to_read = [
    str("../P8_capstone_resource_files/parquet_star/PQ1/d_immigration_countries")
    , str("../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports")
    , str("../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals")
    , str("../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures")
    , str("../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations")
    , str("../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations")
]

def main():
    # create automatically a data dictionary based on the loaded tables (data frames)
    json_data = create_data_dictionary_from_df(locations_to_read)

    # add descriptions to data dictionary for tables and table columns
    json_data = update_descriptions(json_data)

    # persist generated json_data to disk
    location_to_write = "../P8_capstone_documentation/10_P8_capstone_documentation_data_dictionary.json"
    persist_json_data(json_data, location_to_write)
    print("Creation of data dictionary finished")


if __name__ == "__main__":
    main()



