# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-05-05

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into different sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project file follows the following steps:
* Step 4: Run ETL to Model the Data

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness

Run Quality Checks

##### 4.2.1 Define StructType and create result data frame

In [1]:
###### Imports and Installs section
import shutil
import pandas as pd
import pyspark.sql.functions as F
# import spark as spark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, LongType, TimestampType, DateType
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, DataFrameNaFunctions
from pyspark.sql.functions import when, count, col, to_date, datediff, date_format, month
import re
import json
from os import path


max_memory = "5g"

spark = SparkSession\
    .builder\
    .appName("etl pipeline for project 8 - I94 data") \
    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config("spark.executor.memory", max_memory) \
    .config("spark.driver.memory", max_memory) \
    .appName("Foo") \
    .enableHiveSupport()\
    .getOrCreate()

# setting the current LOG-Level
spark.sparkContext.setLogLevel('ERROR')

# Define format to store data quality result data frame
result_struct_type = StructType(
    [
         StructField("dq_result_table_name", StringType(), True)
        ,StructField("dq_result_null_entries", IntegerType(), True)
        ,StructField("dq_result_entries", IntegerType(), True)
        ,StructField("dq_result_status", StringType(), True)
    ]
)

In [None]:
# execute check commands

#####  4.2.2 Data Quality (dq) checks for table d_immigration_countries

In [2]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ1/d_immigration_countries"
df_dq_table_d_immigration_countries = spark.read.parquet(location_to_read)

# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_immigration_countries \
    .select("d_ic_id") \
    .where("d_ic_id is null or d_ic_id == ''") \
    .count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_immigration_countries.count()
print(f"df_dq_check_content: {df_dq_check_content}")

df_dq_check_null_values: 0
df_dq_check_content: 289


In [3]:
# insert result into result_df
table_name = "d_immigration_countries"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(dq_check_result)

OK


In [4]:
dq_results = [ (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result) ]
print(dq_results)

# create results data frame
df_dq_results = spark.createDataFrame(dq_results, result_struct_type)

# check df schema and content
df_dq_results.printSchema()
df_dq_results.show(100, False)

[('d_immigration_countries', 0, 289, 'OK')]
root
 |-- dq_result_table_name: string (nullable = true)
 |-- dq_result_null_entries: integer (nullable = true)
 |-- dq_result_entries: integer (nullable = true)
 |-- dq_result_status: string (nullable = true)

+-----------------------+----------------------+-----------------+----------------+
|dq_result_table_name   |dq_result_null_entries|dq_result_entries|dq_result_status|
+-----------------------+----------------------+-----------------+----------------+
|d_immigration_countries|0                     |289              |OK              |
+-----------------------+----------------------+-----------------+----------------+



#####  4.2.3 Data Quality (dq) checks for table d_immigration_airports

In [5]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ2/d_immigration_airports"
df_dq_table_d_immigration_airports = spark.read.parquet(location_to_read)

# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_immigration_airports \
    .select("d_ia_id") \
    .where("d_ia_id is null or d_ia_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_immigration_airports.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

df_dq_check_null_values: 0
df_dq_check_content: 660


In [6]:
# insert result into result_df
table_name = "d_immigration_airports"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

table_name: d_immigration_airports
dq_check_result: OK
df_dq_check_null_values: 0
df_dq_check_content: 660


In [7]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

[('d_immigration_airports', 0, 660, 'OK')]


In [8]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [9]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

+-----------------------+----------------------+-----------------+----------------+
|dq_result_table_name   |dq_result_null_entries|dq_result_entries|dq_result_status|
+-----------------------+----------------------+-----------------+----------------+
|d_immigration_countries|0                     |289              |OK              |
|d_immigration_airports |0                     |660              |OK              |
+-----------------------+----------------------+-----------------+----------------+



#####  4.2.4 Data Quality (dq) checks for table d_date_arrivals

In [11]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_arrivals"
df_dq_table_d_date_arrivals = spark.read.parquet(location_to_read)

In [12]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_date_arrivals \
    .select("d_da_id") \
    .where("d_da_id is null or d_da_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_date_arrivals.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

df_dq_check_null_values: 0
df_dq_check_content: 366


In [13]:
# insert result into result_df
table_name = "d_date_arrivals"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

table_name: d_date_arrivals
dq_check_result: OK
df_dq_check_null_values: 0
df_dq_check_content: 366


In [14]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

[('d_date_arrivals', 0, 366, 'OK')]


In [15]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [16]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

+-----------------------+----------------------+-----------------+----------------+
|dq_result_table_name   |dq_result_null_entries|dq_result_entries|dq_result_status|
+-----------------------+----------------------+-----------------+----------------+
|d_immigration_countries|0                     |289              |OK              |
|d_immigration_airports |0                     |660              |OK              |
|d_date_arrivals        |0                     |366              |OK              |
+-----------------------+----------------------+-----------------+----------------+



#####  4.2.5 Data Quality (dq) checks for table d_date_departures

In [17]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ3/d_date_departures"
df_dq_table_d_date_departures = spark.read.parquet(location_to_read)

In [18]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_date_departures \
    .select("d_dd_id") \
    .where("d_dd_id is null or d_dd_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_date_departures.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

df_dq_check_null_values: 0
df_dq_check_content: 531


In [19]:
# insert result into result_df
table_name = "d_date_departures"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

table_name: d_date_departures
dq_check_result: OK
df_dq_check_null_values: 0
df_dq_check_content: 531


In [20]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

[('d_date_departures', 0, 531, 'OK')]


In [21]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [22]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

+-----------------------+----------------------+-----------------+----------------+
|dq_result_table_name   |dq_result_null_entries|dq_result_entries|dq_result_status|
+-----------------------+----------------------+-----------------+----------------+
|d_immigration_countries|0                     |289              |OK              |
|d_immigration_airports |0                     |660              |OK              |
|d_date_arrivals        |0                     |366              |OK              |
|d_date_departures      |0                     |531              |OK              |
+-----------------------+----------------------+-----------------+----------------+



#####  4.2.6 Data Quality (dq) checks for table d_state_destinations

In [23]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/d_state_destinations"
df_dq_table_d_state_destinations = spark.read.parquet(location_to_read)

In [24]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_d_state_destinations \
    .select("d_sd_id") \
    .where("d_sd_id is null or d_sd_id == ''") \
    .count()

# Check that table has > 0 rows
df_dq_check_content = df_dq_table_d_state_destinations.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

df_dq_check_null_values: 0
df_dq_check_content: 55


In [25]:
# insert result into result_df
table_name = "d_state_destinations"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

table_name: d_state_destinations
dq_check_result: OK
df_dq_check_null_values: 0
df_dq_check_content: 55


In [26]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

[('d_state_destinations', 0, 55, 'OK')]


In [27]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [28]:
df_dq_results.show(10, False)

##------------------------------------------------------------------------#

+-----------------------+----------------------+-----------------+----------------+
|dq_result_table_name   |dq_result_null_entries|dq_result_entries|dq_result_status|
+-----------------------+----------------------+-----------------+----------------+
|d_immigration_countries|0                     |289              |OK              |
|d_immigration_airports |0                     |660              |OK              |
|d_date_arrivals        |0                     |366              |OK              |
|d_date_departures      |0                     |531              |OK              |
|d_state_destinations   |0                     |55               |OK              |
+-----------------------+----------------------+-----------------+----------------+



#####  4.2.7 Data Quality (dq) checks for table f_i94_immigrations

In [29]:
# read table table
location_to_read = "../P8_capstone_resource_files/parquet_star/PQ4/f_i94_immigrations"
df_dq_table_f_i94_immigrations = spark.read.parquet(location_to_read)

In [30]:
# Check if key fields have valid values (no nulls or empty)
df_dq_check_null_values = df_dq_table_f_i94_immigrations \
    .select(  "f_i94_id"
            , "d_ia_id"
            , "d_sd_id"
            , "d_da_id"
            , "d_dd_id"
            , "d_ic_id"
            ) \
    .where(  "    f_i94_id is null or f_i94_id == ''"
             " or d_ia_id is null or d_ia_id == ''"
             " or d_sd_id is null or d_sd_id == ''"
             " or d_da_id is null or d_da_id == ''"
             " or d_dd_id is null or d_dd_id == ''"
             " or d_ic_id is null or d_ic_id == ''") \
    .count()

In [31]:
# Check that table has > 0 rows
df_dq_check_content = df_dq_table_f_i94_immigrations.count()

print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

df_dq_check_null_values: 0
df_dq_check_content: 12228839


In [32]:
# insert result into result_df
table_name = "f_i94_immigrations"
if df_dq_check_null_values < 1 and df_dq_check_content > 0:
    dq_check_result = "OK"
else:
    dq_check_result = "NOK"

print(f"table_name: {table_name}")
print(f"dq_check_result: {dq_check_result}")
print(f"df_dq_check_null_values: {df_dq_check_null_values}")
print(f"df_dq_check_content: {df_dq_check_content}")

table_name: f_i94_immigrations
dq_check_result: OK
df_dq_check_null_values: 0
df_dq_check_content: 12228839


In [33]:
dq_results = [
    (table_name, df_dq_check_null_values, df_dq_check_content, dq_check_result)
]
print(dq_results)

[('f_i94_immigrations', 0, 12228839, 'OK')]


In [34]:
# add new row to current results data frame
new_row = spark.createDataFrame(dq_results, result_struct_type)
df_dq_results = df_dq_results.union(new_row)

In [35]:
df_dq_results.show(10, False)
##------------------------------------------------------------------------#



+-----------------------+----------------------+-----------------+----------------+
|dq_result_table_name   |dq_result_null_entries|dq_result_entries|dq_result_status|
+-----------------------+----------------------+-----------------+----------------+
|d_immigration_countries|0                     |289              |OK              |
|d_immigration_airports |0                     |660              |OK              |
|d_date_arrivals        |0                     |366              |OK              |
|d_date_departures      |0                     |531              |OK              |
|d_state_destinations   |0                     |55               |OK              |
|f_i94_immigrations     |0                     |12228839         |OK              |
+-----------------------+----------------------+-----------------+----------------+

