# Data Engineering Capstone Project

#### Project Summary
Our objective is to gain insight in tourism by finding out how many people are traveling to destinations (states), what the weather conditions (average temperature) are like at that moment in time so tourism operators can adjust their offerings accordingly. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# install pandas package, required for .toPandas() functionality
sc.install_pypi_package("pandas==0.25.1") 

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
Package already installed for current Spark context!
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 1110, in install_pypi_package
    raise ValueError("Package already installed for current Spark context!")
ValueError: Package already installed for current Spark context!



In [3]:
from pyspark.sql.functions import udf, to_date, col, month, year, dayofmonth, split, format_string, abs, isnan, when, count, substring, length, regexp_extract, monotonically_increasing_id
from datetime import datetime
from datetime import timedelta

import pyspark.sql.types as t
import pandas as pd
import numpy as np
import os
import configparser

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [35]:
# settings, specify the s3 bucket name on which to read data from and store data on
# config = configparser.ConfigParser()
# config.read('settings.cfg')

# ensure that the environment variables are set before the spark session is started, otherwise s3 cannot be accessed
# os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
# os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']
# s3_bucket=config['AWS']['AWS_S3_BUCKET_LOC']
s3_bucket='jjudacitydatalake'

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Step 1: Scope the Project and Gather Data

#### Scope 
We want to gain insight in tourism in the United States. Questions like, how many people are traveling to destinations and what are the weather conditions like at the arrival location. The goal is for tourism operators to gain insight in why people travel there so they can adjust their tourism offerings. 

We use the immigration dataset provided by Udacity (originally from the US National Tourism and Trade Office) which can be found here: https://travel.trade.gov/research/reports/i94/historical/2016.html. 

Our goal is to run a one-time analysis on the full (i.e. the whole year) dataset. We want to match the immigration data to historical temperatures.
One particular example is to count how many people are arriving in a certain state per time period (year, month) and what the expected historical average temperature is like at that period in time.

We'll store the immigration data and the temperature data on S3 with a suitable partitioning for efficient processing. The processing itself will be done via Apache Spark on an EMR cluster. Results will be written back to S3.

#### Describe and Gather Data 
##### Immigration data
Describe the data sets you're using. Where did it come from? What type of information is included? 

We use the immigration dataset provided by Udacity (originally from the US National Tourism and Trade Office) which can be found here: https://travel.trade.gov/research/reports/i94/historical/2016.html. The US National Tourism and Trade Office provides this dataset to the public for third parties to gain useful insights. Each entry in the dataset is an arrival into the USA with additional data provided (such as gender, date of arrival, airline, purposes of visit (we filter on tourism/pleasure).

The Immigration data has been provided by a separate disk on the Udacity workspace. I have read this data in with pandas and re-written it without modifications as parquet files in the i94parquet folder on my S3 folder.

##### Temperature data
This dataset was provided by Udacity and is originally sourced from Kaggle. More information can be found here: https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data.




In [41]:
def readMultipleParquet(listPaths):
    """
    Given a list of paths, return Spark Dataframe for processing
    """
    sc.setJobGroup("Read", "Reading multiple parquet files")
    return spark.read.parquet(*listPaths)

listPaths = [f's3://{s3_bucket}/capstone/staging/i94_parquet/i94_apr16_sub.sas7bdat', f's3://{s3_bucket}/capstone/staging/i94_parquet/i94_may16_sub.sas7bdat']

immigration_staging = readMultipleParquet(listPaths)

print(f"Number of rows read: {immigration_staging.count()}")


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Number of rows read: 6540562

## Examine the raw immigration data. 

In [6]:
immigration_staging.limit(10).show(truncate=False)

immigration_staging.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|cicid    |i94yr |i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear|dtaddto |gender|insnum|airline|admnum        |fltno|visatype|
+---------+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|1252973.0|2016.0|5.0   |127.0 |127.0 |MIA    |20581.0|1.0    |FL     |20610.0|49.0  |1.0    |1.0  |20160507|BCH     |null |G      |N      |null   |M      |1967.0 |05062018|M     |null  |AA     |9.563070103E10|01466|E2      |
|1252974.0|2016.0|5.0   |127.0 |127.0 |MIA    |20581.0|1.0    |FL     |20587.0|30.0  |2.0    |1.

# Dataset enhancement

### This is part of Step 2: Explore and Assess the Data
Identify data quality issues, like missing values, duplicate data, etc. & Document steps necessary to clean the data.

## Tourism filtering
We are interested in tourism data so from the data description we find we have to filter on i94visa = 2 (Pleasure).

<code>
/* I94VISA - Visa codes collapsed into three categories:
   1 = Business
   2 = Pleasure
   3 = Student
*/
</code>

## Adding useful date fields
The arrdate and depdate fields are double. They represent the number of days since 1-1-1960.
We add columns arrdate_dt, depdate_dt (to parse them as proper datetime objects) as well as day of month, month and year columns which can be extracted from arrdate_dt.

## Fixing US states
The US contains 50 states. The field i94addr can contain invalid data (i.e. entries outside the applicable list of 50 states). If this happens we replace the value with 'other'.
We also replace values null with other.

## Dropping duplicates
We drop duplicate rows in the immigration dataset. Each row ought to be unique.

## Gender
Sometimes gender is null. As there's no way of enriching the data we substitute the null value with 'unknown'

In [7]:
def sasDateToDatetime(sasdate):
    """
    Given a spark column which specifies the number of days since 1960, return the datetime object
    """
    return None if sasdate == None else datetime.strptime('1960-01-01', "%Y-%m-%d") + timedelta(sasdate)

valid_us_states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

sasdate_udf = udf(sasDateToDatetime, t.DateType())
imm = immigration_staging.\
    withColumn('arrdate_dt', sasdate_udf('arrdate')).\
    withColumn('depdate_dt', sasdate_udf('depdate')).\
    withColumn('arrdate_dayofmonth', dayofmonth(col('arrdate_dt'))).\
    withColumn('arrdate_month', month(col('arrdate_dt'))).\
    withColumn('arrdate_year', year(col('arrdate_dt'))).\
    withColumn('state', when(~col('i94addr').isin(valid_us_states), 'other').otherwise(col('i94addr'))).\
    fillna('other', subset='state').\
    fillna('unknown', subset='gender').\
    dropDuplicates().\
    select('i94port', 'biryear', 'gender', 'airline', 'i94visa', 'arrdate_dt', 'depdate_dt', 'arrdate_dayofmonth', 'arrdate_month', 'arrdate_year', 'state').\
    filter(col('i94visa') == 2).\
    withColumn('id_imm', monotonically_increasing_id())

print(f"Number of rows in final selected dataset: {imm.count()}")

# dataframe.filter(~dataframe.column.isin(array))
#     withColumn("state", when(immigration_staging["i94addr"] not in valid_us_states, 'other').otherwise(immigration_staging["i94addr"])).\


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Number of rows in final selected dataset: 5388905

## Examine subset of immigration data

In [8]:
imm.limit(10).show(truncate=False)

imm.printSchema()

# result = imm.select('state').distinct().collect()
# print(f'All distinct states are: {result}')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------+-------+-------+-------+-------+----------+----------+------------------+-------------+------------+-----+----------+
|i94port|biryear|gender |airline|i94visa|arrdate_dt|depdate_dt|arrdate_dayofmonth|arrdate_month|arrdate_year|state|id_imm    |
+-------+-------+-------+-------+-------+----------+----------+------------------+-------------+------------+-----+----------+
|LVG    |1951.0 |M      |VS     |2.0    |2016-05-09|2016-05-22|9                 |5            |2016        |NV   |8589934592|
|ORL    |1994.0 |unknown|BA     |2.0    |2016-05-09|2016-05-23|9                 |5            |2016        |FL   |8589934593|
|MIA    |1994.0 |M      |BA     |2.0    |2016-05-09|2016-05-14|9                 |5            |2016        |FL   |8589934594|
|SFR    |1986.0 |unknown|BA     |2.0    |2016-05-09|2016-05-31|9                 |5            |2016        |CA   |8589934595|
|PIT    |1992.0 |M      |ZX     |2.0    |2016-05-09|2016-05-27|9                 |5            |2016        |PA

# Read in immigration data

Explore data skewness, find partition/clustering keys in order to parallelize processing

I have a gut feeling that the number of people arriving each month and day of month ought to be relatively constant. Let's validate this assumption:

In [9]:
imm.createOrReplaceTempView("immdata")
dataSkewQuery = spark.sql("""
select arrdate_month as month, arrdate_dayofmonth as day, count(*) as immnum
from immdata
group by month, day
order by month asc, day asc
""")
dataSkewQuery.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

    month  day  immnum
0       4    1   95254
1       4    2   83883
2       4    3   66613
3       4    4   69904
4       4    5   74432
..    ...  ...     ...
56      5   27  107975
57      5   28  104771
58      5   29   75577
59      5   30   64529
60      5   31   69553

[61 rows x 3 columns]

## Examine seasonality in the data, how many visitors visit the US each month?

In [10]:
imm.createOrReplaceTempView("immdata")
seasonality_table = spark.sql("""
select arrdate_month as month, count(*) as immnum
from immdata
group by month
order by month asc
""")
seasonality_table.toPandas()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

   month   immnum
0      4  2530868
1      5  2858037

# Read in temperature data

In [11]:
temp_staging = spark.read.option("header", "true").csv(f's3://{s3_bucket}/capstone/staging/temperature_data/GlobalLandTemperaturesByCity.csv')
temp_staging.limit(10).show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+------------------+-----------------------------+-----+-------+--------+---------+
|dt        |AverageTemperature|AverageTemperatureUncertainty|City |Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01|6.068             |1.7369999999999999           |Århus|Denmark|57.05N  |10.33E   |
|1743-12-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-01-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-02-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-03-01|null              |null                         |Århus|Denmark|57.05N  |10.33E   |
|1744-04-01|5.7879999999999985|3.6239999999999997           |Århus|Denmark|57.05N  |10.33E   |
|1744-05-01|10.644            |1.2830000000000001           |Århus|Denmark|57.05N  |10.33E   |
|1744-06-01|14.050999999999998|1.347              

In [12]:
df_temp = temp_staging.\
    filter(col('Country') == 'United States').\
    select(to_date(col("dt"),"yyyy-MM-dd").alias("dt"), 'AverageTemperature', 'City', 'Country', 'Latitude', 'Longitude').\
    withColumn('dayofmonth', dayofmonth(col('dt'))).\
    withColumn('month', month(col('dt'))).\
    withColumn('year', year(col('dt'))).\
    withColumn("latitude_rounded", format_string("%.0f", regexp_extract(col('Latitude'), '\d+.\d+', 0).cast(t.DoubleType()))).\
    withColumn("longitude_rounded", format_string("%.0f", regexp_extract(col('Longitude'), '\d+.\d+', 0).cast(t.DoubleType()))).\
    dropna()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [13]:

df_temp.createOrReplaceTempView("tempdata_coord")
temp_table = spark.sql("""
select dayofmonth, month, latitude_rounded as lat, longitude_rounded as long, avg(AverageTemperature) as AvgTemp
from tempdata_coord
group by lat, long, month, dayofmonth
order by lat asc, long asc, month asc, dayofmonth asc
""")

temp_table = temp_table.withColumn("id_temp_coord", monotonically_increasing_id())
temp_table.show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+-----+---+----+------------------+-------------+
|dayofmonth|month|lat|long|AvgTemp           |id_temp_coord|
+----------+-----+---+----+------------------+-------------+
|1         |1    |27 |81  |17.598033333333333|0            |
|1         |2    |27 |81  |18.65918828451882 |1            |
|1         |3    |27 |81  |20.510491666666674|2            |
|1         |4    |27 |81  |22.596529166666656|3            |
|1         |5    |27 |81  |25.009309623430973|4            |
|1         |6    |27 |81  |27.00768049792531 |5            |
|1         |7    |27 |81  |27.85293775933608 |6            |
|1         |8    |27 |81  |27.92431535269709 |7            |
|1         |9    |27 |81  |26.861252100840332|8            |
|1         |10   |27 |81  |23.967126582278475|9            |
|1         |11   |27 |81  |20.727659663865523|10           |
|1         |12   |27 |81  |18.027008403361332|11           |
|1         |1    |27 |82  |18.587668103448276|12           |
|1         |2    |27 |82

# Read in airport code data

In [14]:
airport_codes_staging = spark.read.option("header", "true").csv(f's3://{s3_bucket}/capstone/staging/airportcodes_data/airport-codes_csv.csv')

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

# Mapping coordinates to states
We introduce a bit of noise by rounding the lat/long coordinates to 0 decimals. This means that each lat, long pair can have multiple states, which is undesirable.

Below we count the number of states contained within each lat/long pair and select the maximum as being the most representative state for each lat/long coordinate.

In [15]:
# coordinates are specified in [longitude, latitude]
coordinates_split = split(airport_codes_staging['coordinates'], ',')
region_split = split(airport_codes_staging['iso_region'], '-')

df_airportcodes = airport_codes_staging.\
                    filter(col('iso_country') == 'US').\
                    withColumn("latitude", format_string("%.0f", abs(coordinates_split.getItem(1).cast(t.DoubleType())))).\
                    withColumn("longitude", format_string("%.0f", abs(coordinates_split.getItem(0).cast(t.DoubleType())))).\
                    withColumn("state", region_split.getItem(1)).\
                    withColumn('state', when(~col('state').isin(valid_us_states), 'other').otherwise(col('state'))).\
                    fillna('other', subset='state')

df_airportcodes.createOrReplaceTempView("aircodes")
# count the number of states for each lat/long pair
aircode_table1 = spark.sql("""
select latitude, longitude, state, count(state) as num
from aircodes
group by latitude, longitude, state
order by latitude, longitude, state
""")
# aircode_table1.show(truncate=False)

df_airportcodes.createOrReplaceTempView("aircodes")
# determine the maximum count per lat/long pair
aircode_table2 = spark.sql("""
select latitude as lat, longitude as long, max(num) as maxPerLatLong from (
    select latitude, longitude, state, count(state) as num
    from aircodes
    group by latitude, longitude, state
    order by latitude, longitude, state
)
group by lat, long
order by lat, long
""")
# aircode_table2.show(truncate=False)

# join both tables to get the state with the most counts for each lat/long pairs
aircode_table3 = aircode_table1.\
    join(aircode_table2, [aircode_table1.latitude == aircode_table2.lat, aircode_table1.longitude == aircode_table2.long, aircode_table1.num == aircode_table2.maxPerLatLong]).\
    drop('long', 'lat', 'num', 'maxPerLatLong').\
    withColumn("id_state_coord", monotonically_increasing_id())

aircode_table3.show(truncate=False)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+---------+-----+--------------+
|latitude|longitude|state|id_state_coord|
+--------+---------+-----+--------------+
|0       |0        |other|0             |
|0       |1        |PA   |1             |
|1       |0        |NC   |2             |
|1       |1        |OK   |3             |
|1       |1        |PA   |4             |
|1       |1        |other|5             |
|1       |3        |NY   |6             |
|1       |3        |other|7             |
|12      |8        |CA   |8             |
|19      |155      |HI   |9             |
|19      |156      |HI   |10            |
|19      |17       |other|11            |
|2       |2        |other|12            |
|20      |155      |HI   |13            |
|20      |156      |HI   |14            |
|21      |156      |HI   |15            |
|21      |157      |HI   |16            |
|21      |158      |HI   |17            |
|22      |158      |HI   |18            |
|22      |159      |HI   |19            |
+--------+---------+-----+--------

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

We have 3 sets of data:
1. immigration data
2. temperature data
3. airport codes

Our immigration data is composed of arrival datetime fields (datetime column, month, dayofmonth, year) as well as some personal info (gender, birth year, state they're entering) as well as the airline people flew with. 

The temperature data is composed of temperatures linked to cities and countries. I have tried to join on latitude and longitude with 2 decimals but this is too fine-grained. 
A 2 decimals latitude/longitude corresponds to a real world distance of approx 1.11 km. There is no overlap between the places where temperatures are measured an airport locations.
1 decimal corresponds to 11.1 km and zero to 111 km.

The airport code data is composed of iso-region (i.e. US-PA) and coordinates data (longitude, latitude).

In the end we'd like to relate the immigration data to temperatures.
We cannot do this directly, so the goal is: 
1. read in immigration data, extract state.
2. read in airport codes, extract latitude, longitude, round to zero decimals and extract state
3. read in temperature data, extract latitude, longitude, round to zero decimals

Using #3 we can create a helper table with columns: dayofmonth, month, lat, long, AvgTemp.
The idea of this helper table is to determine an average temperature at a given latitude, longitude for a given day of month and month combination.

The airport codes table can be linked to this helper table to determine an average temperature for each state for a given day of month and month combination.

Finally, we can link immigration data to average temperatures via the day of month and month (of the arrival dates) and the state.

#### 3.2 Mapping Out Data Pipelines
##### Below contains the data dictionary as well!

1. Read in immigration data, temperature data, airport codes data
2. Select relevant columns
    1. immigration data: i94port|biryear|gender |airline|i94visa|i94addr|arrdate_dt|depdate_dt|arrdate_dayofmonth|arrdate_month|arrdate_year
    1. temperature data: dayofmonth|month|lat|long|AvgTemp           
    1. airport codes data: lat|long|state
3. create fact and dimension tables
    1. dim_state_coord  
        1. id_state_coord (integer, unique)  
        1. state (string, state corresponding to rounded lat/long coordinate) 
        1. lat (latitude coordinate, rounded to zero decimals) 
        1. long (longitude coordinate, rounded to zero decimals) 
    1. dim_temp_coord  
        1. id_temp_coord (integer, unique) 
        1. day_of_month (integer, day of month) 
        1. month (integer, month) 
        1. lat (latitude coordinate, rounded to zero decimals) 
        1. long (longitude coordinate, rounded to zero decimals) 
        1. AvgTemp (average temperature in degrees Celsius corresponding to latitude, longitude coordinate) 
    1. dim_time  
        1. arr_dt (date, arrival datetime, unique) 
        1. day_of_month (integer, day of month) 
        1. month (integer, month) 
        1. year (integer, year) 
    1. dim_imm  
        1. id_imm (integer, unique) 
        1. i94port (string, port of arrival)
        1. biryear (integer, birth year of immigrant) 
        1. gender (string, gender of immigrant, is either M for male, F for female or unknown) 
        1. airline (string, airline people flew in with) 
        1. i94visa (integer, type of visa, filtered on 2 for Tourism) 
        1. i94addr (string, destination state)
        1. arr_dt (date, arrival date) 
        1. dep_dt (date, departure date) 
    1. fact_imm  
        1. id_imm (integer)  
        1. id_state_coord (integer)  
        1. id_temp_coord (integer)  
        1. arr_dt (integer)  


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

# Create dimension tables

Use .persist() on the smaller tables to optimize performance.

In [16]:
dim_state_coord = aircode_table3.persist()
dim_state_coord.limit(10).show(truncate=False)

dim_temp_coord = temp_table.persist()
dim_temp_coord.limit(10).show(truncate=False)

imm.createOrReplaceTempView("immdata")
dim_time = spark.sql("""
select distinct arrdate_dt as datetime, arrdate_dayofmonth as dayofmonth, arrdate_month as month, arrdate_year as year
from immdata
""").persist()
dim_time.limit(10).show(truncate=False)

dim_imm = imm

dim_imm.limit(10).show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+---------+-----+--------------+
|latitude|longitude|state|id_state_coord|
+--------+---------+-----+--------------+
|0       |0        |other|0             |
|0       |1        |PA   |1             |
|1       |0        |NC   |2             |
|1       |1        |OK   |3             |
|22      |160      |HI   |25769803776   |
|24      |166      |HI   |25769803777   |
|25      |80       |FL   |25769803778   |
|25      |81       |FL   |25769803779   |
|25      |82       |FL   |25769803780   |
|26      |80       |FL   |25769803781   |
+--------+---------+-----+--------------+

+----------+-----+---+----+------------------+-------------+
|dayofmonth|month|lat|long|AvgTemp           |id_temp_coord|
+----------+-----+---+----+------------------+-------------+
|1         |1    |27 |82  |18.587668103448276|17179869184  |
|1         |2    |27 |82  |18.725623376623375|17179869185  |
|1         |3    |27 |82  |19.821219827586205|17179869186  |
|1         |4    |27 |82  |21.626489270386273

# Create fact table

The fact table should have 4 columns
1. id_imm
1. id_state_coord
1. id_temp_coord
1. arr_dt

In [36]:
print(dim_imm.count())

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

5388905

In [39]:
fact_imm = dim_imm.\
    join(dim_time, [dim_imm.arrdate_dt == dim_time.datetime]).\
    join(dim_temp_coord, [dim_imm.arrdate_dayofmonth == dim_temp_coord.dayofmonth, dim_imm.arrdate_month == dim_temp_coord.month]).\
    join(dim_state_coord, [dim_temp_coord.lat == dim_state_coord.latitude, dim_temp_coord.long == dim_state_coord.longitude]).\
    select('id_imm', 'id_state_coord', 'id_temp_coord', 'datetime').\
    dropDuplicates()

fact_imm.limit(10).show(truncate=False)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

17445547

# Write the dimension/fact tables to S3

In [21]:
s3_result_path = f"s3a://jjudacitydatalake/capstone/processed/dim_state_coord"

dim_state_coord.\
    write.\
    format("parquet").\
    mode("overwrite").\
    save(s3_result_path)

s3_result_path = f"s3a://jjudacitydatalake/capstone/processed/dim_temp_coord"

dim_temp_coord.\
    write.\
    format("parquet").\
    mode("overwrite").\
    save(s3_result_path)


s3_result_path = f"s3a://jjudacitydatalake/capstone/processed/dim_time"

dim_time.\
    write.\
    format("parquet").\
    mode("overwrite").\
    save(s3_result_path)


s3_result_path = f"s3a://jjudacitydatalake/capstone/processed/dim_imm"

dim_imm.\
    write.\
    format("parquet").\
    mode("overwrite").\
    save(s3_result_path)


s3_result_path = f"s3a://jjudacitydatalake/capstone/processed/fact_imm"

fact_imm.\
    write.\
    format("parquet").\
    mode("overwrite").\
    save(s3_result_path)



VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 

In [40]:
sc.setJobGroup("DataQuality", "Counting number of records in tables")

def recordCount(table_name):
    """
    Given a dataframe table_name, return the number of records in it
    """
    return table_name.count()

def checkNumberOfRows(actual_count, expected_count):
    """
    Given a number actual_count, compare to expected_count and raise ValueError exception if it differs.
    """
    if actual_count != expected_count:
        raise ValueError(f"The number of records found is {actual_count}, differing from expected value {expected_count}")

expectedRowCount = {dim_state_coord : 1200, dim_temp_coord : 1164, dim_time : 61, dim_imm : 5388905, fact_imm : 17445547}
                         
for obj in [dim_state_coord, dim_temp_coord, dim_time, dim_imm, fact_imm]:
    numRows = recordCount(obj)
    
    checkNumberOfRows(numRows, expectedRowCount[obj])


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [22]:
# imm.createOrReplaceTempView('immdata')
# sc.setJobGroup("DataQuality", "Selecting distinct states")
# result = spark.sql("""
# select distinct state
# from immdata
# order by state asc
# """)
# result.show(result.count(), truncate=False)

sc.setJobGroup("DataQuality", "Counting total number of distinct states")
numDistinctStates = spark.sql("""
select count(distinct state) 
from immdata
""")

checkNumberOfRows(numDistinctStates.collect()[0]['count(DISTINCT state)'], len(valid_us_states) + 1)
# result.show(truncate=False)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
In this project we choose the combination of AWS S3 for fast & efficient storage with an AWS EMR cluster for efficient processing on large datasets.
S3 is used as well to store the resulting analysis.

As we have relatively static data (immigration data grows relatively slowly with one day at a time) this analysis was set up as a run-once thing. Meaning, we put the data somewhere, run a one-time analysis and proceed wit hthe results.

* Propose how often the data should be updated and why.
Temperature data currently is averaged out over many years so there's no need to update it often unless weather conditions across the US change a lot. 

Immigration data will have grown by a year after one year of waiting so a yearly frequency might be in order.

* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.<br>
If the data was increased by 100x then I'd probably look into partitioning the data into smaller partitions. Furthermore, I would increase the number of nodes on the EMR cluster for faster processing. I'd also consider moving the data from S3 to Hadoop FS to optimize the I/O across the network.
 
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.<br>
I'd create an Airflow workflow to easily schedule this task. Furthermore I'd process the history once and rewrite the code such that only the daily data added would be processed.
 
 * The database needed to be accessed by 100+ people.<br>
If the database needed to be accessed by 100+ people I'd run the analysis once and cache the results somewhere and serve the cache. As the setup currently is now, there is no user input that could change the output. 

# Analysis example

Find out what state has the most tourists per month and what is the temperature like at the destination?

In [40]:
sc.setJobGroup("Analysis", "Counting number of tourists per month and the average temperature therein")

fact_imm.createOrReplaceTempView("fact_imm")
dim_temp_coord.createOrReplaceTempView("temp")
dim_imm.createOrReplaceTempView("imm")
# select arrdate_month, state, imm.id_imm, fact_imm.id_imm, temp.id_temp_coord, fact_imm.id_temp_coord, temp.AvgTemp
analysis1 = spark.sql("""
select arrdate_month, state, avg(temp.AvgTemp), count(*) as tourist_num
from imm
join fact_imm on (imm.id_imm = fact_imm.id_imm)
join temp on (temp.id_temp_coord = fact_imm.id_temp_coord)
group by arrdate_month, state
order by tourist_num desc
""")
analysis1.limit(10).show(truncate=False)


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------+-----+------------------+-----------+
|arrdate_month|state|avg(AvgTemp)      |tourist_num|
+-------------+-----+------------------+-----------+
|4            |FL   |15.319045976272758|678321     |
|4            |NY   |15.399508138264684|564840     |
|4            |CA   |15.399421013491569|434376     |
|5            |FL   |17.65150757190719 |418473     |
|5            |NY   |17.402843662849904|335907     |
|5            |CA   |17.546777904672375|284391     |
|4            |other|15.152546680569664|269649     |
|5            |other|18.016987491220323|193266     |
|4            |HI   |15.469318783133671|175311     |
|5            |HI   |17.754536895845458|141372     |
+-------------+-----+------------------+-----------+