# World Temperature vs CO2 and SO2 Emissions
### Data Engineering Capstone Project

#### Project Summary
This Project creates a Data Lake type of ETL pipeline to process, clean and store data related to world temperature and emissions. Data can be used to analyse if country emissions have impact on world temperature.  

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Import necessary libraries
import pandas as pd
import re
from pyspark.sql import SparkSession
import os
import configparser
from datetime import datetime
from pyspark.sql import types as t
from pyspark.sql.functions import udf, col, monotonically_increasing_id
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear
#from pyspark.sql.functions import year, month, dayofmonth

In [2]:
# Read config file
config = configparser.ConfigParser()
config.read_file(open('dl.cfg'))

os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['AWS_ACCESS_KEY_ID']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['AWS_SECRET_ACCESS_KEY']

# NOTE: Use these if using AWS S3 as a storage
INPUT_DATA = config['AWS']['INPUT_DATA']
OUTPUT_DATA = config['AWS']['OUTPUT_DATA']

# NOTE: Use these if using local storage
INPUT_DATA_LOCAL          = config['LOCAL']['INPUT_DATA_LOCAL']
INPUT_DATA_I94_LOCAL      = config['LOCAL']['INPUT_DATA_I94_LOCAL']
INPUT_DATA_AIRPORT_LOCAL  = config['LOCAL']['INPUT_DATA_AIRPORT_LOCAL']
INPUT_DATA_COUNTRY_LOCAL  = config['LOCAL']['INPUT_DATA_COUNTRY_LOCAL']
INPUT_DATA_AIRPORT_I94_LOCAL  = config['LOCAL']['INPUT_DATA_AIRPORT_I94_LOCAL']
INPUT_DATA_COUNTRY_I94_LOCAL  = config['LOCAL']['INPUT_DATA_COUNTRY_I94_LOCAL']
OUTPUT_DATA_LOCAL         = config['LOCAL']['OUTPUT_DATA_LOCAL']

# NOTE: Use these when storing data on server.
INPUT_DATA_I94_SERVER     = config['SERVER']['INPUT_DATA_I94_SERVER']
INPUT_DATA_AIRPORT_SERVER = config['SERVER']['INPUT_DATA_AIRPORT_SERVER']
INPUT_DATA_COUNTRY_SERVER = config['SERVER']['INPUT_DATA_COUNTRY_SERVER']
OUTPUT_DATA_SERVER        = config['SERVER']['OUTPUT_DATA_SERVER']

# NOTE: Use these when storing data on AWS.
INPUT_DATA_I94            = config['AWS']['INPUT_DATA_I94']
INPUT_DATA_AIRPORT        = config['AWS']['INPUT_DATA_AIRPORT']
INPUT_DATA_COUNTRY        = config['AWS']['INPUT_DATA_COUNTRY']
OUTPUT_DATA               = config['AWS']['OUTPUT_DATA']

DATA_LOCATION             = config['COMMON']['DATA_LOCATION']
DATA_STORAGE              = config['COMMON']['DATA_STORAGE']

#print(AWS_ACCESS_KEY_ID)
#print(AWS_SECRET_ACCESS_KEY)

print(INPUT_DATA_LOCAL)
print(INPUT_DATA_I94_LOCAL)
print(INPUT_DATA_AIRPORT_LOCAL)
print(INPUT_DATA_COUNTRY_LOCAL)
print(INPUT_DATA_AIRPORT_I94_LOCAL)
print(INPUT_DATA_COUNTRY_I94_LOCAL)
print(OUTPUT_DATA_LOCAL)
print(DATA_LOCATION)
print(DATA_STORAGE)

data/
data/18-83510-I94-Data-2016/i94_jan16_sub.sas7bdat
data/airport_codes.csv
data/iso-3166-country-codes.json
data/i94_airport_codes.xlsx
data/i94_country_codes.xlsx
data/output_data/
local
parquet


### Step 1: Scope the Project and Gather Data

#### Scope 
_Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc_

Scope of the project is to create an ETL pipeline for processing, cleaning and storing data related to US I94 immigration data, and country codes.  

Output of the ETL pipeline: processed data stored in Star schema model to parquet files. 
Tools: python, pandas, pyspark, (Amazon AWS S3)

#### Describe and Gather Data 
_Describe the data sets you're using. Where did it come from? What type of information is included?_

Project's data contains the following pieces:
* **data/18-83510-I94-Data-2016/**: US I94 immigration data from 2016 (Jan-Dec).
    * Source: https://travel.trade.gov/research/reports/i94/historical/2016.html
    * Description: I94_SAS_Labels_Descriptions.txt file contains descriptions for the I94 data.
        * I94 dataset has SAS7BDAT file per each month of the year (e.g. i94_jan16_sub.sas7bdat).
        * Each file contains about 3M rows
        * Data has 28 columns containing information about event date, arriving person, airport, airline, etc.
    * I94 immigration data example:
    * ![I94-immigration-data example](./Udacity-DEND-Project-Capstone-I94ImmigrationData-20190812-2.png)
    * NOTE: This data is behind a pay wall and need to be purchased to get access. Data is available for Udacity DEND course.
    
* **data/i94_airport_codes.xlsx**: Airport codes and related cities defined in I94 data description file.
    * Source: https://travel.trade.gov/research/reports/i94/historical/2016.html
    * Description: I94 Airport codes data contains information about different airports around the world.
        * Columns: i94port, i94_airport_name
        * Data has 660 rows and 2 columns.
    * NOTE: I94 data uses its own codes for airports instead of using standard codes (like IATA). Therefore, I94 airport codes have been taken from I94 data description file and processed for ETL use.  
    * Airport Code example:
    * ![I94-AirportCode-data example](./Udacity-DEND-Project-Capstone-I94AirportCodeData-20190813-4.png)

* **data/i94_country_codes.xlsx**: Country codes defined in US I94 Immigration data description file. 
    * Source: https://travel.trade.gov/research/reports/i94/historical/2016.html
    * Description: I94 Country codes data contains information about countries people come to US from.
        * Columns: i94cit, i94_country_code
        * Data has 289 rows and 2 columns.
    * NOTE: I94 data uses its own codes for countries instead of using ISO-3166 standard codes. Therefore, I94 country codes have been taken from I94 data description file and processed for ETL use.
    * Country Code example:
    * ![CountryCode-data example](./Udacity-DEND-Project-Capstone-I94CountryCodeData-20190813-5.png)    
  
* **data/airport-codes.csv**: Airport codes and related cities.
    * Source: https://datahub.io/core/airport-codes#data
    * Description: Airpot codes data contains information about different airports around the world.
        * Columns: Airport code, name, type, location, etc.
        * Data has 48304 rows and 12 columns.
    * Airport Code example:
    * ![AirportCode-data example](./Udacity-DEND-Project-Capstone-AirportCodeData-20190812-3.png)

* **data/iso-3166-country-codes.json**: World country codes (ISO-3166)
    * Source: https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes
    ISO-3166-1 and ISO-3166-2 Country and Dependent Territories Lists with UN Regional Codes
    * ISO-3166: https://www.iso.org/iso-3166-country-codes.html
    * Country Code example:
    * ![CountryCode-data example](./Udacity-DEND-Project-Capstone-CountryCodesData-20190804-4.png)

### 1.1 Define config and read in data

In [3]:
# Set config
if DATA_LOCATION == "local":
    input_data        = INPUT_DATA_LOCAL
    i94_data          = INPUT_DATA_I94_LOCAL
    airport_codes     = INPUT_DATA_AIRPORT_LOCAL
    country_codes     = INPUT_DATA_COUNTRY_LOCAL
    airport_codes_i94 = INPUT_DATA_AIRPORT_I94_LOCAL
    country_codes_i94 = INPUT_DATA_COUNTRY_I94_LOCAL
    output_data       = OUTPUT_DATA_LOCAL
elif DATA_LOCATION == "server":
    input_data_bucket = INPUT_DATA_SERVER
    i94_data          = INPUT_DATA_I94_SERVER
    airport_codes     = INPUT_DATA_AIRPORT_SERVER
    country_codes     = INPUT_COUNTRY_SERVER
    airport_codes_i94 = INPUT_DATA_AIRPORT_I94_SERVER
    country_codes_i94 = INPUT_DATA_COUNTRY_I94_SERVER
    output_data       = OUTPUT_DATA_SERVER
elif DATA_LOCATION == "aws":
    input_data_bucket = INPUT_DATA
    i94_data          = INPUT_DATA_I94
    airport_codes     = INPUT_DATA_AIRPORT
    country_codes     = INPUT_COUNTRY
    airport_codes_i94 = INPUT_DATA_AIRPORT_I94
    country_codes_i94 = INPUT_DATA_COUNTRY_I94
    output_data       = OUTPUT_DATA
    
if DATA_STORAGE == "postgresql":
    pass
elif DATA_STORAGE == "parquet":
    data_storage      = DATA_STORAGE

In [None]:
# Read I94 immigration data:
i94_df = pd.read_sas(i94_data, 'sas7bdat', encoding='ISO-8859-1')

In [4]:
# Read airport code data:
airport_codes_df = pd.read_csv(airport_codes, header=0, sep=',')

# Read Global country codes data:
country_codes_df = pd.read_json(country_codes, orient="records")

# Read I94 Airport codes data:
airport_codes_i94_df = pd.read_excel(airport_codes_i94, header=0, index_col=0)

# Read I94 Country codes data:
country_codes_i94_df = pd.read_excel(country_codes_i94, header=0, index_col=0)

### 1.2 Show data snippets

In [None]:
# I94 immigration data
i94_df.head()

In [5]:
airport_codes_df.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [6]:
country_codes_df.head()

Unnamed: 0,alpha-2,alpha-3,country-code,intermediate-region,intermediate-region-code,iso_3166-2,name,region,region-code,sub-region,sub-region-code
0,AF,AFG,4,,,ISO 3166-2:AF,Afghanistan,Asia,142,Southern Asia,34
1,AX,ALA,248,,,ISO 3166-2:AX,Åland Islands,Europe,150,Northern Europe,154
2,AL,ALB,8,,,ISO 3166-2:AL,Albania,Europe,150,Southern Europe,39
3,DZ,DZA,12,,,ISO 3166-2:DZ,Algeria,Africa,2,Northern Africa,15
4,AS,ASM,16,,,ISO 3166-2:AS,American Samoa,Oceania,9,Polynesia,61


In [7]:
airport_codes_i94_df.head()

Unnamed: 0_level_0,i94_airport_name
i94port,Unnamed: 1_level_1
'ALC',"'ALCAN, AK '"
'ANC',"'ANCHORAGE, AK '"
'BAR',"'BAKER AAF - BAKER ISLAND, AK'"
'DAC',"'DALTONS CACHE, AK '"
'PIZ',"'DEW STATION PT LAY DEW, AK'"


In [8]:
country_codes_i94_df.head()

Unnamed: 0_level_0,i94_country_name
i94cit,Unnamed: 1_level_1
582,MEXICO'
236,'AFGHANISTAN'
101,'ALBANIA'
316,'ALGERIA'
102,'ANDORRA'


### 1.3 Create Spark session

In [9]:
#spark = SparkSession.builder\
#                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
#                     .getOrCreate()
#from pyspark.sql import SparkSession
spark = SparkSession.builder\
                    .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
                    .enableHiveSupport().getOrCreate()

### 1.4 Read I94 immigration data to Spark

In [10]:
i94_schema = t.StructType([
                            t.StructField("alpha-2", t.StringType(), False),
                            t.StructField("alpha-3", t.StringType(), False),
                            t.StructField("country-code", t.IntegerType(), False),
                            t.StructField("intermediate-region", t.StringType(), False),
                            t.StructField("intermediate-region-code", t.StringType(), False),
                            t.StructField("iso-3166-2", t.StringType(), False),
                            t.StructField("name", t.StringType(), False),
                            t.StructField("region", t.StringType(), True),
                            t.StructField("region-code", t.StringType(), True),
                            t.StructField("sub-region", t.StringType(), True),
                            t.StructField("sub-region-code", t.StringType(), True),
                        ])

In [11]:
i94_df_spark =spark.read.format('com.github.saurfang.sas.spark').load(i94_data)

In [12]:
i94_df_spark.printSchema()
i94_df_spark.show(5, truncate=False)

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

### 1.5 Read Airport code data to Spark

In [13]:
#airport_schema = t.StructType([
#                            t.StructField("dt", t.StringType(), False),
#                            t.StructField("AverageTemperature", t.FloatType(), True),
#                            t.StructField("AverageTemperatureUncertainty", t.FloatType(), True),
#                            t.StructField("City", t.StringType(), False),
#                            t.StructField("Country", t.StringType(), False),
#                            t.StructField("Latitude", t.StringType(), False),
#                            t.StructField("Longitude", t.StringType(), False),
#                        ])
airport_codes_iata_df_spark = spark.read.csv(airport_codes, header=True)

In [14]:
airport_codes_iata_df_spark.printSchema()
airport_codes_iata_df_spark.show(5, truncate=False)

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)

+-----+-------------+----------------------------------+------------+---------+-----------+----------+------------+--------+---------+----------+-------------------------------------+
|ident|type         |name                              |elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|coordinates                          |
+-----+-------------+----------------------------------+------------+---------+-----------+----------+------------+--------+---------+---

### 1.6 Read ISO Country Code data to Spark

In [15]:
country_code_schema = t.StructType([
                            t.StructField("alpha-2", t.StringType(), False),
                            t.StructField("alpha-3", t.StringType(), False),
                            t.StructField("country-code", t.IntegerType(), False),
                            t.StructField("intermediate-region", t.StringType(), False),
                            t.StructField("intermediate-region-code", t.StringType(), False),
                            t.StructField("iso-3166-2", t.StringType(), False),
                            t.StructField("name", t.StringType(), False),
                            t.StructField("region", t.StringType(), True),
                            t.StructField("region-code", t.StringType(), True),
                            t.StructField("sub-region", t.StringType(), True),
                            t.StructField("sub-region-code", t.StringType(), True),
                        ])
country_codes_iso_df_spark = spark.createDataFrame(country_codes_df, schema=country_code_schema)

In [16]:
country_codes_iso_df_spark.printSchema()
country_codes_iso_df_spark.show(5, truncate=False)

root
 |-- alpha-2: string (nullable = false)
 |-- alpha-3: string (nullable = false)
 |-- country-code: integer (nullable = false)
 |-- intermediate-region: string (nullable = false)
 |-- intermediate-region-code: string (nullable = false)
 |-- iso-3166-2: string (nullable = false)
 |-- name: string (nullable = false)
 |-- region: string (nullable = true)
 |-- region-code: string (nullable = true)
 |-- sub-region: string (nullable = true)
 |-- sub-region-code: string (nullable = true)

+-------+-------+------------+-------------------+------------------------+-------------+--------------+-------+-----------+---------------+---------------+
|alpha-2|alpha-3|country-code|intermediate-region|intermediate-region-code|iso-3166-2   |name          |region |region-code|sub-region     |sub-region-code|
+-------+-------+------------+-------------------+------------------------+-------------+--------------+-------+-----------+---------------+---------------+
|AF     |AFG    |4           |        

### 1.7 Read I94 Airport code data to Spark

In [100]:
# Cleaning I94 Airport data first
ac = {"i94port_clean": [], "i94_airport_name_clean": [], "i94_state_clean": []}
codes = []
names = []
states = []
for index, row in airport_codes_i94_df.iterrows():
    y = re.sub("'", "", index)
    x = re.sub("'", "", row[0])
    z = re.sub("'", "", row[0]).split(",")
    y = y.strip()
    z[0] = z[0].strip()
    
    if len(z) == 2:
        codes.append(y)
        names.append(z[0])
        z[1] = z[1].strip()
        states.append(z[1])
    else:
        codes.append(y)
        names.append(z[0])
        states.append("NaN")

ac["i94port_clean"] = codes
ac["i94_airport_name_clean"] = names
ac["i94_state_clean"] = states

airport_codes_i94_df_clean = pd.DataFrame.from_dict(ac)

ac_path = output_data + "/airport_codes_i94_clean.csv"
airport_codes_i94_df_clean.to_csv(ac_path, sep=',')

In [101]:
airport_codes_i94_schema = t.StructType([
                            t.StructField("i94port", t.StringType(), False),
                            t.StructField("i94_airport_name", t.StringType(), False),
                            t.StructField("i94_airport_state", t.StringType(), False)
                        ])
airport_codes_i94_df_spark = spark.createDataFrame(airport_codes_i94_df_clean, schema=airport_codes_i94_schema)

In [102]:
airport_codes_i94_df_spark.printSchema()
airport_codes_i94_df_spark.show(5, truncate=False)

root
 |-- i94port: string (nullable = false)
 |-- i94_airport_name: string (nullable = false)
 |-- i94_airport_state: string (nullable = false)

+-------+------------------------+-----------------+
|i94port|i94_airport_name        |i94_airport_state|
+-------+------------------------+-----------------+
|ALC    |ALCAN                   |AK               |
|ANC    |ANCHORAGE               |AK               |
|BAR    |BAKER AAF - BAKER ISLAND|AK               |
|DAC    |DALTONS CACHE           |AK               |
|PIZ    |DEW STATION PT LAY DEW  |AK               |
+-------+------------------------+-----------------+
only showing top 5 rows



### 1.8 Read I94 Country Code data to Spark

In [86]:
# Cleaning I94 Country Code data first
cc = {"i94cit_clean": [], "i94_country_name_clean": []}
ccodes = []
cnames = []

for index, row in country_codes_i94_df.iterrows():
    x = re.sub("'", "", row[0]).strip()
    ccodes.append(index)
    cnames.append(x)

cc["i94cit_clean"] = ccodes
cc["i94_country_name_clean"] = cnames

country_codes_i94_df_clean = pd.DataFrame.from_dict(cc)

cc_path = input_data + "/country_codes_i94_clean.csv"
country_codes_i94_df_clean.to_csv(cc_path, sep=',')

In [87]:
country_codes_i94_schema = t.StructType([
                            t.StructField("i94cit", t.StringType(), False),
                            t.StructField("i94_country_name", t.StringType(), False)
                        ])
country_codes_i94_df_spark = spark.createDataFrame(country_codes_i94_df_clean, schema=country_codes_i94_schema)

In [88]:
country_codes_i94_df_spark.printSchema()
country_codes_i94_df_spark.show(5, truncate=False)

root
 |-- i94cit: string (nullable = false)
 |-- i94_country_name: string (nullable = false)

+------+----------------+
|i94cit|i94_country_name|
+------+----------------+
|582   |MEXICO          |
|236   |AFGHANISTAN     |
|101   |ALBANIA         |
|316   |ALGERIA         |
|102   |ANDORRA         |
+------+----------------+
only showing top 5 rows



### 1.9 Write Spark DataFrames to parquet files

In [23]:
start_time = datetime.now().strftime('%Y-%m-%d-%H-%M-%S-%f')
print(start_time)

2019-08-14-10-30-56-565240


In [24]:
# Write Temperature data to parquet file:
i94_df_path = output_data + "i94_staging.parquet" + "_" + start_time
print(f"OUTPUT: {i94_df_path}")
i94_df_spark.write.mode("overwrite").parquet(i94_df_path)
print("Writing DONE.")

# Read parquet file back to Spark:
i94_df_spark = spark.read.parquet(i94_df_path)

OUTPUT: data/output_data/i94_staging.parquet_2019-08-14-10-30-56-565240


In [42]:
#i94_df_spark.printSchema()
#i94_df_spark.show(5, truncate=False)

In [25]:
# Write I94 Airport data to parquet file:
airport_codes_i94_df_path = output_data + "airport_codes_i94_staging.parquet" + "_" + start_time
print(f"OUTPUT: {airport_codes_i94_df_path}")
airport_codes_i94_df_spark.write.mode("overwrite").parquet(airport_codes_i94_df_path)
print("Writing DONE.")

# Read parquet file back to Spark:
airport_codes_i94_df_spark = spark.read.parquet(airport_codes_i94_df_path)

OUTPUT: data/output_data/airport_codes_i94_staging.parquet_2019-08-14-10-30-56-565240


In [50]:
#airport_codes_i94_df_spark.printSchema()
#airport_codes_i94_df_spark.show(5, truncate=False)

In [27]:
# Write i94 Country data to parquet file:
country_codes_i94_df_path = output_data + "country_codes_i94_staging.parquet" + "_" + start_time
print(f"OUTPUT: {country_codes_i94_df_path}")
country_codes_i94_df_spark.write.mode("overwrite").parquet(country_codes_i94_df_path)
print("Writing DONE.")

# Read parquet file back to Spark:
country_codes_i94_df_spark = spark.read.parquet(country_codes_i94_df_path)

OUTPUT: data/output_data/country_codes_i94_staging.parquet_2019-08-14-10-30-56-565240


In [45]:
#country_codes_i94_df_spark.printSchema()
#country_codes_i94_df_spark.show(5, truncate=False)

In [28]:
# Write IATA Airport data to parquet file:
airport_codes_iata_df_path = output_data + "airport_codes_iata_staging.parquet" + "_" + start_time
print(f"OUTPUT: {airport_codes_iata_df_path}")
airport_codes_iata_df_spark.write.mode("overwrite").parquet(airport_codes_iata_df_path)
print("Writing DONE.")

# Read parquet file back to Spark:
airport_codes_iata_df_spark = spark.read.parquet(airport_codes_iata_df_path)

OUTPUT: data/output_data/airport_codes_iata_staging.parquet_2019-08-14-10-30-56-565240


In [63]:
#airport_codes_iata_df_spark.printSchema()
#airport_codes_iata_df_spark.show(5, truncate=False)

In [29]:
# Write ISO-3166 Country Code data to parquet file:
country_codes_iso_df_path = output_data + "country_codes_iso_staging.parquet" + "_" + start_time
print(f"OUTPUT: {country_codes_iso_df_path}")
country_codes_iso_df_spark.write.mode("overwrite").parquet(country_codes_iso_df_path)
print("Writing DONE.")

# Read parquet file back to Spark:
country_code_iso_df_spark = spark.read.parquet(country_codes_iso_df_path)

OUTPUT: data/output_data/country_codes_iso_staging.parquet_2019-08-14-10-30-56-565240


In [56]:
#country_codes_iso_df_spark.printSchema()
#country_codes_iso_df_spark.show(5, truncate=False)

--------------------
### Step 2: Explore and Assess the Data
#### Explore the Data 
_Identify data quality issues, like missing values, duplicate data, etc._

Input data has the following quality issues:
    
* I94 Immigration data: 
    * Departure date is missing from some of the records.
    * Gender column is missing some data.
    * Otherwise data seems to be clean for further processing.
    
* I94 Airport data: 
    * Data has quote marks and extra white spaces aftwer copy-paste operation.
    * Original data was cleaned-up already before importing to Spark.
    
* I94 Country code data: 
    * Data has quote marks and extra white spaces aftwer copy-paste operation.
    * Original data was cleaned-up already before importing to Spark.

* ISO3166 Country data:
    * Antarctica (row) is missing data from some columns.

#### Cleaning Steps
_Document steps necessary to clean the data_

Input data needs the following cleaning operations:
* I94 data:
    * No actions required. 
    * Missing departure info is assumed to mean that person has not left US, yet.
    * Missing gender info is handled as is.
* I94 Airport data: 
    * Remove quote marks and extra white spaces from the data.
* I94 Country Code data: 
    * Remove quote marks and extra white spaces from the data.
* ISO Country Code data:
    * No action required. Antarctica is handled as a special case to avoid duplicate data.

### 2.1 Cleaning the data

#### 2.1.1 Clean I94 Immigration data

In [30]:
# No cleaning actions. All necessary columns have clean data.
i94_df_spark.createOrReplaceTempView("immigrants_table_DF")
immigrants_table = spark.sql("""
    SELECT  cicid, i94yr, i94mon, i94cit, i94res, i94port, arrdate, \
            i94mode, airline, fltno, depdate, i94bir, i94visa, gender,  \
            visatype, admnum
    FROM immigrants_table_DF
    WHERE   cicid == null OR arrdate == null OR i94port == null \
            OR fltno == null OR i94mode == null OR admnum == null \
            OR gender == null OR admnum == null 
    ORDER BY arrdate
""")
immigrants_table.printSchema()
immigrants_table.show(20)

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- airline: string (nullable = true)
 |-- fltno: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- gender: string (nullable = true)
 |-- visatype: string (nullable = true)
 |-- admnum: double (nullable = true)

+-----+-----+------+------+------+-------+-------+-------+-------+-----+-------+------+-------+------+--------+------+
|cicid|i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|airline|fltno|depdate|i94bir|i94visa|gender|visatype|admnum|
+-----+-----+------+------+------+-------+-------+-------+-------+-----+-------+------+-------+------+--------+------+
+-----+-----+------+----

#### 2.1.2 Clean I94 Airport data

In [36]:
airport_codes_i94_df_clean.head(10)

Unnamed: 0,i94port_clean,i94_airport_name_clean,i94_state_clean
0,ALC,ALCAN,AK
1,ANC,ANCHORAGE,AK
2,BAR,BAKER AAF - BAKER ISLAND,AK
3,DAC,DALTONS CACHE,AK
4,PIZ,DEW STATION PT LAY DEW,AK
5,DTH,DUTCH HARBOR,AK
6,EGL,EAGLE,AK
7,FRB,FAIRBANKS,AK
8,HOM,HOMER,AK
9,HYD,HYDER,AK


#### 2.1.3 Clean I94 Country Code data

In [35]:
country_codes_i94_df_clean.head(10)

Unnamed: 0,i94cit_clean,i94_country_name_clean
0,582,MEXICO
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA
5,324,ANGOLA
6,529,ANGUILLA
7,518,ANTIGUA-BARBUDA
8,687,ARGENTINA
9,151,ARMENIA


#### 2.1.4 Clean ISO Country Codes data

In [67]:
country_codes_iso_df_spark.filter(country_codes_iso_df_spark.name == "Antarctica").show()

+-------+-------+------------+-------------------+------------------------+-------------+----------+------+-----------+----------+---------------+
|alpha-2|alpha-3|country-code|intermediate-region|intermediate-region-code|   iso-3166-2|      name|region|region-code|sub-region|sub-region-code|
+-------+-------+------------+-------------------+------------------------+-------------+----------+------+-----------+----------+---------------+
|     AQ|    ATA|          10|                   |                        |ISO 3166-2:AQ|Antarctica|      |           |          |               |
+-------+-------+------------+-------------------+------------------------+-------------+----------+------+-----------+----------+---------------+



In [66]:
#country_codes_df_spark.filter(country_codes_df_spark.name == "Antarctica").show()
new_country_codes_df = [["AQ", \
                        "ATA", \
                        10, \
                        "Antarctica", \
                        10, \
                        "ISO 3166-2:AQ", \
                        "Antarctica", \
                        "Antarctica", \
                        10, \
                        "Antarctica", \
                        10]]
fixed_df = spark.createDataFrame(new_country_codes_df, schema=country_code_schema)
country_code_df_cleaned = country_codes_iso_df_spark.union(fixed_df)
country_code_df_cleaned.filter(country_code_df_cleaned.name == "Antarctica").show()


+-------+-------+------------+-------------------+------------------------+-------------+----------+----------+-----------+----------+---------------+
|alpha-2|alpha-3|country-code|intermediate-region|intermediate-region-code|   iso-3166-2|      name|    region|region-code|sub-region|sub-region-code|
+-------+-------+------------+-------------------+------------------------+-------------+----------+----------+-----------+----------+---------------+
|     AQ|    ATA|          10|                   |                        |ISO 3166-2:AQ|Antarctica|          |           |          |               |
|     AQ|    ATA|          10|         Antarctica|                      10|ISO 3166-2:AQ|Antarctica|Antarctica|         10|Antarctica|             10|
+-------+-------+------------+-------------------+------------------------+-------------+----------+----------+-----------+----------+---------------+



---------
### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
_Map out the conceptual data model and explain why you chose that model_


#### 3.2 Mapping Out Data Pipelines
_List the steps necessary to pipeline the data into the chosen data model_


-----------
### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [31]:
# Write code here
start_time = datetime.now().strftime('%Y-%m-%d-%H-%M-%S-%f')
print(start_time)

2019-08-14-10-33-00-204832


#### 4.1.1 Create persons table + write to parquet file

In [32]:
# Create table
i94_df_spark = i94_df_spark.withColumn("person_id", monotonically_increasing_id())
i94_df_spark.createOrReplaceTempView("persons_table_DF")
persons_table = spark.sql("""
    SELECT  person_id AS person_id,
            admnum AS admission_nbr,
            i94res AS country_code, 
            i94bir AS age, 
            i94visa AS visa_code, 
            visatype AS visa_type, 
            gender AS gender
    FROM persons_table_DF
    ORDER BY country_code
""")
persons_table.printSchema()
persons_table.show(20)

root
 |-- person_id: long (nullable = false)
 |-- admission_nbr: double (nullable = true)
 |-- country_code: double (nullable = true)
 |-- age: double (nullable = true)
 |-- visa_code: double (nullable = true)
 |-- visa_type: string (nullable = true)
 |-- gender: string (nullable = true)

+---------+-------------+------------+----+---------+---------+------+
|person_id|admission_nbr|country_code| age|visa_code|visa_type|gender|
+---------+-------------+------------+----+---------+---------+------+
|        0| 3.46608285E8|       101.0|20.0|      3.0|       F1|     M|
|        1| 3.46627585E8|       101.0|20.0|      3.0|       F1|     M|
|        2| 3.81092385E8|       101.0|17.0|      2.0|       B2|     F|
|        3| 3.81087885E8|       101.0|45.0|      2.0|       B2|     F|
|        4| 3.81078685E8|       101.0|12.0|      2.0|       B2|     M|
|        5| 4.06155985E8|       101.0|33.0|      2.0|       B2|     M|
|        6| 4.17363085E8|       101.0|28.0|      3.0|       F1|     F|


In [33]:
# Write persons_table to parquet file:
persons_table_path = output_data + "persons_table.parquet" + "_" + start_time
print(f"OUTPUT: {persons_table_path}")
persons_table.write.mode("overwrite").parquet(persons_table_path)
print("Writing DONE.")

# Read parquet file back to Spark:
persons_table_df = spark.read.parquet(persons_table_path)

OUTPUT: data/output_data/persons_table.parquet_2019-08-14-10-33-00-204832
Writing DONE.


#### 4.1.2 Create countries table + write to parquet file

In [67]:
i94_df_spark = i94_df_spark.withColumn("i94res_str", i94_df_spark.i94res.cast(t.IntegerType()).cast(t.StringType()))
i94_df_spark.printSchema()


root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [68]:
i94_df_spark.head()

Row(cicid=7.0, i94yr=2016.0, i94mon=1.0, i94cit=101.0, i94res=101.0, i94port='BOS', arrdate=20465.0, i94mode=1.0, i94addr='MA', depdate=None, i94bir=20.0, i94visa=3.0, count=1.0, dtadfile=None, visapost=None, occup=None, entdepa='T', entdepd=None, entdepu=None, matflag=None, biryear=1996.0, dtaddto='D/S', gender='M', insnum=None, airline='LH', admnum=346608285.0, fltno='424', visatype='F1', person_id=0, i94res_int='101.0', i94res_str='101')

In [69]:
country_codes_i94_df_spark.printSchema()
country_codes_i94_df_spark.head()

root
 |-- i94cit: string (nullable = true)
 |-- i94_country_name: string (nullable = true)



Row(i94cit='527', i94_country_name='TURKS AND CAICOS ISLANDS')

In [70]:
country_codes_i94_df_spark.count()

289

In [71]:
#i94_df_spark_joined = i94_df_spark.join(country_codes_i94_df_spark, \
#                                        (i94_df_spark.i94res_str == country_codes_i94_df_spark.i94cit))

In [82]:
#i94_df_spark_joined.printSchema()
#i94_df_spark_joined.head()

In [73]:
i94_df_spark_joined.count()

2847924

In [80]:
# Create table
country_codes_i94_df_spark.createOrReplaceTempView("countries_table_DF")
countries_table = spark.sql("""
    SELECT  DISTINCT i94cit AS country_code, 
                     i94_country_name AS country_name
    FROM countries_table_DF AS countries
    ORDER BY country_name
""")
countries_table.printSchema()
countries_table.show(20)
countries_table.count()

root
 |-- country_code: string (nullable = true)
 |-- country_name: string (nullable = true)

+------------+---------------+
|country_code|   country_name|
+------------+---------------+
|         236|    AFGHANISTAN|
|         101|        ALBANIA|
|         316|        ALGERIA|
|         102|        ANDORRA|
|         324|         ANGOLA|
|         529|       ANGUILLA|
|         518|ANTIGUA-BARBUDA|
|         687|      ARGENTINA|
|         151|        ARMENIA|
|         532|          ARUBA|
|         438|      AUSTRALIA|
|         103|        AUSTRIA|
|         152|     AZERBAIJAN|
|         512|        BAHAMAS|
|         298|        BAHRAIN|
|         274|     BANGLADESH|
|         513|       BARBADOS|
|         153|        BELARUS|
|         104|        BELGIUM|
|         581|         BELIZE|
+------------+---------------+
only showing top 20 rows



289

In [81]:
# Write countries_table to parquet file:
countries_table_path = output_data + "countries_table.parquet" + "_" + start_time
print(f"OUTPUT: {countries_table_path}")
countries_table.write.mode("overwrite").parquet(countries_table_path)
print("Writing DONE.")

# Read parquet file back to Spark:
countries_table_df = spark.read.parquet(countries_table_path)


OUTPUT: data/output_data/countries_table.parquet_2019-08-14-10-33-00-204832
Writing DONE.


#### 4.1.3 Create airports table + write to parquet file

In [103]:

airport_codes_i94_df_spark.printSchema()
airport_codes_i94_df_spark.show(15, truncate=False)

root
 |-- i94port: string (nullable = false)
 |-- i94_airport_name: string (nullable = false)
 |-- i94_airport_state: string (nullable = false)

+-------+------------------------+-----------------+
|i94port|i94_airport_name        |i94_airport_state|
+-------+------------------------+-----------------+
|ALC    |ALCAN                   |AK               |
|ANC    |ANCHORAGE               |AK               |
|BAR    |BAKER AAF - BAKER ISLAND|AK               |
|DAC    |DALTONS CACHE           |AK               |
|PIZ    |DEW STATION PT LAY DEW  |AK               |
|DTH    |DUTCH HARBOR            |AK               |
|EGL    |EAGLE                   |AK               |
|FRB    |FAIRBANKS               |AK               |
|HOM    |HOMER                   |AK               |
|HYD    |HYDER                   |AK               |
|JUN    |JUNEAU                  |AK               |
|5KE    |KETCHIKAN               |AK               |
|KET    |KETCHIKAN               |AK               |
|MOS   

In [104]:
# Create table
airport_codes_i94_df_spark.createOrReplaceTempView("airports_table_DF")
airports_table = spark.sql("""
    SELECT DISTINCT  i94port AS airport_id, 
                     i94_airport_name AS airport_name,
                     i94_airport_state AS airport_state
    FROM airports_table_DF AS airports
    ORDER BY airport_name
""")

airports_table.printSchema()
airports_table.show(20)
airports_table.count()

root
 |-- airport_id: string (nullable = false)
 |-- airport_name: string (nullable = false)
 |-- airport_state: string (nullable = false)

+----------+--------------------+-------------+
|airport_id|        airport_name|airport_state|
+----------+--------------------+-------------+
|       ABE|            ABERDEEN|           WA|
|       ADS|ADDISON AIRPORT- ...|           TX|
|       AGA|               AGANA|           GU|
|       AGU|           AGUADILLA|           PR|
|       BOI|AIR TERM. (GOWEN ...|           ID|
|       CAK|               AKRON|           OH|
|       AKR|               AKRON|           OH|
|       ALA|          ALAMAGORDO|     NM (BPS)|
|       ALB|              ALBANY|           NY|
|       CHO|ALBEMARLE CHARLOT...|           VA|
|       ABQ|         ALBUQUERQUE|           NM|
|       ABG|              ALBURG|           VT|
|       ABS|      ALBURG SPRINGS|           VT|
|       ALC|               ALCAN|           AK|
|       AXB|      ALEXANDRIA BAY|           

660

In [105]:
airports_table.head()

Row(airport_id='ABE', airport_name='ABERDEEN', airport_state='WA')

In [106]:
# Write airports_table to parquet file:
airports_table_path = output_data + "airports_table.parquet" + "_" + start_time
print(f"OUTPUT: {airports_table_path}")
airports_table.write.mode("overwrite").parquet(airports_table_path)
print("Writing DONE.")

# Read parquet file back to Spark:
airports_table_df = spark.read.parquet(airports_table_path)

OUTPUT: data/output_data/airports_table.parquet_2019-08-14-10-33-00-204832
Writing DONE.


#### 4.1.4 Create time table + write to parquet file

#### 4.1.5 Create immigrations table + write to parquet file

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.