# Project Title
### Data Engineering Capstone Project

#### Project Summary
The project aims to build a cloud based datawarehouse in Redshift for analysing International visitor statistics by the US National Tourism and Trade Office. They would like to understand the distribution of visitors around all the airports in the US and the seasonal change in number of vistors. This would enable them to assign approriate number of personnel so visitor entry is efficiently handled. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd  
import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
from pyspark.sql import functions as F

In [2]:
# Read AWS config to get access id
config = configparser.ConfigParser()

config.read('dw.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

The US Tourism and Trade office would like to understand the following visitor statistics using the I-94 vistor arrival data.

1. Visitor arrival by month and year

2. Visitors by arrival City and by date

3. Visitors by Country and date

The proposed solution is to use AWS to stage and load data. The use of AWS cloud technology enables linear scalabity as the data set grows. AWS also supports a variety of techonologies such as S3, Reshift, EMR and Cassandra suitable for handling massive amounts of data and also to proivde fast response times to end user queries. 

Initial data analysis is done by using Py Spark in the local 

The data will be loaded into fact and dimension tables in an Amazon Redshift data warehouse hosted in the cloud. Data is first copied to Amazon S3 service. Using S3 service will enable to scale the data staging area in case huge gigabyte size files are required to be processed. Also, the storage can be scaled back once the processing is done to save cost.

Data will be first loaded into staging tables in Amazon Redshift using Copy statements. Following this, data cleansing can be perfomed and data loaded into fact and dimension tables using insert statements. The same SQL queries can be used to create ETL pipelines if required at a later stage.


#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

The project uses the following Udacity supplied data sets to build the data warehouse.

1. I94 immigration SAS data files
   This data is provided by the US travel department and is available here - https://travel.trade.gov/research/reports/i94/historical/2016.html
2. Country codes data in csv data format created using I94_SAS_Labels_description dictionary
3. Arrival port - City and State data in pipe delimited format created using I94_SAS_Labels_description dictionary

In [2]:
# Read in the data here
i94_fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df_i94 = pd.read_sas(i94_fname, 'sas7bdat', encoding="ISO-8859-1")
df_i94.head()

In [3]:
# Initialize Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11,org.apache.hadoop:hadoop-aws:2.7.1").\
config("spark.hadoop.fs.s3a.awsAccessKeyId", os.environ['AWS_ACCESS_KEY_ID']).\
config("spark.hadoop.fs.s3a.awsSecretAccessKey", os.environ['AWS_SECRET_ACCESS_KEY']).\
enableHiveSupport().getOrCreate()
    

In [4]:
# Read i94 sas data for 2 months - Mar 2016 and Apr 2016
sas_files=["../../data/18-83510-I94-Data-2016/i94_mar16_sub.sas7bdat","../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat"]
for sasfile in sas_files:
    df_spark_i94 = spark.read.format('com.github.saurfang.sas.spark').load(sasfile)
    print(f"Record count for {sasfile} is ", df_spark_i94.count())
    df_spark_i94.write.mode("append").parquet("sas_parquet/")
   

Record count for ../../data/18-83510-I94-Data-2016/i94_mar16_sub.sas7bdat is  3157072
Record count for ../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat is  3096313


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [5]:
# Performing cleaning tasks here
# Read parquet data
df_all_i94 = spark.read.parquet("sas_parquet/")
df_all_i94.count()

6253385

In [6]:
# Clean i94 data
## Get distinct values for the columns selected - remove duplicates
df_all_i94.createOrReplaceTempView("visitor_i94_table")
i94_table = spark.sql("select distinct cicid, i94yr, i94mon,i94cit,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,visapost, \
gender,airline,fltno,visatype from visitor_i94_table limit 100000")
print("i94 table count:",i94_table.count())

i94 table count: 100000


In [7]:
i94_table.show()
#Write to AWS S3 
#i94_table.write.partitionBy("i94yr", "i94mon").mode("overwrite").parquet("s3a://ctsprojbucket/i94visitors/")
i94_table.write.mode("overwrite").parquet("s3a://ctsprojbucket/i94visitorstest/")

+------+------+------+------+-------+-------+-------+-------+-------+------+-------+--------+------+-------+-----+--------+
| cicid| i94yr|i94mon|i94cit|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|visapost|gender|airline|fltno|visatype|
+------+------+------+------+-------+-------+-------+-------+-------+------+-------+--------+------+-------+-----+--------+
|  48.0|2016.0|   4.0| 101.0|    NYC|20545.0|    1.0|     NY|20572.0|  68.0|    2.0|     FLR|     M|     AA|00199|      B2|
| 270.0|2016.0|   4.0| 103.0|    NYC|20545.0|    1.0|     NY|20560.0|  33.0|    2.0|    null|     M|     LH|00410|      WT|
| 424.0|2016.0|   3.0| 103.0|    NYC|20514.0|    1.0|     NY|20574.0|  37.0|    2.0|     VNN|     M|     OS|00087|      B2|
| 746.0|2016.0|   4.0| 103.0|    LOS|20545.0|    1.0|   null|20546.0|  12.0|    2.0|    null|  null|     AS|00291|      WT|
|1324.0|2016.0|   3.0| 107.0|    LOS|20514.0|    1.0|     CA|20538.0|  33.0|    2.0|    null|  null|     LH|00456|      B2|
|1372.0|

In [4]:
# Gather data from prepared airports file
airport_local_df=spark.read.csv('./airport.csv', header=True, sep='|')
airport_local_df.createOrReplaceTempView("local_airport_table")

#Clean airport data from supplied airports file
airport_df = spark.read.csv('./airport-codes_csv.csv', header=True)
airport_df.createOrReplaceTempView("airport_table")
orig_airport_df = spark.sql("SELECT * from airport_table")
print("Original airport count ",orig_airport_df.count())
clean_airport_df = spark.sql("SELECT distinct iata_code,ident,type,name,continent,iso_country,iso_region,\
                               municipality,gps_code,local_code,coordinates \
                               from airport_table where iso_country = 'US' and iata_code is not null\
                               union all \
                              select iata_code,null,null,null,null,null,null,municipality,null,null,null from local_airport_table\
                              where iata_code not in (select distinct iata_code from airport_table where iata_code is not null)\
                              ")

clean_airport_df.describe
print("Clean airport count ",clean_airport_df.count())
clean_airport_df.show()



Original airport count  55075
Clean airport count  2133
+---------+-----+--------------+--------------------+---------+-----------+----------+------------------+--------+----------+--------------------+
|iata_code|ident|          type|                name|continent|iso_country|iso_region|      municipality|gps_code|local_code|         coordinates|
+---------+-----+--------------+--------------------+---------+-----------+----------+------------------+--------+----------+--------------------+
|      KCG|  KCG|        closed|Chignik Fisheries...|       NA|         US|     US-AK|           Chignik|    null|      null|-158.589706, 56.3...|
|      HLR| KHLR|medium_airport| Hood Army Air Field|       NA|         US|     US-TX|Fort Hood(Killeen)|    KHLR|       HLR|-97.7145004272000...|
|      LMS| KLMS| small_airport|Louisville Winsto...|       NA|         US|     US-MS|        Louisville|    KLMS|       LMS|-89.0625, 33.1461...|
|      PAO| KPAO| small_airport|Palo Alto Airport...|       NA

In [None]:
clean_airport_df.describe
clean_airport_df.write.mode("overwrite").parquet("s3a://ctsprojbucket/airports/")

+---------+-----+--------------+--------------------+---------+-----------+----------+------------------+--------+----------+--------------------+
|iata_code|ident|          type|                name|continent|iso_country|iso_region|      municipality|gps_code|local_code|         coordinates|
+---------+-----+--------------+--------------------+---------+-----------+----------+------------------+--------+----------+--------------------+
|      KCG|  KCG|        closed|Chignik Fisheries...|       NA|         US|     US-AK|           Chignik|    null|      null|-158.589706, 56.3...|
|      HLR| KHLR|medium_airport| Hood Army Air Field|       NA|         US|     US-TX|Fort Hood(Killeen)|    KHLR|       HLR|-97.7145004272000...|
|      LMS| KLMS| small_airport|Louisville Winsto...|       NA|         US|     US-MS|        Louisville|    KLMS|       LMS|-89.0625, 33.1461...|
|      PAO| KPAO| small_airport|Palo Alto Airport...|       NA|         US|     US-CA|         Palo Alto|    KPAO|    

In [None]:
# Read locally prepared state file
local_state_df = spark.read.csv('states.txt',header=True, sep='|')
local_state_df.createOrReplaceTempView("local_state_table")
local_state_df.show()


In [None]:
# US cities state demographics data
us_city_demo_df = spark.read.csv('us-cities-demographics.csv',sep=';',header=True)
us_city_demo_df.describe
us_city_demo_df.createOrReplaceTempView("us_city_state_table")
us_state_table = spark.sql("select `State Code` as state_cd,State,sum(`Total Population`) as total_popln,\
                            sum(`Foreign-born`) as foreign_born_popln\
                           from us_city_state_table\
                           group by `State Code`,State\
                           union all\
                           select state_code,state,null,null\
                           from local_state_table \
                           where state_code not in (select distinct `State Code` from us_city_state_table)\
                           ")
print("State count is ",us_state_table.count())
us_state_table.show()

# Write to local parquet first since the file is small and directly writing to S3 is taking forever
#us_state_table.write.mode("overwrite").parquet("state_parquet/")


In [16]:
# Upload to S3
df_states = spark.read.parquet("state_parquet/")
df_states.write.mode("overwrite").parquet("s3a://ctsprojbucket/states/")

In [10]:
# Write countries to S3
df_cc = spark.read.csv('countrycodes.csv',header=False,sep='|')
print ("Country code count is ",df_cc.count())
df_cc.write.mode("overwrite").parquet("s3a://ctsprojbucket/countries/")

Country code count is  288


### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

The Data model consists of the following fact and dimension tables. 

##### Fact table - i94visitors_fact
This fact table contains all the detailed i94 visitor data from the SAS files. The source data is loaded into a staging table. ETL is performed using SQL as data is loaded into the fact table.
The fact table contains most of the columns from the source SAS files so no detail is lost. All analytical queries can be run on this fact table. The data model can be enhanced by building 
summary fact tables to speed up queries if required.

##### airports_dim
This dimension table is loaded from airport-codes csv file after cleaning up non US airports. The airports dimension provides details about each of the airports in the US and can help disect the fact table by airport

##### states_dim
The states dimension provides state level demographics such as total population and foreign born population and can provide insights into number of visitors arriving in a state and its foreign born population. 

##### dates_dim
The dates dimension is useful for summarizing the visitors data by arrival dates, month, year, etc. This will provide visibility into how the volume of visitors changes over the course of the year and months within that year.


##### countries_dim
The Countries dimension lists the countries of origin for the visitors. Data in this dimension can be enahnced further as required. This dimension helps in summarizing the data based on the visitors' countries and can be used to check where people are visiting from. 


The data model would provide US travel department with a lot of insights such as where people are arriving from and during what season. Although data for all air travellers is not captured here, it would still help various US travel departments to better provision their airport resources and personnel and also help the states involved in performing future capacity planning and upgrades of their airports.  The US embassies in various countries can all augment their staff and capacity based on this data.


#### 3.2 Mapping Out Data Pipelines

Data is first staged in Amazon S3 in parquet format. The data is then loaded into fact and dimension tables in Amazon Redshift for analysis by end users. Apache airflow is used to build the data pipelne. The stage to Redshift operator is reused while loading data eliminating redundant code. Any ETL while loading data is performed by SQL queries. The airflow pipeline is called capstone dag and is available under the project submission folders



### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:

# THe data pipeline for the ETL is implemented using Apache airflow
# The Dags are available under the dags folder. 


#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here
# Data Qualirt

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.