# Project Title
### Data Engineering Capstone Project

#### Project Summary
The project aims to build a cloud based datawarehouse in Redshift for analysing International visitor statistics by the US National Tourism and Trade Office. They would like to understand the distribution of visitors around all the airports in the US and the seasonal change in number of vistors. This would enable them to assign approriate number of personnel so visitor entry is efficiently handled. 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd  
import configparser
from datetime import datetime
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
from pyspark.sql import functions as F

In [2]:
# Read AWS config to get access id
config = configparser.ConfigParser()

config.read('dw.cfg')

os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

The US Tourism and Trade office would like to understand the following visitor statistics using the I-94 vistor arrival data.

1. Visitor arrival by month and year

2. Visitors by arrival City and by date

3. Visitors by Country and date

The proposed solution is to use AWS to stage and load data. The use of AWS cloud technology enables linear scalabity as the data set grows. AWS also supports a variety of techonologies such as S3, Reshift, EMR and Cassandra suitable for handling massive amounts of data and also to proivde fast response times to end user queries. 

Initial data analysis is done by using Py Spark in the local 

The data will be loaded into fact and dimension tables in an Amazon Redshift data warehouse hosted in the cloud. Data is first copied to Amazon S3 service. Using S3 service will enable to scale the data staging area in case huge gigabyte size files are required to be processed. Also, the storage can be scaled back once the processing is done to save cost.

Data will be first loaded into staging tables in Amazon Redshift using Copy statements. Following this, data cleansing can be perfomed and data loaded into fact and dimension tables using insert statements. The same SQL queries can be used to create ETL pipelines if required at a later stage.


#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

The project uses the following Udacity supplied data sets to build the data warehouse.

1. I94 immigration SAS data files
   This data is provided by the US travel department and is available here - https://travel.trade.gov/research/reports/i94/historical/2016.html
2. Country codes data in csv data format created using I94_SAS_Labels_description dictionary
3. Arrival port - City and State data in pipe delimited format created using I94_SAS_Labels_description dictionary

In [2]:
# Read in the data here
i94_fname = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df_i94 = pd.read_sas(i94_fname, 'sas7bdat', encoding="ISO-8859-1")

cc_fname = 'countrycodes.csv'
df_cc = pd.read_csv(cc_fname)

arrival_port_fname = 'arrivalports.csv'
df_ap = pd.read_csv(arrival_port_fname)

In [1]:
df.head()

NameError: name 'df' is not defined

In [3]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11,org.apache.hadoop:hadoop-aws:2.7.1").\
config("spark.hadoop.fs.s3a.awsAccessKeyId", os.environ['AWS_ACCESS_KEY_ID']).\
config("spark.hadoop.fs.s3a.awsSecretAccessKey", os.environ['AWS_SECRET_ACCESS_KEY']).\
enableHiveSupport().getOrCreate()
df_spark_i94 =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [4]:
df_spark_i94.count()

3096313

In [5]:
#write to parquet
df_spark_i94.write.partitionBy("i94yr", "i94mon").parquet("sas_parquet/")

In [6]:
#Write to AWS S3
df_spark_i94.write.parquet("s3a://ctsprojbucket/i94data-parquet/i94_visitors", "overwrite")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [13]:
# Performing cleaning tasks here
## Check for duplicates
df_spark_i94.createOrReplaceTempView("visitor_i94_table")

In [17]:
i94_table = spark.sql("select distinct i94yr, i94mon,i94cit,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,visapost, \
biryear, gender,airline,fltno,visatype from visitor_i94_table")
print("i94 table count:",i94_table.count())

i94 table count: 2933899


In [7]:
df_spark_i94.show()

+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|        admnum|fltno|visatype|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|  6.0|2016.0|   4.0| 692.0| 692.0|    XXX|20573.0|   null|   null|   null|  37.0|    2.0|  1.0|    null|    null| null|      T|   null|      U|   null| 1979.0|10282016|  null|  null|   null| 1.897628485E9| null|      B2|
|  7.0|2016.0|   4.0| 254.0| 276.0|    ATL|20551.0|    1.0|     AL|   null|  25.0|    3.0|  1.0|20130811|     SE

In [19]:
#airport data
airport_df = spark.read.csv('./airport-codes_csv.csv', header=True)
not_null_iata_in_us_df = airport_df.where("iso_country = 'US' and iata_code is not null")
not_null_iata_in_us_df.describe
not_null_iata_in_us_df.show()


+-----+-------------+--------------------+------------+---------+-----------+----------+---------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|   municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+---------------+--------+---------+----------+--------------------+
| 07FA|small_airport|Ocean Reef Club A...|           8|       NA|         US|     US-FL|      Key Largo|    07FA|      OCA|      07FA|-80.274803161621,...|
|  0AK|small_airport|Pilot Station Air...|         305|       NA|         US|     US-AK|  Pilot Station|    null|      PQS|       0AK|-162.899994, 61.9...|
| 0CO2|small_airport|Crested Butte Air...|        8980|       NA|         US|     US-CO|  Crested Butte|    0CO2|      CSE|      0CO2|-106.928341, 38.8...|
| 0TE7|small_airport|   LBJ Ranch Airport|        1515|       NA

In [26]:
# US cities state demographics data
us_city_demo_df = spark.read.csv('us-cities-demographics.csv',sep=';',header=True)
us_city_demo_df.describe


<bound method DataFrame.describe of DataFrame[City: string, State: string, Median Age: string, Male Population: string, Female Population: string, Total Population: string, Number of Veterans: string, Foreign-born: string, Average Household Size: string, State Code: string, Race: string, Count: string]>

In [29]:
us_city_demo_df.createOrReplaceTempView("us_city_state_table")
us_state_table = spark.sql("select 'State Code' as state_cd,State,sum('Total Population') as total_popln from us_city_state_table\
                             group by 'State Code',State")
us_state_table.count()


49

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.