# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
#import libraries
import pandas as pd
import configparser
import boto3
import pandas as pd
import re

<h3>2 - Questions</h3>

<ol> 
    <li>Which cities do immigrants tend to move and where did they come from?</li>
<li>Does temperature play a role on where people on temporary visas go?</li>
</ol>

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc?

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

The data sets used for this project are airport codes, immigration data, us cities demographics data, and temperature data
<h4>Airport Codes</h4>
Airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code. The data was provided by Udacity which was obtained from <a href="https://datahub.io/core/airport-codes#data">Data Hub</a>.
<h4>Immigration Data</h4>
The data comes from the US National Tourism and Trade Office and provided by Udacity. A data dictionary is provided within the file I94_SAS_Labels_Descriptions.SAS.
The data set was taken from <a href="https://travel.trade.gov/research/reports/i94/historical/2016.html">this link</a>.
The dataset can be previewed from the immigration_data_sample.csv file. The full dataset consists of several SAS files which are located within the
SAS_data folder. 
<h4>Temperature Data</h4>
The dataset is provided by <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons BY-NC-SA 4.0 </a>.
<h4>US Cities Demographics</h4>
The data comes from <a href="https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/">OpenSoft</a>. The dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. This data comes from the US Census Bureau's 2015 American Community Survey <a href="https://www.census.gov/data/developers/about/terms-of-service.html">and is referenced in this link. </a> 

In [2]:
#Get AWS credentials
config = configparser.ConfigParser()
config.read('dwh.cfg')
AWS_ACCESS_KEY_ID = config.get('AWS', 'AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = config.get('AWS', 'AWS_SECRET_ACCESS_KEY')
s3 = boto3.resource('s3', aws_access_key_id = AWS_ACCESS_KEY_ID, aws_secret_access_key = AWS_SECRET_ACCESS_KEY)

In [3]:
# Show all the columns for the datasets
pd.set_option('display.max_columns', 30)

<h4>Airport Codes</h4>

In [4]:
#Read airport codes csv and preview the data
data_airport_codes = pd.read_csv('airport-codes_csv.csv')
data_airport_codes.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [5]:
#get shape of dataset
data_airport_codes.shape

(55075, 12)

<h4>Immigration Data</h4>

In [6]:
#Read immigration data csv and preview the data
data_immigration = pd.read_csv('immigration_data_sample.csv')
data_immigration.head()

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,20573.0,61.0,2.0,1.0,20160422,,,G,O,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,20568.0,26.0,2.0,1.0,20160423,MTR,,G,R,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,20571.0,76.0,2.0,1.0,20160407,,,G,O,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,20581.0,25.0,2.0,1.0,20160428,DOH,,G,O,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,20553.0,19.0,2.0,1.0,20160406,,,Z,K,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


In [7]:
#Read in SAS files
imm_fname = '../../data/18-83510-I94-Data-2016/i94_dec16_sub.sas7bdat'
data_imm_dec = pd.read_sas(imm_fname, 'sas7bdat', encoding="ISO-8859-1")

In [8]:
#preview the data
data_imm_dec.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,46.0,2016.0,12.0,129.0,129.0,HOU,20789.0,1.0,TX,20802.0,46.0,1.0,1.0,20161201,MDD,,H,O,,M,1970.0,05262018,M,,RS,97554140000.0,7715.0,E2
1,56.0,2016.0,12.0,245.0,245.0,NEW,20789.0,1.0,OH,20835.0,28.0,3.0,1.0,20161201,BEJ,,U,O,,M,1988.0,D/S,F,,CA,90623720000.0,819.0,F1
2,67.0,2016.0,12.0,512.0,512.0,PEV,20789.0,2.0,MD,20794.0,48.0,2.0,1.0,20161201,NAS,,A,D,,M,1968.0,06012017,M,5920.0,,80105030000.0,,B2
3,68.0,2016.0,12.0,512.0,512.0,PEV,20789.0,2.0,FL,20792.0,46.0,2.0,1.0,20161201,NAS,,A,D,,M,1970.0,06012017,F,5920.0,,80105110000.0,,B2
4,69.0,2016.0,12.0,512.0,512.0,PEV,20789.0,2.0,HI,20792.0,48.0,2.0,1.0,20161201,NAS,,A,D,,M,1968.0,06012017,M,5920.0,,80105110000.0,,B2


In [9]:
#get shape of dataset
data_imm_dec.shape

(3432990, 28)

In [10]:
#Read in SAS files
imm_fname = '../../data/18-83510-I94-Data-2016/i94_jul16_sub.sas7bdat'
data_imm_july = pd.read_sas(imm_fname, 'sas7bdat', encoding="ISO-8859-1")

In [11]:
#preview the data
data_imm_july.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,1.0,2016.0,7.0,254.0,276.0,LOS,20636.0,1.0,CA,20640.0,38.0,2.0,1.0,20160701,,,G,O,,M,1978.0,9282016,M,,OZ,63092900000.0,202,WT
1,2.0,2016.0,7.0,140.0,140.0,NYC,20636.0,1.0,NY,20657.0,45.0,2.0,1.0,20160701,,,G,O,,M,1971.0,9282016,F,,DL,63092900000.0,9858,WT
2,3.0,2016.0,7.0,135.0,135.0,ORL,20636.0,1.0,FL,20657.0,10.0,2.0,1.0,20160701,,,G,O,,M,2006.0,9282016,M,,VS,63092900000.0,71,WT
3,4.0,2016.0,7.0,124.0,124.0,TAM,20636.0,1.0,FL,20645.0,17.0,2.0,1.0,20160701,,,G,O,,M,1999.0,9282016,M,,LH,63092900000.0,482,WT
4,5.0,2016.0,7.0,130.0,130.0,LOS,20636.0,1.0,CA,20662.0,1.0,2.0,1.0,20160701,,,G,K,,M,2015.0,9282016,M,,SU,63092900000.0,106,WT


In [12]:
#get shape of dataset
data_imm_july.shape

(4265031, 28)

<h4>Combine Immigration Data Sets</h4>

In [13]:
#concantenate dataframes
data_imm = pd.concat([data_imm_dec, data_imm_july])

In [None]:
#reset indices
data_imm = data_imm.reset_index(drop=True)

In [None]:
data_imm.head()

In [None]:
#read in immigration data dictionary

with open('I94_SAS_Labels_Descriptions.SAS') as f:
    txt = f.read()
    f.seek(0)
    lines = f.readlines()
comment_lines = [line for line in lines if line.startswith('/*') and line.endswith('*/\n')]

In [None]:
clpatt = re.compile(r'^/\*\s+(?P<code>.+?)\s+-\s+(?P<description>.+)\s+\*/$')
matches = [clpatt.match(line) for line in comment_lines]
if not all(m is not None for m in matches):
    for i, m in enumerate(matches):
        if m is None:
            print(i)
print(f'CODE{"":16}', 'DESCRIPTION')
for m in matches:
    print(f'{m.group("code"):20}', m.group('description'))

<h4>Temperature Data</h4>

In [None]:
#Read temperature dataset from link provided by Udacity
fname = '../../data2/GlobalLandTemperaturesByCity.csv'
data_temp = pd.read_csv(fname)
data_temp.head()

In [None]:
#get shape of dataset
data_temp.shape

<h4>US Cities Demographics</h4>

In [None]:
#Read immigration data csv and preview the data
data_demo = pd.read_csv('us-cities-demographics.csv', sep=';')
data_demo.head()

In [None]:
#get shape of dataset
data_demo.shape

In [None]:
	
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.\
# config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
# .enableHiveSupport().getOrCreate()
# df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


In [None]:
# #write to parquet
# df_spark.write.parquet("sas_data")
# df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [None]:
# Performing cleaning tasks here

<h4>Clean Combined Immigration Data</h4>

<h4>Drop Unecsasry Columns</h4>
The combined immigatation dataset contains several columns with values of NaN. The first step is to identify those columns and if those columns are not needed, remove the columms.
<ul>
    <li>occup - occupation that will be performed in the US. This column does not seem relevant to my project so I removed the column</li>
    <li>entdepu - Departure Flag - Departed, lost I-94 or is deceased. This column does not seem relevant to my project so I removed the column</li>
    <li>insnum - INS number </li>
    </ul>

In [None]:
#columns to be dropped
col_drop = ['occup', 'entdepu', 'insnum']

In [None]:
data_imm.drop(axis=1, columns=col_drop)

<h4>Clean Airport Codes</h4>

<h4>Drop Unnecessary Columns</h4>
The airport codes dataset contains one column with several NaNs. The first step is to identify those columns and if those columns are not needed, remove the columms.
<ul>
    <li>local_code - local airport code. Column is not needed for my dataset</li>
    <li>gps_code - GPS codes</li>
    </ul>

In [None]:
#columns to be dropped
col_drop = ['local_code', 'gps_code']

In [None]:
data_airport_codes.drop(axis=1, columns=col_drop)

<h4>Clean Temperature Data</h4>

In [None]:
data_temp

Just by previewing the data, we can see lots of NaNs. Those rows will be removed as they are not useful as the temperature data listed by month so I can't make any useful inference from the surrounding data. 
Also from previewing the data, we can see dates that extend back to the 1700s. I don't think it's necessary to use data from that far in the past, so I will remove those rows which have dates earlier than 2000.

In [None]:
#drop dates earlier than 2000-01-01
data_temp[data_temp.dt >= '2000-01-01']

In [None]:
#drop rows with NaNs in AverageTemperature column
data_temp.dropna(subset=['AverageTemperature'])

<h4>Clean US Cities Demographics</h4>

In [None]:
data_demo

First we want to ensure that the number of males plus number of females adds up to total population. 
Next we want to ensure that the number of foreign born residents and veteran residents is less than the total population

In [None]:
i=0
for index, row in data_demo.iterrows():
    if(row['Male Population'] + row['Female Population'] != row['Total Population']):
       print("Issue with number of males or females are row: ", row)
       print(i)
    elif(row['Number of Veterans'] > row['Total Population'] or row['Foreign-born'] > row['Total Population']):
       print("Issue with number of foreign born or number of veterans at row: ", row)
       print(i)
    i+=1

It seems like there are three rows which contain number of Males and Females listed as NaN. Since there are so few rows with this issue, I decided that it is appropriate to remove those rows

In [None]:
drop_cols = ['Male Population', 'Female Population']

In [None]:
#drop those values in the Male Population and Female Population columns which contain NaNs
data_demo = data_demo.dropna(subset=drop_cols)

In [None]:
#run loop again to ensure drop worked
i=0
for index, row in data_demo.iterrows():
    if(row['Male Population'] + row['Female Population'] != row['Total Population']):
       print("Issue with number of males or females are row: ", row)
       print(i)
    elif(row['Number of Veterans'] > row['Total Population'] or row['Foreign-born'] > row['Total Population']):
       print("Issue with number of foreign born or number of veterans at row: ", row)
       print(i)
    i+=1

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model



I chose to use a star schema with the immigration dataset chosen to be the fact table and the temperature dataset, airport codes, city demographics chosen to be dimension tables.

The fact table imm_fact is as shown below <br>
cicid                 INTEGER <br>
i94yr                 INTEGER<br>
i94mon                INTEGER<br>
i94cit                INTEGER<br>
i94res                INTEGER<br>
i94port               CHAR(3)<br>
arrdate               INTEGER<br>
i94mode               INTEGER<br>
i94addr               CHAR(3)<br>
depdate               INTEGER<br>
i94bir                INTEGER<br>
i94visa               INTEGER<br>
count                 INTEGER<br>
dtadfile              VARCHAR<br>
visapost              CHAR(3)<br>
entdepa               CHAR(1)<br>
entdepd               CHAR(1)<br>
matflag               CHAR(1)<br>
biryear               INTEGER<br>
dtaddto               INTEGER<br>
gender                CHAR(1)<br>
airline               CHAR(2)<br>
admnum                INTEGER<br>
fltno                 VARCHAR<br>
visatype              VARCHAR<br>

The dimension table dim_country is shown below<br>
country_code  INTEGER<br>
country_name  VARCHAR

The dimension table dim_destination is shown below<br>
state_abb  CHAR(2)<br>


The dimension table dim_origin is shown below<br>
city INT  <br>
res  INT

The dimension table visa_status is shown below<br>
visa_type VARCHAR  <br>
visa_description  VARCHAR

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.