# Project Title
### Data Engineering Capstone Project

#### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [2]:
# Do all imports and installs here
import pandas as pd, re
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

In [3]:
import os, time

In [4]:
%ls

airport-codes_csv.csv            [0m[01;34msas_data[0m/
Capstone Project Template.ipynb  Untitled.ipynb
I94_SAS_Labels_Descriptions.SAS  us-cities-demographics.csv
immigration_data_sample.csv


### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data
-   **I94 Immigration Data:**  This data comes from the US National Tourism and Trade Office. A data dictionary is included in the workspace.  [This](https://travel.trade.gov/research/reports/i94/historical/2016.html)  is where the data comes from. There's a sample file so you can take a look at the data in csv format before reading it all in. You do not have to use the entire dataset, just use what you need to accomplish the goal you set at the beginning of the project.
-   **World Temperature Data:**  This dataset came from Kaggle. You can read more about it  [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).

    -   dt: starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures
    -   AverageTemperature: global average land temperature in celsius
    -   AverageTemperatureUncertainty: the 95% confidence interval around the average
    -   City: Global Land Temperatures By City
    -   Country: Global Average Land Temperature by Country
    -   Latitude: Latitude
    -   Longitude: Longitude




-   **U.S. City Demographic Data:**  This data comes from OpenSoft. You can read more about it  [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).
-   **Airport Code Table:**  This is a simple table of airport codes and corresponding cities. It comes from  [here](https://datahub.io/core/airport-codes#data).






Describe the data sets you're using. Where did it come from? What type of information is included? 

Primary key: cicid
 cicid |-- cicid (double): Unique identifier for each traveller
 |-- citizenship_country (string): Traveller's country of citizenship
 |-- residence_country (string): Traveller's country of residence
 |-- city (string): City where the entry port of the traveller is located
 |-- state (string): State where the entry port of the traveller is located
 |-- arrival_date (date): Traveller's arrival date
 |-- departure_date (string): Traveller's departure date, if known
 |-- age (double):  Traveller's age
 |-- visa_type (string): aggregate visa type. Possible values are:
		Business,
		Pleasure,
		Student,
 |-- detailed_visa_type (string): Detailed visa types. Numerous values are available. Not all could be identified:
		B1: B1 visa is for business visits valid for up to a year
		B2: B2 visa is for pleasure visits valid for up to a year
		CP: could not find a definition
		E2: E2 investor visas allows foreign investors to enter and work inside of the United States based on a substantial investment
		F1: F1 visas are used by non-immigrant students for Academic and Language training Courses. 
		F2: F2 visas are used by the dependents of F1 visa holders
		GMT: could not find a definition
		M1: for students enrolled in non-academic or “vocational study”. Mechanical, language, cooking classes, etc...
		WB: Waiver Program (WT/WB Status) travel to the United States for tourism or business for stays of 90 days or less without obtaining a visa.
		WT: Waiver Program (WT/WB Status) travel to the United States for tourism or business for stays of 90 days or less without obtaining a visa.
 
 
 i94cit = 3 digit code of origin city (using i94cntyl to transform)
 i94res = residency?(using i94cntyl to transform)
 i94port = 3 character code of destination (using 94prtl



## read the mapping data


fn = '/home/workspace/label_mapping/i94addrl.txt'
df = pd.read_csv(fn , sep="=", header=None, engine='python',  names = ["state_code", "state"], skipinitialspace = False) 


In [28]:
df.head(2)

Unnamed: 0,state_code,state
0,\t'AL','ALABAMA'
1,\t'AK','ALASKA'


In [19]:
df = pd.read_csv(fn , sep=" =  ", header=None, engine='python',  names = ["country_code", "country"], skipinitialspace = True) 

In [5]:
# Read in the data here
start_time = time.time()

# 2016 April Immi data
fname_immi = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat'
df = pd.read_sas(fname_immi, 'sas7bdat', encoding="ISO-8859-1")

print("it took", time.time() - start_time, "to run")


it took 168.67981600761414 to run


In [7]:
from IPython.display import display, HTML
display(HTML(df.head().to_html()))
df.count()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,37.0,2.0,1.0,,,,T,,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,25.0,3.0,1.0,20130811.0,SEO,,G,,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1
2,15.0,2016.0,4.0,101.0,101.0,WAS,20545.0,1.0,MI,20691.0,55.0,2.0,1.0,20160401.0,,,T,O,,M,1961.0,09302016,M,,OS,666643200.0,93.0,B2
3,16.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,28.0,2.0,1.0,20160401.0,,,O,O,,M,1988.0,09302016,,,AA,92468460000.0,199.0,B2
4,17.0,2016.0,4.0,101.0,101.0,NYC,20545.0,1.0,MA,20567.0,4.0,2.0,1.0,20160401.0,,,O,O,,M,2012.0,09302016,,,AA,92468460000.0,199.0,B2


cicid       3096313
i94yr       3096313
i94mon      3096313
i94cit      3096313
i94res      3096313
i94port     3096313
arrdate     3096313
i94mode     3096074
i94addr     2943941
depdate     2953856
i94bir      3095511
i94visa     3096313
count       3096313
dtadfile    3096312
visapost    1215063
occup          8126
entdepa     3096075
entdepd     2957884
entdepu         392
matflag     2957884
biryear     3095511
dtaddto     3095836
gender      2682044
insnum       113708
airline     3012686
admnum      3096313
fltno       3076764
visatype    3096313
dtype: int64

In [39]:
# only keep the columns that has no missing value
df_no_missing = df.dropna(axis=1)

In [40]:
df_no_missing.isna().mean().round(4) * 100

cicid       0.0
i94yr       0.0
i94mon      0.0
i94cit      0.0
i94res      0.0
i94port     0.0
arrdate     0.0
i94visa     0.0
count       0.0
admnum      0.0
visatype    0.0
dtype: float64

i94yr = 4 digit year
i94mon = numeric month
i94cit = 3 digit code of origin city
i94port = 3 character code of destination USA city
arrdate = arrival date in the USA
i94mode = 1 digit travel code
depdate = departure date from the USA
i94visa = reason for immigration

In [38]:
# percentage of the missing value
df.isna().mean().round(4) * 100

cicid        0.00
i94yr        0.00
i94mon       0.00
i94cit       0.00
i94res       0.00
i94port      0.00
arrdate      0.00
i94mode      0.01
i94addr      4.92
depdate      4.60
i94bir       0.03
i94visa      0.00
count        0.00
dtadfile     0.00
visapost    60.76
occup       99.74
entdepa      0.01
entdepd      4.47
entdepu     99.99
matflag      4.47
biryear      0.03
dtaddto      0.02
gender      13.38
insnum      96.33
airline      2.70
admnum       0.00
fltno        0.63
visatype     0.00
dtype: float64

In [29]:
start_time = time.time()
fname_tmp = '../../data2/GlobalLandTemperaturesByCity.csv'
df = pd.read_csv(fname_tmp)
print("it took", time.time() - start_time, "to run")
from IPython.display import display, HTML

display(HTML(df.head().to_html()))
df.count()

it took 12.382538318634033 to run


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


dt                               8599212
AverageTemperature               8235082
AverageTemperatureUncertainty    8235082
City                             8599212
Country                          8599212
Latitude                         8599212
Longitude                        8599212
dtype: int64

In [26]:
# percentage of the missing value
df.isna().mean().round(4) * 100

dt                               0.00
AverageTemperature               4.23
AverageTemperatureUncertainty    4.23
City                             0.00
Country                          0.00
Latitude                         0.00
Longitude                        0.00
dtype: float64

In [30]:
start_time = time.time()
fn_airpot = 'airport-codes_csv.csv'
df = pd.read_csv(fn_airpot)
print("it took", time.time() - start_time, "to run")
from IPython.display import display, HTML

display(HTML(df.head().to_html()))
df.count()

it took 0.430330753326416 to run


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


ident           55075
type            55075
name            55075
elevation_ft    48069
continent       27356
iso_country     54828
iso_region      55075
municipality    49399
gps_code        41030
iata_code        9189
local_code      28686
coordinates     55075
dtype: int64

In [31]:
# percentage of the missing value
df.isna().mean().round(4) * 100

ident            0.00
type             0.00
name             0.00
elevation_ft    12.72
continent       50.33
iso_country      0.45
iso_region       0.00
municipality    10.31
gps_code        25.50
iata_code       83.32
local_code      47.91
coordinates      0.00
dtype: float64

In [35]:
start_time = time.time()
fname_city = 'us-cities-demographics.csv'
df = pd.read_csv(fname_city, sep =";" )
print("it took", time.time() - start_time, "to run")
from IPython.display import display, HTML

display(HTML(df.head().to_html()))
df.count()

it took 0.019350051879882812 to run


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


City                      2891
State                     2891
Median Age                2891
Male Population           2888
Female Population         2888
Total Population          2891
Number of Veterans        2878
Foreign-born              2878
Average Household Size    2875
State Code                2891
Race                      2891
Count                     2891
dtype: int64

In [36]:
# percentage of the missing value
df.isna().mean().round(4) * 100

City                      0.00
State                     0.00
Median Age                0.00
Male Population           0.10
Female Population         0.10
Total Population          0.00
Number of Veterans        0.45
Foreign-born              0.45
Average Household Size    0.55
State Code                0.00
Race                      0.00
Count                     0.00
dtype: float64

In [18]:
# df.count()
%cd \home


[Errno 2] No such file or directory: 'home'
/root


In [7]:
	
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


pyspark.sql.dataframe.DataFrame

In [21]:
#write to parquet
# df_spark.write.parquet("sas_data")
# df_spark=spark.read.parquet("sas_data")

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

In [23]:
# Performing cleaning tasks here
#missing value

from pyspark.sql.functions import col,sum
df_spark.select(*(sum(col(c).isNull().cast("int")).alias(c) for c in df.columns)).show()


+-----+-----+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-------+-------+-------+-------+-------+-------+-------+------+-------+-------+------+-----+--------+
|cicid|i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|  occup|entdepa|entdepd|entdepu|matflag|biryear|dtaddto|gender| insnum|airline|admnum|fltno|visatype|
+-----+-----+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-------+-------+-------+-------+-------+-------+-------+------+-------+-------+------+-----+--------+
|    0|    0|     0|     0|     0|      0|      0|    239| 152592| 142457|   802|      0|    0|       1| 1881250|3088187|    238| 138429|3095921| 138429|    802|    477|414269|2982605|  83627|     0|19549|       0|
+-----+-----+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-------+-------+-------+---

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

fact_immigration




#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.