# Project Title
### Data Engineering Capstone Project

#### Project Summary
* The goal of this project is to enable data analysts and other similar parties analyze various aspects of immigration data, by collecting data from four different data sets viz.,immigration, airport codes, US Cities demographics and global temperature.  The project builds a useful schema from these datasets which will enable analysts to get an information like which origin country has more visitors visiting US, how long do they stay in the US, what's the demography of the state where the immigrants land in, and what's the average temparature of the city

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [60]:
# import necessary libraries and user defined utilities
import pandas as pd

import os
import configparser
import datetime as dt
import time
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear, avg, monotonically_increasing_id
from pyspark.sql.types import *
from pyspark.sql.functions import year, month, dayofmonth, weekofyear, date_format
from pyspark.sql import SparkSession, SQLContext, GroupedData, HiveContext
from pyspark.sql.functions import *
from pyspark.sql.functions import date_add as d_add
from pyspark.sql.types import DoubleType, StringType, IntegerType, FloatType
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql import Row
import util as util
import table_helper as helper

In [2]:
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
df_spark.show(2)

+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
|cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|       admnum|fltno|visatype|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+-------------+-----+--------+
|  6.0|2016.0|   4.0| 692.0| 692.0|    XXX|20573.0|   null|   null|   null|  37.0|    2.0|  1.0|    null|    null| null|      T|   null|      U|   null| 1979.0|10282016|  null|  null|   null|1.897628485E9| null|      B2|
|  7.0|2016.0|   4.0| 254.0| 276.0|    ATL|20551.0|    1.0|     AL|   null|  25.0|    3.0|  1.0|20130811|     SEO| n

In [3]:
print(df_spark.count(),len(df_spark.columns))

3096313 28


In [5]:
# dt = util.convert_sas_to_date(20573.0)
# dt.minute

0

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use?

* Purpose of this project is to facilitate meaningful US immigration data analysis for parties like data analysts.
    
* Clean the raw data obtained from the above sources
* Write them to parquet files
* Perform ETL using Spark Cluster
* Create necessary Fact and Dimension tables in Star Schema
* Load data into the above table

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included?

* i94 Immigration Sample Data <br>
    * This data set includes various information of person entering and leaving US like origin country, departure date, arrival date, port of entry, date of departure, visa type to name a few.
* US City Demographic Data from  https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/ _(provided in udacity workspace)_<br>
    * This data set contains total populattion, median age, race, average household size, no. of veterans etc.
* Airport Codes from https://datahub.io/core/airport-codes#data. _(provided in udacity workspace)_<br>
    * This data includes details of country code, airport code, corresponding city code, elevation, type of airport (small, heliport, large etc).
* Global Temperature Data is available in udacity workspace _(GlobalLandTemperaturesByCity.csv)_ 

In [4]:
immigration_data = 'immigration_data_sample.csv'
immigration_df = pd.read_csv(immigration_data)
immigration_df.head(1)

Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,782,WT


In [65]:
for col in immigration_df.columns:
    print(col)

Unnamed: 0
cicid
i94yr
i94mon
i94cit
i94res
i94port
arrdate
i94mode
i94addr
depdate
i94bir
i94visa
count
dtadfile
visapost
occup
entdepa
entdepd
entdepu
matflag
biryear
dtaddto
gender
insnum
airline
admnum
fltno
visatype


In [None]:
# #write to parquet
# df_spark.write.parquet("sas_data")
# df_spark=spark.read.parquet("sas_data")

|column|Description|
|-------|-----------|
|cicid|Unique Identifier|
|i94yr|Year of i94|
|i94mon|Month of i94|
|i94cit|3 digit code of immigrant's country of birth|
|i94res|3 digit code of immigrant's country of residence|
|i94port|Immigrant's port of entry|
|arrdate|Date of arrival in the US|
|i94mode|Mode of arrival (1-Air; 2-Sea; 3-Land 9-Unknown|
|i94addr|State code of arrival|
|depdate|Date of departure from the US|
|i94bir|Age in years|
|i94visa|Visa code|
|dtadfile|Date when added to I-94 file|
|visapost|Place of issuance of visa|
|occup|Occupation in US|
|entdepa|Arrival Flag|
|entdepd|Departure Flag|
|entdepu|Update Flag - extended stay, changed status to permanent res|
|matflag|Match flag  (arrival and departure records)|
|biryear|4 digit year of birth|
|dtaddto|Date until stay is allowed|
|gender|Immigrant's gender|
|insnum|INS number|
|airline|Airline by which immmigrant entered the US|
|admnum|Admission number|
|fltno|Flight number|
|visatype|Visa class with which immigrant entered the US|


#### Global Temperature Data

In [5]:
file_name = '../../data2/GlobalLandTemperaturesByCity.csv'
global_temperature_df = pd.read_csv(file_name)

In [6]:
global_temperature_df.shape

(8599212, 7)

In [7]:
global_temperature_df.head(1)

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E


##### Data Dictionary

|Column|Description|
|-------|----------|
|dt|Date (first day of the month)|
|AverageTemperature|Average Temperature for the month in Celsius|
|AverageTemperatureUncertainty|don't know what it is|
|City|Name of city|
|Country|Name of country|
|Latitude|Latitude of city|
|Longitude|Longitude of city|

In [69]:
dt_begin = "2000-01-01"
dt_end = "2019-01-01"
after_dt_begin = global_temperature_df["dt"] >= dt_begin
before_dt_end = global_temperature_df["dt"] < dt_end
dt_range = after_dt_begin & before_dt_end
global_temperature_df = global_temperature_df.loc[dt_range]
global_temperature_df.shape

(579150, 7)

In [12]:
global_temperature_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
3074,2000-01-01,3.065,0.372,Århus,Denmark,57.05N,10.33E
3075,2000-02-01,3.724,0.241,Århus,Denmark,57.05N,10.33E
3076,2000-03-01,3.976,0.296,Århus,Denmark,57.05N,10.33E
3077,2000-04-01,8.321,0.221,Århus,Denmark,57.05N,10.33E
3078,2000-05-01,13.567,0.253,Århus,Denmark,57.05N,10.33E


In [58]:
for column in global_temperature_df:
    values = global_temperature_df[column].unique()
    if(True in pd.isnull(values)):
        print(f"column {column} has null value")
print("finished checking for null")

finished checking for null


In [59]:
global_temperature_df.dtypes

dt                                object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                              object
Country                           object
Latitude                          object
Longitude                         object
dtype: object

 #### AIRPORT CODES

In [8]:
airport_codes_csv = 'airport-codes_csv.csv'
airport_codes_df = pd.read_csv(airport_codes_csv)
airport_codes_df.shape

(55075, 12)

In [9]:
column_list = ['iata_code', 'local_code']
airport_codes_df = util.cleanup_missing_column_values(airport_codes_df,column_list)

removing rows with null values for ['iata_code', 'local_code']
total rows before clean up 55075
total rows after clean up 2987


In [10]:
airport_codes_df.head(1)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
223,03N,small_airport,Utirik Airport,4.0,OC,MH,MH-UTI,Utirik Island,K03N,UTK,03N,"169.852005, 11.222"


##### Data Dictionary

|Column|Description|
|-------|----------|
|ident|Unique Identifier|
|type|Type of airport viz., heliport, small/medium/large airport|
|name|Name of airport|
|elevationi_ft|Elevation from sea level|
|continent|Name of continent|
|iso_country|ISO Country Code|
|iso_region|ISO Region Code|
|municipality|Municipality of airport city|
|gps_code|GPS Code for the Airport|
|iata_code|Airport IATA Code|
|local_code|Aiport local code|
|coordinates|Latitude and Longitude of airport|

In [11]:
airport_codes_df.shape

(2987, 12)

#### US CITY DEMOGRAPHICS

In [12]:
us_cities_dem_csv = "us-cities-demographics.csv"
demographics_df = pd.read_csv(us_cities_dem_csv, delimiter=';')
demographics_df.head(1)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924


In [13]:
demographics_df.shape

(2891, 12)

##### Data Dictionary

|Column|Description|
|---------|---------|
|City|Name of the city|
|State|Name of the state|
|Median Age|Median age of population|
|Male Population|Male Population count|
|Female Population|Female Population count|
|Total Population|Total Population(Male+Female)|
|Number of Veterans|Veteran population count|
|Foreign-born|Count of foreign-born people|
|Average Household Size|Average no.of persons in a house|
|State Code|State Code|
|Race|Racial Information|
|Count|Count|

In [14]:
demographics_df = demographics_df.drop_duplicates()
demographics_df.shape

for column in demographics_df:
    values = demographics_df[column].unique()
    if(True in pd.isnull(values)):
        print(f"column {column} has null value")
print("finished checking for null")        


column Male Population has null value
column Female Population has null value
column Number of Veterans has null value
column Foreign-born has null value
column Average Household Size has null value
finished checking for null


#### Build country code to country name data frame 
_(The following code is used based on a direction from a mentor)_

In [17]:
#extract country name and country code from I94_SAS_Labels_Descriptions.SAS

with open("I94_SAS_Labels_Descriptions.SAS") as f:
    contents = f.readlines()
    country_code = {}
    for countries in contents[10:298]:
        pair = countries.split('=')
        code,country = pair[0].strip(), pair[1].strip().strip("'")
        country_code[code] = country
country_code_df = pd.DataFrame(list(country_code.items()),columns=['code','country'])
country_code_df.head(5)

Unnamed: 0,code,country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
* Clean up data with null values in important columns
* Drop duplicate rows

In [18]:
# clean up Immigration data
column_list = ["arrdate","depdate"]
immigration_df = util.cleanup_missing_column_values(immigration_df,column_list)
immigration_df = immigration_df.drop_duplicates()

#dropping columns insnum and occup as there are null values for majority of the rows
# immigration_df = immigration_df.drop('insnum', axis=1)
# immigration_df = immigration_df.drop('occup', axis=1)
util.report_unique_info(immigration_df, "immigration_df", ["visapost"])

removing rows with null values for ['arrdate', 'depdate']
total rows before clean up 1000
total rows after clean up 951
Displaying number of unique values for given columns in  immigration_df dataframe
There are 97 unique values in column visapost


In [19]:
# clean up Global Temperature data
column_list = ["AverageTemperature", "Country"]
global_temperature_df = util.cleanup_missing_column_values(global_temperature_df,column_list)
global_temperature_df.shape

removing rows with null values for ['AverageTemperature', 'Country']
total rows before clean up 8599212
total rows after clean up 8235082


(8235082, 7)

In [20]:
#discard historical data and work with data for past 20 years
print("Discarding historical data and work with data for recent 20 years")
dt_begin = "2000-01-01"
dt_end = "2019-01-01"
after_dt_begin = global_temperature_df["dt"] >= dt_begin
before_dt_end = global_temperature_df["dt"] < dt_end
dt_range = after_dt_begin & before_dt_end
global_temperature_df = global_temperature_df.loc[dt_range]
global_temperature_df.shape

Discarding historical data and work with data for recent 20 years


(576080, 7)

In [21]:
global_temperature_df.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
3074,2000-01-01,3.065,0.372,Århus,Denmark,57.05N,10.33E
3075,2000-02-01,3.724,0.241,Århus,Denmark,57.05N,10.33E
3076,2000-03-01,3.976,0.296,Århus,Denmark,57.05N,10.33E
3077,2000-04-01,8.321,0.221,Århus,Denmark,57.05N,10.33E
3078,2000-05-01,13.567,0.253,Århus,Denmark,57.05N,10.33E


In [22]:
# clean up Airport codes
column_list = ['iata_code', 'local_code']
airport_codes_df = util.cleanup_missing_column_values(airport_codes_df,column_list)
airport_codes_df.shape

removing rows with null values for ['iata_code', 'local_code']
total rows before clean up 2987
total rows after clean up 2987


(2987, 12)

In [23]:
demographics_df = demographics_df.drop_duplicates()
# need to remove rows that have null values for the following columns
# Male Population, Female Population
column_list = ['Male Population', 'Female Population']
demographics_df = util.cleanup_missing_column_values(demographics_df,column_list)
demographics_df.shape

removing rows with null values for ['Male Population', 'Female Population']
total rows before clean up 2891
total rows after clean up 2888


(2888, 12)

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

* The process of admitting a visitor to the US triggers events that can be classifed as facts.  In this project we are creating immigration fact table
* Derive dimension tables
    * airport
    * time
    * status
    * visa
    * temperature
    * country
    * state


#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

* Load data into staging environment
* Create fact and dimension tables
* Write table data into parquet files
* Run data quality tests

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [24]:

# not including insnum, occup columns as majority of rows contain null value for these columns
# also these columns are dropped from the data frame
immigration_df.drop(["insnum","occup"], inplace=True, axis=1)
immig_schema = StructType([StructField("0", IntegerType(), True)\
                          ,StructField("cicid", FloatType(), True)\
                          ,StructField("i94yr", FloatType(), True)\
                          ,StructField("i94mon", FloatType(), True)\
                          ,StructField("i94cit", FloatType(), True)\
                          ,StructField("i94res", FloatType(), True)\
                          ,StructField("i94port", StringType(), True)\
                          ,StructField("arrdate", FloatType(), True)\
                          ,StructField("i94mode", FloatType(), True)\
                          ,StructField("i94addr", StringType(), True)\
                          ,StructField("depdate", FloatType(), True)\
                          ,StructField("i94bir", FloatType(), True)\
                          ,StructField("i94visa", FloatType(), True)\
                          ,StructField("count", FloatType(), True)\
                          ,StructField("dtadfile", StringType(), True)\
                          ,StructField("visapost", StringType(), True)\
                          ,StructField("entdepa", StringType(), True)\
                          ,StructField("entdepd", StringType(), True)\
                          ,StructField("entdepu", StringType(), True)\
                          ,StructField("matflag", StringType(), True)\
                          ,StructField("biryear", FloatType(), True)\
                          ,StructField("dtaddto", StringType(), True)\
                          ,StructField("gender", StringType(), True)\
                          ,StructField("airline", StringType(), True)\
                          ,StructField("admnum", FloatType(), True)\
                          ,StructField("fltno", StringType(), True)\
                          ,StructField("visatype", StringType(), True)])

In [25]:
immigration_spark = spark.createDataFrame(immigration_df, schema=immig_schema)
print(immigration_spark.count(),len(immigration_spark.columns))

951 27


In [26]:
immigration_spark.toPandas().head(3)

Unnamed: 0,0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepd,entdepu,matflag,biryear,dtaddto,gender,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,O,,M,1955.0,7202016,F,JL,56582680000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,R,,M,1990.0,10222016,M,*GA,94361990000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,O,,M,1940.0,7052016,M,LH,55780470000.0,00464,WT


In [27]:
globaltemp_schema = StructType([StructField("dt", StringType(), True)\
                          ,StructField("AverageTemperature", FloatType(), True)\
                          ,StructField("AverageTemperatureUncertainty", FloatType(), True)\
                          ,StructField("City", StringType(), True)\
                          ,StructField("Country", StringType(), True)\
                          ,StructField("Latitude", StringType(), True)\
                          ,StructField("Longitude", StringType(), True)])

temp_spark = spark.createDataFrame(global_temperature_df, schema=globaltemp_schema)

temp_spark.toPandas().head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,2000-01-01,3.065,0.372,Århus,Denmark,57.05N,10.33E
1,2000-02-01,3.724,0.241,Århus,Denmark,57.05N,10.33E
2,2000-03-01,3.976,0.296,Århus,Denmark,57.05N,10.33E
3,2000-04-01,8.321,0.221,Århus,Denmark,57.05N,10.33E
4,2000-05-01,13.567,0.253,Århus,Denmark,57.05N,10.33E


In [28]:
dem_schema = StructType([StructField("City", StringType(), True)\
                        ,StructField("State", StringType(), True)\
                        ,StructField("Median Age", FloatType(), True)\
                        ,StructField("Male Population", FloatType(), True)\
                        ,StructField("Female Population", FloatType(), True)\
                        ,StructField("Total Population", IntegerType(), True)\
                        ,StructField("Number of Veterans", FloatType(), True)\
                        ,StructField("Foreign-born", FloatType(), True)\
                        ,StructField("Average Household Size", FloatType(), True)\
                        ,StructField("State Code", StringType(), True)\
                        ,StructField("Race", StringType(), True)\
                        ,StructField("Count", IntegerType(), True)])

dem_spark = spark.createDataFrame(demographics_df, schema=dem_schema)

dem_spark.toPandas().head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.799999,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.599998,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [29]:
airport_schema =  StructType([StructField("ident", StringType(), True)\
                        ,StructField("type", StringType(), True)\
                        ,StructField("name", StringType(), True)\
                        ,StructField("elevation_ft", FloatType(), True)\
                        ,StructField("continent", StringType(), True)\
                        ,StructField("iso_country", StringType(), True)\
                        ,StructField("iso_region", StringType(), True)\
                        ,StructField("municipality", StringType(), True)\
                        ,StructField("gps_code", StringType(), True)\
                        ,StructField("iata_code", StringType(), True)\
                        ,StructField("local_code", StringType(), True)\
                        ,StructField("coordinates", StringType(), True)])
airport_spark = spark.createDataFrame(airport_codes_df, schema=airport_schema)

airport_spark.toPandas().head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,03N,small_airport,Utirik Airport,4.0,OC,MH,MH-UTI,Utirik Island,K03N,UTK,03N,"169.852005, 11.222"
1,07FA,small_airport,Ocean Reef Club Airport,8.0,,US,US-FL,Key Largo,07FA,OCA,07FA,"-80.274803161621, 25.325399398804"
2,0AK,small_airport,Pilot Station Airport,305.0,,US,US-AK,Pilot Station,,PQS,0AK,"-162.899994, 61.934601"
3,0CO2,small_airport,Crested Butte Airpark,8980.0,,US,US-CO,Crested Butte,0CO2,CSE,0CO2,"-106.928341, 38.851918"
4,0TE7,small_airport,LBJ Ranch Airport,1515.0,,US,US-TX,Johnson City,0TE7,JCY,0TE7,"-98.62249755859999, 30.251800537100003"


In [30]:
output_path="table_data/"

##### 1.Create airport dimension

In [31]:
airport_spark = helper.create_airport(airport_spark, output_path)

Writing table airport to table_data/airport
Write complete!


In [32]:
airport = spark.read.parquet(output_path+"airport")
airport.toPandas().head()

Unnamed: 0,ident,type,iata_code,name,iso_country,iso_region,municipality,gps_code,coordinates,elevation_ft
0,BNM,small_airport,BNM,Bodinumu Airport,PG,PG-CPM,Bodinumu,AYBD,"147.666722222, -9.107777777779999",3700.0
1,CFF4,small_airport,DAS,Great Bear Lake Airport,CA,CA-NT,Great Bear Lake,CFF4,"-119.707000732, 66.7031021118",562.0
2,CFQ4,small_airport,TIL,Cheadle Airport,CA,CA-AB,Cheadle,CFQ4,"-113.62400054932, 51.057498931885",3300.0
3,KAPN,medium_airport,APN,Alpena County Regional Airport,US,US-MI,Alpena,KAPN,"-83.56030273, 45.0780983",690.0
4,KBFI,large_airport,BFI,Boeing Field King County International Airport,US,US-WA,Seattle,KBFI,"-122.302001953125, 47.529998779296875",21.0


##### 2.Create time dimension

In [33]:
time = helper.create_time(immigration_spark,output_path)

creating time table
Writing table time to table_data/time
Write complete!


In [34]:
time = spark.read.parquet(output_path+"time")
time.toPandas().head()

Unnamed: 0,arrdate,arrival_date,day,month,year,week,weekday
0,20566.0,2016-04-22,22,4,2016,16,6
1,20567.0,2016-04-23,23,4,2016,16,7
2,20551.0,2016-04-07,7,4,2016,14,5
3,20572.0,2016-04-28,28,4,2016,17,5
4,20550.0,2016-04-06,6,4,2016,14,4


##### 3.Create status dimension

In [35]:
status = helper.create_status(immigration_spark,output_path)

Writing table status to table_data/status
Write complete!


In [36]:
status = spark.read.parquet(output_path+"status")
status.toPandas().head()

Unnamed: 0,status_flag_id,arrival_flag,departure_flag,match_flag
0,8,O,O,M
1,35,A,D,M
2,101,Z,O,M
3,1,G,R,M
4,14,G,K,M


##### 4.Create visa dimension

In [37]:
visa = helper.create_visa(immigration_spark,output_path)

Writing table visa to table_data/visa
Write complete!


In [38]:
visa = spark.read.parquet(output_path+"visa")
visa.toPandas().head()

Unnamed: 0,visa_id,i94visa,visatype,visapost
0,32,2.0,B2,CPT
1,78,2.0,B2,BGT
2,223,2.0,B2,KEV
3,425,2.0,B2,ABD
4,154,2.0,B2,BRA


##### 5.Create temperature dimension

In [61]:
temperature = helper.create_temperature(temp_spark,output_path)

TypeError: cannot perform reduce with flexible type

In [None]:
temperature = spark.read.parquet(output_path+"temperature")
temperature.toPandas().head()

##### 6.Create country dimension

In [40]:
country_spark = spark.createDataFrame(country_code_df)
country_spark.toPandas().head()

Unnamed: 0,code,country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


In [41]:
country = helper.create_country(country_spark,output_path)

Writing table country to table_data/country
Write complete!


In [42]:
country = spark.read.parquet(output_path+"country")
country.toPandas().head()

Unnamed: 0,code,country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


##### 9.Create state dimension

In [43]:
state = helper.create_state(dem_spark,output_path)

TypeError: cannot perform reduce with flexible type

In [None]:
state = spark.read.parquet(output_path+"state")
state.toPandas().head()

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.