# US Immigration Data Engineering Project
### Data Engineering Capstone Project

#### Project Summary
The purpose of this project is to build data lakes for the Analytics team, so that they can use it further for the data analysis tasks in  fast and efficient manner. Data used in this project come from a variety of sources. Mainly 4 datasets are used which are as follows:
*  I94 Immigration Data
*  World Temperature Data
*  U.S. City Demographic Data
*  Airport Code Data

A detailed project overview can be found in Step 1.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
from pyspark.sql import SparkSession
from datetime import datetime,timedelta
from pyspark.sql.functions import from_unixtime, unix_timestamp, to_date, expr,date_add,udf,col
from pyspark.sql.types import StringType, DateType

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

Mainly 4 datasets are used which are as follows:
>  
   *  I94 Immigration Data
   *  World Temperature Data
   *  U.S. City Demographic Data
   *  Airport Code Table

Let's have a detailed look on each dataset:
*  **I94 Immigration Data:** This data comes from the US National Tourism and Trade Office. It contains international visitor arrival statistics by world regions and selected countries, type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry. Details of the columns like their names and meanings can be seen in the data dictionary included at the end of this document.

*  **World Temperature Data:**  This dataset came from Kaggle. This dataset contains temperature data of US from 1850 to 2013. A detailed data dictionary is provided at the end of this documents.

* **U.S. City Demographic Data:** This data comes from Opensoft. This data comes from the US Census Bureau's 2015 American Community Survey. This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. A detailed data dictionary is provided at the end of this documents.

* **Airport Code Table:** This is a simple table of airport codes and corresponding cities. The airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code. 
A detailed data dictionary is provided at the end of this documents.

**Why using data lakes over data warehouse?**
> *The idea here is to use these raw data files to build refined data lakes tables. The benefit of data lakes over data ware house is that it has more flexibility in terms of data availability, storage and data management. In the data lakes provides the data in more manageable way, at the same time it does not modify the raw data too much, by having this it allows the analaytics team more flexilbility and they can mold the data as per their requirement.*  

**How end solution look like?**
> *So final product is a cloud based data lake having following tables:*

**What tool do I use?**
> *I used following tools in this project:*
  * Apache Spark
  * Amazon S3
  * *

#### 1. Importing Immigration dataset:
*Immigration dataset if quite large in size so its better to use **Apache Spark** to read the files in this case.*

In [2]:
%%time
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_immig =spark.read.format('com.github.saurfang.sas.spark').option("inferSchema", "true").option("dateFormat", "yyyyMMdd").\
load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

CPU times: user 40.8 ms, sys: 14.1 ms, total: 54.9 ms
Wall time: 16.3 s


In [3]:
#write to parquet
#df_spark.write.parquet("sas_data")
#df_spark=spark.read.parquet("sas_data")

In [4]:
df_immig.take(2)

[Row(cicid=6.0, i94yr=2016.0, i94mon=4.0, i94cit=692.0, i94res=692.0, i94port='XXX', arrdate=20573.0, i94mode=None, i94addr=None, depdate=None, i94bir=37.0, i94visa=2.0, count=1.0, dtadfile=None, visapost=None, occup=None, entdepa='T', entdepd=None, entdepu='U', matflag=None, biryear=1979.0, dtaddto='10282016', gender=None, insnum=None, airline=None, admnum=1897628485.0, fltno=None, visatype='B2'),
 Row(cicid=7.0, i94yr=2016.0, i94mon=4.0, i94cit=254.0, i94res=276.0, i94port='ATL', arrdate=20551.0, i94mode=1.0, i94addr='AL', depdate=None, i94bir=25.0, i94visa=3.0, count=1.0, dtadfile='20130811', visapost='SEO', occup=None, entdepa='G', entdepd=None, entdepu='Y', matflag=None, biryear=1991.0, dtaddto='D/S', gender='M', insnum=None, airline=None, admnum=3736796330.0, fltno='00296', visatype='F1')]

In [5]:
#df_immig.filter('i94yr="2016.0"').show(2)

#### 2. Importing world_temperature dataset:
*World temperature dataset if quite large in size so its better to use **Apache Spark** to read the files in this case.*

In [6]:
# %%time
# Read in the data here
# df_temp=pd.read_csv('../../data2/GlobalLandTemperaturesByCity.csv')

In [7]:
%%time
df_temp=spark.read.format('csv').option('header','True').\
load('../../data2/GlobalLandTemperaturesByCity.csv')

CPU times: user 1.89 ms, sys: 0 ns, total: 1.89 ms
Wall time: 2.34 s


In [8]:
df_temp.show(2)

+----------+------------------+-----------------------------+-----+-------+--------+---------+
|        dt|AverageTemperature|AverageTemperatureUncertainty| City|Country|Latitude|Longitude|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
|1743-11-01|             6.068|           1.7369999999999999|Århus|Denmark|  57.05N|   10.33E|
|1743-12-01|              null|                         null|Århus|Denmark|  57.05N|   10.33E|
+----------+------------------+-----------------------------+-----+-------+--------+---------+
only showing top 2 rows



#### 3. Importing airport_code dataset:
*airport_code dataset is not that much large in size its better to use **Pandas** to read the files in this case.*

In [9]:
# Read in the data here
#df_port=pd.read_csv('airport-codes_csv.csv')

In [10]:
%%time
df_port=spark.read.format('csv').option('header','True').load('airport-codes_csv.csv')

CPU times: user 2.14 ms, sys: 0 ns, total: 2.14 ms
Wall time: 510 ms


In [11]:
df_port.show(2)

+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|ident|         type|                name|elevation_ft|continent|iso_country|iso_region|municipality|gps_code|iata_code|local_code|         coordinates|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
|  00A|     heliport|   Total Rf Heliport|          11|       NA|         US|     US-PA|    Bensalem|     00A|     null|       00A|-74.9336013793945...|
| 00AA|small_airport|Aero B Ranch Airport|        3435|       NA|         US|     US-KS|       Leoti|    00AA|     null|      00AA|-101.473911, 38.7...|
+-----+-------------+--------------------+------------+---------+-----------+----------+------------+--------+---------+----------+--------------------+
only showing top 2 rows



In [12]:
#df_port.query("iata_code=='MAA'")

#### 4. Importing us_cities_demographics dataset:
*us_cities_demographics dataset is not that much large in size its better to use **Pandas** to read the files in this case.*

In [13]:
# Read in the data here
df_demog=spark.read.format('csv').option('header','True').option('delimiter',';').\
load('us-cities-demographics.csv')

In [14]:
df_demog.show(2)

+-------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+------------------+-----+
|         City|        State|Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|              Race|Count|
+-------------+-------------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+------------------+-----+
|Silver Spring|     Maryland|      33.8|          40601|            41862|           82463|              1562|       30908|                   2.6|        MD|Hispanic or Latino|25924|
|       Quincy|Massachusetts|      41.0|          44129|            49500|           93629|              4147|       32935|                  2.39|        MA|             White|58723|
+-------------+-------------+----------+---------------+-----------------+-----------

### Step 2: Explore and Assess the Data




### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### 1. Exploring Immigration dataset:

In [15]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_immig.columns)))
print("Number of Rows: {}".format(df_immig.count()))

Number of Columns: 28
Number of Rows: 3096313


In [16]:
# Drop duplicates value
df_immig=df_immig.dropDuplicates()

In [17]:
# Number of rows after removing duplicates
print("Number of Rows: {}".format(df_immig.count()))

Number of Rows: 3096313


In [18]:
# Describe
#df_immig.describe().show()

In [19]:
df_immig.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

#### 2. Exploring world_temperature dataset:

In [20]:
#Shape of the Dataset
print("Number of Columns: {}".format(len(df_temp.columns)))
print("Number of Rows: {}".format(df_temp.count()))

Number of Columns: 7
Number of Rows: 8599212


In [21]:
#Shape of the dataset
df_temp.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: string (nullable = true)
 |-- AverageTemperatureUncertainty: string (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)



In [22]:
# Checking missing values
# df_temp.describe()

In [23]:
# Drop Duplicates Values
# df_port.drop_duplicates(inplace=True)

#### 3. Exploring airport_code dataset:

In [24]:
# Shape of the dataset
# df_port.shape

In [25]:
# Checking missing values
df_port.printSchema()

root
 |-- ident: string (nullable = true)
 |-- type: string (nullable = true)
 |-- name: string (nullable = true)
 |-- elevation_ft: string (nullable = true)
 |-- continent: string (nullable = true)
 |-- iso_country: string (nullable = true)
 |-- iso_region: string (nullable = true)
 |-- municipality: string (nullable = true)
 |-- gps_code: string (nullable = true)
 |-- iata_code: string (nullable = true)
 |-- local_code: string (nullable = true)
 |-- coordinates: string (nullable = true)



In [26]:
# Drop Duplicates Values
# df_port.drop_duplicates(inplace=True)

#### 4. Exploring us_cities_demographics dataset:

In [27]:
# Shape of the dataset
#df_demog.shape

In [28]:
# Checking for missing values
df_demog.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: string (nullable = true)
 |-- Male Population: string (nullable = true)
 |-- Female Population: string (nullable = true)
 |-- Total Population: string (nullable = true)
 |-- Number of Veterans: string (nullable = true)
 |-- Foreign-born: string (nullable = true)
 |-- Average Household Size: string (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: string (nullable = true)



In [29]:
# Dropping duplicates
#df_demog.drop_duplicates(inplace=True)

#### Cleaning Steps
Document steps necessary to clean the data

#### 1. Cleaning Immigration dataset:

In [30]:
#Cleaning of Immigration dataset
df_immig=df_immig.withColumn("cicid", df_immig["cicid"].cast('integer'))
df_immig=df_immig.withColumn("i94yr", df_immig["i94yr"].cast('integer'))
df_immig=df_immig.withColumn("i94mon", df_immig["i94mon"].cast('integer'))
df_immig=df_immig.withColumn("i94cit", df_immig["i94cit"].cast('integer'))
df_immig=df_immig.withColumn("i94res", df_immig["i94res"].cast('integer'))
df_immig=df_immig.withColumn("i94mode", df_immig["i94mode"].cast('integer'))
df_immig=df_immig.withColumn("i94bir", df_immig["i94bir"].cast('integer'))
df_immig=df_immig.withColumn("i94visa", df_immig["i94visa"].cast('integer'))
df_immig=df_immig.withColumn("biryear", df_immig["biryear"].cast('integer'))
df_immig=df_immig.withColumn("admnum", df_immig["admnum"].cast('integer'))
df_immig=df_immig.withColumn("count", df_immig["count"].cast('integer'))
df_immig=df_immig.withColumn("arrdate", df_immig["arrdate"].cast('integer'))
df_immig=df_immig.withColumn("depdate", df_immig["depdate"].cast('integer'))

In [31]:
#df_immig=df_immig.select("arrdate", timedelta(days=df_immig["arrdate"])+datetime(1960,1,1))

In [32]:
# convert_date = udf(lambda x: date_add(to_date('1960-01-01'),x))
# df_immig = df_immig.withColumn('arrdate',convert_date(df_immig.arrdate) )
#df_immig = df_immig.withColumn('arrdate', date_add(to_date('1960-01-01'),'arrdate'))

In [33]:
# convert_date_udf = udf(lambda x: datetime.datetime(1960, 1, 1)+datetime.timedelta(days=x))
# df_immig = df_immig.withColumn('arrdate', convert_date_udf('arrdate').alias('arrdate'))

In [34]:
# import pyspark.sql.functions as F
# df_dc = spark.createDataFrame([['1960-01-01']], ['report_date'])
# df_dc.withColumn('report_date_10', F.date_add(df_dc['report_date'],20566)).show()

In [35]:
# def convert_date(x):
#     mDt = datetime(1960, 1, 1)
#     dlt = mDt + timedelta(days=x)
#     return dlt.strftime("%Y-%m-%d")

# convert_date_udf = udf(lambda z: convert_date(z), StringType())
# df_immig = df_immig.withColumn('arrdate', convert_date_udf('arrdate').alias('arrdate')).collect()

In [36]:
#df_immig.show(2)

In [37]:
df_immig.printSchema()

root
 |-- cicid: integer (nullable = true)
 |-- i94yr: integer (nullable = true)
 |-- i94mon: integer (nullable = true)
 |-- i94cit: integer (nullable = true)
 |-- i94res: integer (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: integer (nullable = true)
 |-- i94mode: integer (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: integer (nullable = true)
 |-- i94bir: integer (nullable = true)
 |-- i94visa: integer (nullable = true)
 |-- count: integer (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: integer (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: integer

#### 2. Cleaning world_temperature dataset:

In [38]:
#Cleaning of Immigration dataset
df_temp=df_temp.withColumn("AverageTemperature", df_temp["AverageTemperature"].cast('float'))
df_temp=df_temp.withColumn("AverageTemperatureUncertainty", df_temp["AverageTemperatureUncertainty"].cast('float'))
df_temp=df_temp.withColumn("Latitude", df_temp["Latitude"].cast('float'))
df_temp=df_temp.withColumn("Longitude", df_temp["Longitude"].cast('float'))

In [39]:
df_temp.printSchema()

root
 |-- dt: string (nullable = true)
 |-- AverageTemperature: float (nullable = true)
 |-- AverageTemperatureUncertainty: float (nullable = true)
 |-- City: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Latitude: float (nullable = true)
 |-- Longitude: float (nullable = true)



In [40]:
df_temp=df_temp.filter('Country="United States"')

In [41]:
# df_temp.filter('dt>="2013-01-01"').show(2)

In [42]:
#Shape of the Dataset
print("Number of Columns: {}".format(len(df_temp.columns)))
print("Number of Rows: {}".format(df_temp.count()))

Number of Columns: 7
Number of Rows: 687289


In [43]:
# Performing cleaning tasks here
# import pycountry
# df_temp=df_temp.withColumn("AverageTemperature", df_temp["AverageTemperature"].cast('float'))


#### 3. Cleaning airport_code dataset:

#### 4. Cleaning us_demographic dataset:

In [44]:
df_demog.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: string (nullable = true)
 |-- Male Population: string (nullable = true)
 |-- Female Population: string (nullable = true)
 |-- Total Population: string (nullable = true)
 |-- Number of Veterans: string (nullable = true)
 |-- Foreign-born: string (nullable = true)
 |-- Average Household Size: string (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: string (nullable = true)



In [45]:
df_demog=df_demog.drop('Number of Veterans','Foreign-born','Race','Count')

In [46]:
df_demog.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: string (nullable = true)
 |-- Male Population: string (nullable = true)
 |-- Female Population: string (nullable = true)
 |-- Total Population: string (nullable = true)
 |-- Average Household Size: string (nullable = true)
 |-- State Code: string (nullable = true)



In [47]:
#Cleaning of Demographic dataset
df_demog=df_demog.withColumn("Median Age", df_demog["Median Age"].cast('float'))
df_demog=df_demog.withColumn("Male Population", df_demog["Male Population"].cast('integer'))
df_demog=df_demog.withColumn("Female Population", df_demog["Female Population"].cast('integer'))
df_demog=df_demog.withColumn("Total Population", df_demog["Total Population"].cast('integer'))
df_demog=df_demog.withColumn("Average Household Size", df_demog["Average Household Size"].cast('float'))

In [48]:
df_demog=df_demog.select(col("Median Age").alias("median_age"),col("Male Population").alias("male"),\
                col("Female Population").alias("female"),col("Total Population").alias("total"),\
                col("Average Household Size").alias("avg_household_size"),\
                col("City").alias("city"),col("State").alias("state"),\
                col("State Code").alias("state_code"))

In [49]:
df_demog.groupBy('city','state').avg()

DataFrame[city: string, state: string, avg(median_age): double, avg(male): double, avg(female): double, avg(total): double, avg(avg_household_size): double]

In [50]:
df_demog.show(2)

+----------+-----+------+-----+------------------+-------------+-------------+----------+
|median_age| male|female|total|avg_household_size|         city|        state|state_code|
+----------+-----+------+-----+------------------+-------------+-------------+----------+
|      33.8|40601| 41862|82463|               2.6|Silver Spring|     Maryland|        MD|
|      41.0|44129| 49500|93629|              2.39|       Quincy|Massachusetts|        MA|
+----------+-----+------+-----+------------------+-------------+-------------+----------+
only showing top 2 rows



### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

#### Using Star Schema:
##### Dimension Tables:
> 1. Immigrant Table
    * Columns: gender, biryear, occup, i94bir
##### Fact Table:

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

#### Dimension Tables : 

##### 1. Immigrant Table

**Immigrant Table Schema:**
> *CREATE TABLE IF NOT EXISTS 
      i94( cicid INTEGER PRIMARY KEY, 
           gender STRING,
           biryear INTEGER,
           occup STRING,
           i94bir INTEGER,
           visatype STRING,
           dtaddto STRING)*

In [51]:
# Write code here
immigrant_table=df_immig.select(['cicid','gender','biryear','occup','i94bir','visatype','dtaddto'])

In [52]:
immigrant_table.write.mode('overwrite').option('compression','snappy').parquet("immigrant")

In [53]:
df_imm=spark.read.option('compression','snappy').parquet("immigrant")

In [54]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_imm.columns)))
print("Number of Rows: {}".format(df_imm.count()))

Number of Columns: 7
Number of Rows: 3096313


In [55]:
df_imm.show(5)

+-----+------+-------+-----+------+--------+--------+
|cicid|gender|biryear|occup|i94bir|visatype| dtaddto|
+-----+------+-------+-----+------+--------+--------+
|  118|     F|   1963| null|    53|      WT|06292016|
|  165|     F|   1956| null|    60|      WT|06292016|
|  188|     M|   1970| null|    46|      WT|06292016|
|  225|     F|   1958| null|    58|      WT|06292016|
|  326|     M|   1943| null|    73|      WT|06292016|
+-----+------+-------+-----+------+--------+--------+
only showing top 5 rows



##### 2. Airline Table

In [56]:
df_airline.printSchema()

NameError: name 'df_airline' is not defined

**I94 Table Schema:**
> *CREATE TABLE IF NOT EXISTS 
      i94( cicid INTEGER PRIMARY KEY, 
           i94yr INTEGER,
           i94mon INTEGER,
           i94res INTEGER,
           i94bir INTEGER,
           i94port INTEGER,
           i94mode INTEGER,
           i94addr STRING,
           i94visa INTEGER,
           dtafile STRING,
           dtaddto STRING)*

In [None]:
# Write code here
airline_table=df_immig.select(['cicid','airline','fltno','occup','i94bir','visatype'])

In [None]:
airline_table.write.mode('overwrite').option('compression','snappy').parquet("airline")

In [None]:
df_airline=spark.read.option('compression','snappy').parquet("airline")

In [None]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_airline.columns)))
print("Number of Rows: {}".format(df_airline.count()))

In [None]:
df_airline.show(5)

##### 3. i94 Table

**I94 Table Schema:**
> *CREATE TABLE IF NOT EXISTS 
      i94( cicid INTEGER PRIMARY KEY, 
           i94yr INTEGER,
           i94mon INTEGER,
           i94res INTEGER,
           i94bir INTEGER,
           i94port INTEGER,
           i94mode INTEGER,
           i94addr STRING,
           i94visa INTEGER,
           dtafile STRING,
           dtaddto STRING)*

In [None]:
# Write code here
i94_table=df_immig.select(['cicid','i94yr','i94mon','i94res','i94bir','i94port','i94mode','i94addr',\
                           'i94visa','dtadfile','dtaddto'])

In [None]:
i94_table.write.mode('overwrite').option('compression','snappy').parquet("i94")

In [None]:
df_i94=spark.read.option('compression','snappy').parquet("i94")

In [None]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_i94.columns)))
print("Number of Rows: {}".format(df_i94.count()))

In [None]:
df_i94.show(5)

##### 4. Population Table

**Population Table Schema:**
> *CREATE TABLE IF NOT EXISTS 
      population( median_age FLOAT, 
                  male INTEGER,
                  female INTEGER,
                  total INTEGER,
                  avg_household_size FLOAT,
                  city VARCHAR NOT NULL, 
                  state VARCHAR NOT NULL,
                  state_code VARCHAR NOT NULL,
                  PRIMARY KEY (city,state,state_code))*
                          


In [None]:
df_demog.printSchema()

In [None]:
df_demog.write.mode('overwrite').option('compression','snappy').parquet("population")

In [None]:
df_pop=spark.read.option('compression','snappy').parquet("population")

In [None]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_pop.columns)))
print("Number of Rows: {}".format(df_pop.count()))

In [None]:
df_pop.filter('state="Maryland" and city="Columbia"').show(5)

##### 5. Temperature Table

**Temperature Table Schema:**
> *CREATE TABLE IF NOT EXISTS 
      temperature(date date NOT NULL, 
                  city VARCHAR NOT NULL,
                  country VARCHAR NOT NULL, 
                  avg_temp FLOAT, 
                  avg_temp_uncertainty FLOAT,
                  latitude FLOAT,
                  longitude FLOAT,
                  PRIMARY KEY (date,city,country))*
                          


In [None]:
df_temp.printSchema()

In [None]:
df_temp.write.mode('overwrite').option('compression','snappy').parquet("temperature")

In [None]:
df_temp=spark.read.option('compression','snappy').parquet("temperature")

In [None]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_temp.columns)))
print("Number of Rows: {}".format(df_temp.count()))

In [None]:
df_temp.show(5)

#### Fact Table

##### Immigration

In [None]:
df_immig.createOrReplaceTempView("immig")
df_airline.createOrReplaceTempView("airline")
df_i94.createOrReplaceTempView("i94")
df_port.createOrReplaceTempView("port")

In [None]:
immigration_table = spark.sql("""SELECT i.cicid,i.gender,i.visatype,i.dtaddto,b.i94port,b.i94mode,p.name FROM immig  i
                                 LEFT OUTER JOIN i94 b ON b.cicid=i.cicid
                                 LEFT OUTER JOIN airline a ON i.cicid=a.cicid
                                 LEFT OUTER JOIN port p ON b.i94port=p.iata_code""")

In [None]:
immigration_table.write.mode('overwrite').option('compression','snappy').parquet("immigration")

In [None]:
df_airline=spark.read.option('compression','snappy').parquet("immigration")

In [None]:
# Checking for missing values
print("Number of Columns: {}".format(len(df_airline.columns)))
print("Number of Rows: {}".format(df_airline.count()))

In [None]:
df_airline.show(5)

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
_Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file._

#### Immigration Data:
~~~
  * cicid    -  
  * i94yr    - 4 digit year
  * i94mon   - Numeric month
  * i94cit   - This format shows all the valid and invalid codes for processing
  * i94res   - This format shows all the valid and invalid codes for processing
  * i94port  - This format shows all the valid and invalid codes for processing
  * arrdate  - is the Arrival Date in the USA. It is a SAS date numeric field that a 
               permament format has not been applied.  Please apply whichever date format 
               works for you.
  * i94mode  - There are missing values as well as not reported (9)
               	1 = 'Air'
                2 = 'Sea'
                3 = 'Land'
                9 = 'Not reported' ;   
  * i94addr  - There is lots of invalid codes in this variable and the list below 
               shows what we have found to be valid, everything else goes into 'other'
  * depdate  - is the Departure Date from the USA. It is a SAS date numeric field that 
               a permament format has not been applied.  Please apply whichever date format 
               works for you.
  * i94bir   - Age of Respondent in Years
  * i94visa  - Visa codes collapsed into three categories:
               1. 1 = Business
               2. 2 = Pleasure
               3. 3 = Student
  * count    - Used for summary statistics
  * dtadfile - Character Date Field - Date added to I-94 Files
  * visapost - Department of State where where Visa was issued
  * occup    - Occupation that will be performed in U.S.
  * entdepa  - Arrival Flag - admitted or paroled into the U.S.
  * entdepd  - Departure Flag - Departed, lost I-94 or is deceased
  * entdepu  - Update Flag - Either apprehended, overstayed, adjusted to perm residence
  * matflag  - Match flag - Match of arrival and departure records
  * biryear  - 4 digit year of birth
  * dtaddto  - Character Date Field - Date to which admitted to U.S. (allowed to stay until)
  * gender   - Non-immigrant sex
  * insnum   - INS number
  * airline  - Airline used to arrive in U.S.
  * admnum   - Admission Number
  * fltno    - Flight number of Airline used to arrive in U.S.
  * visatype - Class of admission legally admitting the non-immigrant to temporarily stay in U.S.
  
 ~~~

#### World Temperature Data:
```
* dt
* AverageTemperature
* AverageTemperatureUncertainty
* City
* Country
* Latitude
* Longitude
```

#### U.S. City Demographic Data:
```
* City
* State
* Median Age
* Male Population
* Female Population
* Total Population
* Number of Veterans
* Foreign-born
* Average Household Size
* State Code
* Race
* Count
```

#### Airport Codes:
```
* ident         - 
* type          - Type of airport e.g. 'small_airport', 'medium_airport', 'large_airport' etc
* name          - Airport full name.
* elevation_ft  - Elevation of the port measured in feet.  
* continent     - Continet code (e.g 'AS' for Asia,'NA' for North America)
* iso_country   - Two letter ISO country code assigned by International Organization for Standardization
* iso_region    - iso region code assigned by ISO.
* municipality  - Locality Municipality.  
* gps_code      - Global Positioning System code.
* iata_code     - IATA(Internation Air Transport Association) airport code, also known as IATA 
                  location identifier is a three letter code assigned to airports around the world.         
* local_code    - Similar to gps_code.
* coordinates   - (x,y) co-ordinates of the location.
```       

### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

### References:
1. https://travel.trade.gov/research/reports/i94/historical/2016.html
2. https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data
3. https://public.opendatasoft.com/explore/dataset/us-cities-demographics/information/
4. https://stackoverflow.com/questions/51830697/convert-date-from-integer-to-date-format