# ETL Pipeline for Immigration Data
## Project Summary
The goal of this project is to create an ETL pipeline using "I94 immigration data", "US city demographic data" and "World Temperature Data" in order to make a datawarehouse in parquet file format that is optimized for queries regarding immigration behavior.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import pandas as pd
import os
import re

from pyspark.sql.functions import udf

## Scope of the project and dataset description

### Scope of the project
In this project, we will aggregate "World Demographic Data" with "Temperature data" to form our first dimension table "city_table". Next we will aggregate city temperature data by city to form the second dimension table. The two datasets will be joined on destination city to form the fact table. The final database is optimized to query on immigration events to determine if temperature affects the selection of destination cities. Spark will be used to process the data.
This modeling of data helps us to answer these questions:
* Do immigrants prefer places with higher or lower foreign population?
* Do immigration prefer higher or low temperature regions?
* Do immigrants prefer Big or Small Cities?
* Information about the cities, etc

The datasets used in this project are described in next section.

PySpark was used in this project for processing the data, but first we will use Pandas to perform exploratory analysis on the data. 

### Dataset Used 
The I94 immigration data comes from the US National Tourism and Trade Office. It is provided in SAS7BDAT format which is a binary database storage format. Some relevant attributes include:

- i94yr: 4 digit year
- i94mon: Month's value in number
- i94cit: 3 digit code of origin city
- i94port: 3 character code of destination city in US
- arrdate: arrival date in USA
- i94mode: Numeric Value mode of travel (air, land, sea or Not reported)
- depdate: departure date from the USA
- i94visa: visa codes Numeric Value(business, pleasure or student)
- occup: occupation that would be performed in US

The "US city demographic data" data comes from Opensoft. It is provided in csv format. Some of relevant attributes include these columns:
- city: city name
- state: state name
- total population: population of city
- race: primary race of population living in the city
- average_household_size
- foreign_born: no of foreigners in the city

The "World Temperature Data" comes from Kaggle. It is also provided in CSV format. Some of the relevant attributes includes:
- AverageTemperature: average temperature of city
- City: city name
- Country: country name
- Latitude: latitude
- Longitude = longitude

### Reading the data & Priliminary check for NAN values

In [2]:
demographics_data = pd.read_csv("us-cities-demographics.csv",sep=";")
temperature_data = pd.read_csv('../../data2/GlobalLandTemperaturesByCity.csv')
# we first read a sample from immigration data to see what type of data does the dataset contains
immigration_data = pd.read_csv("immigration_data_sample.csv")

In [3]:
print(demographics_data.columns)
print("NAN values containing columns")
print(demographics_data.isna().any()[lambda x: x])
demographics_data.head()

Index(['City', 'State', 'Median Age', 'Male Population', 'Female Population',
       'Total Population', 'Number of Veterans', 'Foreign-born',
       'Average Household Size', 'State Code', 'Race', 'Count'],
      dtype='object')
NAN values containing columns
Male Population           True
Female Population         True
Number of Veterans        True
Foreign-born              True
Average Household Size    True
dtype: bool


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [4]:
len(demographics_data.State.unique())

49

In [5]:
print(temperature_data.columns)
print("NAN values containing columns")
print(temperature_data.isna().any()[lambda x: x])
temperature_data.head()

Index(['dt', 'AverageTemperature', 'AverageTemperatureUncertainty', 'City',
       'Country', 'Latitude', 'Longitude'],
      dtype='object')
NAN values containing columns
AverageTemperature               True
AverageTemperatureUncertainty    True
dtype: bool


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [6]:
print(len(temperature_data.Country.unique()))

159


In [7]:
print(immigration_data.columns)
print("NAN values containing columns")
print(immigration_data.isna().any()[lambda x: x])
immigration_data.head()

Index(['Unnamed: 0', 'cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port',
       'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa',
       'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd',
       'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum',
       'airline', 'admnum', 'fltno', 'visatype'],
      dtype='object')
NAN values containing columns
i94addr     True
depdate     True
visapost    True
occup       True
entdepd     True
entdepu     True
matflag     True
gender      True
insnum      True
airline     True
fltno       True
dtype: bool


Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


Since we are planning to use i94port column as joining key between the dataset and in the description it mentions the values for this column aren't consistient. We can list the values data holds.

In [8]:
immigration_data.i94port.unique()

array(['HHW', 'MCA', 'OGG', 'LOS', 'CHM', 'ATL', 'SFR', 'NYC', 'CHI',
       'PHI', 'FTL', 'BOS', 'SAI', 'NAS', 'SEA', 'ORL', 'PSP', 'HOU',
       'NEW', 'BAL', 'SNJ', 'DET', 'AGA', 'LVG', 'MIA', 'SDP', 'VCV',
       'DUB', 'PEM', 'TAM', 'BLA', 'WAS', 'KOA', 'DAL', 'SHA', 'SPM',
       'NIA', 'PHR', 'MIL', 'SLC', 'CLT', 'EPI', 'SNA', 'MON', 'DLR',
       'SFB', 'OPF', 'X96', 'CLM', 'LIH', 'DEN', 'PHO', 'POO', 'NOL',
       'WPB', 'PBB', 'TOR', 'MAA', 'RNO', 'FMY', 'HIG', 'OAK', 'OTM',
       'ONT', 'SRQ', 'LLB', 'NCA', 'SUM', 'STR', 'HAM'], dtype=object)

Since all the values seems to be properly formulated except "X96", it is probably the case for a small sample. We can list the unique values in larger sample to properly explore the problem 

In [9]:
immigration_data_segment = pd.read_sas('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat', 'sas7bdat', encoding="ISO-8859-1")

In [10]:
immigration_data_segment.i94port.unique()

array(['XXX', 'ATL', 'WAS', 'NYC', 'TOR', 'BOS', 'HOU', 'MIA', 'CHI',
       'LOS', 'CLT', 'DEN', 'DAL', 'DET', 'NEW', 'FTL', 'LVG', 'ORL',
       'NOL', 'PIT', 'SFR', 'SPM', 'POO', 'PHI', 'SEA', 'SLC', 'TAM',
       'HAM', 'NAS', 'VCV', 'MAA', 'AUS', 'HHW', 'OGG', 'PHO', 'SDP',
       'SFB', 'EDA', 'MON', 'CLG', 'DUB', 'FMY', 'YGF', 'SAJ', 'CIN',
       'BAL', 'RDU', 'WPB', 'STT', 'OAK', 'NSV', 'SNA', 'OTT', 'X96',
       '5KE', 'CLE', 'HAR', 'PSP', 'CHR', 'HAL', 'SAA', 'KOA', 'SHA',
       'WIN', 'BGM', 'NCA', 'OPF', 'SAI', 'JFA', 'AGA', 'ONT', 'CLM',
       'STL', 'W55', 'CHS', 'SNJ', 'SRQ', 'ANC', 'LNB', 'LIH', 'MIL',
       'INP', 'KAN', 'ROC', 'SAC', 'BRO', 'LAR', 'RNO', 'SGR', 'ELP',
       'MCA', 'MDT', 'SPE', 'FPR', 'SYR', 'ICT', 'MLB', 'ADS', 'TUC',
       'DLR', 'CAE', 'CHA', 'HSV', 'WIL', 'HPN', 'HEF', 'BRG', 'BED',
       'DAB', 'JAC', 'FRB', 'SWF', 'KEY', 'PTK', 'MWH', 'X44', 'MYR',
       'APF', 'ATW', 'PVD', 'BUF', 'PIE', 'MHT', 'BDL', 'NYL', 'VNY',
       '5T6', 'LEX',

The dataset contains ill formulated values like 
'XXX, X96, 5T6, ML8, NC8 etc' for column i94port. We'll have to clean them before modeling the data

We'll need a unique id for each immigration event for which column "cicid" seems promising. We can check for its uniqueness as len(df) = len(df.cicid.unique())

In [11]:
len(immigration_data_segment)

3096313

In [12]:
len(immigration_data_segment.cicid.unique())

3096313

Next We will create a dictionary referencing SAS Label Description file called "code_city_dict" which stores all possible port codes and city names respecitively

In [13]:
# Create dictionary of valid i94port codes
code_city_dict = {}
city_code_dict = {}
r_exp = re.compile(r'\'(.*)\'.*\'(.*)\'')
with open('portcode_city.txt') as f:
     for line in f:
         match = r_exp.search(line)
         code_city_dict[match[1]]=match[2].strip()
         city_code_dict[match[2].strip()]=match[1]


# Testing the dictionary format
#code_city_dict
#city_code_dict

In [14]:
@udf()
def get_portcode_from_city(city):
    '''
    Input: City name
    Output: Corresponding i94port
    '''
    for key in code_city_dict:
        if city.lower() in code_city_dict[key].lower():
            return key

In [35]:
# Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()
df_spark =spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')


### Step 2: Explore and Assess the Data
#### Cleaning Steps
For the I94 immigration data, we want to drop all entries where the destination city code i94port is not a valid value (e.g., XXX, 99, etc) as described in I94_SAS_Labels_Description.SAS.
For the data use in the project, we want to drop all entries with duplicate values and Null Values, and then add the i94port of the location in each entry. We clean and processes the dataframe individually as given below.

##### Cleaning Immigration data

In [17]:
def remove_invalid_ports(file_path):
    '''
    Args: 
    file_path: Path to I94 immigration data
    Returns: 
    Spark dataframe of the data with valid i94port
    '''
    imm_dataframe = spark.read.format('com.github.saurfang.sas.spark').load(file_path)
    # Filter entries where i94port is invalid
    valid_imm_dataframe = imm_dataframe.filter(imm_dataframe.i94port.isin(list(code_city_dict.keys())))

    return valid_imm_dataframe

immigration_df = '../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat' 
immigration_df = remove_invalid_ports(immigration_df)
immigration_df=immigration_df.dropDuplicates(['cicid'])
immigration_df=immigration_df.filter(immigration_df.i94port != 'null')
immigration_df.show(3)

+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+
|cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|         admnum|fltno|visatype|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+---------------+-----+--------+
|299.0|2016.0|   4.0| 103.0| 103.0|    NYC|20545.0|    1.0|     NY|20550.0|  54.0|    2.0|  1.0|20160401|    null| null|      O|      O|   null|      M| 1962.0|06292016|  null|  null|     OS|5.5425872433E10|00087|      WT|
|305.0|2016.0|   4.0| 103.0| 103.0|    NYC|20545.0|    1.0|     NY|20555.0|  63.0|    2.0|  1.0|20160401|   

In [18]:
temperature_df=spark.read.format("csv").option("header", "true").load("../../data2/GlobalLandTemperaturesByCity.csv")
temperature_df=temperature_df.filter(temperature_df.AverageTemperature != 'NaN')
# Remove duplicate locations
temperature_df=temperature_df.dropDuplicates(['City', 'Country'])
# Get corresponding port name
temperature_df=temperature_df.withColumn("i94port", get_portcode_from_city(temperature_df.City))
# Remove entries with no iport94 code
temperature_df=temperature_df.filter(temperature_df.i94port != 'null')
# Show results
temperature_df.show(4)

+----------+------------------+-----------------------------+--------+-------------+--------+---------+-------+
|        dt|AverageTemperature|AverageTemperatureUncertainty|    City|      Country|Latitude|Longitude|i94port|
+----------+------------------+-----------------------------+--------+-------------+--------+---------+-------+
|1856-01-01|            26.901|                        1.359|     Ife|      Nigeria|   7.23N|    4.05E|    888|
|1852-07-01|            15.488|                        1.395|   Perth|    Australia|  31.35S|  114.97E|    PER|
|1828-01-01|            -1.977|                        2.551| Seattle|United States|  47.42N|  121.97W|    SEA|
|1743-11-01|             2.767|                        1.905|Hamilton|       Canada|  42.59N|   80.73W|    HAM|
+----------+------------------+-----------------------------+--------+-------------+--------+---------+-------+
only showing top 4 rows



In [19]:
demographics_df=spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true', quote='"', delimiter=';').load('us-cities-demographics.csv')
# Remove duplicate locations
demographics_df=demographics_df.dropDuplicates(['City', 'State'])
# Get corresponding port name
demographics_df=demographics_df.withColumn("i94port", get_portcode_from_city(demographics_df.City))
# Remove entries with no iport94 code
demographics_df=demographics_df.filter(demographics_df.i94port != 'null')
# Show results
demographics_df.show(4)

+-----------+------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+------+-------+
|       City| State|Median Age|Male Population|Female Population|Total Population|Number of Veterans|Foreign-born|Average Household Size|State Code|                Race| Count|i94port|
+-----------+------+----------+---------------+-----------------+----------------+------------------+------------+----------------------+----------+--------------------+------+-------+
| Cincinnati|  Ohio|      32.7|         143654|           154883|          298537|             13699|       16896|                  2.08|        OH|               White|162245|    CIN|
|Kansas City|Kansas|      33.4|          74606|            76655|          151261|              8139|       25507|                  2.71|        KS|Black or African-...| 40177|    KAN|
|     Dayton|  Ohio|      32.8|          66631|            73966|          

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Our Schema contains 3 tables. 1 Fact table and 2 dimension table. The details of the table are listed below:

Fact Table: `immigration_table`

Dimension Table: `cities_us_table` and `immigrants_table`

~~~~
cities_us_table
 |-- city: name of city
 |-- state: name of state
 |-- port_code: code for city
 |-- total_population: total population of the city
 |-- no_of_veterans: no of veterans in the city
 |-- no_of_foreignborns: no of foreign born residents in the city
 |-- average_household_size: no of average household size in the city
 |-- race: dominant racial group in the city 
 |-- average_temperature: average temperature of the city (joined from temperature data
~~~~

~~~~
immigrants_table
 |-- cicid: unique identifier for immigration/immigrants
 |-- birthdate: birthdate of immigrant
 |-- gender: gender of immigrant
 |-- occupation: occupation immigrant adopts in US (preferably)
 |-- visa_mode: business, pleasure or student
 |-- mode_of_arrival: mode of arrival to US eg: air, land etc
 |-- arrival_date: arrival date of immigrant in US
~~~~

~~~~
immigration_table:
 |-- year: year of immigration
 |-- month: month of immigration
 |-- source_city: source city port
 |-- destination_city: destination city
 |-- mode_of_arrival: mode of arrival to US eg: air, land etc
 |-- average_temperature: average_temperature of US city
 |-- race: dominant racial group of destination city
 |-- foreign_born_no: total no of people in the city who were foreign born
 
~~~~

#### 3.2 Mapping Out Data Pipelines
The following steps can be performed to create etl process
1. Select relevant columns from immigration data for `immigrants_table`
2. Perform join operation in demographics_df and temperature_df to get average temperature
3. Create a new dataframe `city_us_table` selecting relevant columns from joined result obtained on step 2
4. Join demographics and temperature table with column i94port
5. Select relevant columns for creating table `immigration_table` as fact table


### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model

In [20]:
# immigrants_table
immigrants_table = immigration_df.select(["cicid","biryear", "gender", "occup", "i94visa", "i94mode", "arrdate"])

# drop city to resolve ambiguity
demographics_df = demographics_df.drop("City")
#joining demographics_df and temperature_df for average temperature
city_temperature_full= demographics_df.join(temperature_df, demographics_df.i94port == temperature_df.i94port).drop(temperature_df.i94port)
#select columns for city_us_table
city_us_table = city_temperature_full.select(["City","State","i94port","Total Population","Number of Veterans","Foreign-born","Average Household Size","Race","AverageTemperature"])

#joining demographics_df and immigration_df
immigration_demographics_full= immigration_df.join(city_temperature_full, city_temperature_full.i94port == immigration_df.i94port).drop(immigration_df.i94port)
#selecting relevent columns for immigration table
immigration_table = immigration_demographics_full.select(["cicid","i94yr","i94mon","i94cit","i94port","i94mode","Race","Foreign-born","AverageTemperature"])

Renaming for consistency in column name

In [25]:
immigrants_table = immigrants_table.withColumnRenamed("biryear", "birthdate")\
        .withColumnRenamed("occup", "occupation")\
        .withColumnRenamed("i94visa", "visa_mode")\
        .withColumnRenamed("i94mode", "mode_of_arrival")\
        .withColumnRenamed("arrdate", "arrival_date")\

city_us_table = city_us_table.withColumnRenamed("City", "city")\
        .withColumnRenamed("State", "state")\
        .withColumnRenamed("i94port", "port_code")\
        .withColumnRenamed("Total Population", "total_population")\
        .withColumnRenamed("Number of Veterans", "no_of_veterans")\
        .withColumnRenamed("Foreign-born", "no_of_foreignborns")\
        .withColumnRenamed("Average Household Size", "average_household_size")\
        .withColumnRenamed("Race", "race")\
        .withColumnRenamed("AverageTemperature", "average_temperature")\
        
immigration_table = immigration_table.withColumnRenamed("i94yr", "city")\
        .withColumnRenamed("i94mon", "state")\
        .withColumnRenamed("i94port", "port_code")\
        .withColumnRenamed("i94cit", "total_population")\
        .withColumnRenamed("i94mode", "no_of_veterans")\
        .withColumnRenamed("Foreign-born", "no_of_foreignborns")\
        .withColumnRenamed("Race", "race")\
        .withColumnRenamed("AverageTemperature", "average_temperature")\

In [26]:
immigrants_table.show(5)

+-----+---------+------+----------+---------+---------------+------------+
|cicid|birthdate|gender|occupation|visa_mode|mode_of_arrival|arrival_date|
+-----+---------+------+----------+---------+---------------+------------+
|299.0|   1962.0|  null|      null|      2.0|            1.0|     20545.0|
|305.0|   1953.0|  null|      null|      2.0|            1.0|     20545.0|
|496.0|   1952.0|  null|      null|      1.0|            1.0|     20545.0|
|558.0|   1974.0|     M|      null|      1.0|            1.0|     20545.0|
|596.0|   1992.0|     M|      null|      2.0|            1.0|     20545.0|
+-----+---------+------+----------+---------+---------------+------------+
only showing top 5 rows



In [27]:
city_us_table.show(5)

+-------+----------+---------+----------------+--------------+------------------+----------------------+--------------------+-------------------+
|   city|     state|port_code|total_population|no_of_veterans|no_of_foreignborns|average_household_size|                race|average_temperature|
+-------+----------+---------+----------------+--------------+------------------+----------------------+--------------------+-------------------+
|Seattle|Washington|      SEA|          684443|         29364|            119840|                  2.13|  Hispanic or Latino|             -1.977|
|Ontario|California|      ONT|          171200|          4816|             48557|                  3.52|Black or African-...|  7.399999999999999|
|Spokane|Washington|      SPO|          213267|         18044|             13253|                  2.34|               Asian|              2.322|
|    Ica|  Illinois|      CHI|         2720556|         72042|            573463|                  2.53|Black or African-...

In [28]:
immigration_table.show(5)

+--------+------+-----+----------------+---------+--------------+-----+------------------+-------------------+
|   cicid|  city|state|total_population|port_code|no_of_veterans| race|no_of_foreignborns|average_temperature|
+--------+------+-----+----------------+---------+--------------+-----+------------------+-------------------+
|158829.0|2016.0|  4.0|           582.0|      SNA|           1.0|White|            208046|  7.168999999999999|
|576038.0|2016.0|  4.0|           582.0|      SNA|           1.0|White|            208046|  7.168999999999999|
|775234.0|2016.0|  4.0|           582.0|      SNA|           1.0|White|            208046|  7.168999999999999|
|924449.0|2016.0|  4.0|           582.0|      SNA|           1.0|White|            208046|  7.168999999999999|
|928755.0|2016.0|  4.0|           582.0|      SNA|           1.0|White|            208046|  7.168999999999999|
+--------+------+-----+----------------+---------+--------------+-----+------------------+-------------------+
o

#### 4.2 Data Quality Checks
In order to ensure the pipeline ran as it was intended, we can run following checks and tests
 * Source/Count checks to ensure completeness
 * Listing Number of Nan Values in every columns to ensure reliability
 * Printing a sample of data obtained using show()

Source/Count checks to ensure completeness - we can compare the result with original dataset

In [None]:
def completeness_test(df, description):
    '''
    Source/Count checks to ensure completeness  
    
    Args:
    df: Spark dataframe, description of Spark dataframe
    description: output message placeholder for table name
    
    Returns: None
    '''
    result = df.count()
    if result == 0:
        print("Test Failed for {} with zero records".format(description))
    else:
        print("Test Passed for {} with {} records".format(description, result))
    return None

# Perform data quality check
completeness_test(immigration_table, "immigration table")
completeness_test(city_us_table, "city table")
completeness_test(immigrants_table, "immigrants table")

Listing Number of Nan Values in every columns to ensure reliability. The result is expected as we have removed Nan values during preprocessing.

In [29]:
from pyspark.sql.functions import isnan, when, count, col
immigration_table.select([count(when(isnan(c), c)).alias(c) for c in immigration_table.columns]).show()

+-----+----+-----+----------------+---------+--------------+----+------------------+-------------------+
|cicid|city|state|total_population|port_code|no_of_veterans|race|no_of_foreignborns|average_temperature|
+-----+----+-----+----------------+---------+--------------+----+------------------+-------------------+
|    0|   0|    0|               0|        0|             0|   0|                 0|                  0|
+-----+----+-----+----------------+---------+--------------+----+------------------+-------------------+



In [32]:
from pyspark.sql.functions import isnan, when, count, col
city_us_table.select([count(when(isnan(c), c)).alias(c) for c in city_us_table.columns]).show()

+----+-----+---------+----------------+--------------+------------------+----------------------+----+-------------------+
|city|state|port_code|total_population|no_of_veterans|no_of_foreignborns|average_household_size|race|average_temperature|
+----+-----+---------+----------------+--------------+------------------+----------------------+----+-------------------+
|   0|    0|        0|               0|             0|                 0|                     0|   0|                  0|
+----+-----+---------+----------------+--------------+------------------+----------------------+----+-------------------+



In [31]:
from pyspark.sql.functions import isnan, when, count, col
immigrants_table.select([count(when(isnan(c), c)).alias(c) for c in immigrants_table.columns]).show()

+-----+---------+------+----------+---------+---------------+------------+
|cicid|birthdate|gender|occupation|visa_mode|mode_of_arrival|arrival_date|
+-----+---------+------+----------+---------+---------------+------------+
|    0|        0|     0|         0|        0|              0|           0|
+-----+---------+------+----------+---------+---------------+------------+



#### 4.3 Data dictionary 
~~~~
cities_us_table
 |-- city: name of city
 |-- state: name of state
 |-- port_code:
 |-- total_population: total population of the city
 |-- no_of_veterans: no of veterans in the city
 |-- no_of_foreignborns: no of foreign born residents in the city
 |-- average_household_size: no of average household size in the city
 |-- race: dominant racial group in the city 
 |-- average_temperature: average temperature of the city (joined from temperature data)
~~~~

~~~~
immigrants_table
 |-- cicid: unique identifier for immigration/immigrants
 |-- birthdate: birthdate of immigrant
 |-- gender: gender of immigrant
 |-- occupation: occupation immigrant adopts in US (preferably)
 |-- visa_mode: business, pleasure or student
 |-- mode_of_arrival: mode of arrival to US eg: air, land etc
 |-- arrival_date: arrival date of immigrant in US
~~~~

~~~~
immigration_table:
 |-- year: year of immigration
 |-- month: month of immigration
 |-- source_city: source city port
 |-- destination_city: destination city
 |-- mode_of_arrival: mode of arrival to US eg: air, land etc
 |-- average_temperature: average_temperature of US city
 |-- race: dominant racial group of destination city
 |-- foreign_born_no: total no of people in the city who were foreign born
 
~~~~

### Justification of Technology and Update of Data

Pandas was used to explore the data since it has an intuitive and easy interface for cleaning, transforming, manipulating and analyzing data. For the production side of things (building the data pipeline), Spark(specifically pyspark) was chosen for etl processing of the data. This framework is known for its greater speed compared with the other traditional data processing frameworks because of distributed processing mechanism. Pyspark also has support for mutiple files formats like csv,parquet,etc and can be easily integrated with s3, an amazon based cloud storage service. 

The update of the data depends entirely on the business requirement. The project doesn't deal with urgent analysis (for example, getting nearest ride), so it is not intuitive to know when should we should update the data. The most probable use case for the project is for some study regarding immigratants behaviour, on which it can be updated at the time of query. However, for projects like these, it might be a good idea to update the data once a month or once every 3 month.

## Possible Scenerios

If the data was increased by 100x, we will require massive storage, server performance has to be optimal, and there’s an array of networking and security concerns. The most straightforward way to mitigate this concern would be to use service like Elastic Map Reduce (EMR) from AWS. Amazon EMR can also helps us to  transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

If the data needs to populate a dashboard daily  then we could use a scheduling tool such as Airflow to run the ETL pipeline. It can interface with third party python APIs or Amazon Based Services to extract, transform, or load data processing steps as pipeline into the user dashboard.

100 users accessing the dataset shouldn't be a problem after we move it to Amazon s3. The issue can arise when those users start updating and deleting the files. So first of all read only access should be done for those users. Also it is a good idea to host two buckets, one for production and the other for development to ensure further reliability.