# Project Title
### Data Engineering Capstone Project

#### Project Summary
* The goal of this project is to enable data analysts and other similar parties analyze various aspects of immigration data, by collecting data from four different data sets viz.,immigration, airport codes, US Cities demographics and global temperature.  The project builds a useful schema from these datasets which will enable analysts to get an information like which origin country has more visitors visiting US, how long do they stay in the US, what's the demography of the state where the immigrants land in, and what's the average temparature of the country

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# import necessary libraries and user defined utilities
import pandas as pd

import os
import configparser
import datetime as dt
import time
from pyspark.sql.functions import isnan, when, count, col, udf, dayofmonth, dayofweek, month, year, weekofyear, avg, monotonically_increasing_id
from pyspark.sql.types import *
from pyspark.sql.functions import year, month, dayofmonth, weekofyear, date_format
from pyspark.sql import SparkSession, SQLContext, GroupedData, HiveContext
from pyspark.sql.functions import *
from pyspark.sql.functions import date_add as d_add
from pyspark.sql.types import DoubleType, StringType, IntegerType, FloatType
from pyspark.sql import functions as F
from pyspark.sql.functions import lit
from pyspark.sql import Row
import util as util
import table_helper as helper

In [2]:
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

immigration_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
immigration_spark.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

#### Global Temperature Data

In [3]:
file_name = '../../data2/GlobalLandTemperaturesByCity.csv'
global_temperature_df = pd.read_csv(file_name)

In [None]:
global_temperature_df.dtypes

In [None]:
global_temperature_df.shape

In [None]:
global_temperature_df.head()

 #### AIRPORT CODES

In [4]:
airport_codes_csv = 'airport-codes_csv.csv'
airport_codes_df = pd.read_csv(airport_codes_csv)
airport_codes_df.shape

(55075, 12)

In [5]:
column_list = ['iata_code', 'local_code']
airport_codes_df = util.cleanup_missing_column_values(airport_codes_df,column_list)

removing rows with null values for ['iata_code', 'local_code']
total rows before clean up 55075
total rows after clean up 2987


In [None]:
airport_codes_df.head(1)

In [None]:
airport_codes_df.shape

#### US CITY DEMOGRAPHICS

In [6]:
us_cities_dem_csv = "us-cities-demographics.csv"
demographics_df = pd.read_csv(us_cities_dem_csv, delimiter=';')
demographics_df.head(1)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924


In [None]:
demographics_df.shape

In [7]:

for column in demographics_df:
    values = demographics_df[column].unique()
    if(True in pd.isnull(values)):
        print(f"column {column} has null value")
print("finished checking for null")        


column Male Population has null value
column Female Population has null value
column Number of Veterans has null value
column Foreign-born has null value
column Average Household Size has null value
finished checking for null


#### Build country code to country name data frame 
_(The following code is used based on a direction from a mentor)_

In [8]:
#extract country name and country code from I94_SAS_Labels_Descriptions.SAS

with open("I94_SAS_Labels_Descriptions.SAS") as f:
    contents = f.readlines()
    country_code = {}
    for countries in contents[10:298]:
        pair = countries.split('=')
        code,country = pair[0].strip(), pair[1].strip().strip("'")
        country_code[code] = country
country_code_df = pd.DataFrame(list(country_code.items()),columns=['code','country'])
country_code_df.head(5)

Unnamed: 0,code,country
0,236,AFGHANISTAN
1,101,ALBANIA
2,316,ALGERIA
3,102,ANDORRA
4,324,ANGOLA


### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
* Clean up data with null values in important columns
* Drop duplicate rows

In [9]:
immigration_spark = immigration_spark.drop("insnum","occup")

In [10]:
# clean up Immigration data
immigration_spark = immigration_spark.where(immigration_spark.arrdate.isNotNull())

In [11]:
immigration_spark = immigration_spark.where(immigration_spark.depdate.isNotNull())

In [12]:
immigration_spark.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = true)
 |-- fltno: string (nullable = true)
 |-- visatype: string (nullable 

In [13]:
# clean up Global Temperature data
global_temperature_df = global_temperature_df.dropna()

In [None]:
global_temperature_df.shape

In [14]:
#discard historical data and work with data for past 20 years
print("Discarding historical data and work with data for recent 20 years")
dt_begin = "2000-01-01"
dt_end = "2019-01-01"
after_dt_begin = global_temperature_df["dt"] >= dt_begin
before_dt_end = global_temperature_df["dt"] < dt_end
dt_range = after_dt_begin & before_dt_end
global_temperature_df = global_temperature_df.loc[dt_range]
global_temperature_df.shape

Discarding historical data and work with data for recent 20 years


(576080, 7)

In [None]:
global_temperature_df.head()

In [15]:
# clean up Airport codes
column_list = ['iata_code', 'local_code']
airport_codes_df = util.cleanup_missing_column_values(airport_codes_df,column_list)
airport_codes_df.shape

removing rows with null values for ['iata_code', 'local_code']
total rows before clean up 2987
total rows after clean up 2987


(2987, 12)

In [16]:
demographics_df = demographics_df.drop_duplicates()
demographics_df = demographics_df.dropna()
# need to remove rows that have null values for the following columns
# Male Population, Female Population
# column_list = ['Male Population', 'Female Population',]
# demographics_df = util.cleanup_missing_column_values(demographics_df,column_list)
demographics_df.shape

(2875, 12)

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

* The process of admitting a visitor to the US triggers events that can be classifed as facts.  In this project we are creating immigration fact table
* Derive dimension tables
    * airport
    * time
    * status
    * visa
    * temperature
    * country
    * state


#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

* Load data into staging environment
* Create fact and dimension tables
* Write table data into parquet files
* Run data quality tests

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:

# not including insnum, occup columns as majority of rows contain null value for these columns
# also these columns are dropped from the data frame
# immig_schema = StructType([StructField("0", IntegerType(), True)\
#                           ,StructField("cicid", FloatType(), True)\
#                           ,StructField("i94yr", FloatType(), True)\
#                           ,StructField("i94mon", FloatType(), True)\
#                           ,StructField("i94cit", FloatType(), True)\
#                           ,StructField("i94res", FloatType(), True)\
#                           ,StructField("i94port", StringType(), True)\
#                           ,StructField("arrdate", FloatType(), True)\
#                           ,StructField("i94mode", FloatType(), True)\
#                           ,StructField("i94addr", StringType(), True)\
#                           ,StructField("depdate", FloatType(), True)\
#                           ,StructField("i94bir", FloatType(), True)\
#                           ,StructField("i94visa", FloatType(), True)\
#                           ,StructField("count", FloatType(), True)\
#                           ,StructField("dtadfile", StringType(), True)\
#                           ,StructField("visapost", StringType(), True)\
#                           ,StructField("entdepa", StringType(), True)\
#                           ,StructField("entdepd", StringType(), True)\
#                           ,StructField("entdepu", StringType(), True)\
#                           ,StructField("matflag", StringType(), True)\
#                           ,StructField("biryear", FloatType(), True)\
#                           ,StructField("dtaddto", StringType(), True)\
#                           ,StructField("gender", StringType(), True)\
#                           ,StructField("airline", StringType(), True)\
#                           ,StructField("admnum", FloatType(), True)\
#                           ,StructField("fltno", StringType(), True)\
#                           ,StructField("visatype", StringType(), True)])

In [None]:
# immigration_spark = spark.createDataFrame(immigration_df, schema=immig_schema)
print(immigration_spark.count(),len(immigration_spark.columns))

In [17]:
globaltemp_schema = StructType([StructField("dt", StringType(), True)\
                          ,StructField("average_temperature", FloatType(), True)\
                          ,StructField("average_temperature_uncertainty", FloatType(), True)\
                          ,StructField("city", StringType(), True)\
                          ,StructField("country", StringType(), True)\
                          ,StructField("latitude", StringType(), True)\
                          ,StructField("longitude", StringType(), True)])
global_temperature_df.rename(columns={'AverageTemperature':'average_temperature'}, inplace=True)
global_temperature_df.rename(columns={'AverageTemperatureUncertainty':'average_temperature_uncertainty'}, inplace=True)
global_temperature_df.rename(columns={'City':'city'}, inplace=True)
global_temperature_df.rename(columns={'Country':'country'}, inplace=True)
global_temperature_df.rename(columns={'Latitude':'latitude'}, inplace=True)
global_temperature_df.rename(columns={'Longitude':'longitude'}, inplace=True)

temp_spark = spark.createDataFrame(global_temperature_df, schema=globaltemp_schema)

temp_spark.toPandas().head()

Unnamed: 0,dt,average_temperature,average_temperature_uncertainty,city,country,latitude,longitude
0,2000-01-01,3.065,0.372,Århus,Denmark,57.05N,10.33E
1,2000-02-01,3.724,0.241,Århus,Denmark,57.05N,10.33E
2,2000-03-01,3.976,0.296,Århus,Denmark,57.05N,10.33E
3,2000-04-01,8.321,0.221,Århus,Denmark,57.05N,10.33E
4,2000-05-01,13.567,0.253,Århus,Denmark,57.05N,10.33E


In [None]:
temp_spark.toPandas().shape

In [18]:
dem_schema = StructType([StructField("City", StringType(), True)\
                        ,StructField("State", StringType(), True)\
                        ,StructField("Median Age", FloatType(), True)\
                        ,StructField("Male Population", FloatType(), True)\
                        ,StructField("Female Population", FloatType(), True)\
                        ,StructField("Total Population", IntegerType(), True)\
                        ,StructField("Number of Veterans", FloatType(), True)\
                        ,StructField("Foreign-born", FloatType(), True)\
                        ,StructField("Average Household Size", FloatType(), True)\
                        ,StructField("State Code", StringType(), True)\
                        ,StructField("Race", StringType(), True)\
                        ,StructField("Count", IntegerType(), True)])

dem_spark = spark.createDataFrame(demographics_df, schema=dem_schema)

dem_spark.toPandas().head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.799999,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.599998,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [19]:
airport_schema =  StructType([StructField("ident", StringType(), True)\
                        ,StructField("type", StringType(), True)\
                        ,StructField("name", StringType(), True)\
                        ,StructField("elevation_ft", FloatType(), True)\
                        ,StructField("continent", StringType(), True)\
                        ,StructField("iso_country", StringType(), True)\
                        ,StructField("iso_region", StringType(), True)\
                        ,StructField("municipality", StringType(), True)\
                        ,StructField("gps_code", StringType(), True)\
                        ,StructField("iata_code", StringType(), True)\
                        ,StructField("local_code", StringType(), True)\
                        ,StructField("coordinates", StringType(), True)])
airport_spark = spark.createDataFrame(airport_codes_df, schema=airport_schema)

airport_spark.toPandas().head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,03N,small_airport,Utirik Airport,4.0,OC,MH,MH-UTI,Utirik Island,K03N,UTK,03N,"169.852005, 11.222"
1,07FA,small_airport,Ocean Reef Club Airport,8.0,,US,US-FL,Key Largo,07FA,OCA,07FA,"-80.274803161621, 25.325399398804"
2,0AK,small_airport,Pilot Station Airport,305.0,,US,US-AK,Pilot Station,,PQS,0AK,"-162.899994, 61.934601"
3,0CO2,small_airport,Crested Butte Airpark,8980.0,,US,US-CO,Crested Butte,0CO2,CSE,0CO2,"-106.928341, 38.851918"
4,0TE7,small_airport,LBJ Ranch Airport,1515.0,,US,US-TX,Johnson City,0TE7,JCY,0TE7,"-98.62249755859999, 30.251800537100003"


##### Define output path

In [20]:
output_path="table_data2/"

##### 1.Create airport dimension

In [21]:
airport_spark = helper.create_airport(airport_spark, output_path)

Writing table airport to table_data2/airport
Write complete!


##### 2.Create time dimension

In [22]:
time = helper.create_time(immigration_spark,output_path)

Writing table time to table_data2/time
Write complete!


##### 3.Create status dimension

In [23]:
status = helper.create_status(immigration_spark,output_path)

Writing table status to table_data2/status
Write complete!


##### 4.Create visa dimension

In [24]:
visa = helper.create_visa(immigration_spark,output_path)

Writing table visa to table_data2/visa
Write complete!


In [25]:
def create_temperature(input_df, output_path):
    """
        Gather temperature data, create dataframe and write data into parquet files.
        
        :param input_df: dataframe of input data.
        :param output_data: path to write data to.
        :return: dataframe representing temperature dimension
    """
    print("creating temperature table data")
    output_df = input_df.groupBy("country").agg(
                round(mean('average_temperature'),1).alias("average_temperature"),\
                round(mean("average_temperature_uncertainty"),1).alias("average_temperature_uncertainty")
            ).dropna()\
            .withColumn("temperature_id", monotonically_increasing_id()) \
            .select(["temperature_id", "country", "average_temperature", "average_temperature_uncertainty"])
    
    util.output_to_parquet_file(output_df, output_path, "temperature")
    
    return output_df


##### 5.Create temperature dimension

In [26]:
temperature = create_temperature(temp_spark,output_path)

creating temperature table data
Writing table temperature to table_data2/temperature
Write complete!


##### 6.Create country dimension

In [27]:
country_spark = spark.createDataFrame(country_code_df)

In [28]:
country = helper.create_country(country_spark,output_path)

Writing table country to table_data2/country
Write complete!


##### 9.Create state dimension

In [29]:
def create_state(input_df, output_path):
    """
        Get state specific data and create dataframe and write data into parquet files.
        Here we will group the information by state code
        Rename the columns from Xxxx Yyyy to xxx_yyy
        Drop rows with null values
        
        :param input_df: dataframe of input data.
        :param output_data: path to write data to.
        :return: dataframe representing state dimension
    """
    
    output_df = input_df.select(["State Code", "State", "Median Age", "Male Population", "Female Population", "Total Population", "Average Household Size",\
                          "Foreign-born", "Race", "Count"])\
                .withColumnRenamed("State","state")\
                .withColumnRenamed("State Code", "state_code")\
                .withColumnRenamed("Median Age", "median_age")\
                .withColumnRenamed("Male Population", "male_population")\
                .withColumnRenamed("Female Population", "female_population")\
                .withColumnRenamed("Total Population", "total_population")\
                .withColumnRenamed("Average Household Size", "avg_household_size")\
                .withColumnRenamed("Foreign-born", "foreign_born")\
                .withColumnRenamed("Race", "race")\
                .withColumnRenamed("Count", "count")
    print(output_df.show(2))
    output_df = output_df.groupBy("state_code","state").agg(\
                round(mean('median_age'),0).alias("median_age"),\
                sum("total_population").alias("total_population"),\
                sum("male_population").alias("male_population"),\
                sum("female_population").alias("female_population"),\
                sum("foreign_born").alias("foreign_born"), \
                sum("avg_household_size").alias("average_household_size")
                ).dropna()
    
    util.output_to_parquet_file(output_df, output_path, "state")
    
    return output_df

In [30]:
state = create_state(dem_spark,output_path)

+----------+-------------+----------+---------------+-----------------+----------------+------------------+------------+------------------+-----+
|state_code|        state|median_age|male_population|female_population|total_population|avg_household_size|foreign_born|              race|count|
+----------+-------------+----------+---------------+-----------------+----------------+------------------+------------+------------------+-----+
|        MD|     Maryland|      33.8|        40601.0|          41862.0|           82463|               2.6|     30908.0|Hispanic or Latino|25924|
|        MA|Massachusetts|      41.0|        44129.0|          49500.0|           93629|              2.39|     32935.0|             White|58723|
+----------+-------------+----------+---------------+-----------------+----------------+------------------+------------+------------------+-----+
only showing top 2 rows

None
Writing table state to table_data2/state
Write complete!


##### Create Immigration Fact

In [31]:
# join city dimension and temperature dimension
country_temperature = country.select(["*"])\
            .join(temperature, (country.country == upper(temperature.country)), how='full')\
            .select([country.code, country.country, temperature.temperature_id, temperature.average_temperature, temperature.average_temperature_uncertainty])

country_temperature.write.mode("overwrite").parquet(output_path+"country_temperature_mapping")

In [32]:
immigration = helper.create_immigration(immigration_spark, output_path, spark)

In [33]:
#create a temp view - immigration_view
immigration.createOrReplaceTempView("immigration_view")

In [34]:
immigration.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- fltno: string (nullable = true)
 |-- ident: string (nullable = true)
 |-- code: string (nullable = true)
 |-- temperature_id: long (nullable = true)
 |-- status_flag_id: long (nullable = true)
 |-- visa_id: long (nullable = true)
 |-- state_code: string (nullable = true)
 |-- foreign_born: double (nullable = true)
 |-- country: string (nullable = true)
 |-- arrdate: double (nullable = true)



In [35]:
immigration_states = immigration.groupBy("state_code").count().persist()

In [36]:
immigration_states.printSchema()

root
 |-- state_code: string (nullable = true)
 |-- count: long (nullable = false)



In [41]:
immigration_state_foreign_born = spark.sql("select distinct state_code, foreign_born from immigration_view where foreign_born > 100 order by foreign_born desc").persist()

In [42]:
immigration_state_foreign_born.printSchema()

root
 |-- state_code: string (nullable = true)
 |-- foreign_born: double (nullable = true)



In [None]:
immigration_state_foreign_born.first()

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
  


##### 1. Generic data check

_In this section, top 5 rows are printed form each of the dimension and fact tables_

In [None]:
airport = spark.read.parquet(output_path+"airport")
#airport.toPandas().head()
airport.toPandas().dtypes


In [None]:
time = spark.read.parquet(output_path+"time")
#time.toPandas().head()
time.toPandas().dtypes

In [None]:
status = spark.read.parquet(output_path+"status")
#status.toPandas().head()
status.toPandas().dtypes

In [None]:
visa = spark.read.parquet(output_path+"visa")
#visa.toPandas().head()
visa.toPandas().dtypes

In [None]:
temperature = spark.read.parquet(output_path+"temperature")
#temperature.toPandas().head()
temperature.toPandas().dtypes

In [None]:
country = spark.read.parquet(output_path+"country")
#country.toPandas().head()
country.toPandas().dtypes

In [None]:
state = spark.read.parquet(output_path+"state")
#state.toPandas().head()
state.toPandas().dtypes

In [None]:
country_temperature = spark.read.parquet(output_path+"country_temperature_mapping")
country_temperature.toPandas().head()

In [None]:
immigration = spark.read.parquet(output_path+"immigration")
#immigration.toPandas().head()
immigration.toPandas().dtypes

##### 2. Record count check
##### _In this section, a generic record count is done for each table_

In [None]:
util.run_record_count_check(airport, "airport")
util.run_record_count_check(time, "time")
util.run_record_count_check(status, "status")
util.run_record_count_check(visa, "visa")
util.run_record_count_check(state, "state")
util.run_record_count_check(temperature, "temperature")
util.run_record_count_check(immigration, "immigration")

##### 2. Custom data check
_In this section, results for custom queries are printed_

* Get total count of immigrants for each country

In [None]:
imm_df = immigration.toPandas()
imm_df = imm_df.groupby(['country'])['country'].count()
imm_df.head()

#### Data Dictionary

__Please refer Data Dictionary.ipynb file__

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

##### 5.1 Rationale for the choice of tools and technologies for the project.

* Used Apache Spark, so that we can take advantage of Spark's feature of caching data in memory and also parallel processing.
* Used Pandas data frames for convenience in data manipulation.

##### 5.2 How often the data should be updated/refreshed, why?

* From the project's perspective, it depends on how often the immigration dataset gets updated as the data on other dimension tables depneds on immigration data.  If the immigration data gets updated on a monthly basis, then all other data should be updated monthly

##### 5.3 Approach in different scenarios

###### a. If the data is increased by 100x
* I would have to use frameworks like Amazon Redshift, EMR Cluster etc to handle such a magnitude of data

###### b. If the data populates a dashboard that needs to be updated by 7am daily
* In this case, I would use Apache Airflow DAG which facilitates scheduling building/managing data pipelines on a timely basis.

###### c. If 100+ people are going to access the system
* In this case I would have to use a tried and tested set up like Amazon Redshift, Snowflakes and few other similar setup which are known to handle high database traffic.