# US Immigration Statistics
### Data Engineering Capstone Project

#### Project Summary
Based on the 2016 data of immigrations into the US we want to analyse to which cities in the US immigration took place and what characteristics these cities have regarding population and airport infrastructure.



##### The project follows the following steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [83]:
# Do all imports and installs here
import pandas as pd
import glob
from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
from pyspark.sql.types import StringType, IntegerType, DoubleType, BooleanType, DateType
from pyspark.sql.functions import concat, col, lit,udf
import re
import datetime as dt
import quality_check_functions
import etl
import importlib
from CodeUtilities import CodeUtilities
importlib.reload(quality_check_functions)
from quality_check_functions import performQualityChecks,check_existence
importlib.reload(etl)
from etl import process_city_data,process_country_codes,process_airport_data,process_state_codes,etl_immigrations_data, etl_demographics_data


In [9]:
spark = SparkSession.builder.\
config("spark.jars.repositories", "https://repos.spark-packages.org/").\
config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
enableHiveSupport().getOrCreate()

# Step 1: Scope the Project and Gather Data

## Scope 
Create an ETL pipeline for US immigration data, demographic data of US cities and airports to allow queries on the relationship of these data sources and enable deeper analytics on the data.

### Describe and Gather Data 
i94 Immigration Data : This data comes from the US National Tourism and Trade Office. This table is used for the fact table in this project.

U.S. City Demographic Data us-cities-demographics. This dataset contains population details of all US Cities and census-designated places includes gender & race information. This data came from OpenSoft. The table is grouped by state to get aggregated statistics.¶

Airport Codes is a simple table of airport codes and corresponding cities. The rows where IATA codes are available in the table are selected for this project. 

# Step 2: Explore and Assess the Data

## Immigration Data

In [3]:
print(glob.glob("../../data/*/*"))

['../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_sep16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_nov16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_mar16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_aug16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_may16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_jan16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_oct16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_jul16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_feb16_sub.sas7bdat', '../../data/18-83510-I94-Data-2016/i94_dec16_sub.sas7bdat']


-> 12 files available, one for each month in 2016.

In [86]:
df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat').drop('validres','delete_days','delete_mexl','delete_dup','delete_visa','delete_recdup')
df_spark.printSchema()
print(f"Rows: {df_spark.count()}")

for file in glob.glob("../../data/*/*"):
    print('Reading file '+file)
    if  '18-83510-I94-Data-2016/i94_jun16_sub.sas7bdat' not in file:
        df_spark = df_spark.unionByName(spark.read.format('com.github.saurfang.sas.spark').load(file))
        print(f"Rows: {df_spark.count()}")
    
#df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016')
columns_in_file=len(df_spark.columns)
rows_in_file=df_spark.count()
distinct_rows_data=df_spark.distinct().count()
print(f"The total number of columns in the data file are {columns_in_file} and number of rows {rows_in_file} ")
print(f"Count of distinct number of rows {distinct_rows_data} ")

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [None]:
#write to parquet
df_spark.write.parquet("all_sas_data")
df_spark=spark.read.parquet("all_sas_data")

Over 40m entries in immigration data files!
File from June contains additional columns to the other files. The columns were removed since they are not needed and hinder aggregation of the data.

## US Cities demographic data

In [23]:
file_name = "us-cities-demographics.csv"
demographics_df = spark.read.csv(file_name, inferSchema=True, header=True, sep=';')

performQualityChecks(demographics_df, file_name, True)
demographics_df = demographics_df.withColumnRenamed('State Code', 'state_code')
check_existence(demographics_df, '((City=\'Silver Spring\') AND (state_code=\'MD\'))')
demographics_df.createOrReplaceTempView('demographics')
result = spark.sql("SELECT count(*) as result FROM demographics where City = 'Silver Spring' and state_code = 'MD'")
result.collect()[0][0]
# display the first five records
demographics_df.limit(4).toPandas()


<Not Empty> quality check passed for us-cities-demographics.csv with 2,891 records.
<All lines loaded> quality check passed!
<Check existence> quality check succeeeded


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,state_code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129,49500,93629,4147,32935,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040,46799,84839,4819,8229,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127,87105,175232,5821,33878,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040,143873,281913,5829,86253,2.73,NJ,White,76402


Let's look at the data for one city:

In [24]:
demographics_df.filter('City == \'Silver Spring\'').toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,state_code,Race,Count
0,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Hispanic or Latino,25924
1,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,White,37756
2,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Black or African-American,21330
3,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,American Indian and Alaska Native,1084
4,Silver Spring,Maryland,33.8,40601,41862,82463,1562,30908,2.6,MD,Asian,8841


Data set contains one row per city per race living in the city. 
Let's check how man different races are available:

In [25]:
demographics_df.select('Race').distinct().show()

+--------------------+
|                Race|
+--------------------+
|Black or African-...|
|  Hispanic or Latino|
|               White|
|               Asian|
|American Indian a...|
+--------------------+



The data is available for five different races.

## Airport Codes (incl. IATA codes)

Filter US airports with IATA code

In [26]:
airports_df = process_airport_data(spark,"airport-codes_csv.csv")

# display the first five records
airports_df.limit(5).toPandas()


<Not Empty> quality check passed for airport-codes_csv.csv with 55,075 records.
<All lines loaded> quality check passed!
<Not Empty> quality check passed for city_codes.txt with 526 records.
<All lines loaded> quality check NOT passed! Number of lines: 533, number of rows: 526


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,state_code,city_code
0,07FA,small_airport,Ocean Reef Club Airport,8,,US,US-FL,Key Largo,07FA,OCA,07FA,"-80.274803161621, 25.325399398804",FL,0
1,0AK,small_airport,Pilot Station Airport,305,,US,US-AK,Pilot Station,,PQS,0AK,"-162.899994, 61.934601",AK,0
2,0CO2,small_airport,Crested Butte Airpark,8980,,US,US-CO,Crested Butte,0CO2,CSE,0CO2,"-106.928341, 38.851918",CO,0
3,0TE7,small_airport,LBJ Ranch Airport,1515,,US,US-TX,Johnson City,0TE7,JCY,0TE7,"-98.62249755859999, 30.251800537100003",TX,0
4,13MA,small_airport,Metropolitan Airport,418,,US,US-MA,Palmer,13MA,PMX,13MA,"-72.31140136719999, 42.223300933800004",MA,0


Let's habe a look at the differenct types of airports available:

In [27]:
airports_df.select('type').distinct().collect()

[Row(type='large_airport'),
 Row(type='seaplane_base'),
 Row(type='heliport'),
 Row(type='closed'),
 Row(type='medium_airport'),
 Row(type='small_airport')]

And the amount of airports of each type:

In [28]:
airports_df.groupBy("type").count().orderBy("type").show()

+--------------+-----+
|          type|count|
+--------------+-----+
|        closed|   63|
|      heliport|   19|
| large_airport|  167|
|medium_airport|  653|
| seaplane_base|   72|
| small_airport| 1045|
+--------------+-----+



In [29]:
airports_df.count()

2019

Since we want to combine the airports data on a geographic base, let's have a look what is given in this data set for the US (only interested in airports with IATA code):

In [30]:
airports_df.filter("iso_country='US'").filter("iata_code != 'none'").groupBy("iso_country","iso_region").count().orderBy("iso_region").show(5)

+-----------+----------+-----+
|iso_country|iso_region|count|
+-----------+----------+-----+
|         US|     US-AK|  334|
|         US|     US-AL|   30|
|         US|     US-AR|   29|
|         US|     US-AZ|   46|
|         US|     US-CA|  157|
+-----------+----------+-----+
only showing top 5 rows



## Country codes extracted from I94_SAS_Labels_Descriptions.SAS

In [31]:
countries_df = process_country_codes(spark, 'country_codes.txt')

<Not Empty> quality check passed for country_codes.txt with 236 records.
<All lines loaded> quality check passed!


In [32]:
countries_df.head(5)

[Row(country_code=582, country_name='MEXICO Air Sea, and Not Reported (I-94, no land arrivals)'),
 Row(country_code=236, country_name='AFGHANISTAN'),
 Row(country_code=101, country_name='ALBANIA'),
 Row(country_code=316, country_name='ALGERIA'),
 Row(country_code=102, country_name='ANDORRA')]

## City codes extracted from I94_SAS_Labels_Descriptions.SAS

In [90]:
us_cities_df = process_city_data(spark, 'city_codes.txt')

<Not Empty> quality check passed for city_codes.txt with 526 records.
<All lines loaded> quality check NOT passed! Number of lines: 533, number of rows: 526


## US state codes extracted from I94_SAS_Labels_Descriptions.SAS

In [40]:
us_states_df = process_state_codes(spark, 'us_state_codes.txt')


<Not Empty> quality check passed for us_state_codes.txt with 55 records.
<All lines loaded> quality check passed!


## Demographics Data

In [34]:
demographics_pd = demographics_df.toPandas()

for column in demographics_pd.columns:
    print(f"Null data in {column}? : {any(demographics_pd[column].isnull())}")

Null data in City? : False
Null data in State? : False
Null data in Median Age? : False
Null data in Male Population? : True
Null data in Female Population? : True
Null data in Total Population? : False
Null data in Number of Veterans? : True
Null data in Foreign-born? : True
Null data in Average Household Size? : True
Null data in state_code? : False
Null data in Race? : False
Null data in Count? : False


Empty cells only in non relevant columns.

In [35]:
demographics_pd.shape[0] - demographics_pd.dropna().shape[0]

16

In [36]:
demographics_pd.shape

(2891, 12)

Only 16 rows from 2891 are missing values! No problem!


In [92]:
us_cities_pd = us_cities_df.toPandas() 
    
def getCityCode(s, sc):
    try:
        result = \
                us_cities_pd.loc[(us_cities_pd['name'] == s.upper()) & (us_cities_pd['state_code'] == sc)]['code'].iloc[0]
    except Exception:
        result = '000'
    return result

    
getCityCodeUDF = udf(lambda c,sc: getCityCode(c,sc), StringType())

demographics_df = demographics_df.withColumn("city_code", getCityCodeUDF(sf.col("City"), sf.col("state_code")))
demographics_df.filter("city_code != '000'").count()

571

Found 571 US cities in the given demographic data and in SAS data. Let's focus on these cities!

## US airports

In [93]:
getCityCodeUDF = udf(lambda c,sc: getCityCode(c,sc), StringType())

airports_df = airports_df.withColumn("city_code", getCityCodeUDF(sf.col("municipality"), sf.col("state_code")))
airports_df.filter("city_code != '000'").count()


324

Found 324 airports in cities of the cities data set

In [63]:
airports_df.filter("city_code != '000'").head(5)

[Row(ident='4Z7', type='seaplane_base', name='Hyder Seaplane Base', elevation_ft=None, continent='NA', iso_country='US', iso_region='US-AK', municipality='Hyder', gps_code=None, iata_code='WHD', local_code='4Z7', coordinates='-130.009975, 55.903324', state_code='AK', city_code='HYD'),
 Row(ident='57A', type='seaplane_base', name='Tokeen Seaplane Base', elevation_ft=None, continent='NA', iso_country='US', iso_region='US-AK', municipality='Tokeen', gps_code='57A', iata_code='TKI', local_code='57A', coordinates='-133.32699585, 55.9370994568', state_code='AK', city_code='TKI'),
 Row(ident='5KE', type='seaplane_base', name='Ketchikan Harbor Seaplane Base', elevation_ft=None, continent='NA', iso_country='US', iso_region='US-AK', municipality='Ketchikan', gps_code=None, iata_code='WFB', local_code='5KE', coordinates='-131.677002, 55.349899', state_code='AK', city_code='5KE'),
 Row(ident='6CA3', type='heliport', name='Catalina Air-Sea Terminal Heliport', elevation_ft=5, continent='NA', iso_cou

# Step 3: Define the Data Model
## 3.1 Conceptual Data Model
Correlation between demographic data and immigration data is evaluated based on the city code. The code must be calculated and added to the demographic data set and the airport dataset. To have a more robust mapping the city code is calculated on the city's name and the US state code.

## 3.2 Mapping Out Data Pipelines
Build spark views from the explored dataframes to allow easy queuing on the data.

# Step 4: Run Pipelines to Model the Data 
## 4.1 Create the data model
Build the data pipelines to create the data model.

### US Airports

In [64]:
us_airports_df = process_airport_data(spark, 'airport-codes_csv.csv')

<Not Empty> quality check passed for airport-codes_csv.csv with 55,075 records.
<All lines loaded> quality check passed!
<Not Empty> quality check passed for city_codes.txt with 526 records.
<All lines loaded> quality check NOT passed! Number of lines: 533, number of rows: 526


### Immigration Data

In [67]:
immigrations_table = etl_immigrations_data(spark, 'sas_data')
immigrations_table.limit(20).toPandas()
visatype = spark.sql(""" select visatype, count(*) from immigrations_table group by (visatype)""")
visatype.toPandas()

Unnamed: 0,visatype,count(1)
0,F2,2984
1,GMB,150
2,B2,1117897
3,F1,39016
4,CPL,10
5,I1,234
6,WB,282983
7,M1,1317
8,B1,212410
9,WT,1309059


In [68]:
immigrations_table.groupBy("visa","visatype").count().orderBy("visa","visatype").show(truncate=False)


+--------+--------+-------+
|visa    |visatype|count  |
+--------+--------+-------+
|business|B1      |212410 |
|business|E1      |3743   |
|business|E2      |19383  |
|business|GMB     |150    |
|business|I       |3176   |
|business|I1      |234    |
|business|WB      |282983 |
|pleasure|B2      |1117897|
|pleasure|CP      |14758  |
|pleasure|CPL     |10     |
|pleasure|GMT     |89133  |
|pleasure|SBP     |11     |
|pleasure|WT      |1309059|
|student |F1      |39016  |
|student |F2      |2984   |
|student |M1      |1317   |
|student |M2      |49     |
+--------+--------+-------+



### Demographic Data

In [84]:
demographics_table = etl_demographics_data(spark,"us-cities-demographics.csv")

<Not Empty> quality check passed for us-cities-demographics.csv with 2,891 records.
<All lines loaded> quality check passed!
<Check existence> quality check succeeeded
<Not Empty> quality check passed for city_codes.txt with 526 records.
<All lines loaded> quality check NOT passed! Number of lines: 533, number of rows: 526


Join immigration data with demographics data and countries table

In [85]:
countries_df.createOrReplaceTempView('countries')

demographics_immigration = spark.sql("""
    SELECT 
        i94yr as year,
        i94mon as month,
        i94cit as origin_city,
        country_name,
        i94port as destination_city,
        arrdate as arrival_date,
        visatype as visatype,
        (CASE WHEN i94visa = '1.0' 
            THEN 'business'
            ELSE 
                (CASE WHEN i94visa = '3.0' 
                     THEN 'student' 
                     ELSE (CASE WHEN i94visa = '2.0' 
                              THEN 'pleasure' 
                              ELSE 'unknown'
                          END)
                END)
        END) as visa,
        total_population,
        City as city,
        State as state
    FROM immigrations_table
    LEFT JOIN demographics_table ON demographics_table.city_code = immigrations_table.i94port
    LEFT JOIN countries ON countries.country_code = immigrations_table.i94cit
""")
demographics_immigration.limit(20).toPandas()

Unnamed: 0,year,month,origin_city,country_name,destination_city,arrival_date,visatype,visa,total_population,city,state
0,2016.0,4.0,296,UNITED ARAB EMIRATES,NYC,2016-04-30,B2,pleasure,,,
1,2016.0,4.0,296,UNITED ARAB EMIRATES,DAL,2016-04-30,B2,pleasure,,,
2,2016.0,4.0,296,UNITED ARAB EMIRATES,DAL,2016-04-30,B2,pleasure,,,
3,2016.0,4.0,296,UNITED ARAB EMIRATES,DAL,2016-04-30,B2,pleasure,,,
4,2016.0,4.0,296,UNITED ARAB EMIRATES,HOU,2016-04-30,B1,business,,,
5,2016.0,4.0,296,UNITED ARAB EMIRATES,NYC,2016-04-29,B2,pleasure,,,
6,2016.0,4.0,296,UNITED ARAB EMIRATES,NYC,2016-04-30,B2,pleasure,,,
7,2016.0,4.0,296,UNITED ARAB EMIRATES,NYC,2016-04-30,B2,pleasure,,,
8,2016.0,4.0,296,UNITED ARAB EMIRATES,MAA,2016-04-30,B2,pleasure,,,
9,2016.0,4.0,296,UNITED ARAB EMIRATES,CHI,2016-04-30,B2,pleasure,,,


#### 4.2 Data Quality Checks

Quality Checks have been executed during loading and exploring the data and in the ETL pipeline functions. 

* <u>Demographic Data</u>\
Check for empty cells. Check if result contain all 40m datasets

* <u>Immigration data</u>\
Aggregate all given data to one dataframe. Remove columns not available in all datasets.

* <u>Airport Codes</u>\
Filter for US airports with valid state code.

#### 4.3 Data dictionary 

##### 4.3.1 Immigration Data 



<table align="left">
<tr>
    <th>Column</th><th>Description</th>
    </tr>
<tr>
    <td>cicid</td><td>Unique record ID</td>
</tr>
<tr><td>i94yr</td><td>4 digit year</td></tr>
<tr><td>ii94mon</td><td>Numeric month</td></tr>
<tr><td>ii94cit</td><td>3 digit code for immigrant country of birth</td></tr>
<tr><td>ii94res</td><td>3 digit code for immigrant country of residence</td></tr>
<tr><td>ii94port</td><td>Port of admission</td></tr>
<tr><td>iarrdate</td><td>Arrival Date in the USA</td></tr>
<tr><td>ii94mode</td><td>Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported)\</td></tr>
<tr><td>ii94addr</td><td>USA State of arrival</td></tr>
<tr><td>idepdate</td><td>Departure Date from the USA</td></tr>
<tr><td>ii94bir</td><td>Age of Respondent in Years</td></tr>
<tr><td>ii94visa</td><td>Visa codes collapsed into three categories</td></tr>
<tr><td>icount</td><td>Field used for summary statistics</td></tr>
<tr><td>idtadfile</td><td>Character Date Field - Date added to I-94 Files</td></tr>
<tr><td>ivisapost</td><td>Department of State where where Visa was issued</td></tr>
<tr><td>ioccup</td><td>Occupation that will be performed in U.S</td></tr>
<tr><td>ientdepa</td><td>Arrival Flag - admitted or paroled into the U.S.</td></tr>
<tr><td>ientdepd</td><td>Departure Flag - Departed, lost I-94 or is deceased</td></tr>
<tr><td>ientdepu</td><td>Update Flag - Either apprehended, overstayed, adjusted to perm residence</td></tr>
<tr><td>imatflag</td><td>Match flag - Match of arrival and departure records</td></tr>
<tr><td>ibiryear</td><td>4 digit year of birth</td></tr>
<tr><td>idtaddto</td><td>Character Date Field - Date to which admitted to U.S. (allowed to stay until)</td></tr>
<tr><td>igender</td><td>Non-immigrant sex</td></tr>
<tr><td>iinsnum</td><td>INS number</td></tr>
<tr><td>iairline</td><td>Airline used to arrive in U.S.</td></tr>
<tr><td>iadmnum</td><td>Admission Number</td></tr>
<tr><td>ifltno</td><td>Flight number of Airline used to arrive in U.S.</td></tr>
<tr><td>ivisatype</td><td>Class of admission legally admitting the non-immigrant to temporarily stay in U.S.</td></tr>
</table>


##### 4.3.2 Demographic Data

<table align="left">
      <tr>,
        <th >Column</th>
        <th >Description</th>
      </tr>
     <tr><td>City</td><td>City Name</td>
     <tr><td>State</td><td>US State where the city is located</td>
     <tr><td>Median Age</td><td>Median age of the population</td>
     <tr><td>Male Population</td><td>Count of male population</td>
     <tr><td>Female Population</td><td>Count of female population</td>
     <tr><td>Total Population</td><td>Count of total population</td>
     <tr><td>Number of Veterans</td><td>Count of total veterans</td>
     <tr><td>Foreign born</td><td>Count of residents of the city that were not born in that city</td>
     <tr><td>Average household Size</td><td>Average household size in the city</td>
     <tr><td>State Code</td><td>2 character US State code</td>
     <tr><td>Race</td><td>Respondent race</td></tr>
    <tr><td>Count</td><td>Count of people of race</td></tr>
         
</table>


#### Step 5: Complete Project Write Up
* <u>choice of tools and technologies:</u>

To get an overview of the data and run a few queries it seem sufficient to set up a Juniper note book and use python to explore the datasets. For quick access Spark is used to load the data and store in parquet format which allows quick load and save.

* <u>update of data:</u>

Immigration data should be updated every month because it is collected monthly.
For demographic data it seem sufficient to be updated oncy a year or even less.


#### If the data was increased by 100x:
Spark scales very well horizontically. Set up spark cluster according to requirements regarding processing time.

#### The data populates a dashboard that must be updated on a daily basis by 7am every day.
Use a scheduler like Airflow. Use the code form this note book to create a DAG and schedule it accordingly

#### The database needed to be accessed by 100+ people.

Set up a distributed database like Redshift to store the data permanently in fact and dimension tables.