# Data Engineering Capstone Project

#### Project Summary

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
import pandas as pd

### Step 1: Scope the Project and Gather Data

#### Scope 
This project will build a data pipelines that manages an analytics database for querying information about I94 immigration, world temperature, city demographic and airport code. This analytics database can be used to find immigration patterns to the US. For example, we might wonder people from countries with warmer or cold climate immigrate to the US in large numbers. The analytics tables are hosted in AWS Redshift and the data pipelines are implemented using Apache Airflow.

#### Datasets 

The following datasets were used to create the analytics database:

- **I94 Immigration Data**: This data comes from the US National Tourism and Trade Office found [here](https://travel.trade.gov/research/reports/i94/historical/2016.html). Each report contains international visitor arrival statistics by world regions and select countries (including top 20), type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries).
- **World Temperature Data**: This dataset came from Kaggle found [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).
- **U.S. City Demographic Data**: This dataset contains information about the demographics of all US cities and census-designated places with a population greater or equal to 65,000. Dataset comes from OpenSoft found [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).
- **Airport Code Table**: This is a simple table of airport codes and corresponding cities. The airport codes may refer to either IATA airport code, a three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code (from wikipedia). It comes from [here](https://datahub.io/core/airport-codes#data).

In [3]:
df = pd.read_csv("data/immigration_data_sample.csv")

In [4]:
print(df.columns)
df.head()

Index(['Unnamed: 0', 'cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port',
       'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa',
       'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd',
       'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum',
       'airline', 'admnum', 'fltno', 'visatype'],
      dtype='object')


Unnamed: 0.1,Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2027561,4084316.0,2016.0,4.0,209.0,209.0,HHW,20566.0,1.0,HI,...,,M,1955.0,7202016,F,,JL,56582670000.0,00782,WT
1,2171295,4422636.0,2016.0,4.0,582.0,582.0,MCA,20567.0,1.0,TX,...,,M,1990.0,10222016,M,,*GA,94362000000.0,XBLNG,B2
2,589494,1195600.0,2016.0,4.0,148.0,112.0,OGG,20551.0,1.0,FL,...,,M,1940.0,7052016,M,,LH,55780470000.0,00464,WT
3,2631158,5291768.0,2016.0,4.0,297.0,297.0,LOS,20572.0,1.0,CA,...,,M,1991.0,10272016,M,,QR,94789700000.0,00739,B2
4,3032257,985523.0,2016.0,4.0,111.0,111.0,CHM,20550.0,3.0,NY,...,,M,1997.0,7042016,F,,,42322570000.0,LAND,WT


### Step 2: Explore and Assess the Data

#### Explore the Data 

Use pandas for exploratory data analysis to get an overview on these datasets

#### I94 Immigration Data

In [5]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.\
        config("spark.jars.repositories", "https://repos.spark-packages.org/").\
        config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
        enableHiveSupport().getOrCreate()

df_spark = spark.read.format('com.github.saurfang.sas.spark').load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

In [7]:
df_spark.write.parquet("sas_data")

In [8]:
df_immigration = spark.read.parquet("sas_data")
print(df_immigration.count())
df_immigration.limit(10).toPandas()

3096313


Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,5748517.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,CA,20582.0,...,,M,1976.0,10292016,F,,QF,94953870000.0,11,B1
1,5748518.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,NV,20591.0,...,,M,1984.0,10292016,F,,VA,94955620000.0,7,B1
2,5748519.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20582.0,...,,M,1987.0,10292016,M,,DL,94956410000.0,40,B1
3,5748520.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1987.0,10292016,F,,DL,94956450000.0,40,B1
4,5748521.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,...,,M,1988.0,10292016,M,,DL,94956390000.0,40,B1
5,5748522.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20579.0,...,,M,1959.0,10292016,M,,NZ,94981800000.0,10,B2
6,5748523.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,...,,M,1950.0,10292016,F,,NZ,94979690000.0,10,B2
7,5748524.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,...,,M,1975.0,10292016,F,,NZ,94979750000.0,10,B2
8,5748525.0,2016.0,4.0,245.0,464.0,HOU,20574.0,1.0,FL,20581.0,...,,M,1989.0,10292016,M,,NZ,94973250000.0,28,B2
9,5748526.0,2016.0,4.0,245.0,464.0,LOS,20574.0,1.0,CA,20581.0,...,,M,1990.0,10292016,F,,NZ,95013550000.0,2,B2


#### U.S. City Demographic Data

In [9]:
df_city_demographic = pd.read_csv('data/us-cities-demographics.csv', sep=';')
print(df_city_demographic.info())
df_city_demographic.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2891 entries, 0 to 2890
Data columns (total 12 columns):
City                      2891 non-null object
State                     2891 non-null object
Median Age                2891 non-null float64
Male Population           2888 non-null float64
Female Population         2888 non-null float64
Total Population          2891 non-null int64
Number of Veterans        2878 non-null float64
Foreign-born              2878 non-null float64
Average Household Size    2875 non-null float64
State Code                2891 non-null object
Race                      2891 non-null object
Count                     2891 non-null int64
dtypes: float64(6), int64(2), object(4)
memory usage: 271.1+ KB
None


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


#### Airport Code Data

In [10]:
df_airport_code = pd.read_csv('data/airport-codes_csv.csv')
print(df_airport_code.info())
df_airport_code.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55075 entries, 0 to 55074
Data columns (total 12 columns):
ident           55075 non-null object
type            55075 non-null object
name            55075 non-null object
elevation_ft    48069 non-null float64
continent       27356 non-null object
iso_country     54828 non-null object
iso_region      55075 non-null object
municipality    49399 non-null object
gps_code        41030 non-null object
iata_code       9189 non-null object
local_code      28686 non-null object
coordinates     55075 non-null object
dtypes: float64(1), object(11)
memory usage: 5.0+ MB
None


Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


#### World Temperature Data

In [11]:
df_temperature = pd.read_csv('data/GlobalLandTemperaturesByCity.csv')
print(df_temperature.info())
df_temperature.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
dt                               object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                             object
Country                          object
Latitude                         object
Longitude                        object
dtypes: float64(2), object(5)
memory usage: 459.2+ MB
None


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


#### Data preparation

Parsing source code file to desired data frame

In [3]:
def sas_program_file_value_parser(sas_source_file, value, columns):
    """Parses SAS Program file to return value as pandas dataframe
    Args:
        sas_source_file (string): SAS source code file.
        value (string): sas value to extract.
        columns (list): list of 2 containing column names.
    Return:
        dataframe: pandas dataframe.
    """
    file_string = ''
    
    with open(sas_source_file) as f:
        file_string = f.read()
    
    file_string = file_string[file_string.index(value):]
    file_string = file_string[:file_string.index(';')]
    
    line_list = file_string.split('\n')[1:]
    codes = []
    values = []
    
    for line in line_list:
        
        if '=' in line:
            code, val = line.split('=')
            code = code.strip()
            val = val.strip()

            if code[0] == "'":
                code = code[1:-1]

            if val[0] == "'":
                val = val[1:-1]

            codes.append(code)
            values.append(val)
        
            
    return pd.DataFrame(list(zip(codes,values)), columns=columns)

In [23]:
i94cit_res = sas_program_file_value_parser('data/I94_SAS_Labels_Descriptions.SAS', 'i94cntyl', ['code', 'country'])
i94cit_res.head()

Unnamed: 0,code,country
0,582,"MEXICO Air Sea, and Not Reported (I-94, no lan..."
1,236,AFGHANISTAN
2,101,ALBANIA
3,316,ALGERIA
4,102,ANDORRA


In [24]:
i94port = sas_program_file_value_parser('data/I94_SAS_Labels_Descriptions.SAS', 'i94prtl', ['code', 'port'])
i94port.head()

Unnamed: 0,code,port
0,ALC,"ALCAN, AK"
1,ANC,"ANCHORAGE, AK"
2,BAR,"BAKER AAF - BAKER ISLAND, AK"
3,DAC,"DALTONS CACHE, AK"
4,PIZ,"DEW STATION PT LAY DEW, AK"


In [25]:
i94mode = sas_program_file_value_parser('data/I94_SAS_Labels_Descriptions.SAS', 'i94model', ['code', 'mode'])
i94mode.head()

Unnamed: 0,code,mode
0,1,Air
1,2,Sea
2,3,Land
3,9,Not reported


In [26]:
i94addr = sas_program_file_value_parser('data/I94_SAS_Labels_Descriptions.SAS', 'i94addrl', ['code', 'addr'])
i94addr.head()

Unnamed: 0,code,addr
0,AL,ALABAMA
1,AK,ALASKA
2,AZ,ARIZONA
3,AR,ARKANSAS
4,CA,CALIFORNIA


In [5]:
i94visa = sas_program_file_value_parser('data/I94_SAS_Labels_Descriptions.SAS', 'I94VISA', ['code', 'type'])
i94visa.head()

Unnamed: 0,code,type
0,1,Business
1,2,Pleasure
2,3,Student


### Step 3: Define the Data Model

#### 3.1 Conceptual Data Model

The data model consists of tables
- immigration
- us_cities_demographics
- airport_codes
- world_temperature
- i94cit_res
- i94port
- i94mode
- i94addr
- i94visa

Notes:
1. In `immigration` table `i94mon` column is used as a DISTKEY AND `i94year` as SORTKEY
2. The following tables are distributed across all nodes(`DISTSTYLE ALL`): `us_cities_demographics`, `i94cit_res`, `i94port`, `i94mode`, `i94addr`, `i94visa`

#### 3.2 Mapping Out Data Pipelines

Steps necessary to pipeline the data into the chosen data model:

1. Begin Dummy Operator.

2. Operator extract tables from I94 labels mappings files and stage to S3 as csv:
    - i94cit_res
    - i94port
    - i94mode
    - i94addr
    - i94visa   
3. Copy the above csv files from S3 to Redshift.
4. Transform immigration data files on S3 and write results to `immigration` Redshift table.
5. Copy csv files from S3 to create the following tables in Redshift.
    - us_cities_demographics
    - airport_codes
    - world_temperature
4. Perform data quality checks for the tables above.
5. End Dummy Operator.

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

Create tables by running ```create_tables.py```:
```
python3 create_tables.py
```

#### 4.2 Run the data pipelines

Launch Airflow UI:

1. Initialize Airflow & Run Webserver

```
mkdir -p ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" > .env
docker compose up airflow-init
docker compose up
```

2. Access Airflow UI at `http://localhost:8080`

3. Run `capstone_dag` in Airflow UI

#### 4.3 Data dictionary

##### Table `immigration`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column Name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">cicid</td><td class="tg-0pky">Primary key</td></tr>
 <tr><td class="tg-0pky">i94yr</td><td class="tg-0pky">Numeric year</td></tr>
 <tr><td class="tg-0pky">i94mon</td><td class="tg-0pky">Numeric month</td></tr>
 <tr><td class="tg-0pky">i94cit</td><td class="tg-0pky">Country code where visitor was born. This is FK to the i94cit_res table</td></tr>
 <tr><td class="tg-0pky">i94res</td><td class="tg-0pky">Country code where visitor resides in. This is FK to the i94cit_res table</td></txr>
 <tr><td class="tg-0pky">i94port</td><td class="tg-0pky">Port of admission. This is FK to the i94port table</td></tr>
 <tr><td class="tg-0pky">arrdate</td><td class="tg-0pky">Arrival Date in the USA</td></tr>
 <tr><td class="tg-0pky">i94mode</td><td class="tg-0pky">Mode of transportation (1 = Air; 2 = Sea; 3 = Land; 9 = Not reported). This is FK to the i94mode table</td></tr>
 <tr><td class="tg-0pky">i94addr</td><td class="tg-0pky">USA State of arrival. This is FK to the i94addr table</td></tr>
 <tr><td class="tg-0pky">depdate</td><td class="tg-0pky">Departure Date from the USA</td></tr>
 <tr><td class="tg-0pky">i94bir</td><td class="tg-0pky">Age of Respondent in Years</td></tr>
 <tr><td class="tg-0pky">i94visa</td><td class="tg-0pky">Visa codes collapsed into three categories. This is FK to the i94visa table</td></tr>
 <tr><td class="tg-0pky">count</td><td class="tg-0pky">Field used for summary statistics</td></tr>
 <tr><td class="tg-0pky">dtadfile</td><td class="tg-0pky">Character Date Field - Date added to I-94 Files</td></tr>
 <tr><td class="tg-0pky">visapost</td><td class="tg-0pky">Department of State where where Visa was issued </td></tr>
 <tr><td class="tg-0pky">occup</td><td class="tg-0pky">Occupation that will be performed in U.S</td></tr>
 <tr><td class="tg-0pky">entdepa</td><td class="tg-0pky">Arrival Flag - admitted or paroled into the U.S.</td></tr>
 <tr><td class="tg-0pky">entdepd</td><td class="tg-0pky">Departure Flag - Departed, lost I-94 or is deceased</td></tr>
 <tr><td class="tg-0pky">entdepu</td><td class="tg-0pky">Update Flag - Either apprehended, overstayed, adjusted to perm residence</td></tr>
 <tr><td class="tg-0pky">matflag</td><td class="tg-0pky">Match flag - Match of arrival and departure records</td></tr>
 <tr><td class="tg-0pky">biryear</td><td class="tg-0pky">4 digit year of birth</td></tr>
 <tr><td class="tg-0pky">dtaddto</td><td class="tg-0pky">Character Date Field - Date to which admitted to U.S. (allowed to stay until)</td></tr>
 <tr><td class="tg-0pky">gender</td><td class="tg-0pky">Non-immigrant sex</td></tr>
 <tr><td class="tg-0pky">insnum</td><td class="tg-0pky">INS number</td></txr>
 <tr><td class="tg-0pky">airline</td><td class="tg-0pky">Airline used to arrive in U.S</td></txr>
 <tr><td class="tg-0pky">admnum</td><td class="tg-0pky"> Admission Number</td></txr>
 <tr><td class="tg-0pky">flightno</td><td class="tg-0pky">Flight number of Airline used to arrive in U.S</td></txr>
 <tr><td class="tg-0pky">visa_type</td><td class="tg-0pky">Class of admission legally admitting the non-immigrant to temporarily stay in U.S</td></tr>
</table>

##### Table `i94cit_res`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column Name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">code</td><td class="tg-0pky">Unique country code</td></tr>
 <tr><td class="tg-0pky">country</td><td class="tg-0pky">Name of country</td></tr>
</table>

##### Table `i94port`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column Name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">code</td><td class="tg-0pky">Unique port code</td></tr>
 <tr><td class="tg-0pky">port</td><td class="tg-0pky">Name of port</td></tr>
</table>

##### Table `i94mode`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column Name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">code</td><td class="tg-0pky">Unique mode code</td></tr>
 <tr><td class="tg-0pky">mode</td><td class="tg-0pky">Name of mode</td></tr>
</table>

##### Table `i94addr`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column Name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">code</td><td class="tg-0pky">Unique address code</td></tr>
 <tr><td class="tg-0pky">addr</td><td class="tg-0pky">Name of address</td></tr>
</table>

##### Table `i94visa`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column Name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">code</td><td class="tg-0pky">Unique visa type code</td></tr>
 <tr><td class="tg-0pky">type</td><td class="tg-0pky">Name of visa type</td></tr>
</table>

##### Table `us_cities_demographics`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">city</td><td class="tg-0pky">City Name</td>
 <tr><td class="tg-0pky">state</td><td class="tg-0pky">US State where city is located</td>
 <tr><td class="tg-0pky">median_age</td><td class="tg-0pky">Median age of the population</td>
 <tr><td class="tg-0pky">male_population</td><td class="tg-0pky">Count of male population</td>
 <tr><td class="tg-0pky">female_population</td><td class="tg-0pky">Count of female population</td>
 <tr><td class="tg-0pky">total_population</td><td class="tg-0pky">Count of total population</td>
 <tr><td class="tg-0pky">count</td><td class="tg-0pky">Count of city's individual per race</td>
 <tr><td class="tg-0pky">race</td><td class="tg-0pky">Respondent race</td>
  <tr><td class="tg-0pky">state_code</td><td class="tg-0pky">US state code </td>
 <tr><td class="tg-0pky">average_household_size</td><td class="tg-0pky">Average city household size</td>
 <tr><td class="tg-0pky">foreign_born</td><td class="tg-0pky">Count of residents of the city that were not born in the city</td>
 <tr><td class="tg-0pky">number_of_veterans</td><td class="tg-0pky">Count of total Veterans</td>
</table>

##### Table `world_temperature`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">dt</td><td class="tg-0pky">Starts in 1750 for average land temperature and 1850 for max and min land temperatures and global ocean and land temperatures</td>
 <tr><td class="tg-0pky">averagetemperature</td><td class="tg-0pky">Global average land temperature in celsius</td>
 <tr><td class="tg-0pky">averagetemperatureuncertainly</td><td class="tg-0pky">The 95% confidence interval around the average</td>
 <tr><td class="tg-0pky">city</td><td class="tg-0pky">Name of the city</td>
 <tr><td class="tg-0pky">country</td><td class="tg-0pky">Name of the country</td>
 <tr><td class="tg-0pky">latitude</td><td class="tg-0pky">Latitude of the location</td>
 <tr><td class="tg-0pky">longtitude</td><td class="tg-0pky">Longtitude of the location</td>
</table>

##### Table `airport_codes`

<table class="tg" align="left">
  <tr>
    <th class="tg-0pky">Column name</th>
    <th class="tg-0pky">Description</th>
  </tr>
 <tr><td class="tg-0pky">ident</td><td class="tg-0pky">The text identifier used in the OurAirports URL</td>
 <tr><td class="tg-0pky">type</td><td class="tg-0pky">The type of the airport. Allowed values are "closed_airport", "heliport", "large_airport", "medium_airport", "seaplane_base", and "small_airport"</td>
 <tr><td class="tg-0pky">name</td><td class="tg-0pky">The official airport name</td>
 <tr><td class="tg-0pky">elevation_ft</td><td class="tg-0pky">The airport elevation MSL in feet</td>
 <tr><td class="tg-0pky">continent</td><td class="tg-0pky">	The code for the continent where the airport is (primarily) located</td>
 <tr><td class="tg-0pky">iso_country</td><td class="tg-0pky">The two-character ISO 3166:1-alpha2 code for the country where the airport is (primarily) located</td>
 <tr><td class="tg-0pky">iso_region</td><td class="tg-0pky">An alphanumeric code for the high-level administrative subdivision of a country where the airport is primarily located</td>
 <tr><td class="tg-0pky">municipality</td><td class="tg-0pky">The primary municipality that the airport serves</td>
  <tr><td class="tg-0pky">gps_code</td><td class="tg-0pky">The code that an aviation GPS database (such as Jeppesen's or Garmin's) would normally use for the airport</td>
 <tr><td class="tg-0pky">lata_code</td><td class="tg-0pky">The three-letter IATA code for the airport</td>
 <tr><td class="tg-0pky">local_code</td><td class="tg-0pky">The local country code for the airport</td>
</table>

#### Step 5: Complete Project Write Up

##### Technologies

- Apache Airflow: scheduling and monitoring ETL piplines for keeping analytics database up to date
- Redshift: storing analytics tables in a distributed manner

##### Data Schedule Proposal
Pipeline will be scheduled monthly as immigration data is the primary datasource is on a monthly granularity

##### Possible Scenerios, changes and approach

 * **The data was increased by 100x:** using the partitioning functionality in the DAG, it may also be necessary to use Cloud services such as AWS EMR to use Spark for more efficient processing of big data.
 
 * **The data populates a dashboard that must be updated on a daily basis by 7am every day:** need to update the schedule of the DAG accordingly as make sure we have data needed for the dashboard.
 
 * **The database needed to be accessed by 100+ people:** need to implement authorization and apply horizontal or vertical scale to accommodate the traffic