# Single Source of truth database for Cons&Analytics Inc.
### Data Engineering Capstone Project

#### Project Summary
The purpose of this project is to establish a single source of truth database around **I94 US immigration data** for a startup consulting company **Cons&Analytics Inc.** in the area of tourism and trade to provide insights about business opportunities arising from immigration.

A cloud based solution enables **data analysts** to faster implement customer specific use cases and gives **business consultants** a flexible database to answer analytical questions in day-to-day business revolving around trends in immigration patterns considering basic immigration profiles, purpose of travel, visa status and weather impacts.

The technologies used in this projects are **AWS S3, EMR , Glue, Athena and Apache Airflow**. Python, Pyspark and SQL are the main programming languages used to build the data pipeline for cleansing and loading the source data into the AWS S3 datalake.


The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# imports and installs
import pandas as pd

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

This is the relevant scope:
- **Data understanding:**
    - Perform exploratory analysis
    - Understand business meaning of datasets
    - Identify data quality issues
    - Find potential new datasets for enrichment
    - Decide on scope of data
- **Data Architectur:**
    - Define single source of truth database providing </br>
        1) **data analysts** a data lake for advanced and individual customer use cases </br>
        2) **business consultants** a structured and flexible data model for day to day business analytics and consulting work </br>
    - Derive conceptual datamodel: Dimension and Fact tables
    - Decide on Technology
- **ETL-Pipeline:**
    - Define Cleansing & Loading Strategy:
        - Local cleansing vs cloud cleansing using EMR
        - Semi vs fully-automated dataloading using Airflow
        - Incremental vs Full Loading of Facts and Dimesions
        - Update cycle of data (e. g. monthly)
        - Decide on Technologies
    - Decide on Data Quality Measurements
        - Automated Tests e.g. using Record Counts, Null Counts
        - Unit tests e. g. test transformation functions
- **Develop and test the ETL pipeline**

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

#### 1) I94 Immigration Data
This dataset came from the US National Tourism and Trade Office and documents daily arrivals of immigrants from countries all over the world into the US and additional information such as type of visa, mode of transportation, age groups, states visited (first intended address only), and the top ports of entry (for select countries). </br>
https://travel.trade.gov/research/reports/i94/historical/2016.html

In [2]:
# Load sample data
df = pd.read_csv('./data/immigration_data_sample.csv', index_col=[0])
df.head().T

Unnamed: 0,2027561,2171295,589494,2631158,3032257
cicid,4084316.0,4422636.0,1195600.0,5291768.0,985523.0
i94yr,2016.0,2016.0,2016.0,2016.0,2016.0
i94mon,4.0,4.0,4.0,4.0,4.0
i94cit,209.0,582.0,148.0,297.0,111.0
i94res,209.0,582.0,112.0,297.0,111.0
i94port,HHW,MCA,OGG,LOS,CHM
arrdate,20566.0,20567.0,20551.0,20572.0,20550.0
i94mode,1.0,1.0,1.0,1.0,3.0
i94addr,HI,TX,FL,CA,NY
depdate,20573.0,20568.0,20571.0,20581.0,20553.0


- Note the full dataset is available in SAS data format by month (2016 Jan to Dec)
- Only the most relevant fields will be used

#### 2) Temperature Data
This dataset came from Kaggle and includes monthly average temperatures sind 1743 across the globe and therefore can be used to establish correlations between US immigration and weather trends. </br>
https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data

In [3]:
# Load sample data
df_temp = pd.read_csv("./data/GlobalLandTemperaturesByCity.csv")
df_temp.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


- the data used for the project is limited to the last 20 years of data and aggregated by country and month
- the detailed extraction and cleaning process is described in **temperature.ipynb**

#### 3) Visa Categories
The immigration dataset contains a visatype (e. g. E1). A more detailed description about Visa Categories will allow for clearer insights of the purpose of the travel (in combination with the i94visa). The dataset is scrapped from the "Bureau Of Consulare Affairs" using Beautiful Soup.
</br>
https://travel.state.gov/content/travel/en/us-visas/visa-information-resources/all-visa-categories.html

- this is a sample of the cleaned version
- the detailed extraction and cleaning process is described in **visa-categories.ipynb**

In [4]:
df_visa = pd.read_csv("../staging/visa_categories.csv", sep=";")
df_visa.head()

Unnamed: 0,visa_category,visa_group,visa_desc,visa,visa_id
0,Unknown Visa Categories,Unknown Visa,Unknown,,1
1,Immigrant Visa Categories,Immediate Relative & Family Sponsored,Spouse of a U.S. Citizen,IR1,2
2,Immigrant Visa Categories,Immediate Relative & Family Sponsored,Spouse of a U.S. Citizen,CR1,3
3,Immigrant Visa Categories,Immediate Relative & Family Sponsored,Spouse of a U.S. Citizen awaiting approval of ...,K3,4
4,Immigrant Visa Categories,Immediate Relative & Family Sponsored,Fiancé(e) to marry U.S. Citizen & live in U.S.,K1,5


#### 4) Countries and Continents
This dataset contains iso-codes for identification of countries and continents and is on the one hand a quality enrichment for drill-down possibilities of the I94 immigration data and allows for better drill-down opportunities.

https://datahub.io/JohnSnowLabs/country-and-continent-codes-list#resource-country-and-continent-codes-list-csv

In [5]:
df_country = pd.read_csv("../staging/countries.csv", sep=";")
df_country.head()

Unnamed: 0,country_id,country_name,continent_name
0,DZ,"Algeria, People's Democratic Republic of",Africa
1,AO,"Angola, Republic of",Africa
2,AO,"Angola, Republic of",Africa
3,BJ,"Benin, Republic of",Africa
4,BW,"Botswana, Republic of",Africa


- this is a sample of the cleaned version
- the detailed extraction and cleaning process is described in **countries.ipynb**

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

#### Cleaning Steps
Document steps necessary to clean the data

### Exploratorty Analysis and Cleansing Steps are documented in the following notebooks by dataset:
- i94-exploratory-analysis.ipynb
- temperature.ipynb 
- visa-categories.ipynb
- countries.ipynb

#### List of Analytical Questions derived from the exploratory analysis helping to establish the datamodel:

- How many people arrive at a given state in a given month / throughout the year?
- What is the purpose for arriving to the US (business or travel)?
- What is the age gender distribution of immigrants?
- Which continents/countries do people come from?
- What is the proportion of immigrants with US citizenship?
- What visa types are used frequently?
- How is wheater impacting the immigration numbers?
- How long are people staying on average in the US?
- Are most people arriving by air?
- Which is the first home address after arriving to the US?

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model





<img src="../imgs/datamodel.png" style="width: 1000px;">

- The Datamodel follows mostly a **STAR SCHEMA**
    - Dimension Tables:
        - countries
        - us_states
        - dates
        - mode
        - purpose
        - visa categories
    - Fact Tables:
        - i94
</br>
- The temperature table is implemented in **SNOWFLAKE** form because
    - it has an optional relationship to i94 facts and
    - the country dimension can be directly used with the i94 facts when temperature information is irrelevant
</br></br>
- note that country_id can be used to join to cit_country_id as well to res_country_id (from countries and temperature) 
- data_quality is a separated table with data quality metadata in respect to the overall datamodel

- Benefits of the **STAR SCHEMA**
    - Simplifies Queries (less joins needed)
    - Fast Aggregation (e. g. sums, counts)
    - Easy to understand and query
</br></br>
- Benefits of the **SNOWFLAKE SCHEMA**
    - Needs less disk space (not a concern here)
    - Protection from data integrity issues
    - Easier to maintain

- The Datamodel is availabe in **AWS S3 (parquet) and can be directly queried via Athena**
- Example Query:
    - Which continents/countries do people come from?

In [6]:
"""
SELECT 
 co.continent_name
,co.country_name
,sum(count) as number_immigrants
FROM "model_db"."i94_parquet" im
LEFT OUTER JOIN "model_db"."countries_parquet" co
ON co.country_id = im.cit_country_id
GROUP BY
 co.continent_name
,co.country_name
order by co.continent_name, co.country_name
;
""";

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

<img src="../imgs/architecture.png" style="width: 1000px;">

- **Airflow is used to manage the execution and monitoring of the overall data pipeline**
    - A **monthly schedule** picks up the locally stored scripts (ETL and Data Quality) and moves the partially prepared data (see Data Transformation Details) from local Staging to the Amazon S3 Staging Bucket
    - The Airflow Dag spins up an Amazon EMR Cluster and executes the above mentioned scripts as steps:</br>
        **1) Staging to Model** - Extracts data from S3 (Staging), transforms and loads data into the SCHEMA implemented on S3 (Model) </br>
        **2) Data Quality** - Reads current dataload from S3 (Model), performs data quality tests and saves results in "data_quality_parquet" folder on S3 (Model)
    - The EMR Cluster is terminated automatically on success.
    - The AWS Glue Crawler is started to update the Athena Database schema to be ready for querying

- **Data Transformation Details**
    - Dimension tables ***countries***, ***us_states***, ***visa_categories*** and ***temperature*** are prepared locally using python pandas: 
        - the data is small and rather static (e. g. contintents / countries and related iso-codes will rarely change over time)
        - additionally, the cleansing of the visa categories is semi-automated due to the unstructured html-datasource and needs a visual examination before providing it to data consumers
        - the data is cleansed and stored locally in csv-file format
    - ***i94 immigration facts*** are transformed using AWS EMR and pyspark:
        - these monthly SAS files are rather large (~500MB per file) which makes local transformation inefficient and difficult
        - therefore the data is moved to S3 in its raw-form and transformed by the EMR cluster
    - ***data "ready-for-upload"*** to S3 staging is stored in the local ***staging folder***

- **Loading Stragegy**
    - Dimension tables:
        - ***countries***, ***us_states***, ***visa_categories*** </br> are overwritten with each (monthly) loading; even though a monthly update is not required, the data is uploaded along the i94 immigration data to ensure that potential updates are not missed
        - ***dates*** depends on the i94 immigration facts and therefore is appended monthly to the coressponding parquet folder partitioned by year/month
        - ***modes*** is defined in the etl-script and loaded with every run, previous data is overwritten
        - ***purpose*** is defined in the etl-script and loaded with every run, previous data is overwritten
        - ***temperature*** is overwritten with each (monthly) loading
    - Fact tables:
        - ***i94*** is monthly data and requires at most a monthly provisioning, therefore the data is appended monthly to the corresponding parquet folder partitioned by year/month

- **Data Consumption** </br>
    - Data Analysts can access the Datalake directly and enrich the data for customer specific use cases and build machine learning models e. g. predictive model considering temperature
    - Business Consultants will be enabled to access the data via Amazon Athena since it is very easy to use

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

- Local transformation pipes are stored in folliwing scripts:
    - **countries.ipynb** outputs ***countries_mapping.csv***, ***countries.csv*** and ***us_states.csv*** to **local staging**
    - **visa-categories.ipynb** outputs ***visa_categories.csv*** to **local staging**
    - **temperature.ipynb** outputs ***temperature.csv*** to **local staging**

### Airflow Graph

<img src="../imgs/airflow_graph.png" style="width: 1500px;">

- **spark_submit.py** contains the dag for the pipeline
- ***load_data to S3*** </br>
    - flag *load_sas*
        - False: i94 sas files must be uploaded to AWS S3 staging prior to starting the dag either
            - manually or 
            - with the script ***transfer-files-to-S3.py*** (recommended way)
        - True: picks up the file based of the airflow schedule (local docker environment/large filesize can cause timeouts)
    - *.csv files are always loaded
- **etl-prod.py** Staging to model Step 1
- **dq-prod.py** Data Quality Step 2

### Ariflow Gantt

<img src="../imgs/airflow_gant.png" style="width: 1500px;">

### AWS Overview

- AWS S3 Datalake

<img src="../imgs/s3_datalake.png" style="width: 1000px;">

- AWS S3 Staging

<img src="../imgs/s3_staging.png" style="width: 1000px;">

- AWS S3 I94

<img src="../imgs/s3_staging_i94.png" style="width: 1000px;">

- AWS S3 Scripts

<img src="../imgs/s3_scripts.png" style="width: 1000px;">

- AWS EMR Steps

<img src="../imgs/emr_steps.png" style="width: 1000px;">

- AWS S3 Model

<img src="../imgs/s3_model.png" style="width: 1000px;">

- AWS S3 model i94

<img src="../imgs/s3_model_i94.png" style="width: 1000px;">

- AWS S3 i94 parquet files

<img src="../imgs/s3_model_i94_parquet.png" style="width: 1000px;">

- AWS Glue Crawler

<img src="../imgs/glue_crawler.png" style="width: 1000px;">

- AWS Glue Tables

<img src="../imgs/glue_tables.png" style="width: 1000px;">

- AWS Athena Sample Query: Total number of immigrants in 2016 by month

<img src="../imgs/athena_number_immigrants.png" style="width: 1000px;">

- AWS Athena Sample Query: Tourists vs Business People from Germany to California/Florida by month

In [7]:
"""
    with t1 as (
    SELECT 
     co.country_name
    ,st.state_name
    ,im.month
    ,round(tp.temperature_mean,1) as mean_temperature
    ,sum(case when pp.purpose_desc='Pleasure' then count else 0 end) as number_tourists
    ,sum(case when pp.purpose_desc='Business' then count else 0 end) as number_business
    ,sum(count) as number_immigrants
    FROM "model_db"."i94_parquet" im
    LEFT OUTER JOIN "model_db"."countries_parquet" co
    ON co.country_id = im.res_country_id
    LEFT OUTER JOIN "model_db"."us_states_parquet" st
    ON st.state_id = im.state_id
    LEFT OUTER JOIN "model_db"."purpose_parquet" pp
    ON pp.purpose_id = im.purpose_id
    LEFT OUTER JOIN "model_db"."temperature_parquet" tp
    ON tp.country_id = im.res_country_id
    and tp.month = im.month
    where 
        co.country_id = 'DE'
    and pp.purpose_desc in ('Business', 'Pleasure')
    and im.state_id in ('FL','CA')
    GROUP BY
     co.country_name
    ,im.month
    ,st.state_name
    ,tp.temperature_mean
    order by co.country_name, st.state_name, cast(im.month as integer)
    )
    select
     t1.country_name
    ,t1.state_name
    ,t1.month
    ,t1.mean_temperature
    ,t1.number_immigrants
    , round(cast(number_tourists as double) / cast(number_immigrants as double) *100,0) as rel_tourists
    , round(cast(number_business as double) / cast(number_immigrants as double) *100,0) as rel_business
    from t1
    ;
""";

<img src="../imgs/athena_germany_to_CA_FL.png" style="width: 1200px;">

- Percentage of tourists coming to Florida stays consistently at ~90% throughout the year
- Favorite visiting time to Florida is October and March, in this timeframe it is rather cold in Germany
- On the contrary, proportion of business travels to California is fluctuating between 11% and 53%
- August and September is busiest visiting time from Germany to California, tourists being the most

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

- Integrity is ensured by testing completeness of foreign key columns (no-nulls) although constraints are not enforced
- The final data model is checked whether it received records (record count > 0)
- A data quality report is created with each load which can be reviewed in more depth by a data engineer
- The data pipeline will fail if one of the data quality tests fails
- Cleansing steps are documented, tested and examined in the corresponding development scripts (folder dev)

- Athena data_quality report

<img src="../imgs/athena_data_quality_report.png" style="width: 1000px;">

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

In [8]:
pd.set_option('display.max_colwidth', 250)
df_dict = pd.read_csv("data-dict.csv", sep=";")

In [9]:
df_dict

Unnamed: 0,Table,Field,Description
0,i94,i94_id,monotonically increasing id used as a primary key
1,i94,cit_country_id,Country Codes (2-digit-ISO-Norm) represent the country of citizenship.
2,i94,res_country_id,Country Codes (2-digit-ISO-Norm) represent the country of residence.
3,i94,state_id,"US-State Codes (2-digit), first address (state) after arrival"
4,i94,mode_id,Arrival mode Integer see mode table
5,i94,purpose_id,Travel purpose Integer see purpose table
6,i94,visa_id,"Type of Visa, Integer see visa_categories"
7,i94,arrival_date,date of arrival to the US
8,i94,cic_id,Attribute: identifies the underlying immigration record
9,i94,gender,"Attribute: M=Male, F=Female, U=Unknown"


#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
    - **Amazon Simple Storage Service (S3)** is storage for the Internet. It is designed to make web-scale computing easier for developers.
        </br>In this project S3 is used as a Datalake for the following reasons:
        - Scalability:
            - Once the demand for more storage increases, S3 can easily scale up at low cost rates
            - A huge upfront investment into local hardware and maintenance of the infrastructure would be a risk
            - Integration with other data (cloud-native) will be simple
            - Data can be archived if no longer needed "readily available" at even lower cost rates
        - Availability:
            - Data is available around the clock and accessible from anywhere, which is a huge benefit for business consultants meeting customers accross the US or are in homeoffice
        - Security:
            - Data is secured in the companies Virtual Private Cloud
            - Data can be encrypted if needed
        - Integration:
            - S3 Integrates easily with other AWS Services, e. g. EMR and Athena
            - Advanced services such as AWS Forecasting can easily access the data without the need for further data movement
    - **Amazon Elastic Map Reduce (EMR)** is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark
        - The Immigration SAS datafiles are rather huge (~500MB per file) and are at the core of the current datamodel
        - Since Apache Spark is designed to process huge amounts of data in parallel it is a perfect choice for the data transformation
        - The EMR Cluster is used very efficiently for that purpose, the cluster spins up once needed and turns down once the job is completed
        - This means it is very ressource efficient and therefore good for the environment and cost effective (since it is not running around the clock)
    - **Amazon Athena** is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL .
        - Integrates seamlessly with the S3 model bucket and parquet files
        - Since the parquet files provide already a schema (as defined in the transformation process) it can be re-used for Athena
        - A simple aws glue crawler can update Athena with new partitions once arrived
        - Athena uses native SQL and therefore on-demand queries can be executed without needing to maintain a separate infrastructure / advanced databases such as Redshift
        - Also ideal for a startup with a small number of users; cost and ressource efficient
    - **Apache Airflow** is an open-source workflow management platform.
        - Since Airflow monitors the overall pipeline it is not only ressource efficient it also secures the data quality
        - Scheduling capabilities lead to full automation of worfklows
        - Clear transparency which steps completed successfully and which not (easier debugging possible)
        - Graphical interface allows to understand the completion times of each task so that long running statements can be spotted and improved
       
</br>

* Propose how often the data should be updated and why.
    - The immigration data files cover a whole month of data therefore it is reasonable to update the datamodel once a month, this will help to stay on track for the latest developments on the immigration to discover trends
    - The dimension tables are partially static so an update is reasonable once major changes happen (see Data Transformation Details above)
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
       - AWS S3 will be able to handle the additional data with no effort.
       - The data is stored in a columnar file format (parquet) which has the benefit of fast data aggregations, so no change required.
       - The Apache Spark cluster is also scalable, it would be just a matter of adding additional ressources to the cluster.
       - Depending on the major use cases AWS Athena can still make sense. The data is partitioned by year and month and therefore does not need to read through all the data if e. g. analytics are based on the more recent data. 
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
    - Due to the use of Apache Airflow it is just a matter of re-scheduling the workflow. 
    - I generally would recommend to migrate the local Airflow solution to AWS as well. There is a relatively new managed service (MWAA) for that as well allowing for better scalability. 
    - A dashboard such as Tableau can be connected to Athena to feed the data.
 * The database needed to be accessed by 100+ people.
   - Athena would still work however it would make sense to build some data views which are specific to the use cases and to manage the access rights therewith. 
   - If there are many queries running against the database a better choice might be to replace Athena with a high end database which is permanenlty available (such as Redshift). 
   - The transition would be rather seamless in my point of view. It is just a matter of adding another task in the workflow which copies the data to redshift. 
   - This could be either done from the AWS EMR Cluster or from the AWS S3 datamodel. The Redshift Copy statment is very efficient and parquet is a supported file format.

### Folder Structure

##### Repository

```
etl-capstone
|-- dags
|	|-- config_variables.json
|	|-- dq-prod.py
|	|-- etl-prod.py
|	|-- job_flow_overrides.json
|	|-- spark_submit.py
|-- imgs
|-- plugins
|	|-- __init__.py
|	|-- operators
|		|-- __init__.py
|		|-- glue_crawler.py
|		|-- stage_s3.py
|-- scripts
|	|-- dev
|	|	|-- dq-dev.ipynb
|	|	|-- etl-dev.ipynb
|	|	|-- glue-crawler-dev.ipynb
|	|-- Capstone Project Template.ipynb
|	|-- countries.ipynb
|	|-- data-dict.csv
|	|-- i94-exploratory-analysis.ipynb
|	|-- temperature.ipynb
|	|-- transfer-files-to-S3.py
|	|-- visa-categories.ipynb
|-- staging (not available in git)
	|-- i94
	|	|-- i94_apr16_sub.sas7bdat
	|	|-- i94_aug16_sub.sas7bdat
	|	|-- xxx.sas7bdat
	|-- countries_mapping.csv
	|-- countries.csv
	|-- temperature.csv
	|-- us_states.csv
	|-- visa_categories.csv
```

##### AWS S3 Bucket

```
etl-capstone
|-- athena/
|-- model/
|	|-- countries.parquet/
|	|-- data_quality.parquet/
|	|-- dates.parquet/
|	|-- i94.parquet/
|	|	|--2016/
|	|		|--01/
|	|		|--02/
|	|-- mode.parquet/
|	|-- purpose.parquet/
|	|-- states.parquet/
|	|-- visa_categories.parquet/
|-- staging/
|	|-- countries_mapping.csv
|	|-- countries.csv
|	|-- i94
|	|	|-- i94_2016-01_sub.sas7bdat
|	|	|-- i94_2016-02_sub.sas7bdat
|	|	|-- xxx.sas7bdat
|	|-- temperature.csv
|	|-- us_states.csv
|	|-- visa_categories.csv
|-- scripts/
	|-- etl-prod.py
	|-- dq-prod.py
```

### Installation

##### Pre-Requesites
- AWS Account
- AWS CLI 2 locally installed / credentials configured (accessed by docker-airflow)
- Docker
- python 3
- jupyter lab
- pandas
- pyspark
- pycountry
- beautiful soup

##### Steps to setup Pipeline

- create aws bucket
- setup roles for EMR cluster
- run local ipynb scripts
- transfer files to s3 (or set load_sas = True)
- start docker airflow
- import config_variables to airflow
- create aws glue crawler**
- run dag

** to make use of the parquet-schema, the model needs to be loaded up to step "Staging to Model"

- run docker airflow

```docker-compose -f docker-compose-LocalExecutor.yml up -d```

- manual execution of ETL pipeline EMR cluster (ssh into cluster) for debugging

```/usr/bin/spark-submit --packages "saurfang:spark-sas7bdat:2.1.0-s_2.11" "s3://<bucket_name>/scripts/etl-prod.py" "<bucket_name>" "i94_2016-01_sub.sas7bdat"``` 

```/usr/bin/spark-submit "s3://<bucket_name>/scripts/dq-prod.py <bucket_name> 2016 1```

### Ressources

- https://github.com/puckel/docker-airflow
- https://programmaticponderings.com/2020/12/24/running-spark-jobs-on-amazon-emr-with-apache-airflow-using-the-new-amazon-managed-workflows-for-apache-airflow-amazon-mwaa-service-on-aws/
- https://www.startdataengineering.com/post/how-to-submit-spark-jobs-to-emr-cluster-from-airflow/
- https://stackoverflow.com/questions/52996591/wait-until-aws-glue-crawler-has-finished-running