# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-05-05

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into five sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project file follows the following step:
* Step 5: Complete Project Write Up


### Step 5: Complete Project Write Up
* Outline of the steps taken in the project:
    * Definition of the project scope: Four project questions had to be answered.
    * The four given data sets from different areas were examined and aligned with the project questions. Based on these
      questions, the data model was built step by step.
    * Examination of the data provided important insights. Pandas were used to take a quick look at small data sets to
      gain these insights.
    * Transformation of the data within the ETL (Extract, Transform, Load) pipeline to build the star schema data model.
    * Automatically creation of a data dictionary. The only manual part was to fill the table and column descriptions.


* The purpose of the final data model is made explicit.
    * At the beginning there were 4 project questions that had to be answered. Based on these questions, the data model
      was built step by step to the final star data model.

* Clearly state the rationale for the choice of tools and technologies for the project.
* Used technologies and tools:
    * This project uses Python, Pandas, Jupyter Notebook and Apache Spark (PySpark) in local mode to process 2016 U.S. immigration data.
      There are 5 project questions to answer. The ETL pipeline described is always aligned with the questions to be
      answered. The data model therefore evolves piece by piece to the final version. The specified tools were selected
      because, on the one hand, they are easily suitable for data analysis and preparation. If the requirements become
      larger and the amount of data increases, a switch to cloud technologies based on e.g. AWS is possible at any time.
      However, this is not the scope of this project.

* The write-up describes a logical approach to this project under the following scenarios:
* Propose how often the data should be updated and why.
    * The ETL process should run on a monthly basis.. This decision was made due to the fact that SAS data is only provided
      monthly.

* Write a description of how you would approach the problem differently under the following scenarios:
    * The data was increased by 100x.
        * Source data should be stored in Cloud storage like AWS S3
        * To process all data in parallel use clustered Spark nodes (AWS EMR)
        * Storing the calculated data in a Star Model data structure within a cloud-based data warehouse (DWH) such as 
          AWS Redshift, is possible. Optionally, it is also conceivable to store the Star Data Model as Parquet files in 
          S3 cloud storage for further analysis. 
        
    * The data populates a dashboard that must be updated on a daily basis by 7am every day.
        * The I94 source data should be read in daily. This will reduce the amount of data per run. Note that not every
          project dataset (e.g. US Cities Demographics or Airport Codes) needs to be loaded daily.
        * Apache Airflow could be used for the daily data loading procedure

    * The database needed to be accessed by 100+ people.
        * Output data should be stored in a cloud DWH such as AWS Redshift to be "always available".  In addition, there
          is the possibility that the data in the Star data model is made available to the user for self-selection through
          self-service BI. Tools such as QlikSense or similar can be used here.

## Summary
Project-Capstone provides tools to automatically process, clean, analyze US I94 Immigration data in a flexible way and
help to answer questions like the four Project questions.

--------------------------
#### Hint: Call the script on a cluster with the given package:

        !spark-submit --packages saurfang:spark-sas7bdat:2.1.0-s_2.11 script.py