# Project 08 - Analysis of U.S. Immigration (I-94) Data
### Udacity Data Engineer - Capstone Project
> by Peter Wissel | 2021-04-03

## Project Overview
This project works with a data set for immigration to the United States. The supplementary datasets will include data on
airport codes, U.S. city demographics and temperature data.

The following process is divided into five sub-steps to illustrate how to answer the questions set by the business
analytics team.

The project file follows the following steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data



### Step 1: Scope the Project and Gather Data

##### Scope of the Project
Based on the given data set, the following four project questions (PQ) are posed for business analysis, which need to be
 answered in this project. The data pipeline and star data model are completely aligned with the questions.

1. From which country do immigrants come to the U.S. and how many?
2. At what airports do foreign persons arrive for immigration to the U.S.?
3. At what times do foreign persons arrive for immigration to the U.S.?
4. To which states in the U.S. do immigrants want to continue their travel after their initial arrival and what
   demographics can immigrants expect when they arrive in the destination state, such as average temperature, population
   numbers or population density?


##### Gather Data
The project works primarily with a dataset based on immigration data (I94) to the United States.

- Gathering Data (given data sets):
    1. [Immigration data '18-83510-I94-Data-2016' to the U.S.](https://travel.trade.gov/research/programs/i94/description.asp)
    2. [airport-codes_csv.csv: Airports around the world](https://datahub.io/core/airport-codes#data)
    3. [us-cities-demographics.csv: US cities and it's information about citizens](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/)
    4. [GlobalLandTemperaturesByCity.csv: Temperature grouped by City and Country](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)

### Step 2: Explore and Assess the Data
The next step is used to find insights within given data.

#### Summary for Immigration data `18-83510-I94-Data-2016` to the U.S.:
* **Source**: [Visitor Arrivals Program (I-94 Form)](https://travel.trade.gov/research/programs/i94/description.asp)
* **Description**: [I94_SAS_Labels_Descriptions.SAS](../P8_capstone_resource_files/I94_SAS_Labels_Descriptions.SAS) file
contains descriptions for the I94 data
* **Data**: Month based dataset for year 2016
* **Format**: SAS (SAS7BDAT - e.g. `i94_apr16_sub.sas7bdat`)
* **Rows**: Over 3 million lines for each file. In total, about 40 million lines.
* **Data description**: Data has 29 columns containing information about event date, arriving person, airport, airline, etc.
![I94-immigration-data example](../P8_capstone_documentation/10_P8_immigration_data_sample.png)
NOTE: The Data has to be paid. Year 2016 is included and available for Udacity DEND course.

##### Immigration data '18-83510-I94-Data-2016' to the U.S.
   The descriptions for the listed columns were taken from file [I94_SAS_Labels_Descriptions.SAS](../P8_capstone_resource_files/I94_SAS_Labels_Descriptions.SAS).

    - **i94yr:** 4 digit year
    - **i94mon:** numeric month
    - **i94cit + i94res:** Country where the immigrants come from - `Country code, country name`
    Look at file [I94_SAS_Labels_I94CIT_I94RES.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94CIT_I94RES.txt) for more details.

            438 =  'AUSTRALIA'
            112 =  'GERMANY'
    ! Note that the I94 country codes are different from the ISO country numbers.

   - **i94port:** arrival airport - `Airport code, Airport city, State of Airport`. Note that the airport code is **not** the same as the [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association) code.
     [IATA-Code Search Engine](https://www.iatacodes.de/)

   The data of the I-94 table do not correspond to the current ISO standards. Therefore, `SFR` is used for San
   Francisco Airport rather than the more common `SFO` designation.

            'SFR'	=	'SAN FRANCISCO, CA     '
            'LOS'	=	'LOS ANGELES, CA       '
            'NYC'	=	'NEW YORK, NY          '

    Look at file [I94_SAS_Labels_I94PORT.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt) for more details.

   - **arrdate:** Arrival date in the U.S. (SAS Date format)

            SAS: Start Date is 01.01.1960 (SAS - Days since 1/1/1960: 0)
            Example:
            01.01.1960: (SAS: Days since 1/1/1960: 0)
            01.01.1970: (SAS: Days since 1/1/1960: 3653)

        Take a look at [Free SAS Date Calculator](https://www.sastipsbyhal.com)
       

    - **i94mode:** Type of immigration to U.S.
    Look at file [I94_SAS_Labels_I94MODE.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94MODE.txt) for more details.

            1 = 'Air'
            2 = 'Sea'
            3 = 'Land'
            9 = 'Not reported'

    - **i94addr:** Location State where the immigrants want travel to.
      Look at file [I94_SAS_Labels_I94ADDR.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt) for more details.

            'AL'='ALABAMA'
            'IN'='INDIANA'

    - **depdate:** Departure date from USA (SAS Date format) -> look at `arrdate` for calculation

    - **i94bir:** Age of respondent in years
    - **i94ivsa:** Visa codes collapsed into three categories:
      Look at file [I94_SAS_Labels_I94VISA.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94VISA.txt) for more details.

            1 = Business
            2 = Pleasure
            3 = Student

    - **count:** value is for summary statistics
    - **dtadfile:** Date added to I-94 Files - Character date field as YYYYMMDD (represents `arrdate`)
    - **visapost:** Department of state where Visa was issued
    - **occup:** Occupation that will be performed in U.S.
    - **entdepa:** Arrival Flag - admitted or paroled into the U.S.
    - **entdepd:** Departure Flag - Departed, lost I-94 or is deceased
    - **entdepu:** Update Flag - Either apprehended, overstayed, adjusted to perm residence
    - **matflag:** Match flag - Match of arrival and departure records
    - **biryear:** 4 digit year of birth
    - **dtaddto:** Date to which admitted to U.S. (allowed to stay until) - Character date field as MMDDYYYY (represents `depdate`)
    - **gender:** Gender - Non-immigrant sex
    - **insnum:** Insurance (INS) number
    - **airline:** Airline used to arrive in U.S.
    - **admnum:** Admission Number
    - **fltno:** Flight number of Airline used to arrive in U.S.
    - **viatype:** Class of admission legally admitting the non-immigrant to temporarily stay in U.S.


##### Imports and Installs section

In [1]:
import shutil
import pandas as pd
import pyspark.sql.functions as F
# import spark as spark
from pyspark.sql.types import StructType, StructField, DoubleType, StringType, IntegerType, LongType, TimestampType, DateType
from datetime import datetime, timedelta
from pyspark.sql import SparkSession, DataFrameNaFunctions
from pyspark.sql.functions import when, count, col, to_date, datediff, date_format, month
import re
import json
from os import path

##### Create Pandas and SparkSession to create data frames from source data

In [2]:
# If code will be executed in Udacity workbench --> use the following config(...)
#spark = SparkSession.builder.config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11").enableHiveSupport().getOrCreate()

# The version number for "saurfang:spark-sas7bdat" had to be updated for the local installation
MAX_MEMORY = "5g"

spark = SparkSession\
    .builder\
    .appName("etl pipeline for project 8 - I94 data") \
    .config("spark.jars.packages","saurfang:spark-sas7bdat:3.0.0-s_2.12")\
    .config('spark.sql.repl.eagerEval.enabled', True) \
    .config("spark.executor.memory", MAX_MEMORY) \
    .config("spark.driver.memory", MAX_MEMORY) \
    .appName("Foo") \
    .enableHiveSupport()\
    .getOrCreate()

# setting the current LOG-Level
spark.sparkContext.setLogLevel('ERROR')


In [3]:
# Read data from Immigration data '18-83510-I94-Data-2016' to the U.S.
filepath = '../P8_capstone_resource_files/immigration_data/18-83510-I94-Data-2016/i94_feb16_sub.sas7bdat'
df_pd_i94 = pd.read_sas(filepath, format=None, index=None, encoding=None, chunksize=None, iterator=False)

In [4]:
# Show data (1st 5 rows)
df_pd_i94.head()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,2.0,2016.0,2.0,101.0,101.0,b'ATL',20498.0,1.0,b'MI',,...,,,1995.0,b'D/S',b'F',,b'DL',491319785.0,b'241',b'F1'
1,5.0,2016.0,2.0,101.0,101.0,b'CHI',20492.0,1.0,b'IL',,...,,,1961.0,b'08072016',b'F',,b'TK',470581085.0,b'5',b'B2'
2,6.0,2016.0,2.0,101.0,101.0,b'CHI',20492.0,1.0,b'IL',,...,,,2010.0,b'08072016',b'M',,b'TK',470572885.0,b'5',b'B2'
3,7.0,2016.0,2.0,101.0,101.0,b'CHI',20500.0,1.0,b'AZ',20527.0,...,,b'M',1978.0,b'08152016',b'F',,b'LH',497400985.0,b'434',b'B2'
4,8.0,2016.0,2.0,101.0,101.0,b'CHI',20503.0,1.0,b'IL',20518.0,...,,b'M',1979.0,b'08182016',b'M',,b'AA',507772085.0,b'87',b'B2'


In [5]:
# Show data (last 5 rows)
df_pd_i94.tail()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
2570538,4053984.0,2016.0,2.0,745.0,745.0,b'SYS',20506.0,3.0,b'CA',20508.0,...,,b'M',1957.0,b'07242016',,,,86428530000.0,b'00066',b'B2'
2570539,1863071.0,2016.0,2.0,745.0,745.0,b'SYS',20494.0,3.0,b'CA',,...,,,1975.0,b'08092016',b'F',,,87855340000.0,b'LAND',b'B2'
2570540,5313112.0,2016.0,2.0,745.0,745.0,b'SYS',20513.0,3.0,b'NV',20524.0,...,,b'M',1989.0,b'D/S',b'F',,,83326890000.0,b'00088',b'F1'
2570541,2834382.0,2016.0,2.0,745.0,745.0,b'THO',20499.0,3.0,b'MD',20501.0,...,,b'M',1983.0,b'08142016',b'M',,,88282840000.0,b'LAND',b'B1'
2570542,4387479.0,2016.0,2.0,135.0,749.0,b'PBB',20509.0,3.0,,20510.0,...,,b'M',1967.0,b'05192016',b'F',,,53205730000.0,b'00878',b'WT'


In [6]:
# Get an overview about filled fields (not null)
df_pd_i94.count()

cicid       2570543
i94yr       2570543
i94mon      2570543
i94cit      2570543
i94res      2570543
i94port     2570543
arrdate     2570543
i94mode     2570521
i94addr     2421292
depdate     2276656
i94bir      2569642
i94visa     2570543
count       2570543
dtadfile    2529996
visapost    1032575
occup          7921
entdepa     2570521
entdepd     2277645
entdepu          63
matflag     2277645
biryear     2569642
dtaddto     2570157
gender      2274862
insnum       136041
airline     2500643
admnum      2570543
fltno       2562240
visatype    2570543
dtype: int64

#### Summary for Airport Codes [`airport-codes_csv.csv`](../P8_capstone_resource_files/airport-codes_csv.csv):
* **Source**: [datahub.io - Airport codes](https://datahub.io/core/airport-codes#data)
* **Description**: Airport codes from around the world contain codes that may refer to either IATA airport code, a
  three-letter code which is used in passenger reservation, ticketing and baggage-handling systems, or the ICAO airport
  code which is a four letter code used by ATC systems and for airports that do not have an IATA airport code.
* **Data**: Large file, containing information about all airports from [this site](https://ourairports.com/data/)
* **Format**: CSV File - Comma separated text file format
* **Rows**: over 55k
* **Data description**: Detailed information about each listed airport is displayed in 12 columns.
  ![08_P8_airport-codes_csv.png](../P8_capstone_documentation/08_P8_airport-codes_csv.png)


##### Read data from file Airport Codes: `airport-codes_csv.csv`

In [7]:
filepath = '../P8_capstone_resource_files/airport-codes_csv.csv'
df_pd_airport = pd.read_csv(filepath)

In [8]:
# Show data (1st 5 rows)
df_pd_airport.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [9]:
# Show data (last 5 rows)
df_pd_airport.tail()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
55070,ZYYK,medium_airport,Yingkou Lanqi Airport,0.0,AS,CN,CN-21,Yingkou,ZYYK,YKH,,"122.3586, 40.542524"
55071,ZYYY,medium_airport,Shenyang Dongta Airport,,AS,CN,CN-21,Shenyang,ZYYY,,,"123.49600219726562, 41.784400939941406"
55072,ZZ-0001,heliport,Sealand Helipad,40.0,EU,GB,GB-ENG,Sealand,,,,"1.4825, 51.894444"
55073,ZZ-0002,small_airport,Glorioso Islands Airstrip,11.0,AF,TF,TF-U-A,Grande Glorieuse,,,,"47.296388888900005, -11.584277777799999"
55074,ZZZZ,small_airport,Satsuma IÅjima Airport,338.0,AS,JP,JP-46,Mishima-Mura,RJX7,,,"130.270556, 30.784722"


In [10]:
# Get an overview about filled fields
df_pd_airport.count()

ident           55075
type            55075
name            55075
elevation_ft    48069
continent       27356
iso_country     54828
iso_region      55075
municipality    49399
gps_code        41030
iata_code        9189
local_code      28686
coordinates     55075
dtype: int64

#### Summary for US Cities: Demographics [`us-cities-demographics.json`](../P8_capstone_resource_files/us-cities-demographics.json):
* **Source:** [US Cities: Demographics ](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/information/)
* **Description:** This dataset contains information about the demographics of all US cities and census-designated places
  with a population greater or equal to 65,000. This data comes from the [US Census Bureau's 2015 American Community Survey](https://www.census.gov/en.html).
* **Data:** Structured data about City, State, Age, Population, etc.
* **Format:** JSON File - Structured data
* **Rows:** 2,8k
* **Data description:** 12 columns describing facts from cities across the U.S. about demographics.
  ![12_P8_us-cities-demographics.png](../P8_capstone_documentation/12_P8_us-cities-demographics.png)


##### Read data from file US Cities and it's information about citizens: `us-cities-demographics.csv:`

In [11]:
filepath = '../P8_capstone_resource_files/us-cities-demographics.json'
df_pd_us_cities = pd.read_json(filepath, orient='columns')

In [12]:
# Show data (1st 5 rows)
df_pd_us_cities.head()

Unnamed: 0,datasetid,recordid,fields,record_timestamp
0,us-cities-demographics,0074451cff52969855654d21497e9459f1108d8d,"{'count': 8791, 'city': 'Wichita', 'number_of_...",1970-01-01T01:00:00+01:00
1,us-cities-demographics,54b201cac9c7523363eb0cfeadc352a04fe016af,"{'count': 22304, 'city': 'Allen', 'number_of_v...",1970-01-01T01:00:00+01:00
2,us-cities-demographics,9dc3d4a59d7e3e2ad31ec5a6d3bab5fac67ee462,"{'count': 8454, 'city': 'Danbury', 'number_of_...",1970-01-01T01:00:00+01:00
3,us-cities-demographics,630ac8078919e7c8c96b861a336c66af27ffcc88,"{'count': 67526, 'city': 'Nashville', 'number_...",1970-01-01T01:00:00+01:00
4,us-cities-demographics,ae093b0dc0b8b9116176092b731533f5b008c75b,"{'count': 11013, 'city': 'Stamford', 'number_o...",1970-01-01T01:00:00+01:00


In [13]:
# Show data (last 5 rows)
df_pd_us_cities.tail()

Unnamed: 0,datasetid,recordid,fields,record_timestamp
2886,us-cities-demographics,2cd192d37e3cf2e922b5c993e083f13a9dbab57e,"{'count': 624, 'city': 'Caguas', 'male_populat...",1970-01-01T01:00:00+01:00
2887,us-cities-demographics,bd94fa78923358abf4b5f10da11460e58d635415,"{'count': 406, 'city': 'West Palm Beach', 'num...",1970-01-01T01:00:00+01:00
2888,us-cities-demographics,1582817d1ec5cca79f347c3c69b8efc32a88f242,"{'count': 3434, 'city': 'Clovis', 'number_of_v...",1970-01-01T01:00:00+01:00
2889,us-cities-demographics,61b94f1482025536ad191db2fb3b76e46df798ed,"{'count': 80975, 'city': 'Beaverton', 'number_...",1970-01-01T01:00:00+01:00
2890,us-cities-demographics,75ea951cf36060d95b87878da7d8edfe15f79a6f,"{'count': 18601, 'city': 'Louisville/Jefferson...",1970-01-01T01:00:00+01:00


In [14]:
# Get an overview about filled fields
df_pd_us_cities.count()

datasetid           2891
recordid            2891
fields              2891
record_timestamp    2891
dtype: int64

#### Summary for World Temperature Data [`GlobalLandTemperaturesByCity.csv`](../P8_capstone_resource_files/GlobalLandTemperaturesByCity.csv):
* **Source:** [World Temperature Data: Temperature grouped by City and Country](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data)
* **Description:** Climate Change: Earth Surface Temperature Data. Global temperatures since 1750.
* **Data:**  Structured data about Average Temperature, City, Country, Location (Latitude and Longitude)
* **Format:** CSV File - Comma separated text file format
* **Rows:** 8,5 million entries
* **Data description:** Temperature record as time series information since 1750.
  ![09_P8_GlobalLandTemperaturesByCity.png](../P8_capstone_documentation/09_P8_GlobalLandTemperaturesByCity.png)
* **Note:** Temperature data must be formatted correctly

##### Read data from World Temperature Data where Temperature is grouped by City and Country: `GlobalLandTemperaturesByCity.csv`

In [15]:
filepath = '../P8_capstone_resource_files/GlobalLandTemperaturesByCity.csv'
df_pd_temperature = pd.read_csv(filepath)

In [16]:
# Show data (1st 5 rows)
df_pd_temperature.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [17]:
# Show data (last 5 rows)
df_pd_temperature.tail()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
8599207,2013-05-01,11.464,0.236,Zwolle,Netherlands,52.24N,5.26E
8599208,2013-06-01,15.043,0.261,Zwolle,Netherlands,52.24N,5.26E
8599209,2013-07-01,18.775,0.193,Zwolle,Netherlands,52.24N,5.26E
8599210,2013-08-01,18.025,0.298,Zwolle,Netherlands,52.24N,5.26E
8599211,2013-09-01,,,Zwolle,Netherlands,52.24N,5.26E


In [18]:
# Get an overview about filled fields
df_pd_temperature.count()


dt                               8599212
AverageTemperature               8235082
AverageTemperatureUncertainty    8235082
City                             8599212
Country                          8599212
Latitude                         8599212
Longitude                        8599212
dtype: int64

#### Findings from Immigration data `18-83510-I94-Data-2016` to the U.S.:

###### 1. `df_spark_i94.i94cit`:
- County Code does not match to `iso-3166`-Country-Code for further analysis
- Null values in column `i94cit`

###### 2. `df_spark_i94.i94port`:
- Airport Code `i94port` does not correspondent to [IATA](https://en.wikipedia.org/wiki/International_Air_Transport_Association)
3 letter airport codes from file [I94_SAS_Labels_I94PORT.txt](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94PORT.txt).
**Project decision**: Only usage of given i94 airport codes.

###### 3. `df_spark_i94.arrdate` / `df_spark_i94.depdate`:
- `arrdate` and `depdate` are in SAS date format (String), whose epoch starts on 1960-01-01. This date values will be converted into DateFormat.

###### 4. `df_spark_i94.i94addr`:
- Null values in column `i94addr`

###### 5. [I94_SAS_Labels_I94ADDR.txt.I94ADDR](../P8_capstone_resource_files/I94_sas_labels_descriptions_extracted_data/I94_SAS_Labels_I94ADDR.txt):
- `I94ADDR` State description has errors like 'WI'='WISCONS**O**N' instead of 'WI'='WISCONS**I**N'. **Project decision:**
The only incorrect US state will be corrected manually.
