# Project i94
##### (Step1_Exploration.ipynb)
**Note:** This notebook includes the following work steps:
* Step 1: Defining the project and gathering data
* Step 2: Exploring & assessing datasets

---
## Project Scope & Data Sources
- This project aims to build a data warehouse with US immigration and demographic data

- The end goal is to create analytical tables for various agencies that are impacted by the influx US visitors and/or immigrants. For examples:
    - Tour agencies in Hawaii may benefit from knowing, "to which countries should they target their travel deals & promotions to?"
    - Tech companies may want to know, "for which countries should they prepare visa sponsorships during summer internships?" 
    - Hotel managements may benefit from understanding, "which months of the year should they consider increasing their staff size on?  
    - etc.
    
    
- Data Sources:
    - I94 Immigration Data from the US National Tourism and Trade Office
    - U.S. City Demographic Data from OpenSoft



---

##### Importing libraries

In [23]:
import pandas as pd
import numpy as np
import os
import re

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import IntegerType, FloatType

from utility_functions import count_nulls

##### Loading data and initial exploration

In [2]:
# Start spark session
spark = (SparkSession 
            .builder 
            .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11") 
            .enableHiveSupport().getOrCreate()
        )

In [3]:
# Load files
df_immigration = (spark.read.format('com.github.saurfang.sas.spark') 
                    .load('../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
                 )
# write to parquet
# df_immigration.write.parquet("sas_data")
df_immigration=spark.read.parquet("sas_data")

print('total rows in raw data:',df_immigration.count())
df_immigration.printSchema()

total rows in raw data: 3096313
root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)


### Explore and Assess Immigration Data
- Identify data quality issues, like missing values, duplicate data, etc.
- Document steps necessary to clean the data

In [4]:
# list of column names
print(df_immigration.columns)

['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline', 'admnum', 'fltno', 'visatype']


In [5]:
# See some examples
df_immigration.limit(2).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,5748517.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,CA,20582.0,...,,M,1976.0,10292016,F,,QF,94953870000.0,11,B1
1,5748518.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,NV,20591.0,...,,M,1984.0,10292016,F,,VA,94955620000.0,7,B1


##### check Null values 

In [6]:
num_nulls=dict()
for col in df_immigration.columns:
    total = df_immigration.filter(df_immigration[col].isNull()).count()
    num_nulls[col]= total
    

In [7]:
# examine null values
num_nulls

{'cicid': 0,
 'i94yr': 0,
 'i94mon': 0,
 'i94cit': 0,
 'i94res': 0,
 'i94port': 0,
 'arrdate': 0,
 'i94mode': 239,
 'i94addr': 152592,
 'depdate': 142457,
 'i94bir': 802,
 'i94visa': 0,
 'count': 0,
 'dtadfile': 1,
 'visapost': 1881250,
 'occup': 3088187,
 'entdepa': 238,
 'entdepd': 138429,
 'entdepu': 3095921,
 'matflag': 138429,
 'biryear': 802,
 'dtaddto': 477,
 'gender': 414269,
 'insnum': 2982605,
 'airline': 83627,
 'admnum': 0,
 'fltno': 19549,
 'visatype': 0}

##### Investigate the summary statistics of age 

In [8]:
df_immigration.select('i94bir').describe().toPandas()

Unnamed: 0,summary,i94bir
0,count,3095511.0
1,mean,41.767614458485205
2,stddev,17.42026053458727
3,min,-3.0
4,max,114.0


**Notes:**  
- Why is age minimum -3? 
- Age of 114 is still believable

##### Check to see if there are other visitors over 100 yrs old

In [9]:
# Check  i94_age >100
df_immigration.filter(df_immigration['i94bir']>100).select('i94bir','i94port','i94addr','visatype','i94cit','i94visa').toPandas()

Unnamed: 0,i94bir,i94port,i94addr,visatype,i94cit,i94visa
0,109.0,HHW,HI,WT,438.0,2.0
1,108.0,SDP,,WT,116.0,2.0
2,107.0,HHW,HI,WT,117.0,2.0
3,101.0,HHW,HI,WT,180.0,2.0
4,105.0,HHW,,WT,438.0,2.0
5,102.0,HHW,,WT,438.0,2.0
6,103.0,HHW,HI,WT,438.0,2.0
7,102.0,NYC,,WT,999.0,2.0
8,102.0,XXX,HI,WT,112.0,2.0
9,102.0,KOA,HI,WT,180.0,2.0


##### Examine Null values in age

In [10]:
df_immigration.filter(df_immigration['i94bir'].isNull()).select('i94bir','i94port','i94addr','visatype','i94cit','i94visa').limit(6).toPandas()

Unnamed: 0,i94bir,i94port,i94addr,visatype,i94cit,i94visa
0,,CHM,NY,WT,111.0,2.0
1,,CHM,NY,WT,111.0,2.0
2,,CHM,,WT,111.0,2.0
3,,CHM,,WT,111.0,2.0
4,,HHW,,WT,112.0,2.0
5,,HHW,HI,WT,112.0,2.0


#### Prepare columns and define column types
- remove entries with age < 0
- replace Null in 'i94addr' with 'unspecified'
- rename columns to make them more intutitive to read
- define integer type for numeric values
- drop duplicate entries

In [11]:
df_imm= (df_immigration.filter(df_immigration['i94bir']>0)
                .withColumn("year", df_immigration['i94yr'].cast(IntegerType()))
                .withColumn('i94addr', 
                            when(df_immigration["i94addr"].isNull(), 'unspecified')
                             .otherwise(df_immigration["i94addr"]))
                .withColumn("month", df_immigration['i94mon'].cast(IntegerType()))
                .withColumn("purpose", df_immigration['i94visa'].cast(IntegerType()))
                .withColumn("citizenship", df_immigration['i94cit'].cast(IntegerType())) 
                .withColumn("age", df_immigration['i94bir'].cast(IntegerType()))
                .withColumn("count", df_immigration['count'].cast(IntegerType()))
                .withColumnRenamed('i94port','entry_port')
                .withColumnRenamed('i94addr','destination_state')
                .withColumnRenamed('visatype', 'visa_type')
        ).drop_duplicates()

In [12]:
# Count number of rows
df_imm.count()

3094745

#### See examples of groupby result

In [None]:
(df_imm.groupBy(['year', 'month','entry_port','destination_state',
                                        'citizenship','age','purpose',
                                        'visa_type']).agg({'count':'sum'})
                                        .withColumnRenamed("sum(count)", "count")
).sample(0.00002).toPandas()


---

### Explore and Assess Demographic Data

In [13]:
df_demographic = (spark.read.format('csv')
                         .option("header","true")
                         .option("inferSchema","true")
                         .option("sep",";")
                         .load('./us-cities-demographics.csv')
                    )


df_demographic.printSchema()

root
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Median Age: double (nullable = true)
 |-- Male Population: integer (nullable = true)
 |-- Female Population: integer (nullable = true)
 |-- Total Population: integer (nullable = true)
 |-- Number of Veterans: integer (nullable = true)
 |-- Foreign-born: integer (nullable = true)
 |-- Average Household Size: double (nullable = true)
 |-- State Code: string (nullable = true)
 |-- Race: string (nullable = true)
 |-- Count: integer (nullable = true)



#### Check for Nulls

In [14]:
num_nulls=dict()
for col in df_demographic.columns:
    total = df_demographic.filter(df_demographic[col].isNull()).count()
    num_nulls[col]= total

In [15]:
num_nulls

{'City': 0,
 'State': 0,
 'Median Age': 0,
 'Male Population': 3,
 'Female Population': 3,
 'Total Population': 0,
 'Number of Veterans': 13,
 'Foreign-born': 13,
 'Average Household Size': 16,
 'State Code': 0,
 'Race': 0,
 'Count': 0}

In [16]:
# See example of Null values
df_demographic.filter(df_demographic['Foreign-born'].isNull()).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,San Juan,Puerto Rico,41.4,155408,186829,342237,,,,PR,Hispanic or Latino,335559
1,Caguas,Puerto Rico,40.4,34743,42265,77008,,,,PR,Hispanic or Latino,76349
2,Carolina,Puerto Rico,42.0,64758,77308,142066,,,,PR,American Indian and Alaska Native,12143
3,Carolina,Puerto Rico,42.0,64758,77308,142066,,,,PR,Hispanic or Latino,139967
4,San Juan,Puerto Rico,41.4,155408,186829,342237,,,,PR,American Indian and Alaska Native,4031
5,Mayagüez,Puerto Rico,38.1,30799,35782,66581,,,,PR,Asian,235
6,Ponce,Puerto Rico,40.5,56968,64615,121583,,,,PR,Hispanic or Latino,120705
7,Bayamón,Puerto Rico,39.4,80128,90131,170259,,,,PR,Hispanic or Latino,169155
8,San Juan,Puerto Rico,41.4,155408,186829,342237,,,,PR,Asian,2452
9,Guaynabo,Puerto Rico,42.2,33066,37426,70492,,,,PR,Hispanic or Latino,69936


In [17]:
# See examples 
df_demographic.sample(0.002).toPandas()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Killeen,Texas,29.2,69442,71367,140809,24281,15769,2.72,TX,Black or African-American,54601
1,Plano,Texas,38.1,138565,145054,283619,11719,74030,2.65,TX,White,195220
2,Bend,Oregon,37.3,42294,44723,87017,6199,3032,2.39,OR,Asian,2726
3,Menifee,California,37.1,42866,44297,87163,6821,12481,3.06,CA,White,62331


In [18]:
# See examples of a particular city, Lynchburg

df_demographic.filter(df_demographic['City']=='Lynchburg').toPandas()


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Lynchburg,Virginia,28.7,38614,41198,79812,4322,4364,2.48,VA,Black or African-American,23271
1,Lynchburg,Virginia,28.7,38614,41198,79812,4322,4364,2.48,VA,Hispanic or Latino,2689
2,Lynchburg,Virginia,28.7,38614,41198,79812,4322,4364,2.48,VA,Asian,2910
3,Lynchburg,Virginia,28.7,38614,41198,79812,4322,4364,2.48,VA,American Indian and Alaska Native,1024
4,Lynchburg,Virginia,28.7,38614,41198,79812,4322,4364,2.48,VA,White,53727


#### Combine all preparation and cleaning tasks
- Rename column names to make it more intutitive to read
- Turn city column into upper case
- Define the correct data type into numeric age column

In [24]:
# Create uppercase function
uppercase_string = udf(lambda s: s.upper())

df_demo = (df_demographic.withColumnRenamed('State Code', 'state_code')
                        .withColumn('city', uppercase_string('City'))
                        .withColumn('median_age', col('Median Age').cast(FloatType())) 
                        .withColumnRenamed('Foreign-born','foreign_born')
                        .withColumnRenamed('Total Population', 'total_population')
                        .withColumnRenamed('Race','race')
                        .withColumnRenamed('count', 'race_count')
                ).drop_duplicates().dropna()


In [25]:
dim_demographic_table = df_demo.select(['state_code','city','median_age','foreign_born',
                                       'total_population','race','race_count'])

In [26]:
dim_demographic_table.count()

2875

---

**Next Step**
- Consolidate data cleaning steps into functions and store them as [utility_functions.py](./utility_functions.py)
- Import this py file as a library in the next [ELT notebook](./Step3_ETL.ipynb)