# Project i94
##### (Step1_Exploration.ipynb)
**Note:** This notebook includes the following work steps:
* Step 1: Defining the project and gathering data
* Step 2: Exploring & assessing datasets

---
## Project Scope & Data Sources
- This project aims to build a data warehouse with US immigration and demographic data

- The end goal is to create analytical tables for various agencies that are impacted by the influx US visitors and/or immigrants. For examples:
    - Tour agencies in Hawaii may benefit from knowing, "to which countries should they target their travel deals & promotions to?"
    - Tech companies may want to know, "for which countries should they prepare visa sponsorships during summer internships?" 
    - Hotel managements may benefit from understanding, "which months of the year should they consider increasing their staff size on?  
    - etc.
    
    
- Data Sources:
    - I94 Immigration Data from the US National Tourism and Trade Office
    - U.S. City Demographic Data from OpenSoft



---

##### Importing libraries and initiating spark session

In [1]:
import pandas as pd
import numpy as np
import os
import re

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import IntegerType, FloatType

from utility_functions import count_nulls

In [2]:
# Start spark session
spark = (SparkSession 
            .builder 
            .config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11") 
            .enableHiveSupport().getOrCreate()
        )

##### Loading data and initial exploration
- Due to size limitation, the sas files are not included in this repo 

In [3]:
# Load files
df_immigration = (spark.read.format('com.github.saurfang.sas.spark') 
                    .load('../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')
                 )
# write to parquet
# df_immigration.write.parquet("sas_data")
# df_immigration=spark.read.parquet("sas_data")

print('total rows in raw data:',df_immigration.count())
df_immigration.printSchema()

total rows in raw data: 3096313
root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)


### Explore and Assess Immigration Data
- Identify data quality issues, like missing values, duplicate data, etc.
- Document steps necessary to clean the data

In [4]:
# list of column names
print(df_immigration.columns)

['cicid', 'i94yr', 'i94mon', 'i94cit', 'i94res', 'i94port', 'arrdate', 'i94mode', 'i94addr', 'depdate', 'i94bir', 'i94visa', 'count', 'dtadfile', 'visapost', 'occup', 'entdepa', 'entdepd', 'entdepu', 'matflag', 'biryear', 'dtaddto', 'gender', 'insnum', 'airline', 'admnum', 'fltno', 'visatype']


In [5]:
# See some examples
df_immigration.limit(2).toPandas()

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,...,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,6.0,2016.0,4.0,692.0,692.0,XXX,20573.0,,,,...,U,,1979.0,10282016,,,,1897628000.0,,B2
1,7.0,2016.0,4.0,254.0,276.0,ATL,20551.0,1.0,AL,,...,Y,,1991.0,D/S,M,,,3736796000.0,296.0,F1


##### check Null values 

In [6]:
# examine and count null values

count_nulls(df_immigration)

{'cicid': 0,
 'i94yr': 0,
 'i94mon': 0,
 'i94cit': 0,
 'i94res': 0,
 'i94port': 0,
 'arrdate': 0,
 'i94mode': 239,
 'i94addr': 152592,
 'depdate': 142457,
 'i94bir': 802,
 'i94visa': 0,
 'count': 0,
 'dtadfile': 1,
 'visapost': 1881250,
 'occup': 3088187,
 'entdepa': 238,
 'entdepd': 138429,
 'entdepu': 3095921,
 'matflag': 138429,
 'biryear': 802,
 'dtaddto': 477,
 'gender': 414269,
 'insnum': 2982605,
 'airline': 83627,
 'admnum': 0,
 'fltno': 19549,
 'visatype': 0}

##### Investigate the summary statistics of age 

In [7]:
df_immigration.select('i94bir').describe().toPandas()

Unnamed: 0,summary,i94bir
0,count,3095511.0
1,mean,41.767614458485205
2,stddev,17.42026053458826
3,min,-3.0
4,max,114.0


**Notes:**  
- Why is age minimum -3? 
- Age of 114 is still believable

##### Check to see if there are other visitors over 100 yrs old

In [8]:
# Check  i94_age >100
df_immigration.filter(df_immigration['i94bir']>100).select('i94bir','i94port','i94addr','visatype','i94cit','i94visa').toPandas()

Unnamed: 0,i94bir,i94port,i94addr,visatype,i94cit,i94visa
0,109.0,HHW,HI,WT,438.0,2.0
1,108.0,SDP,,WT,116.0,2.0
2,107.0,HHW,HI,WT,117.0,2.0
3,101.0,HHW,HI,WT,180.0,2.0
4,105.0,HHW,,WT,438.0,2.0
5,102.0,HHW,,WT,438.0,2.0
6,103.0,HHW,HI,WT,438.0,2.0
7,102.0,NYC,,WT,999.0,2.0
8,102.0,XXX,HI,WT,112.0,2.0
9,102.0,KOA,HI,WT,180.0,2.0


##### Examine Null values in age

In [9]:
df_immigration.filter(df_immigration['i94bir'].isNull()).select('i94bir','i94port','i94addr','visatype','i94cit','i94visa').limit(6).toPandas()

Unnamed: 0,i94bir,i94port,i94addr,visatype,i94cit,i94visa
0,,X96,AZ,B2,518.0,2.0
1,,HHW,,WT,180.0,2.0
2,,HHW,HI,WT,180.0,2.0
3,,HHW,HI,WT,180.0,2.0
4,,HHW,HI,WT,180.0,2.0
5,,HHW,HI,WT,180.0,2.0


#### Prepare columns and define column types
- remove entries with age < 0
- replace Null in 'i94addr' with 'unspecified'
- rename columns to make them more intutitive to read
- define integer type for numeric values
- drop duplicate entries

In [10]:
df_imm= (df_immigration.filter(df_immigration['i94bir']>0)
                .withColumn("year", df_immigration['i94yr'].cast(IntegerType()))
                .withColumn('i94addr', 
                            when(df_immigration["i94addr"].isNull(), 'unspecified')
                             .otherwise(df_immigration["i94addr"]))
                .withColumn("month", df_immigration['i94mon'].cast(IntegerType()))
                .withColumn("purpose", df_immigration['i94visa'].cast(IntegerType()))
                .withColumn("citizenship", df_immigration['i94cit'].cast(IntegerType())) 
                .withColumn("age", df_immigration['i94bir'].cast(IntegerType()))
                .withColumn("count", df_immigration['count'].cast(IntegerType()))
                .withColumnRenamed('i94port','entry_port')
                .withColumnRenamed('i94addr','destination_state')
                .withColumnRenamed('visatype', 'visa_type')
        ).drop_duplicates()

In [11]:
# Count number of rows
df_imm.count()

3094745

---

#### See examples of groupby result

In [12]:
(df_imm.groupBy(['year', 'month','entry_port','destination_state',
                                        'citizenship','age','purpose',
                                        'visa_type']).agg({'count':'sum'})
                                        .withColumnRenamed("sum(count)", "count")
).sample(0.00001).toPandas()


Unnamed: 0,year,month,entry_port,destination_state,citizenship,age,purpose,visa_type,count
0,2016,4,OPF,CO,582,34,2,B2,1
1,2016,4,MIA,FL,438,40,1,WB,2
2,2016,4,ATL,FL,575,62,2,B2,1
3,2016,4,OGG,CA,254,31,2,WT,3
4,2016,4,FTL,unspecified,111,45,2,WT,2
5,2016,4,LVG,TX,135,42,2,WT,1
6,2016,4,NEW,IL,148,41,1,WB,3
7,2016,4,HOU,CA,129,32,1,WB,1
8,2016,4,FTL,MA,126,75,2,WT,1
9,2016,4,ORL,FL,575,21,2,B2,12


In [13]:
(df_imm.groupBy(['year', 'month','entry_port','destination_state',
                                        'citizenship','age','purpose',
                                        'visa_type']).agg({'count':'sum'})
                                        .withColumnRenamed("sum(count)", "count")
).dropna().count()

856233

In [14]:
from Misc import *

In [15]:
junk2= valid_data_bycode(df_imm)

the number of entries with valid port codes:  3094745


In [None]:
code = list(i94cit_dict.keys())

In [None]:
code = list(i94cit_dict.keys())
pattern= re.compile(r'\'(\w\w)\'')
key= [pattern.search(i).groups()[0] for i in code]


In [None]:
junk = df_imm.filter(df_imm['destination_state'].isin(key))

In [None]:
junk.count()

In [None]:
# df_imm.filter(df_imm['destination_state']=='.I').limit(5).toPandas()

---

### Explore and Assess Demographic Data

In [None]:
import pandas as pd
import numpy as np
import os
import re

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import IntegerType, FloatType

from utility_functions import count_nulls

In [None]:
df_demographic = (spark.read.format('csv')
                         .option("header","true")
                         .option("inferSchema","true")
                         .option("sep",";")
                         .load('../data/us-cities-demographics.csv')
                    )


df_demographic.printSchema()

#### Check for Nulls

In [None]:
# Count null values 

count_nulls(df_demographic)

In [None]:
# See example of Null values
df_demographic.filter(df_demographic['Foreign-born'].isNull()).toPandas()

In [None]:
# See examples 
df_demographic.sample(0.002).toPandas()

In [None]:
# See examples of a particular city, Lynchburg

df_demographic.filter(df_demographic['City']=='Lynchburg').toPandas()


#### Combine all preparation and cleaning tasks
- Rename column names to make it more intutitive to read
- Turn city column into upper case
- Define the correct data type into numeric age column

In [None]:
# Create uppercase function
uppercase_string = udf(lambda s: s.upper())

df_demo = (df_demographic.withColumnRenamed('State Code', 'state_code')
                        .withColumn('city', uppercase_string('City'))
                        .withColumn('median_age', col('Median Age').cast(FloatType())) 
                        .withColumnRenamed('Foreign-born','foreign_born')
                        .withColumnRenamed('Total Population', 'total_population')
                        .withColumnRenamed('Race','race')
                        .withColumnRenamed('count', 'race_count')
                ).drop_duplicates().dropna()


In [None]:
# Prepare a dim table
dim_demographic_table = df_demo.select(['state_code','city','median_age','foreign_born',
                                       'total_population','race','race_count'])

In [None]:
# Count the number of rows in dim table
dim_demographic_table.count()

---

**Next Step**
- Consolidate data cleaning steps into functions and store them as [utility_functions.py](./utility_functions.py)
- Import this py file as a library in the next [ELT notebook](./Step3_ETL.ipynb)