# Project Title
## Data Engineering Capstone Project

### Project Summary
--describe your project at a high level--

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [4]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, when, count, col
from datetime import datetime
from datetime import timedelta

In [2]:
spark = SparkSession.builder.\
config("spark.jars.packages","saurfang:spark-sas7bdat:2.0.0-s_2.11")\
.enableHiveSupport().getOrCreate()

In [13]:
import pyspark.sql.functions as f
import pyspark.sql.types as t

## Step 1: Scope the Project and Gather Data

### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

In [5]:
# Due to its size, I process the 
df_sas = spark.read.parquet("sas_data")
df_airport = pd.read_csv("data/airport-codes_csv.csv")
df_demo = pd.read_csv("data/us-cities-demographics.csv", sep=';')
df_weather = pd.read_csv("temperature_data/GlobalLandTemperaturesByCountry.csv")

In [6]:
with open('data/I94_SAS_Labels_Descriptions.SAS') as f:
    
    def clean_field(df, col, regex):
        df[col] = df[col].str.extract(regex)
        df[col] = df[col].str.strip()
        df[col] = df[col].str.upper()
        return df[col]
    
    lines=f.readlines()
    
    df_cntyl = pd.DataFrame(lines[9:297])
    df_cntyl = df_cntyl[0].str.split("=", n=1, expand= True)
    df_cntyl.columns = ['i94cntyl','country']
    df_cntyl['country'] = clean_field(df_cntyl, 'country', r'\'([^\']+)\'')
    df_cntyl['i94cntyl'] = df_cntyl['i94cntyl'].astype(int)
    
    df_port = pd.DataFrame(lines[302:962])
    df_port = df_port[0].str.split("=", n=1, expand= True)
    df_port_comma_split = df_port[1].str.split(",", n=1, expand= True)
    df_port[1] = df_port_comma_split[0]
    df_port[2] = df_port_comma_split[1]
    df_port.columns = ['i94port','port','addr']
    df_port['i94port'] = clean_field(df_port, 'i94port', r'\'([^\']+)\'')
    df_port['port'] = clean_field(df_port, 'port', r'\'([^\']+)')
    df_port['addr'] = clean_field(df_port, 'addr', r'([^\']+)\'')
  
    df_mode = pd.DataFrame(lines[972:976])
    df_mode = df_mode[0].str.split("=", n=1, expand= True)
    df_mode.columns = ['i94mode','mode']
    df_mode['mode'] = clean_field(df_mode, 'mode', r'\'([^\']+)\'')
    df_mode['i94mode'] = clean_field(df_mode, 'i94mode', r'\s+([^\']+)')
    
    df_addr = pd.DataFrame(lines[981:1036])
    df_addr = df_addr[0].str.split("=", n=1, expand= True)
    df_addr.columns = ['i94addr','state']
    df_addr['i94addr'] = clean_field(df_addr, 'i94addr', r'\'([^\']+)\'')
    df_addr['state'] = clean_field(df_addr, 'state', r'\'([^\']+)\'')
    
    df_visa = pd.DataFrame(lines[1046:1049])
    df_visa = df_visa[0].str.split("=", n=1, expand= True)
    df_visa.columns = ['i94visa','visa']
    df_visa['visa'] = clean_field(df_visa, 'visa', r'([^\']+)\n')
    

In [7]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df_sas.limit(10).toPandas())

Unnamed: 0,cicid,i94yr,i94mon,i94cit,i94res,i94port,arrdate,i94mode,i94addr,depdate,i94bir,i94visa,count,dtadfile,visapost,occup,entdepa,entdepd,entdepu,matflag,biryear,dtaddto,gender,insnum,airline,admnum,fltno,visatype
0,5748517.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,CA,20582.0,40.0,1.0,1.0,20160430,SYD,,G,O,,M,1976.0,10292016,F,,QF,94953870000.0,11,B1
1,5748518.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,NV,20591.0,32.0,1.0,1.0,20160430,SYD,,G,O,,M,1984.0,10292016,F,,VA,94955620000.0,7,B1
2,5748519.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20582.0,29.0,1.0,1.0,20160430,SYD,,G,O,,M,1987.0,10292016,M,,DL,94956410000.0,40,B1
3,5748520.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,29.0,1.0,1.0,20160430,SYD,,G,O,,M,1987.0,10292016,F,,DL,94956450000.0,40,B1
4,5748521.0,2016.0,4.0,245.0,438.0,LOS,20574.0,1.0,WA,20588.0,28.0,1.0,1.0,20160430,SYD,,G,O,,M,1988.0,10292016,M,,DL,94956390000.0,40,B1
5,5748522.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20579.0,57.0,2.0,1.0,20160430,ACK,,G,O,,M,1959.0,10292016,M,,NZ,94981800000.0,10,B2
6,5748523.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,66.0,2.0,1.0,20160430,ACK,,G,O,,M,1950.0,10292016,F,,NZ,94979690000.0,10,B2
7,5748524.0,2016.0,4.0,245.0,464.0,HHW,20574.0,1.0,HI,20586.0,41.0,2.0,1.0,20160430,ACK,,G,O,,M,1975.0,10292016,F,,NZ,94979750000.0,10,B2
8,5748525.0,2016.0,4.0,245.0,464.0,HOU,20574.0,1.0,FL,20581.0,27.0,2.0,1.0,20160430,ACK,,G,O,,M,1989.0,10292016,M,,NZ,94973250000.0,28,B2
9,5748526.0,2016.0,4.0,245.0,464.0,LOS,20574.0,1.0,CA,20581.0,26.0,2.0,1.0,20160430,ACK,,G,O,,M,1990.0,10292016,F,,NZ,95013550000.0,2,B2


## Step 2: Explore and Assess the Data
### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

### Cleaning Steps
Document steps necessary to clean the data

### Immigration data

In [8]:
df_sas.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [9]:
# Count the relative number of null values 
df_sas_total_rows = df_sas.count()
df_sas_nulls = df_sas.select([(count(when(isnan(c) | col(c).isNull(), c))/df_sas_total_rows).alias(c) for c in df_sas.columns]).toPandas()

# Drop columns with over 90% null values. 
# Note: This step is for demonstration purposes; in a real project I would leave
# this decision to a data scientist.
empty_cols = []
for c in df_sas_nulls.columns:
    if df_sas_nulls[c][0] > 0.9:
        empty_cols.append(c)
print(empty_cols)
df_sas_clean_a = df_sas.drop(*empty_cols)
       

['occup', 'entdepu', 'insnum']


In [10]:
# Drop columns with duplicate ids
df_sas_clean_b = df_sas_clean_a.dropna(how='all', subset=['cicid'])

In [11]:
# Convert double columns to the original format (integer)
df_sas_clean_c = df_sas_clean_b.\
withColumn("cicid", df_sas_clean_b["cicid"].cast('integer')).\
withColumn("i94yr", df_sas_clean_b["i94yr"].cast('integer')).\
withColumn("i94mon", df_sas_clean_b["i94mon"].cast('integer')).\
withColumn("i94cit", df_sas_clean_b["i94cit"].cast('integer')).\
withColumn("i94res", df_sas_clean_b["i94res"].cast('integer')).\
withColumn("arrdate", df_sas_clean_b["arrdate"].cast('integer')).\
withColumn("i94mode", df_sas_clean_b["i94mode"].cast('integer')).\
withColumn("i94bir", df_sas_clean_b["i94bir"].cast('integer')).\
withColumn("count", df_sas_clean_b["count"].cast('integer')).\
withColumn("i94visa", df_sas_clean_b["i94visa"].cast('integer')).\
withColumn("depdate", df_sas_clean_b["depdate"].cast('integer')).\
withColumn("biryear", df_sas_clean_b["biryear"].cast('integer')).\
withColumn("admnum", df_sas_clean_b["admnum"].cast('integer'))

In [14]:
# Convert SAS date format to datetime:
def date_add_(days):
    date = datetime.strptime('1960-01-01', "%Y-%m-%d")
    return date + timedelta(days)

date_add_udf = f.udf(date_add_, t.DateType())

df_sas_clean_d = df_sas_clean_c.withColumn('arrdate', date_add_udf('arrdate'))\
    .withColumn('depdate', date_add_udf('depdate'))

# Drop year and mon columns
df_sas_clean_e = df_sas_clean_d.drop('i94yr','i94mon')

Here, I have decided against keeping year and month columns (or even generating an additional day column), since we don't have weather data for these dates available, and hence a direct join would not make much sense. Instead, I leave it up to the data scientist on the receiving end of the data to process the date values and join them as desired.

### Remaining data

In [15]:
# For the weather and cyntl data, the country column is capitalized 
# to enable joins. I also convert the weather date string to datetime format.
df_weather.columns=['date','average_temperature','average_temperature_uncertainty','country']
df_weather['country'] = df_weather['country'].str.upper().astype(str)
df_weather['date'] = pd.to_datetime(df_weather['date'])
df_weather=df_weather[df_weather['average_temperature'].notnull()]

df_cntyl['country'] = df_cntyl['country'].str.upper().astype(str)


In [16]:
# The demographic column names have many spaces and capitalization, so I adjust them to be DWH-friendly
df_demo.columns=['city', 'state', 'median_age', 'male_population', 'female_population',
       'total_population', 'number_of_veterans', 'foreign_born',
       'average_household_size', 'state_code', 'race', 'count']

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

## Step 4: Run Pipelines to Model the Data 

Here, I build the data pipelines to create the data model. We also define a testing function to perform some basic data quality checks on a dataframe

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [17]:
def dataframe_quality_check(df):
    
    if(df.index.is_unique):
        print("The dataframe has a unique index.")
    else:
        print("Warning: The dataframe does not have a unique index.")
    
    col_summary = dict()
    for c in df.columns:
        col_attributes = dict()
        col_attributes['dtype'] = df[c].dtype
        col_attributes['count'] = df[c].count()
        col_attributes['count_null'] = df[c].size - col_attributes['count']
        col_attributes['unique_values'] = df[c].nunique()
    
        col_summary[c] = col_attributes
    return pd.DataFrame(col_summary).transpose()

### Immigration data

With the cleanups we already did, the immigration data should actually be fine as-is.

In [18]:
df_immigration_dwh = df_sas_clean_e

### Weather data: augment weather data with cntyl code

In order to be able to join the weather data to the immigration data, the cntyl country code needs to be available in the weather data. This is done via a join on the country field. Ideally, this join would be fuzzy, but for now I will just perform a rigid join.

We leave the actual aggreagation of weather data over time to the data scientist. This implies that country names will have multiple appearances, and hence cannot be used as an index column.

In [19]:
df_weather_dwh = pd.merge(left=df_weather, right=df_cntyl, 
                          left_on='country', right_on='country',
                          how='left')

In [20]:
df_weather_dwh.head(5)

Unnamed: 0,date,average_temperature,average_temperature_uncertainty,country,i94cntyl
0,1743-11-01,4.384,2.294,ÅLAND,
1,1744-04-01,1.53,4.68,ÅLAND,
2,1744-05-01,6.702,1.789,ÅLAND,
3,1744-06-01,11.609,1.577,ÅLAND,
4,1744-07-01,15.342,1.41,ÅLAND,


In [21]:
# Only keep countries that are in the imigration data
i94cntyl_in_sas = list(set(df_immigration_dwh.select("i94cit").distinct().toPandas()['i94cit'] \
+ df_immigration_dwh.select("i94res").distinct().toPandas()['i94res']))
i94cntyl_in_sas = [int(x) for x in i94cntyl_in_sas if str(x) != 'nan']

df_weather_dwh = df_weather_dwh[df_weather_dwh['i94cntyl'].notnull()]
df_weather_dwh = df_weather_dwh[df_weather_dwh['i94cntyl'].isin(i94cntyl_in_sas)]
df_weather_dwh = df_weather_dwh.reset_index(drop=True)

In [22]:
df_weather_dwh.head(5)

Unnamed: 0,date,average_temperature,average_temperature_uncertainty,country,i94cntyl
0,1824-01-01,25.146,0.874,BARBADOS,513.0
1,1824-02-01,24.806,2.374,BARBADOS,513.0
2,1824-03-01,25.318,1.09,BARBADOS,513.0
3,1824-04-01,26.43,2.173,BARBADOS,513.0
4,1824-05-01,26.553,1.217,BARBADOS,513.0


In [23]:
dataframe_quality_check(df_weather_dwh)

The dataframe has a unique index.


Unnamed: 0,count,count_null,dtype,unique_values
date,40018,0,datetime64[ns],2457
average_temperature,40018,0,float64,16448
average_temperature_uncertainty,40018,0,float64,2726
country,40018,0,object,20
i94cntyl,40018,0,float64,20


The data looks fine.

##### State data: aggregate demographic data on state level

We aggregate the available numeric data on a city level for each state. Since we don't have access to total state demographics in this data set, we express the male population, female population, veteran number and foreign born number as fractions of total pupolation.

In [24]:
df_demo_dwh = df_demo[['state_code', 'state']].drop_duplicates().set_index('state_code')\
.join(df_demo.groupby(['state_code'])['male_population', 'female_population',\
                                      'total_population', 'number_of_veterans', 'foreign_born'].agg('sum'))\
.join(df_demo.groupby(['state_code'])['median_age', 'average_household_size'].agg('median'))

df_demo_dwh['male_population'] = df_demo_dwh['male_population']/df_demo_dwh['total_population']
df_demo_dwh['female_population'] = df_demo_dwh['female_population']/df_demo_dwh['total_population']
df_demo_dwh['number_of_veterans'] = df_demo_dwh['number_of_veterans']/df_demo_dwh['total_population']
df_demo_dwh['foreign_born'] = df_demo_dwh['foreign_born']/df_demo_dwh['total_population']

df_demo_dwh = df_demo_dwh.drop(['total_population'], axis=1)

In [25]:
df_demo_dwh.head(5)

Unnamed: 0_level_0,state,male_population,female_population,number_of_veterans,foreign_born,median_age,average_household_size
state_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MD,Maryland,0.478574,0.521426,0.048885,0.175131,35.9,2.64
MA,Massachusetts,0.484253,0.515747,0.032929,0.257458,34.9,2.43
AL,Alabama,0.474154,0.525846,0.068347,0.048911,38.0,2.41
CA,California,0.494601,0.505399,0.037402,0.300214,35.8,3.06
NJ,New Jersey,0.493871,0.506129,0.021156,0.335845,34.6,2.85


In [26]:
dataframe_quality_check(df_demo_dwh)

The dataframe has a unique index.


Unnamed: 0,count,count_null,dtype,unique_values
state,49,0,object,49
male_population,49,0,float64,49
female_population,49,0,float64,49
number_of_veterans,49,0,float64,49
foreign_born,49,0,float64,49
median_age,49,0,float64,39
average_household_size,48,1,float64,41


In addition to the data above, we can also extract the "race distribution" of each state in a similar fashion. This table acts as an additinal dimension table for each state code.

In [27]:
df_demo_race_dwh = pd.DataFrame(df_demo.groupby(['state_code', 'race'])['count'].agg('sum'))\
.join(df_demo.groupby(['state_code'])['count'].agg('sum'), rsuffix='_total')

df_demo_race_dwh['fraction']=df_demo_race_dwh['count']/df_demo_race_dwh['count_total']
df_demo_race_dwh = pd.DataFrame(df_demo_race_dwh['fraction'])

In [28]:
df_demo_race_dwh.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,fraction
state_code,race,Unnamed: 2_level_1
AK,American Indian and Alaska Native,0.108078
AK,Asian,0.109524
AK,Black or African-American,0.068724
AK,Hispanic or Latino,0.081079
AK,White,0.632595
AL,American Indian and Alaska Native,0.007375
AL,Asian,0.026245
AL,Black or African-American,0.47536
AL,Hispanic or Latino,0.035864
AL,White,0.455155


In [29]:
dataframe_quality_check(df_demo_race_dwh)

The dataframe has a unique index.


Unnamed: 0,count,count_null,dtype,unique_values
fraction,243,0,float64,243


##### State data from sas: not used

This data only contains information which is already available in the state data above:

In [30]:
df_addr.head()

Unnamed: 0,i94addr,state
0,AL,ALABAMA
1,AK,ALASKA
2,AZ,ARIZONA
3,AR,ARKANSAS
4,CA,CALIFORNIA


##### Mode data

The mode data can be taken as-is with the correct index.

In [31]:
df_mode_dwh=df_mode.set_index('i94mode')

In [32]:
df_mode_dwh.head()

Unnamed: 0_level_0,mode
i94mode,Unnamed: 1_level_1
1,AIR
2,SEA
3,LAND
9,NOT REPORTED


##### Visa data

The visa data can be taken as-is with the correct index.

In [33]:
df_visa_dwh = df_visa.set_index('i94visa')

In [34]:
df_visa_dwh.head()

Unnamed: 0_level_0,visa
i94visa,Unnamed: 1_level_1
1,BUSINESS
2,PLEASURE
3,STUDENT


##### Airport data

Previously, we extracted df_port from the sas data file:

In [35]:
df_port.head()

Unnamed: 0,i94port,port,addr
0,ALC,ALCAN,AK
1,ANC,ANCHORAGE,AK
2,BAR,BAKER AAF - BAKER ISLAND,AK
3,DAC,DALTONS CACHE,AK
4,PIZ,DEW STATION PT LAY DEW,AK


We can attempt to combine this information with the available airport information:

In [36]:
df_airport_dwh = pd.merge(left=df_port, right=df_airport, 
                          left_on='i94port', right_on='ident',
                          how='left')

The join is acctually sucesfful in some occasions:

In [37]:
df_airport_dwh[df_airport_dwh['type'].notnull()].head(3)

Unnamed: 0,i94port,port,addr,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
11,5KE,KETCHIKAN,AK,5KE,seaplane_base,Ketchikan Harbor Seaplane Base,,,US,US-AK,Ketchikan,,WFB,5KE,"-131.677002, 55.349899"
13,MOS,MOSES POINT INTERMEDIATE,AK,MOS,small_airport,Moses Point Airport,14.0,,US,US-AK,Elim,MOS,,MOS,"-162.0570068359375, 64.69819641113281"
15,NOM,NOM,AK,NOM,small_airport,Nomad River Airport,305.0,OC,PG,PG-WPD,Nomad River,,NOM,NDR,"142.234166667, -6.294"


In [38]:
dataframe_quality_check(df_airport_dwh)

The dataframe has a unique index.


Unnamed: 0,count,count_null,dtype,unique_values
i94port,660,0,object,660
port,660,0,object,634
addr,583,77,object,112
ident,37,623,object,37
type,37,623,object,4
name,37,623,object,37
elevation_ft,30,630,float64,28
continent,17,643,object,5
iso_country,37,623,object,12
iso_region,37,623,object,22


We can use an additional check: The iso_region field from df_airport should match with the addr field from df_port. Let us check the cases where this is _not_ true.

In [39]:
df_airport_dwh['iso_region_state'] = clean_field(df_airport_dwh, 'iso_region', r'-([^-]+)')
df_airport_dwh_outliers = df_airport_dwh[\
    (df_airport_dwh['ident'].notnull())\
    & (df_airport_dwh['iso_region_state'] != df_airport_dwh['addr'])]

print(len(df_airport_dwh_outliers))
df_airport_dwh_outliers.head(10)

30


Unnamed: 0,i94port,port,addr,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates,iso_region_state
15,NOM,NOM,AK,NOM,small_airport,Nomad River Airport,305.0,OC,PG,WPD,Nomad River,,NOM,NDR,"142.234166667, -6.294",WPD
28,MAP,MARIPOSA AZ,,MAP,small_airport,Mamai Airport,90.0,OC,PG,CPM,Mamai,,MAP,,"149.519166667, -10.290833333299998",CPM
34,SAS,SASABE,AZ,SAS,small_airport,Salton Sea Airport,-84.0,,US,CA,Salton City,KSAS,SAS,SAS,"-115.952003479, 33.2414016724",CA
78,DVD,DOVER-AFB,DE,DVD,small_airport,Andavadoaka Airport,30.0,AF,MG,U,Andavadoaka,,DVD,,"43.259573, -22.06608",U
140,LKC,LAKE CHARLES,LA,LKC,small_airport,Lekana Airport,2634.0,AF,CG,14,Lekana,,LKC,,"14.606, -2.313",14
167,HTM,HOULTON,ME,HTM,small_airport,Khatgal Airport,5500.0,AS,MN,041,Hatgal,ZMHG,HTM,,"100.139532, 50.435988",041
176,SRL,ST AURELIE,ME,SRL,small_airport,Palo Verde Airport,127.0,,MX,BCS,Santa Rosalia,,SRL,CIB,"-112.0985, 27.0927",BCS
192,SAG,SAGINAW,MI,SAG,closed,Sagwon Airport,650.0,,US,AK,Sagwon,,,,"-148.7114, 69.3596",AK
217,WSB,WARROAD INTL,"SPB, MN",WSB,seaplane_base,Steamboat Bay Seaplane Base,,,US,AK,Steamboat Bay,WSB,WSB,WSB,"-133.641998291, 55.5295982361",AK
249,SWE,SWEETGTASS,MT,SWE,small_airport,Siwea Airport,5960.0,OC,PG,MPL,Siwea,AYEW,SWE,SIW,"147.580833, -6.284639",MPL


Many of these are not even in the US, which clearly indicates a false join. This makes it hard to trust the data we generated with the join. We should _at least_ remove these case, even though they will leave very little data to work with. I will leave this as an option question an simple flag the data with the improper join.

In [40]:
df_airport_dwh['false_join'] = df_airport_dwh.index.isin(df_airport_dwh_outliers.index.tolist())

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

The **bold** values link to other fact/dimension tables.

#### df_immigration_dwh

This is the fact table with the immigration data. 

| field | type | description | origin |
| --- | --- | --- | --- |
| cicid | int | unique id | original parquet/sas files
| **i94cit** | int | country code (birth) | original parquet/sas files
| **i94res** | int | country code (residence) | original parquet/sas files
| i94port | string | arrival airport | original parquet/sas files
| arrdate | date | arrival date | original parquet/sas files
| i94mode | int | mode of transportation | original parquet/sas files
| i94addr | string | arrival state code | original parquet/sas files
| depdate | int | departure date | original parquet/sas files
| i94bir | int | age of respondent in years | original parquet/sas files
| **i94visa** | int | visa type | original parquet/sas files
| count | int | summary statistics | original parquet/sas files
| dtadfile | string | character date field | original parquet/sas files
| visapost | string | Department of State where where Visa was issued | original parquet/sas files
| entdepa | string | Arrival Flag | original parquet/sas files
| entdepd | string | Departure Flag | original parquet/sas files
| matflag | string | Match flag | original parquet/sas files
| biryear | int | 4 digit year of birth | original parquet/sas files
| dtaddto | string | character date field | original parquet/sas files
| gender | string | Non-immigrant sex | original parquet/sas files
| airline | string | Airline used to arrive in US | original parquet/sas files
| admnum | int | Admission number | original parquet/sas files
| fltno | string | Flight number | original parquet/sas files
| **visatype** | string | class of admission | original parquet/sas files

#### df_weather_dwh

This is a fact table with weather data by country and date. 

| field | type | description | origin |
| --- | --- | --- | --- |
| index | int | unique id | generated |
| date | date | date of record | weather data |
| average_temperature | numeric |average temperature |weather data |
| average_temperature_uncertainty |  numeric | average temperature uncertainty  | weather data |
| country | string | country | weather data |
| **i94cntyl** | int | cntyl country code | SAS description via join | 

#### df_demo_dwh

This is a dimension table that provides demographic data for the states in the us.

| field | type | description | origin |
| --- | --- | --- | --- |
| **state_code** | string | us state code | demographic data |
| state | string | us state | demographic data |
| male_population | numeric | fraction of males |demographic data |
| female_population | numeric | fraction of females |demographic data |
| number_of_veterans | numeric | fraction of veterans | demographic data |
| foreign_born | numeric | fraction of foreign borns | demographic data |
| median_age | numeric | median age | demographic data |
| average_household_size | numeric | average household size | demographic data |

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.

In [81]:
df_immigration_dwh.printSchema()

root
 |-- cicid: integer (nullable = true)
 |-- i94cit: integer (nullable = true)
 |-- i94res: integer (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: date (nullable = true)
 |-- i94mode: integer (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: date (nullable = true)
 |-- i94bir: integer (nullable = true)
 |-- i94visa: integer (nullable = true)
 |-- count: integer (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: integer (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: integer (nullable = true)
 |-- fltno: string (nullable = true)
 |-- visatype: string (nullable = true)



In [11]:
#write to parquet
df_spark.write.parquet("sas_data")
