# Data Engineering Capstone Project

#### Project Summary

The goal of this project is to create an ETL pipeline that ultimately creates a star schema with a fact table describing immigration data for the US. Dimensional tables refer to time (immigration date) and multiple location (arrival state, arrival airport, country of origin) dimensions, and are enriched with additional data of interest.

Based on the data aggregated in the star schema, one can submit analytical queries to gain insight about e.g.
* temporal distribution of immigration events
* connections between arrival state 

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

In [1]:
# Do all imports and installs here
import pandas as pd
import numpy as np

from pathlib import Path
from datetime import datetime
from typing import Sequence
from pyspark.sql import SparkSession
from pyspark.sql.functions import isnan, when, count, col, to_date

### Helper Functions

In [2]:
def get_nan_percent(df: pd.DataFrame, cols: Sequence[str]) -> pd.DataFrame:

    info = {}
    len_df = float(len(df))
    for col_name in cols:
        na_ratio = len(df[df[col_name].isna() == True])/len_df
        info[col_name] = na_ratio * 100.0

    df_out = pd.DataFrame.from_dict({"Column": list(info.keys()), "Percent_NaN": list(info.values())})
    return df_out

___

### Step 1: Scope the Project and Gather Data

#### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc>

#### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

---

#### 1. Weather Data

**Description**: contains average temperatures for a variety of cities along the time axis. Additional spatio-geographical information is included.

Table structure:

| Column               | Type     | Comment      |
| -------------------- | -------- | ------------ |
| dt                   | datetime | record time  |
| AverageTemperature   | float    | celsius      |
| City                 | string   |              |
| Country              | string   |              |
| Latitude             | string   |              |
| Longitude            | string   |              |


In [3]:
weather_data = Path("../../data2/GlobalLandTemperaturesByCity.csv")
df_wtr = pd.read_csv(weather_data)

---

#### 2. Airport Data

**Description**: contains identification characteristics and geographical information about airports.
    
Table structure:

| Column               | Type     | Comment      |
| -------------------- | -------- | ------------ |
| ident                | string   |              |
| type                 | string   |              |
| elevation_ft         | float    |              |
| continent            | string   |              |
| iso_country          | string   |              |
| iso_region           | string   |              |
| municipality         | string   |              |
| gps_code             | string   |              |
| iata_code            | string   |              |
| local_code           | string   |              |
| coordinates          | 2-tuple  | floats       |

In [4]:
airport_data = Path("airport-codes_csv.csv")
df_apt = pd.read_csv(airport_data)

---

#### 3. Immigration Data

**Description**: contains information about individual immigration processes. Each row contains a set of records for an individuals' arrival in the US. All data in the set is from the year 2016 and is partitioned by months.
    
Table structure:

| Column               | Type     | Comment      |
| -------------------- | -------- | ------------ |
| cicid                | float   |      record ID        |
| i94yr                | float   |      year        |
| i94mon               | float    |         month     |
| i94cit               | float   |      birth country ID        |
| i94res               | float   |      residence country ID        |
| i94port              | string   |     arrival port in US        |
| arrdate              | float   |      arrival date in US      |
| i94mode              | float   |      transportation mode (air: 1, sea: 2, land: 3, else: 9)        |
| i94addr              | string   |     arrival state         |
| depdate              | float   |      departure date        |
| i94bir               | float  | age       |
| i94visa              | float   |   visa code           |
| count                | float   |   auxiliary field           |
| dtadfile             | string    |  auxiliary date field            |
| visapost             | string   |   state of visa grant           |
| occup                | string   |   occupation in US           |
| entdepa              | string   |   arrival code           |
| entdepd              | string   |   departure code           |
| entdepu              | string   |   update code           |
| matflag              | string   |   matching code            |
| biryear              | float   |    birth year          |
| dtaddto              | string  | residence allowance date        |
| gender               | string   |  gender            |
| insnum               | string    |  INS number            |
| airline              | string   |  airline for arrival            |
| admnum               | float   |   admission number           |
| fltno                | string   |  flight number            |
| visatype             | string   |  type of visa            |

In [4]:
spark = SparkSession.builder.\
    config("spark.jars.repositories", "https://repos.spark-packages.org/").\
    config("spark.jars.packages", "saurfang:spark-sas7bdat:2.0.0-s_2.11").\
    enableHiveSupport().getOrCreate()

dfsp_imgn_apr = spark.read.format('com.github.saurfang.sas.spark').load('i94_jan16_sub.sas7bdat') #'../../data/18-83510-I94-Data-2016/i94_apr16_sub.sas7bdat')

Py4JJavaError: An error occurred while calling o40.load.
: java.lang.NoClassDefFoundError: scala/Product$class
	at com.github.saurfang.sas.spark.SasRelation.<init>(SasRelation.scala:48)
	at com.github.saurfang.sas.spark.SasRelation$.apply(SasRelation.scala:42)
	at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:50)
	at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:39)
	at com.github.saurfang.sas.spark.DefaultSource.createRelation(DefaultSource.scala:27)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104)
	at java.base/java.lang.reflect.Method.invoke(Method.java:577)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
	... 21 more


In [6]:
# write to parquet
# df_spark.write.parquet("sas_data")
# df_spark=spark.read.parquet("sas_data")

---

#### 4. City demographic data

**Description**: contains demographic data about US cities, where each city & state entry is partitioned by race.
    
Table structure:

| Column               | Type     | Comment      |
| -------------------- | -------- | ------------ |
| City                | string   |      city name        |
| State                 | string   |    state name          |
| Median Age           | float    |     median resident age         |
| Male Population       | float   |     total number of male residents         |
| Female Population       | float   |   total number of female residents           |
| Total Population        | int   |     total population number         |
| Number of Veterans       | float   |  number of veterans           |
| Foreign born             | float   |  number of foreign born residents            |
| Average Household Size    | float   |   average number of people in households           |
| State Code                | string   |    2-character US state code          |
| Race                      | string  |  race for next column       |
| Count                 | int  | number of residents of race defined in former column       |

In [7]:
city_data = Path("us-cities-demographics.csv")

df_cities = pd.read_csv(city_data, sep=";")

---

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc.

##### 1. Weather Data

In [8]:
df_wtr.head()

Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E


In [9]:
df_wtr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8599212 entries, 0 to 8599211
Data columns (total 7 columns):
dt                               object
AverageTemperature               float64
AverageTemperatureUncertainty    float64
City                             object
Country                          object
Latitude                         object
Longitude                        object
dtypes: float64(2), object(5)
memory usage: 459.2+ MB


In [10]:
# Check for unique city/country fields just out of interest
len(df_wtr["City"].unique()), len(df_wtr["Country"].unique())

(3448, 159)

In [41]:
# Check for duplicate rows (with respect to relevant tuples) - there are many duplicates!
len(df_wtr[df_wtr.duplicated(subset=["dt", "City", "Country"])])

46034

In [11]:
# Check for NaNs/NULLs - relatively well-conditioned ...! However, rows without a temperature are useless.
print(get_nan_percent(df_wtr, list(df_wtr.keys())))

                          Column  Percent_NaN
0                             dt     0.000000
1             AverageTemperature     4.234458
2  AverageTemperatureUncertainty     4.234458
3                           City     0.000000
4                        Country     0.000000
5                       Latitude     0.000000
6                      Longitude     0.000000


---

##### 2. Airport Data

In [12]:
df_apt.head()

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"


In [13]:
df_apt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55075 entries, 0 to 55074
Data columns (total 12 columns):
ident           55075 non-null object
type            55075 non-null object
name            55075 non-null object
elevation_ft    48069 non-null float64
continent       27356 non-null object
iso_country     54828 non-null object
iso_region      55075 non-null object
municipality    49399 non-null object
gps_code        41030 non-null object
iata_code       9189 non-null object
local_code      28686 non-null object
coordinates     55075 non-null object
dtypes: float64(1), object(11)
memory usage: 5.0+ MB


In [14]:
# Check for duplicates - ident is a unique identifier for each row
len(df_apt["ident"].unique()), len(df_apt["municipality"].unique())

(55075, 27134)

In [15]:
# Check for NaNs/NULLs - continent, iata_code, local code have over 40% missing values, rest is quite OK!
print(get_nan_percent(df_apt, list(df_apt.keys())))

          Column  Percent_NaN
0          ident     0.000000
1           type     0.000000
2           name     0.000000
3   elevation_ft    12.720835
4      continent    50.329551
5    iso_country     0.448479
6     iso_region     0.000000
7   municipality    10.305946
8       gps_code    25.501589
9      iata_code    83.315479
10    local_code    47.914662
11   coordinates     0.000000


---

##### 3. Immigration Data

In [16]:
col_parts = np.array_split(dfsp_imgn_apr.columns, 3)

In [30]:
dfsp_imgn_apr.show(5)

+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|cicid| i94yr|i94mon|i94cit|i94res|i94port|arrdate|i94mode|i94addr|depdate|i94bir|i94visa|count|dtadfile|visapost|occup|entdepa|entdepd|entdepu|matflag|biryear| dtaddto|gender|insnum|airline|        admnum|fltno|visatype|
+-----+------+------+------+------+-------+-------+-------+-------+-------+------+-------+-----+--------+--------+-----+-------+-------+-------+-------+-------+--------+------+------+-------+--------------+-----+--------+
|  6.0|2016.0|   4.0| 692.0| 692.0|    XXX|20573.0|   null|   null|   null|  37.0|    2.0|  1.0|    null|    null| null|      T|   null|      U|   null| 1979.0|10282016|  null|  null|   null| 1.897628485E9| null|      B2|
|  7.0|2016.0|   4.0| 254.0| 276.0|    ATL|20551.0|    1.0|     AL|   null|  25.0|    3.0|  1.0|20130811|     SE

In [8]:
dfsp_imgn_apr.printSchema()

root
 |-- cicid: double (nullable = true)
 |-- i94yr: double (nullable = true)
 |-- i94mon: double (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: double (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullable = 

In [29]:
dfsp_imgn_apr.select([col(c) for c in col_parts[0]]).summary().show()

+-------+-----------------+--------------------+-------+------------------+------------------+-------+-----------------+------------------+------------------+------------------+
|summary|            cicid|               i94yr| i94mon|            i94cit|            i94res|i94port|          arrdate|           i94mode|           i94addr|           depdate|
+-------+-----------------+--------------------+-------+------------------+------------------+-------+-----------------+------------------+------------------+------------------+
|  count|          3096313|             3096313|3096313|           3096313|           3096313|3096313|          3096313|           3096074|           2943721|           2953856|
|   mean|3078651.879075533|              2016.0|    4.0| 304.9069344733559|303.28381949757664|   null|20559.84854179794|1.0736897761487614|51.652482269503544| 20573.95283554784|
| stddev|1763278.099749858|4.282829613261096...|    0.0|210.02688853063327| 208.5832129278886|   null|8.777339

In [30]:
dfsp_imgn_apr.select([col(c) for c in col_parts[1]]).summary().show()

+-------+------------------+-------------------+-------+--------------------+--------+-----------------+-------+-------+-------+
|summary|            i94bir|            i94visa|  count|            dtadfile|visapost|            occup|entdepa|entdepd|entdepu|
+-------+------------------+-------------------+-------+--------------------+--------+-----------------+-------+-------+-------+
|  count|           3095511|            3096313|3096313|             3096312| 1215063|             8126|3096075|2957884|    392|
|   mean|41.767614458485205| 1.8453925685161674|    1.0|2.0160424766168267E7|   999.0|          885.675|   null|   null|   null|
| stddev| 17.42026053458826|0.39839102005409577|    0.0|   50.01513449489737|     0.0|264.6551105950961|   null|   null|   null|
|    min|              -3.0|                1.0|    1.0|            20130811|     999|              049|      A|      D|      U|
|    25%|              30.0|                2.0|    1.0|         2.0160409E7|   999.0|           

In [31]:
dfsp_imgn_apr.select([col(c) for c in col_parts[2]]).summary().show()

+-------+-------+------------------+------------------+-------+-----------------+------------------+--------------------+------------------+--------+
|summary|matflag|           biryear|           dtaddto| gender|           insnum|           airline|              admnum|             fltno|visatype|
+-------+-------+------------------+------------------+-------+-----------------+------------------+--------------------+------------------+--------+
|  count|2957884|           3095511|           3095836|2682044|           113708|           3012686|             3096313|           3076764| 3096313|
|   mean|   null|1974.2323855415148| 8291120.333841449|   null|4131.050016327899|59.477601493233784|7.082885011090295E10|1360.2463696420555|    null|
| stddev|   null|17.420260534588262|1656502.4244925014|   null|8821.743471773656|172.63339952061747|2.215441594755763...| 5852.676345633782|    null|
|    min|      M|            1902.0|          /   183D|      F|                0|               *FF|

In [7]:
n_rows = dfsp_imgn_apr.count()
print(n_rows)

3096313


In [6]:
# Check for duplicate identifiers - cicid is a unique row identifier!
dfsp_imgn_apr.select("cicid").distinct().count()

(299, 3096313)

In [25]:
# Check for NaNs/NULLs - quite OK in this selection!
dfsp_imgn_apr.select([(count(when(isnan(c) | col(c).isNull(), c)) / n_rows * 100).alias(c) for c in col_parts[0]]).show()

+-----+-----+------+------+------+-------+-------+--------------------+-----------------+---------------+
|cicid|i94yr|i94mon|i94cit|i94res|i94port|arrdate|             i94mode|          i94addr|        depdate|
+-----+-----+------+------+------+-------+-------+--------------------+-----------------+---------------+
|  0.0|  0.0|   0.0|   0.0|   0.0|    0.0|    0.0|0.007718857880324114|4.928183940060324|4.6008591508675|
+-----+-----+------+------+------+-------+-------+--------------------+-----------------+---------------+



In [26]:
# Check for NaNs/NULLs - visapost, occup, entdepu have very high amounts of invalid entries!
dfsp_imgn_apr.select([(count(when(isnan(c) | col(c).isNull(), c)) / n_rows * 100).alias(c) for c in col_parts[1]]).show()

+--------------------+-------+-----+--------------------+-----------------+-----------------+--------------------+-----------------+-----------------+
|              i94bir|i94visa|count|            dtadfile|         visapost|            occup|             entdepa|          entdepd|          entdepu|
+--------------------+-------+-----+--------------------+-----------------+-----------------+--------------------+-----------------+-----------------+
|0.025901774142342845|    0.0|  0.0|3.229647648671177E-5|60.75774639062653|99.73755883206898|0.007686561403837...|4.470768943579024|99.98733978121722|
+--------------------+-------+-----+--------------------+-----------------+-----------------+--------------------+-----------------+-----------------+



In [27]:
# Check for NaNs/NULLs - insnum has very high amounts of invalid entries!
dfsp_imgn_apr.select([(count(when(isnan(c) | col(c).isNull(), c)) / n_rows * 100).alias(c) for c in col_parts[2]]).show()

+-----------------+--------------------+--------------------+----------------+-----------------+-----------------+------+------------------+--------+
|          matflag|             biryear|             dtaddto|          gender|           insnum|          airline|admnum|             fltno|visatype|
+-----------------+--------------------+--------------------+----------------+-----------------+-----------------+------+------------------+--------+
|4.470768943579024|0.025901774142342845|0.015405419284161517|13.3794290176736|96.32763225164898|2.700857439154246|   0.0|0.6313638188387285|     0.0|
+-----------------+--------------------+--------------------+----------------+-----------------+-----------------+------+------------------+--------+



---

##### 4. City Demographic Data

In [65]:
df_cities.head()

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402


In [39]:
df_cities.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2891 entries, 0 to 2890
Data columns (total 12 columns):
City                      2891 non-null object
State                     2891 non-null object
Median Age                2891 non-null float64
Male Population           2888 non-null float64
Female Population         2888 non-null float64
Total Population          2891 non-null int64
Number of Veterans        2878 non-null float64
Foreign-born              2878 non-null float64
Average Household Size    2875 non-null float64
State Code                2891 non-null object
Race                      2891 non-null object
Count                     2891 non-null int64
dtypes: float64(6), int64(2), object(4)
memory usage: 271.1+ KB


In [72]:
# Check for relevant partitions - city+state+race correspond to unique rows
len(df_cities[["City", "State", "Race"]].drop_duplicates())

2891

In [75]:
df_cities["Race"].unique(), len(df_cities["Count"].unique())

(array(['Hispanic or Latino', 'White', 'Asian', 'Black or African-American',
        'American Indian and Alaska Native'], dtype=object), 2785)

In [76]:
df_cities["Median Age"].max(), df_cities["Median Age"].min()

(70.5, 22.899999999999999)

In [77]:
df_cities[df_cities["Median Age"] == df_cities["Median Age"].max()]

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
333,The Villages,Florida,70.5,,,72590,15231.0,4034.0,,FL,Hispanic or Latino,1066
449,The Villages,Florida,70.5,,,72590,15231.0,4034.0,,FL,Black or African-American,331
1437,The Villages,Florida,70.5,,,72590,15231.0,4034.0,,FL,White,72211


In [40]:
# Check for NaNs/NULLs - everything is quite fine!
print(get_nan_percent(df_cities, list(df_cities.keys())))

                    Column  Percent_NaN
0                     City     0.000000
1                    State     0.000000
2               Median Age     0.000000
3          Male Population     0.103770
4        Female Population     0.103770
5         Total Population     0.000000
6       Number of Veterans     0.449671
7             Foreign-born     0.449671
8   Average Household Size     0.553442
9               State Code     0.000000
10                    Race     0.000000
11                   Count     0.000000


---

#### Cleaning Steps
Document steps necessary to clean the data

##### 1. Weather Data

Check and fix NaN values

In [44]:
# Drop rows with invalid temperature fields
df_wtr = df_wtr.dropna(subset=["AverageTemperature"])

In [45]:
len(df_wtr)

8235082

Check and possibly fix other invalid entries

In [34]:
# Check for invalid temps - none found!
df_wtr["AverageTemperature"].min(), df_wtr["AverageTemperature"].max()

(-42.703999999999994, 39.650999999999996)

In [35]:
# Check for empty city names - none found!
len(df_wtr[df_wtr["City"].apply(lambda x: x.strip()) == ""])

0

In [19]:
# Convert 'dt' column to datetime format
df_wtr["dt"] = pd.to_datetime(df_wtr["dt"])

In [20]:
# Check for unexpected dates within 'dt' column - none found!
len(df_wtr[(df_wtr["dt"] < pd.Timestamp(1700, 1, 1)) | (df_wtr["dt"] > pd.Timestamp(2016, 1, 1))])

0

In [47]:
# Remove duplicate rows
df_wtr = df_wtr[df_wtr.duplicated(subset=["dt", "City", "Country"], keep="first") == False]

---

##### 2. Airport Data

---

##### 3. Immigration Data

In [35]:
# Cast some columns to more appropriate types
for c, t in {"cicid": "int", "i94yr": "int", "i94mon": "int", "biryear": "int"}.items():
    dfsp_imgn_apr = dfsp_imgn_apr.withColumn(c, col(c).cast(t))

dfsp_imgn_apr.printSchema()

root
 |-- cicid: integer (nullable = true)
 |-- i94yr: integer (nullable = true)
 |-- i94mon: integer (nullable = true)
 |-- i94cit: double (nullable = true)
 |-- i94res: double (nullable = true)
 |-- i94port: string (nullable = true)
 |-- arrdate: double (nullable = true)
 |-- i94mode: double (nullable = true)
 |-- i94addr: string (nullable = true)
 |-- depdate: double (nullable = true)
 |-- i94bir: double (nullable = true)
 |-- i94visa: double (nullable = true)
 |-- count: double (nullable = true)
 |-- dtadfile: string (nullable = true)
 |-- visapost: string (nullable = true)
 |-- occup: string (nullable = true)
 |-- entdepa: string (nullable = true)
 |-- entdepd: string (nullable = true)
 |-- entdepu: string (nullable = true)
 |-- matflag: string (nullable = true)
 |-- biryear: integer (nullable = true)
 |-- dtaddto: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- insnum: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- admnum: double (nullabl

In [38]:
# Drop columns with high amounts of invalid data
invalid_cols = ["occup", "entdepu", "insnum"]

dfsp_imgn_apr = dfsp_imgn_apr.drop(*invalid_cols)

---

##### 4. City Demographic Data

---

### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model.

In [None]:
# Write code here

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks

In [None]:
# Perform quality checks here

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

#### Step 5: Complete Project Write Up
* Clearly state the rationale for the choice of tools and technologies for the project.
* Propose how often the data should be updated and why.
* Write a description of how you would approach the problem differently under the following scenarios:
 * The data was increased by 100x.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
 * The database needed to be accessed by 100+ people.