# Travely Data Lake
## Data Engineering Capstone Project

### Project Summary
This project implements a Data Lake on S3 in parquet and json format, using mainly pandas, PySpark, and the AWS CLI. The data used includes datasets on US immigration, US demographics, worldwide daily temperatures, and airports.

The project follows the follow steps:
* Step 1: Scope the Project and Gather Data
* Step 2: Explore and Assess the Data
* Step 3: Define the Data Model
* Step 4: Run ETL to Model the Data
* Step 5: Complete Project Write Up

## Prerequisites

Before running the Jupyter notebook or the Python script, please
- make sure the Python packages required are installed - all imports are in the first cell below.
- make sure you have the AWS CLI installed. If you are on a Linux machine (like the VM provided in the Udacity workspace), you could use the `aws_cli_linux_install.sh` script (by running the command `bash aws_cli_linux_install.sh`). You can test if your aws cli works by running `aws --version` in a terminal.
- insert your AWS credentials, i.e. access key and secret key, in the `aws.cfg` file for the corresponding variables (without quotes around them). The IAM user you use needs to have full S3 permissions.
- create an S3 bucket you would like to write the parquet and json files to, and insert its name as the `DEST_BUCKET` variable in the `aws.cfg` file. Example name: `s3://travely-data-lake`

## Preparations: Imports, installs, configuration, path variables

In [1]:
# imports and installs here
!pip install sh

import os, time, json, sh, configparser
from sh import aws
import numpy as np
import pandas as pd
from datetime import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T

Collecting sh
[?25l  Downloading https://files.pythonhosted.org/packages/fa/9c/796934ee6d990d504c600056aa435e31bd49dbfba37e81d2045d37c8bdaf/sh-1.13.1-py2.py3-none-any.whl (40kB)
[K    100% |████████████████████████████████| 40kB 2.9MB/s ta 0:00:011
[?25hInstalling collected packages: sh
Successfully installed sh-1.13.1


In [2]:
# read configuration file with AWS credentials
config = configparser.ConfigParser()
config.read('aws.cfg')

# make AWS credentials accessible to the AWS CLI
os.environ['AWS_ACCESS_KEY_ID']=config['AWS']['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY']=config['AWS']['AWS_SECRET_ACCESS_KEY']

# make bucket and folder names accessible to the AWS CLI in the shell
# not strictly necessary for this script, but helpful if you want to 
#    execute some commands directly in the command line
os.environ['DEST_BUCKET']=config['S3']['DEST_BUCKET']
os.environ['IMM_KEY']=config['S3']['IMM_KEY']
os.environ['DESC_KEY']=config['S3']['DESC_KEY']
os.environ['TEMPERATURE_KEY']=config['S3']['TEMPERATURE_KEY']
os.environ['DEMOGRAPHICS_KEY']=config['S3']['DEMOGRAPHICS_KEY']
os.environ['AIRPORT_KEY']=config['S3']['AIRPORT_KEY']

In [3]:
# define the paths where the raw data is
imm_folder_loc = "../../data/18-83510-I94-Data-2016"
airport_file_loc = "./airport-codes_csv.csv"
demographics_file_loc = "./us-cities-demographics.csv"
temperature_file_loc = "../../data2/GlobalLandTemperaturesByCity.csv"

# define the destination paths where the data should go, e.g. on S3
# the destination S3 bucket
bucket = "s3://udacity-dend-capstone"
# names of folders where data is written to
imm_key = "immigration/" 
airport_key = "airport_codes/"
demographics_key = "demographics/"
temperature_key = "temperature/"
desc_key = "desc/"
# base folder for local data writing
local = "."

## Step 1: Scope the Project and Gather Data

### Scope 
Explain what you plan to do in the project in more detail. What data do you use? What is your end solution look like? What tools did you use? etc.

### Describe and Gather Data 
Describe the data sets you're using. Where did it come from? What type of information is included? 

### Scope of the project
Travely, a US-based touristic tour provider, would like to analyze data of people traveling to the US by plane. They would like to improve their offerings of day-tours and longer guided travels to meet their potential customers' needs. Additionally, they would like to know more about where people arrive and which months are heaviest in travelling so they can best advertise accordingly.

For this, they would like to have a data lake including data on people flying in to the US, the length of their stay, the airports and the respective cities and the weather on days with many arrivals. They want it to be accessible mainly to their data science team, who are all well-versed with Spark SQL, and it should be as inexpensive as possible for the time being.

For this project, I will use PySpark to process the data and AWS S3 to store it in parquet format.

### Describe and Gather Data
I am using the datasets provided by Udacity. These include:

**US immigration data**: A dataset that includes flight passenger data collected at immigration, such as the airport, arrival and departure, birthyear, gender, airline, etc.

**Airport codes**: A dataset about airports, including their international and local codes, country, municipality, coordinates etc.

**City temperature data**: A dataset about temperature in global cities, including data from the 18th to the 21st century, such as city, country, coordinates, temperature and temperature uncertainty.

**US cities demographic**: A dataset about US cities, including the state, total population, and other factors such as average household size.

### Step 2: Explore and Assess the Data
#### Explore the Data 
Identify data quality issues, like missing values, duplicate data, etc. - done separately for datasets, in the ETL process. See "Explanation" part below.

#### Cleaning Steps
Document steps necessary to clean the data - done separately for datasets, in the ETL process. See "Explanation" part below.



### Step 3: Define the Data Model
#### 3.1 Conceptual Data Model
Map out the conceptual data model and explain why you chose that model

#### 3.2 Mapping Out Data Pipelines
List the steps necessary to pipeline the data into the chosen data model

For each of the datasets, the pipeline process is either:
**SAS-format data (immigration dataset)**
* read data into Spark DataFrame
* explore data, identify data quality issues
* clean data, select relevant columns, rename columns to match snakecase_convention and to ease JOINing tables together
* write data to parquet files locally
* copy data to S3 using the AWS CLI

**CSV-format data (demographics, airport, temperature datasets)**
* read data into pandas DataFrame
* explore data, identify data quality issues
* clean data, select relevant columns, rename columns to match snakecase_convention and to ease JOINing tables together
* read the prepared data from the pandas DataFrame into a Spark DataFrame
* write data to parquet files locally
* copy data to S3 using the AWS CLI

**Why did I choose these steps?**
see Step 5 - rationale

### Step 4: Run Pipelines to Model the Data 
#### 4.1 Create the data model
Build the data pipelines to create the data model - done separately for datasets, in the ETL process. See "Explanation" part below.

#### 4.2 Data Quality Checks
Explain the data quality checks you'll perform to ensure the pipeline ran as expected. These could include:
 * Integrity constraints on the relational database (e.g., unique key, data type, etc.)
 * Unit tests for the scripts to ensure they are doing the right thing
 * Source/Count checks to ensure completeness
 
Run Quality Checks - done separately for datasets, in the ETL process. See "Explanation" part below.

#### 4.3 Data dictionary 
Create a data dictionary for your data model. For each field, provide a brief description of what the data is and where it came from. You can include the data dictionary in the notebook or in a separate file.

### Step 5: Complete Project Write Up
#### Clearly state the rationale for the choice of tools and technologies for the project.

* **pandas DataFrames** offer very fast data processing in tables, especially with csv as data source, as well as integrated table-displaying abilities.
* **PySpark** offers great Big Data processing abilities, a SQL API and a great range of options to adapt to many data formats. At scale, it also provides amazing distributed computing capabilities, which makes this solution easily scalable.
* **AWS S3** is a very inexpensive Cloud storage service with high-availability and no big up-front costs due to its pay-only-what-you-use model. It is well integrated with other AWS services, such as EMR, which could be a good option for scaling the project up (as discussed below in the part about the three scenarios).
* **Apache Parquet file format** is part of the Hadoop ecosystem, a very important Big Data ecosystem that Spark is part of as well. Parquet is a columnar storage which has several advantages over other formats like csv, including higher performance, only reading the minimum required amount of data, support of compression, and lesser size. This often results in faster queries and computations, less storage used, and thus fewer costs. This is ideal for data scientists using Spark to query the datasets. (Source and more details: https://databricks.com/glossary/what-is-parquet)
* **writing locally and then copying to S3**: While doing this project, I found that writing directly to S3 was a) very slow and b) used a lot of PUT, LIST etc. requests (which are more expensive than GET requests). Thus I decided to write the (in comparison to the CSV and SAS files) small parquet files to the local Udacity workspace and then copy them to S3. The results were astonishing, as the immigration data example shows: Writing the parquet files directly to S3 had not finished even after 6 hours, while first writing them locally and then copying to S3 took only around 15 + 2 min (yet always under 20 min).

For the reasons listed above I think that my technology choices are very well adapted to the problem presented. For scaling options, see the part about the three scenarios below.

#### Propose how often the data should be updated and why.
The data I used in this project does not have immediate updates. For Travely's immediate needs, no update is necessary for the time being.

Depending on the data source, one might consider different options as listed below:
* **Immigration data from the US National Tourism and Trade Office**: has different (paid!) dataset subscriptions
* **Temperature data**: The dataset comes from https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data, which apparently is not updated anymore. The original data comes from http://berkeleyearth.org/data/, where the data seems to be updated at least yearly.
* **Demographics data**: comes from https://public.opendatasoft.com/explore/dataset/us-cities-demographics/information/ and has not been updated since 2017. Thus, this might need to be replaced with or enriched with a more current and/or regularly updated dataset.
* **Airport data**: According to the source (https://datahub.io/core/airport-codes#readme), this data is updated on a (near-)daily basis.

Overall, if all of these sources were to be updated as often as possible when new data comes in, the airport dataset would prevail with a daily update. The best option would be to move the project to Airflow if updates should be made, and there one could schedule updating tasks according to the updating frequency of each data source separately.


#### Write a description of how you would approach the problem differently under the following scenarios:
 * **The data was increased by 100x**:
 In this case, I would consider moving the entire project to AWS EMR. EMR provides high-powered clusters for technologies like Hadoop and Spark, with the respective packages already installed. EMR offers integrated support for Jupyter notebooks as well as other AWS services, such as S3. You can submit a Spark job to the cluster and choose whether the cluster should keep running or terminate itself after the Spark job has finished.
 
 * **The data populates a dashboard that must be updated on a daily basis by 7am every day**:
 In this case, I would migrate the project to an Apache Airflow pipeline, which offers the option of regularly running certain steps, e.g. data ingestion, and also has the option to retry multiple times if a step fails.
 
 * **The database needed to be accessed by 100+ people**:
 In this case, I would consider different options of Data Warehouses or Database Services, such as AWS's Redshift or RDS (Relational Database Service), since probably some of these people would not be as well-versed in Spark.
 Another option would be ingesting the data into a dashboard or a BI application, depending on who these 100+ people are and in what format they can work best with the data.

## Explanation
Some of these steps are done for each dataset as it is processed - this is to avoid "hopping around" between the datasets.

## Immigration Data

Here we work with Spark, which has a package supporting the SAS format (.sas7bdat). Spark is a bit slow, so when I ran the cells below, each one took up to 30 s.

(In this case, pandas was a lot slower, so I chose Spark, although pandas also has an option to read in SAS files.)

**Please note**: For your convenience, I have commented out some of the exploring steps since they took 5-15 min each. If there was a result to be observed, I included it in a comment, along with the wall time observed with %time. - If you would like to see these steps, please feel free to de-comment them.

In [4]:
# create a Spark session with the necessary packages for the project
spark = SparkSession.builder \
        .config("spark.jars.packages",\
                "org.apache.hadoop:hadoop-aws:2.7.1,com.amazonaws:aws-java-sdk:1.7.4,saurfang:spark-sas7bdat:2.0.0-s_2.11") \
        .enableHiveSupport() \
        .getOrCreate()
print("SparkSession created for Immigration data")

# read one of the files into a dataframe to observe some characteristics
df_imm =spark.read.format('com.github.saurfang.sas.spark')\
        .load(os.path.join(imm_folder_loc, 'i94_apr16_sub.sas7bdat'))
print("Immigration data read into a Spark DataFrame")


SparkSession created for Immigration data
Immigration data read into a Spark DataFrame


In [5]:
# look at the first few rows to observe some characteristics
df_imm.head(5)

# observations from this and the data source's data dictionary:
# - columns like cicid, visapost, dtadfile, dentdepa, entdepd, ... are not relevant
#   or even completely deprecated
# - year and month are given in intuitive numbers
# - gender has some None

[Row(cicid=6.0, i94yr=2016.0, i94mon=4.0, i94cit=692.0, i94res=692.0, i94port='XXX', arrdate=20573.0, i94mode=None, i94addr=None, depdate=None, i94bir=37.0, i94visa=2.0, count=1.0, dtadfile=None, visapost=None, occup=None, entdepa='T', entdepd=None, entdepu='U', matflag=None, biryear=1979.0, dtaddto='10282016', gender=None, insnum=None, airline=None, admnum=1897628485.0, fltno=None, visatype='B2'),
 Row(cicid=7.0, i94yr=2016.0, i94mon=4.0, i94cit=254.0, i94res=276.0, i94port='ATL', arrdate=20551.0, i94mode=1.0, i94addr='AL', depdate=None, i94bir=25.0, i94visa=3.0, count=1.0, dtadfile='20130811', visapost='SEO', occup=None, entdepa='G', entdepd=None, entdepu='Y', matflag=None, biryear=1991.0, dtaddto='D/S', gender='M', insnum=None, airline=None, admnum=3736796330.0, fltno='00296', visatype='F1'),
 Row(cicid=15.0, i94yr=2016.0, i94mon=4.0, i94cit=101.0, i94res=101.0, i94port='WAS', arrdate=20545.0, i94mode=1.0, i94addr='MI', depdate=20691.0, i94bir=55.0, i94visa=2.0, count=1.0, dtadfile=

After some observing, we would like to have all of this dataset in one Spark DataFrame.

In [6]:
# create dataframe by reading in january data
df_imm =spark.read.format('com.github.saurfang.sas.spark')\
        .load(os.path.join(imm_folder_loc, 'i94_jan16_sub.sas7bdat'))
print("Immigration for jan loaded")

# list of months
months = ["feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec"]

# columns the dataframe has in the end = columns of january file
imm_columns = df_imm.columns

# For some reason, the file for June has more columns, all starting with 'delete', probably deprecated.
# I will thus select only the columns present in all of the dataframes.
# load and append the remaining 11 months:
for month in months:
    df_imm_next = spark.read.format('com.github.saurfang.sas.spark')\
                        .load(os.path.join(imm_folder_loc, f'i94_{month}16_sub.sas7bdat'))
    print(f"Immigration for {month} loaded")
    df_imm = df_imm.union(df_imm_next[imm_columns])
    print(f"Immigration for {month} appended")

print("Immigration: All months loaded")


Immigration for jan loaded
Immigration for feb loaded
Immigration for feb appended
Immigration for mar loaded
Immigration for mar appended
Immigration for apr loaded
Immigration for apr appended
Immigration for may loaded
Immigration for may appended
Immigration for jun loaded
Immigration for jun appended
Immigration for jul loaded
Immigration for jul appended
Immigration for aug loaded
Immigration for aug appended
Immigration for sep loaded
Immigration for sep appended
Immigration for oct loaded
Immigration for oct appended
Immigration for nov loaded
Immigration for nov appended
Immigration for dec loaded
Immigration for dec appended
Immigration: All months loaded


Let's see how many rows we have in this dataset (result: 40,790,529 rows). There are no duplicates (number of rows stays the same when dropping duplicates).

In [7]:
# %time df_imm.count() # wall time 8min 50s - result: 40,790,529 rows

In [8]:
df_imm = df_imm.dropDuplicates()
# %time df_imm.count() # wall time 13min 38s - result: 40,790,529 rows

In [9]:
# select only people who came by plane -> i94mode==1
df_imm = df_imm[df_imm['i94mode']==1]

# %time df_imm.count() # -> reduced to 39,166,088
# wall time 7min 58s

In [10]:
df_imm = df_imm.dropna(subset=['i94yr', 'i94mon', 'i94port', 'arrdate', 'i94bir']) # wall time < 1s

# %time df_imm.count() # wall time 7min 53s - result: 39,1458,783 rows

In [11]:
# select the most relevant columns for Travely:
# year and month, airport, 
%time df_imm = df_imm['i94yr', 'i94mon', 'i94port', 'i94addr', 'i94visa', 'arrdate', 'depdate', 'biryear', 'gender', 'visatype']
# wall time < 1s

CPU times: user 8.26 ms, sys: 64 µs, total: 8.32 ms
Wall time: 67.9 ms


In [12]:
# rename columns, e.g. i94yr -> year, i94port -> airport_code
df_imm = df_imm.withColumnRenamed("i94yr", "year") \
                .withColumnRenamed("i94mon", "month") \
                .withColumnRenamed("i94port", "airport_code") \
                .withColumnRenamed("i94addr", "address") \
                .withColumnRenamed("i94visa", "visacode")
df_imm.schema

StructType(List(StructField(year,DoubleType,true),StructField(month,DoubleType,true),StructField(airport_code,StringType,true),StructField(address,StringType,true),StructField(visacode,DoubleType,true),StructField(arrdate,DoubleType,true),StructField(depdate,DoubleType,true),StructField(biryear,DoubleType,true),StructField(gender,StringType,true),StructField(visatype,StringType,true)))

In [13]:
# convert dates and add+compute duration column
date_converter = F.udf(lambda x: datetime.fromordinal(x), T.DateType())
df_imm = df_imm.withColumn("stay_duration", (F.col("depdate") - F.col("arrdate")))
df_imm = df_imm.drop("depdate").drop("arrdate")
df_imm.schema


StructType(List(StructField(year,DoubleType,true),StructField(month,DoubleType,true),StructField(airport_code,StringType,true),StructField(address,StringType,true),StructField(visacode,DoubleType,true),StructField(biryear,DoubleType,true),StructField(gender,StringType,true),StructField(visatype,StringType,true),StructField(stay_duration,DoubleType,true)))

In [14]:
# df_imm.head(10) # controlling that stay_duration is correctly implemented


In [15]:
# write data to parquet files

# Here I chose to write the parquet files locally and then use the aws cli to upload them to the S3 bucket.
# Time comparison: 15-20 min writing + 1-2 min uploading 
# vs. writing directly to S3: not finished even after 6 hours

%time df_imm.write \
            .partitionBy('month').mode('overwrite') \
            .parquet(os.path.join(local, imm_key))

# Here, I only partitioned by month, since all the data is from the year 2016 
# and it would thus not make sense to include the year in the partitioning process.
# However, should Travely decide to include more immigration from other years, 
# year would definitely be the first partitioning key, followed by month.


CPU times: user 125 ms, sys: 12.3 ms, total: 138 ms
Wall time: 15min 11s


The I94 immigration data also had some annotations in the file I94_SAS_Labels_Descriptions.SAS. From this, I have created the respective JSON files (such as address_desc.json, in the desc folder), which will generate more tables for the final data model. (This is not included as code since I had to manually fix the files at some points.) Joining the immigration data and these tables will give more information about the encoded information, such as the state/country the immigrating person comes from.

In [16]:
# define function for regular S3 upload where nothing needs to be excluded,
# and variables that simplify it
onlyerrors = "--only-show-errors"
s3 = "s3"
sync = "sync"

def aws_upload(key, bucket):
    """Uploads files from a local folder with name key to a folder of the same name in an S3 Bucket named bucket."""
    aws(s3, sync, os.path.join(local, key), os.path.join(bucket, key), onlyerrors, _fg=True)

In [17]:
# upload description json files to S3
# command in bash: aws s3 sync ./desc/ s3://udacity-dend-capstone/desc/ --exclude "*ipynb*" --only-show-errors
# this uses the sh module and the `from sh import aws` import 

aws(s3, sync, os.path.join(local, desc_key), os.path.join(bucket, desc_key), '--exclude', '*ipynb*', onlyerrors, _fg=True)
print("Immigration description data: json files uploaded to S3")

Immigration description data: json files uploaded to S3


In [18]:
# upload immigration parquet files to S3
# command in bash: aws s3 sync ./imm_data/ s3://udacity-dend-capstone/immigration/ --only-show-errors
aws_upload(imm_key, bucket)
print("Immigration data: parquet files uploaded to S3")

Immigration data: parquet files uploaded to S3


In [19]:
print("Immigration data: PROCESSING FINISHED")

Immigration data: PROCESSING FINISHED


#### Temperature Data

Since the raw data is in CSV format, I will use pandas for the exploration and cleaning because it is faster and provides an easy overview through its automatic table-formatting. Each step with pandas should take under 10 s.

To write the data to parquet files in S3, I will use Pyspark - unlike pandas, it does not require additional dependencies for that task.

In [20]:
# read temperature into pandas DataFrame
%time pd_temperature = pd.read_csv(temperature_file_loc)

CPU times: user 8.6 s, sys: 1.26 s, total: 9.86 s
Wall time: 11.1 s


In [21]:
# get some information about the data
print(pd_temperature.shape)
pd_temperature.head(10)

(8599212, 7)


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
1,1743-12-01,,,Århus,Denmark,57.05N,10.33E
2,1744-01-01,,,Århus,Denmark,57.05N,10.33E
3,1744-02-01,,,Århus,Denmark,57.05N,10.33E
4,1744-03-01,,,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
9,1744-08-01,,,Århus,Denmark,57.05N,10.33E


Here we can already see there are some NaN values in the dataset. In the next step, we will determine how many values in which column are NaN.

In [22]:
# determine the number of nulls/NaNs in the data
pd_temperature.isnull().sum()

dt                                    0
AverageTemperature               364130
AverageTemperatureUncertainty    364130
City                                  0
Country                               0
Latitude                              0
Longitude                             0
dtype: int64

As we can see, only AverageTemperature and AverageTemperatureUncertainty have NaN values, and they have the same number of NaN values.
Thus, we assume they are always either both NaN or both have a non-null value (as seen above in the first few lines of data).

In the next step, we will drop the NaN values from the dataframe and verify there are no more NaN values in it.

In [23]:
# drop NaN values
pd_temperature = pd_temperature.dropna()

# check that NaN values have been dropped
print(pd_temperature.isnull().sum())
pd_temperature.head(10)

dt                               0
AverageTemperature               0
AverageTemperatureUncertainty    0
City                             0
Country                          0
Latitude                         0
Longitude                        0
dtype: int64


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
0,1743-11-01,6.068,1.737,Århus,Denmark,57.05N,10.33E
5,1744-04-01,5.788,3.624,Århus,Denmark,57.05N,10.33E
6,1744-05-01,10.644,1.283,Århus,Denmark,57.05N,10.33E
7,1744-06-01,14.051,1.347,Århus,Denmark,57.05N,10.33E
8,1744-07-01,16.082,1.396,Århus,Denmark,57.05N,10.33E
10,1744-09-01,12.781,1.454,Århus,Denmark,57.05N,10.33E
11,1744-10-01,7.95,1.63,Århus,Denmark,57.05N,10.33E
12,1744-11-01,4.639,1.302,Århus,Denmark,57.05N,10.33E
13,1744-12-01,0.122,1.756,Århus,Denmark,57.05N,10.33E
14,1745-01-01,-1.333,1.642,Århus,Denmark,57.05N,10.33E


Since Travely currently only operates in the US, we will select the temperature values for Country='United States', and also just include data after 2000.

In [24]:
# select only US
pd_temperature = pd_temperature[pd_temperature['Country']=='United States']
# select only timestamps after 2000-01-01 (including)
pd_temperature = pd_temperature[pd_temperature['dt']>='2000-01-01']

# get information about dataframe
print(pd_temperature.shape)
pd_temperature.head(10)

(42404, 7)


Unnamed: 0,dt,AverageTemperature,AverageTemperatureUncertainty,City,Country,Latitude,Longitude
49715,2000-01-01,8.039,0.18,Abilene,United States,32.95N,100.53W
49716,2000-02-01,11.908,0.306,Abilene,United States,32.95N,100.53W
49717,2000-03-01,14.423,0.385,Abilene,United States,32.95N,100.53W
49718,2000-04-01,18.274,0.262,Abilene,United States,32.95N,100.53W
49719,2000-05-01,25.358,0.358,Abilene,United States,32.95N,100.53W
49720,2000-06-01,25.264,0.412,Abilene,United States,32.95N,100.53W
49721,2000-07-01,29.421,0.345,Abilene,United States,32.95N,100.53W
49722,2000-08-01,29.733,0.354,Abilene,United States,32.95N,100.53W
49723,2000-09-01,25.446,0.305,Abilene,United States,32.95N,100.53W
49724,2000-10-01,18.477,0.262,Abilene,United States,32.95N,100.53W


Before writing the table to file, I will rename the columns so they are a bit shorter (in the case of the temperature columns) and match the usual snake-case convention (e.g. AverageTemperature -> avg_temperature).

In [25]:
# rename columns
pd_temperature.rename(columns = {"AverageTemperature": "avg_temperature",
                                "AverageTemperatureUncertainty": "avg_temperature_uncert",
                                "City": "city",
                                "Country": "country",
                                "Latitude": "latitude",
                                "Longitude": "longitude"},
                     inplace = True)

# verify successful rename by looking at the header of the table
pd_temperature.head(10)

Unnamed: 0,dt,avg_temperature,avg_temperature_uncert,city,country,latitude,longitude
49715,2000-01-01,8.039,0.18,Abilene,United States,32.95N,100.53W
49716,2000-02-01,11.908,0.306,Abilene,United States,32.95N,100.53W
49717,2000-03-01,14.423,0.385,Abilene,United States,32.95N,100.53W
49718,2000-04-01,18.274,0.262,Abilene,United States,32.95N,100.53W
49719,2000-05-01,25.358,0.358,Abilene,United States,32.95N,100.53W
49720,2000-06-01,25.264,0.412,Abilene,United States,32.95N,100.53W
49721,2000-07-01,29.421,0.345,Abilene,United States,32.95N,100.53W
49722,2000-08-01,29.733,0.354,Abilene,United States,32.95N,100.53W
49723,2000-09-01,25.446,0.305,Abilene,United States,32.95N,100.53W
49724,2000-10-01,18.477,0.262,Abilene,United States,32.95N,100.53W


Finally, we write the data into parquet files using Spark - after a small data quality check. The transformation of the pandas DataFrame to the Spark DataFrame, the file writing, and the S3 upload should each only take a few seconds.

To avoid high charges for S3 LIST etc. operations and to make the process faster, I will first write the files to the workspace and then copy them to the S3 bucket.

In [26]:
# create a SparkSession
spark = SparkSession \
        .builder \
        .config("spark.jars.packages",\
                "org.apache.hadoop:hadoop-aws:2.7.1,com.amazonaws:aws-java-sdk:1.7.4,saurfang:spark-sas7bdat:2.0.0-s_2.11") \
        .getOrCreate()

# read data from pandas DataFrame into Spark DataFrame
df_temperature = spark.createDataFrame(pd_temperature)

In [27]:
# check that dataframe is not empty
if df_temperature.head(1) != 0:
    print("Data Quality Check: data frame not empty, passed")
else:
    print("DATAFRAME EMPTY")

# check that there are multiple cities and dates
if df_temperature.groupby("city").count().head(1) != 0 :
    print("Data Quality Check: multiple cities, passed")
else:
    print("Data Quality Check FAILED: missing cities")

if df_temperature.groupby("dt").count().head(1) != 0:
    print("Data Quality Check: multiple and dates, passed")
else:
    print("Data Quality Check FAILED: missing or dates")

Data Quality Check: data frame not empty, passed
Data Quality Check: multiple cities, passed
Data Quality Check: multiple and dates, passed


In [28]:
# write data to parquet
%time df_temperature.write \
                .partitionBy('city') \
                .mode('overwrite') \
                .parquet(os.path.join(local, temperature_key))

CPU times: user 2.68 ms, sys: 325 µs, total: 3 ms
Wall time: 4.37 s


In [29]:
# upload files to S3 using the AWS CLI
# bash version: !aws s3 sync ./temperature/ s3://udacity-dend-capstone/temperature/ --only-show-errors
aws_upload(temperature_key, bucket)
print("Temperature data: uploaded to S3")

Temperature data: uploaded to S3


In [30]:
print("Temperature data: PROCESSING FINISHED")

Temperature data: PROCESSING FINISHED


#### Airport Data

In [31]:
# read data into pandas DataFrame
pd_airport = pd.read_csv(airport_file_loc, delimiter=",")

# display a few lines of data to get to know it
pd_airport.head(10)

Unnamed: 0,ident,type,name,elevation_ft,continent,iso_country,iso_region,municipality,gps_code,iata_code,local_code,coordinates
0,00A,heliport,Total Rf Heliport,11.0,,US,US-PA,Bensalem,00A,,00A,"-74.93360137939453, 40.07080078125"
1,00AA,small_airport,Aero B Ranch Airport,3435.0,,US,US-KS,Leoti,00AA,,00AA,"-101.473911, 38.704022"
2,00AK,small_airport,Lowell Field,450.0,,US,US-AK,Anchor Point,00AK,,00AK,"-151.695999146, 59.94919968"
3,00AL,small_airport,Epps Airpark,820.0,,US,US-AL,Harvest,00AL,,00AL,"-86.77030181884766, 34.86479949951172"
4,00AR,closed,Newport Hospital & Clinic Heliport,237.0,,US,US-AR,Newport,,,,"-91.254898, 35.6087"
5,00AS,small_airport,Fulton Airport,1100.0,,US,US-OK,Alex,00AS,,00AS,"-97.8180194, 34.9428028"
6,00AZ,small_airport,Cordes Airport,3810.0,,US,US-AZ,Cordes,00AZ,,00AZ,"-112.16500091552734, 34.305599212646484"
7,00CA,small_airport,Goldstone /Gts/ Airport,3038.0,,US,US-CA,Barstow,00CA,,00CA,"-116.888000488, 35.350498199499995"
8,00CL,small_airport,Williams Ag Airport,87.0,,US,US-CA,Biggs,00CL,,00CL,"-121.763427, 39.427188"
9,00CN,heliport,Kitchen Creek Helibase Heliport,3350.0,,US,US-CA,Pine Valley,00CN,,00CN,"-116.4597417, 32.7273736"


In [32]:
# determine shape of the data, i.e. # columns and rows
pd_airport.shape

(55075, 12)

In [33]:
# get information about data
pd_airport.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55075 entries, 0 to 55074
Data columns (total 12 columns):
ident           55075 non-null object
type            55075 non-null object
name            55075 non-null object
elevation_ft    48069 non-null float64
continent       27356 non-null object
iso_country     54828 non-null object
iso_region      55075 non-null object
municipality    49399 non-null object
gps_code        41030 non-null object
iata_code       9189 non-null object
local_code      28686 non-null object
coordinates     55075 non-null object
dtypes: float64(1), object(11)
memory usage: 5.0+ MB


Since in the immigration data only airports with an international airport code are given (also known as IATA code), we will remove any rows without an IATA code from the airport table. Then, we will check how many rows are left in our dataset.

In [34]:
# drop all rows without an IATA code
pd_airport = pd_airport.dropna(subset=["iata_code"])

# see how the DataFrame has changed in size
print(pd_airport.shape)
pd_airport.info()

(9189, 12)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9189 entries, 223 to 55070
Data columns (total 12 columns):
ident           9189 non-null object
type            9189 non-null object
name            9189 non-null object
elevation_ft    8819 non-null float64
continent       6222 non-null object
iso_country     9158 non-null object
iso_region      9189 non-null object
municipality    8423 non-null object
gps_code        8538 non-null object
iata_code       9189 non-null object
local_code      2987 non-null object
coordinates     9189 non-null object
dtypes: float64(1), object(11)
memory usage: 933.3+ KB


For joining with the immigration table, columns like ident, local_code, gps_code, and continent are not as relevant, thus we will drop them.

In [35]:
# drop less relevant columns and verify
pd_airport.drop(columns = ["ident", "local_code", "continent", "gps_code"], inplace=True)
pd_airport.head(10)

Unnamed: 0,type,name,elevation_ft,iso_country,iso_region,municipality,iata_code,coordinates
223,small_airport,Utirik Airport,4.0,MH,MH-UTI,Utirik Island,UTK,"169.852005, 11.222"
440,small_airport,Ocean Reef Club Airport,8.0,US,US-FL,Key Largo,OCA,"-80.274803161621, 25.325399398804"
594,small_airport,Pilot Station Airport,305.0,US,US-AK,Pilot Station,PQS,"-162.899994, 61.934601"
673,small_airport,Crested Butte Airpark,8980.0,US,US-CO,Crested Butte,CSE,"-106.928341, 38.851918"
1088,small_airport,LBJ Ranch Airport,1515.0,US,US-TX,Johnson City,JCY,"-98.62249755859999, 30.251800537100003"
1402,small_airport,Metropolitan Airport,418.0,US,US-MA,Palmer,PMX,"-72.31140136719999, 42.223300933800004"
1438,seaplane_base,Loring Seaplane Base,0.0,US,US-AK,Loring,WLR,"-131.636993408, 55.6012992859"
1555,small_airport,Nunapitchuk Airport,12.0,US,US-AK,Nunapitchuk,NUP,"-162.440454, 60.905591"
1574,seaplane_base,Port Alice Seaplane Base,0.0,US,US-AK,Port Alice,PTC,"-133.597, 55.803"
1722,small_airport,Icy Bay Airport,50.0,US,US-AK,Icy Bay,ICY,"-141.662002563, 59.96900177"


To make joins more intuitive, we will rename some of the columns.

In [36]:
# rename columns
pd_airport.rename(columns = {"type": "airport_type",
                            "name": "airport_name",
                            "iata_code": "airport_code"},
                 inplace = True)
pd_airport.head(5)

Unnamed: 0,airport_type,airport_name,elevation_ft,iso_country,iso_region,municipality,airport_code,coordinates
223,small_airport,Utirik Airport,4.0,MH,MH-UTI,Utirik Island,UTK,"169.852005, 11.222"
440,small_airport,Ocean Reef Club Airport,8.0,US,US-FL,Key Largo,OCA,"-80.274803161621, 25.325399398804"
594,small_airport,Pilot Station Airport,305.0,US,US-AK,Pilot Station,PQS,"-162.899994, 61.934601"
673,small_airport,Crested Butte Airpark,8980.0,US,US-CO,Crested Butte,CSE,"-106.928341, 38.851918"
1088,small_airport,LBJ Ranch Airport,1515.0,US,US-TX,Johnson City,JCY,"-98.62249755859999, 30.251800537100003"


In [37]:
# data quality check: relevant columns do not contain null/NaN
if (pd_airport.isna().sum()["airport_name"] == 0) and (pd_airport.isna().sum()["airport_code"] == 0):
    print("Data Quality Check - passed: No missing airport names or airport codes")
else:
    print("Data Quality Check - FAILED: Missing airport names or airport codes")

Data Quality Check - passed: No missing airport names or airport codes


The last step is to write this data to files and copy these to S3, using Spark.

In [38]:
# create SparkSession
spark = SparkSession \
        .builder \
        .config("spark.jars.packages",\
                "org.apache.hadoop:hadoop-aws:2.7.1,com.amazonaws:aws-java-sdk:1.7.4,saurfang:spark-sas7bdat:2.0.0-s_2.11") \
        .getOrCreate()

# create schema for data
schema = T.StructType([T.StructField("airport_type", T.StringType()),
                      T.StructField("airport_name", T.StringType()),
                      T.StructField("elevation_ft", T.DoubleType()),
                      T.StructField("iso_country", T.StringType()),
                      T.StructField("iso_region", T.StringType()),
                      T.StructField("municipality", T.StringType()),
                      T.StructField("airport_code", T.StringType()),
                      T.StructField("coordinates", T.StringType())])

# read data from pandas DataFrame into Spark DataFrame
df_airport = spark.createDataFrame(pd_airport, schema)

In [40]:
# write data to parquet
%time df_airport.write \
                .partitionBy('iso_country') \
                .mode('overwrite') \
                .parquet(os.path.join(local, airport_key))

CPU times: user 2.66 ms, sys: 0 ns, total: 2.66 ms
Wall time: 2.64 s


In [41]:
# upload files to S3 using the AWS CLI
# bash version: !aws s3 sync ./airport_codes/ s3://udacity-dend-capstone/airport_codes/ --only-show-errors
aws_upload(airport_key, bucket)
print("Airport data: uploaded to S3")
print("Airport data: PROCESSING FINISHED")

Airport data: uploaded to S3
Airport data: PROCESSING FINISHED


#### Demographics Data

In [42]:
# read data into pandas DataFrame
pd_demographics = pd.read_csv(demographics_file_loc, delimiter=";")

# look at a few rows to get to know the data
pd_demographics.head(10)

Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code,Race,Count
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD,Hispanic or Latino,25924
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA,White,58723
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL,Asian,4759
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA,Black or African-American,24437
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ,White,76402
5,Peoria,Illinois,33.1,56229.0,62432.0,118661,6634.0,7517.0,2.4,IL,American Indian and Alaska Native,1343
6,Avondale,Arizona,29.1,38712.0,41971.0,80683,4815.0,8355.0,3.18,AZ,Black or African-American,11592
7,West Covina,California,39.8,51629.0,56860.0,108489,3800.0,37038.0,3.56,CA,Asian,32716
8,O'Fallon,Missouri,36.0,41762.0,43270.0,85032,5783.0,3269.0,2.77,MO,Hispanic or Latino,2583
9,High Point,North Carolina,35.5,51751.0,58077.0,109828,5204.0,16315.0,2.65,NC,Asian,11060


In [43]:
# determine the size of the DataFrame
pd_demographics.shape

(2891, 12)

In [44]:
# look at the number of null/NaN values
print(pd_demographics.isna().sum())

# drop NaN values
pd_demographics.dropna(inplace = True)

# verify
print(pd_demographics.isna().sum())

City                       0
State                      0
Median Age                 0
Male Population            3
Female Population          3
Total Population           0
Number of Veterans        13
Foreign-born              13
Average Household Size    16
State Code                 0
Race                       0
Count                      0
dtype: int64
City                      0
State                     0
Median Age                0
Male Population           0
Female Population         0
Total Population          0
Number of Veterans        0
Foreign-born              0
Average Household Size    0
State Code                0
Race                      0
Count                     0
dtype: int64


Race is not relevant in any way for Travely, so the colums Race and Count can be omitted. The information in the other columns is the same for any "Race", so no aggregations are necessary here.

In [45]:
# drop irrelevant columns
pd_demographics.drop(columns = ["Race", "Count"], inplace = True)

# since now the rest of the rows is each a duplicate (or even a "quadruplicate"),
# drop the repeated rows
pd_demographics.drop_duplicates(inplace = True)

# verify and look at size of DataFrame
print(pd_demographics.shape)
pd_demographics.head(5)

(588, 10)


Unnamed: 0,City,State,Median Age,Male Population,Female Population,Total Population,Number of Veterans,Foreign-born,Average Household Size,State Code
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ


In [46]:
# rename columns for easier joining
# and to match snake_case
pd_demographics.rename(columns = {"City": "city",
                                  "State": "state",
                                  "Median Age": "median_age",
                                  "Male Population": "male_population",
                                  "Female Population": "female_population",
                                  "Total Population": "population",
                                  "Number of Veterans": "veterans",
                                  "Foreign-born": "foreign_born",
                                  "Average Household Size": "avg_household_size",
                                  "State Code": "state_code"},
                      inplace = True)
pd_demographics.head(5)

Unnamed: 0,city,state,median_age,male_population,female_population,population,veterans,foreign_born,avg_household_size,state_code
0,Silver Spring,Maryland,33.8,40601.0,41862.0,82463,1562.0,30908.0,2.6,MD
1,Quincy,Massachusetts,41.0,44129.0,49500.0,93629,4147.0,32935.0,2.39,MA
2,Hoover,Alabama,38.5,38040.0,46799.0,84839,4819.0,8229.0,2.58,AL
3,Rancho Cucamonga,California,34.5,88127.0,87105.0,175232,5821.0,33878.0,3.18,CA
4,Newark,New Jersey,34.6,138040.0,143873.0,281913,5829.0,86253.0,2.73,NJ


In [47]:
# check datatypes of columns
pd_demographics.dtypes

city                   object
state                  object
median_age            float64
male_population       float64
female_population     float64
population              int64
veterans              float64
foreign_born          float64
avg_household_size    float64
state_code             object
dtype: object

In [48]:
# people occur only in integers -> change type to int64 for columns male_population, female_population, veterans, foreign_born
pd_demographics = pd_demographics.astype({"male_population": "int64",
                                          "female_population": "int64",
                                          "veterans": "int64",
                                          "foreign_born": "int64"})

In [49]:
# verify
pd_demographics.dtypes

city                   object
state                  object
median_age            float64
male_population         int64
female_population       int64
population              int64
veterans                int64
foreign_born            int64
avg_household_size    float64
state_code             object
dtype: object

Data Quality Checks

We would like the following conditions to be true:

* population = male_population + female_population
* population > foreign_born
* population > avg_household_size
* male_population > veterans

In [56]:
# errors = 0

# for index, row in pd_demographics.iterrows():
#     if not (row.population >= (row.male_population + row.female_population)):
#         # since some people might self-identify as neither male or female, there could be more, but not less people than males+females
#         errors += 1
# print(f"Data quality issues in population count: {errors}")

# errors = 0
# for index, row in pd_demographics.iterrows():
#     if row.population < row.foreign_born:
#         errors += 1
# print(f"Data quality issues in foreign_born: {errors}")
# errors = 0

# for index, row in pd_demographics.iterrows():
#     if row.population < row.avg_household_size:
#         errors += 1
# print(f"Data quality issues in avg_household_size: {errors}")
# errors = 0

# for index, row in pd_demographics.iterrows():
#     if row.male_population < row.veterans:
#         errors += 1
# print(f"Data quality issues in veterans: {errors}")

def data_quality_check(condition_wanted, message):
    errors = 0
    for index, row in pd_demographics.iterrows():
        if not eval(condition_wanted):
            errors += 1
    print(message)

data_quality_check("row.population >= (row.male_population + row.female_population)",
                   f"Data quality issues in population count: {errors}")
data_quality_check("row.population > row.foreign_born",
                  f"Data quality issues in foreign_born: {errors}")
data_quality_check("row.population > row.avg_household_size",
                  f"Data quality issues in avg_household_size: {errors}")
data_quality_check("row.male_population > row.veterans",
                  f"Data quality issues in veterans: {errors}")

Data quality issues in population count: 0
Data quality issues in foreign_born: 0
Data quality issues in avg_household_size: 0
Data quality issues in veterans: 0


In [51]:
# create a Spark session
spark = SparkSession \
        .builder \
        .config("spark.jars.packages",\
                "org.apache.hadoop:hadoop-aws:2.7.1,com.amazonaws:aws-java-sdk:1.7.4,saurfang:spark-sas7bdat:2.0.0-s_2.11") \
        .getOrCreate()

In [52]:
# read data from pandas DataFrame into Spark DataFrame
df_demographics = spark.createDataFrame(pd_demographics)

# write data to parquet locally
%time df_demographics.write \
                .partitionBy('state_code') \
                .mode('overwrite') \
                .parquet(os.path.join(local, demographics_key))

CPU times: user 1.47 ms, sys: 181 µs, total: 1.65 ms
Wall time: 1.14 s


In [53]:
# upload files to S3 using the AWS CLI
# bash version: !aws s3 sync ./demographics/ s3://udacity-dend-capstone/demographics/ --only-show-errors
aws_upload(demographics_key, bucket)
print("Demographics data: uploaded to S3")

Demographics data: uploaded to S3


In [54]:
print("Demographics data: PROCESSING FINISHED")

Demographics data: PROCESSING FINISHED
