# ETL Project
---

## Extract

U.S. border-crossing data was extracted from the Bureau of Transportation Statistics (BTS) Border Crossing API (https://data.transportation.gov/Research-and-Statistics/Border-Crossing-Entry-Data/keg4-3bc2). Seeing as this data is found in the Socrata Public Data API, instead of extracting border-crossing data via JSON, the sodapy library was used.

In [44]:
# Dependencies
import pandas as pd
from sodapy import Socrata
import datetime as dt
from datetime import datetime

In [43]:
# Activate the Socrata Public Data API, specifically transportation data
client = Socrata("data.transportation.gov", None)

# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.transportation.gov,
#                  MyAppToken,
#                  userame="user@example.com",
#                  password="AFakePassword")



In [3]:
# Create conditions to ensure most rows will be used in final database
test_conditions = "date >= '2008-01-01' and value > 0 and (measure = 'Personal Vehicles' or \
                   measure = 'Personal Vehicles Passengers' or measure = 'Bus Passengers' or \
                   measure = 'Train Passengers') "

conditions = "date >= '2009-01-01' and value > 0"

In [7]:
# Request Border Entry data from specified API including additional conditions
results = client.get("keg4-3bc2", limit = 23000,  border = 'US-Canada Border', where = test_conditions)

In [8]:
# Create a Pandas Dataframe
results_df = pd.DataFrame.from_records(results)

In [45]:
# Preview the Dataframe
results_df = results_df.sort_values(by=['date'])

In [47]:
# Create a copy of the dataframe to manipulate the columns
clean_df = results_df.copy()
# Remove border and port code columns
clean_df = clean_df[['date','measure','port_name','state','value']]
# Change data type for date column
clean_df['date'] = pd.to_datetime(clean_df['date'])
# Set date as index
clean_df = clean_df.set_index('date')

In [48]:
clean_df

Unnamed: 0_level_0,measure,port_name,state,value
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2008-01-01,Personal Vehicles,Oroville,Washington,14091
2008-01-01,Bus Passengers,Eastport,Idaho,227
2008-01-01,Personal Vehicles,Limestone,Maine,3030
2008-01-01,Bus Passengers,Calais,Maine,405
2008-01-01,Bus Passengers,Metaline Falls,Washington,183
2008-01-01,Personal Vehicles,Laurier,Washington,2870
2008-01-01,Personal Vehicles,Sarles,North Dakota,263
2008-01-01,Bus Passengers,Porthill,Idaho,457
2008-01-01,Bus Passengers,Laurier,Washington,134
2008-01-01,Personal Vehicles,Porthill,Idaho,9917


In [36]:
clean_df

Unnamed: 0,date,measure,port_name,state,value
0,2008-01-01,Personal Vehicles,Oroville,Washington,14091
1,2008-01-01,Personal Vehicles,Nighthawk,Washington,323
2,2008-01-01,Personal Vehicles,Turner,Montana,221
3,2008-01-01,Bus Passengers,Champlain-Rouses Point,New York,14691
4,2008-01-01,Bus Passengers,Sault Sainte Marie,Michigan,20912
5,2008-01-01,Personal Vehicles,Ambrose,North Dakota,93
6,2008-01-01,Bus Passengers,Massena,New York,2473
7,2008-01-01,Train Passengers,Vanceboro,Maine,55
8,2008-01-01,Train Passengers,Eastport,Idaho,172
9,2008-01-01,Bus Passengers,Sweetgrass,Montana,184


In [10]:
results_df['measure'].value_counts()

Personal Vehicles    11568
Bus Passengers        6542
Train Passengers      3344
Name: measure, dtype: int64