## Bike Sharing Station Data Collection & Preperation
In this notebook, we will download data on the bicycle sharing stations that are referenced by the trip data.
The station data is publicly available as GeoJSON [here](http://bikeshare.metro.net/stations/json/).

In [1]:
import urllib.request
import json
import pandas as pd
import folium
import re

In [2]:
# disable ssl for python on windows
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

First, we will emulate a real browser by using the headers defined below to request the station data. Without these
headers Metro Bike Share would block the request with a 403 forbidden error.

In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib.request.Request('http://bikeshare.metro.net/stations/json/', headers=headers)

try:
    stations_data_json
    request_complete = True
except NameError:
    request_complete = False

if(not request_complete):
    print("Downloading station data")
    with urllib.request.urlopen(req) as url:
       stations_data_json = json.loads(url.read().decode())

Downloading station data


In [4]:
stations_df = pd.json_normalize(stations_data_json['features'])
stations_df.head(2)

Unnamed: 0,type,geometry.coordinates,geometry.type,properties.addressStreet,properties.addressCity,properties.addressState,properties.addressZipCode,properties.bikesAvailable,properties.closeTime,properties.docksAvailable,...,properties.kioskConnectionStatus,properties.kioskType,properties.latitude,properties.longitude,properties.hasGeofence,properties.classicBikesAvailable,properties.smartBikesAvailable,properties.electricBikesAvailable,properties.isArchived,properties.clientVersion
0,Feature,"[-118.25854, 34.0485]",Point,700 Flower St,DTLA,CA,90017,3,05:39:00,27,...,Active,1,34.0485,-118.25854,False,3,0,0,False,2.174.113
1,Feature,"[-118.25667, 34.04554]",Point,729 S Olive Street,DTLA,CA,90014,1,05:39:00,29,...,Active,1,34.04554,-118.25667,False,1,0,0,False,5.2.2.26


In [5]:
# next, we will only keep the columns relevant to our project.
stations_df = stations_df[
    [
        "properties.latitude",
        "properties.longitude",
        "properties.addressZipCode",
        "properties.kioskId"
    ]
]
stations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216 entries, 0 to 215
Data columns (total 4 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   properties.latitude        216 non-null    float64
 1   properties.longitude       216 non-null    float64
 2   properties.addressZipCode  216 non-null    object 
 3   properties.kioskId         216 non-null    int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 6.9+ KB


As we can see our selected columns do not have null values. Below we will rename the columns to our naming convention.

In [6]:
stations_df.columns = stations_df.columns.str.replace(
    "properties.", "", regex=False
).map(lambda name: re.sub("(?!^)([A-Z]+)", r"_\1", name).lower())
stations_df.head(2)

Unnamed: 0,latitude,longitude,address_zip_code,kiosk_id
0,34.0485,-118.25854,90017,3005
1,34.04554,-118.25667,90014,3006


Now we can plot the stations on a map to check for potential outliers.

In [7]:
la_map = folium.Map(
    location=(34.052235, -118.243683),  # the orig mean values as location coordinates
    zoom_start=11,
    control_scale=True,
    max_zoom=20,
)

for _, row in stations_df.iterrows():
    folium.Marker(
        location=[row["latitude"], row["longitude"]],
        popup=row["kiosk_id"],
        icon=folium.Icon(color="red"),
    ).add_to(la_map)


la_map

All stations are located in Los Angeles, therefore we have no obvious outliers.
Lastly, we rename some of the columns and save the data as a pickle file.

In [8]:
stations_df.rename(
    columns={"address_zip_code": "zip_code", "kiosk_id": "station_id"}, inplace=True
)

In [9]:
stations_df.to_pickle('../00_data/stations.pkl')