### Bike Sharing Station Data Collection & Preperation
In this notebook, we will download data on the bicycle sharing stations that are referenced by the trip data.
The station data is publicly available as GeoJSON [here](http://bikeshare.metro.net/stations/json/).

In [1]:
import urllib.request
import json
import pandas as pd
import folium
import re

In [2]:
# disable ssl for python on windows
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

First we will emulate a real browser by using the headers defined below to request the station data. Without these
headers metro bike share would block the request with a 403 forbidden error.

In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
       'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
       'Accept-Encoding': 'none',
       'Accept-Language': 'en-US,en;q=0.8',
       'Connection': 'keep-alive'}

req = urllib.request.Request('http://bikeshare.metro.net/stations/json/', headers=headers)

try:
    stations_data_json
    request_complete = True
except NameError:
    request_complete = False

if(not request_complete):
    print("Downloading station data")
    with urllib.request.urlopen(req) as url:
       stations_data_json = json.loads(url.read().decode())

Downloading station data


In [17]:
stations_df = pd.json_normalize(stations_data_json['features'])
stations_df.head()

Unnamed: 0,type,geometry.coordinates,geometry.type,properties.addressStreet,properties.addressCity,properties.addressState,properties.addressZipCode,properties.bikesAvailable,properties.closeTime,properties.docksAvailable,...,properties.kioskConnectionStatus,properties.kioskType,properties.latitude,properties.longitude,properties.hasGeofence,properties.classicBikesAvailable,properties.smartBikesAvailable,properties.electricBikesAvailable,properties.isArchived,properties.clientVersion
0,Feature,"[-118.25854, 34.0485]",Point,700 Flower St,DTLA,CA,90017,6,05:39:00,24,...,Active,1,34.0485,-118.25854,False,5,0,0,False,5.2.2.26
1,Feature,"[-118.25667, 34.04554]",Point,729 S Olive Street,DTLA,CA,90014,9,05:39:00,22,...,Active,1,34.04554,-118.25667,False,9,0,0,False,5.2.2.26
2,Feature,"[-118.25459, 34.05048]",Point,557 S 5th Street,DTLA,CA,90071,7,05:39:00,16,...,Active,1,34.05048,-118.25459,False,6,0,0,False,5.2.2.26
3,Feature,"[-118.26273, 34.04661]",Point,865 S Figueroa Street,DTLA,CA,90017,4,05:39:00,11,...,Active,1,34.04661,-118.26273,False,4,0,0,False,5.2.2.36
4,Feature,"[-118.25487, 34.03705]",Point,401 East 11th Street,DTLA,CA,90015,3,05:39:00,12,...,Active,1,34.03705,-118.25487,False,3,0,0,False,5.2.2.26


Now we will only keep the columns relevant to our project.

In [18]:
stations_df = stations_df[
    [
        "properties.latitude", # used
        "properties.longitude", # used
        "properties.addressZipCode",
        "properties.totalDocks",
        "properties.isEventBased",
        "properties.isVirtual",
        "properties.isVisible",
        "properties.kioskId",
        "properties.openTime", # used
        "properties.closeTime", # used
        "properties.hasGeofence"
    ]
]
stations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   properties.latitude        225 non-null    float64
 1   properties.longitude       225 non-null    float64
 2   properties.addressZipCode  225 non-null    object 
 3   properties.totalDocks      225 non-null    int64  
 4   properties.isEventBased    225 non-null    bool   
 5   properties.isVirtual       225 non-null    bool   
 6   properties.isVisible       225 non-null    bool   
 7   properties.kioskId         225 non-null    int64  
 8   properties.openTime        225 non-null    object 
 9   properties.closeTime       225 non-null    object 
 10  properties.hasGeofence     225 non-null    bool   
dtypes: bool(4), float64(2), int64(2), object(3)
memory usage: 13.3+ KB


As we can see our selected columns do not have null values. Below we will rename the columns to our naming convention.

In [33]:
stations_df.columns = stations_df.columns.str.replace(
    "properties.", "", regex=False
).map(lambda name: re.sub("(?!^)([A-Z]+)", r"_\1", name).lower())
stations_df.head(2)

Unnamed: 0,latitude,longitude,address_zip_code,total_docks,is_event_based,is_virtual,is_visible,kiosk_id,open_time,close_time,has_geofence
0,34.0485,-118.25854,90017,31,False,False,False,3005,05:45:00,05:39:00,False
1,34.04554,-118.25667,90014,31,False,False,False,3006,05:45:00,05:39:00,False


Now we can draw the stations on a map to look for potential outliers.

In [34]:
la_map = folium.Map(
    location=(34.052235, -118.243683),  # the orig mean values as location coordinates
    zoom_start=11,
    control_scale=True,
    max_zoom=20,
)

for _, row in stations_df.iterrows():
    folium.Marker(
        location=[row["latitude"], row["longitude"]],
        popup=row["kiosk_id"],
        icon=folium.Icon(color="red"),
    ).add_to(la_map)


la_map


All stations are located in Los Angeles, therefore we have no outliers.
Lastly we will split the open and closing time into hour and minute features, rename some of the columns and save the
data as a pickle file.

In [7]:
stations_df['open_time_temp'] = pd.to_datetime(stations_df['open_time'])
stations_df['open_time_hour'] = stations_df['open_time_temp'].dt.hour
stations_df['open_time_minute'] = stations_df['open_time_temp'].dt.minute
stations_df = stations_df.drop(columns=['open_time_temp'])

stations_df['close_time_temp'] = pd.to_datetime(stations_df['close_time'])
stations_df['close_time_hour'] = stations_df['close_time_temp'].dt.hour
stations_df['close_time_minute'] = stations_df['close_time_temp'].dt.minute
stations_df = stations_df.drop(columns=['close_time_temp'])


In [8]:
stations_df.rename(
    columns={"address_zip_code": "zip_code", "kiosk_id": "station_id"}, inplace=True
)


In [9]:
stations_df.to_pickle('../00_data/stations.pkl')
