# Streetcar Delay Prediction - Data Preparation Geocode Specific

Use dataset covering Toronto Transit Commission (TTC) streetcar delays 2014 - present to predict future delays and come up with recommendations for avoiding delays.

Source dataset: : https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/#e8f359f0-2f47-3058-bf64-6ec488de52da

This notebook contains the data preparation steps specific to mapping free-form location descriptions to latitude and longitude

- use the Google Maps API Web Services for Python  https://github.com/googlemaps/google-maps-services-python
- generate the latitude and longitude values for locations and create new columns in the output dataset

# Streetcar routes

From https://www.ttc.ca/Routes/Streetcars.jsp

<table style="border: none" align="left">
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/ryanmark1867/streetcarnov3/master/streetcar%20routes.jpg" width="600" alt="Icon"> </th>
   </tr>
</table>

# Get path and load dataframe saved from previous data preparation step

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# import seaborn as sns
import datetime
import os

remove_bad_values = False
city_name = 'Toronto'


In [2]:
# get the directory for that this notebook is in
rawpath = os.getcwd()
print("raw path is",rawpath)

raw path is t:\Documents\DataProjects\tutorial_keras\py_dl_for_structured_data\notebooks


In [3]:
# data is in a directory called "data" that is a sibling to the directory containing the notebook
path = os.path.abspath(os.path.join(rawpath, '..', 'data'))
print("path is", path)

path is t:\Documents\DataProjects\tutorial_keras\py_dl_for_structured_data\data


In [4]:
# constants for the streetcar problem
# same values saved in data_preparation notebook: pickled_input_dataframe, pickled_output_dataframe
pickled_data_file = '2014_2018.pkl'
#pickled_dataframe = '2014_2018_df.pkl'
pickled_dataframe = '2014_2018_df_cleaned_keep_bad_apr23.pkl'
pickled_output_dataframe = '2014_2018_df_cleaned_keep_bad_loc_geocoded.pkl'

In [5]:
file_name = os.path.join(path,pickled_dataframe)
df = pd.read_pickle(file_name)
df.head()

Unnamed: 0_level_0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
Report Date Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2016-01-01 00:00:00,2016-01-01,505,00:00:00,Friday,dundas west stationt to broadview station,General Delay,7.0,14.0,w,4028,2016-01-01 00:00:00
2016-01-01 02:14:00,2016-01-01,511,02:14:00,Friday,fleet st. and strachan,Mechanical,10.0,20.0,e,4018,2016-01-01 02:14:00
2016-01-01 02:22:00,2016-01-01,301,02:22:00,Friday,queen st. west and roncesvalles,Mechanical,9.0,18.0,w,4201,2016-01-01 02:22:00
2016-01-01 03:28:00,2016-01-01,301,03:28:00,Friday,lake shore blvd. and superior st.,Mechanical,20.0,40.0,e,4251,2016-01-01 03:28:00
2016-01-01 14:28:00,2016-01-01,501,14:28:00,Friday,roncesvalles to neville park,Mechanical,6.0,12.0,e,4242,2016-01-01 14:28:00


In [6]:
df.shape

(69603, 11)

In [7]:
# create a dataframe just containing
# gapminder['continent'].unique().tolist
loc_unique = df['Location'].unique().tolist()
print("loc_unique", loc_unique[0])
# pd.DataFrame(q_list, columns=['q_data'])
df_unique = pd.DataFrame(loc_unique, columns=['Location'])
df_unique.head()

loc_unique dundas west stationt to broadview station


Unnamed: 0,Location
0,dundas west stationt to broadview station
1,fleet st. and strachan
2,queen st. west and roncesvalles
3,lake shore blvd. and superior st.
4,roncesvalles to neville park


In [8]:
df_unique.shape

(10074, 1)

# Set up geocode

In [9]:
! pip install -U googlemaps

Collecting googlemaps
  Using cached googlemaps-4.4.2.tar.gz (29 kB)
Building wheels for collected packages: googlemaps
  Building wheel for googlemaps (setup.py): started
  Building wheel for googlemaps (setup.py): finished with status 'done'
  Created wheel for googlemaps: filename=googlemaps-4.4.2-py3-none-any.whl size=37864 sha256=b506762a8700027d6b45923fb4fe2f50e2c5745df1ac0b033f59f5f1561b4a5c
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\7e\30\c7\07c30ff7be3c000ed5f8b2aad1083c8697a2afde133f58b5ca
Successfully built googlemaps
Installing collected packages: googlemaps
Successfully installed googlemaps-4.4.2


In [10]:
import googlemaps

# API key comes from https://developers.google.com/maps/documentation/embed/get-api-key
# NOTE: to run this code you will need to generate your own API key and enter it as the key value in the line below
gmaps = googlemaps.Client(key='')

# Geocoding an address
geocode_result = gmaps.geocode('lake shore blvd. and superior st., Toronto')

print("geocode result",geocode_result[0]["geometry"]["location"])

ValueError: Must provide API key or enterprise credentials when creating client.

In [11]:
# given an address / junction, return a list containg the latitude and longitude values returned by geocode api

def get_geocode_result(junction):
    
    geo_string = junction+", "+city_name
    # print("geo_string is", geo_string)
    geocode_result = gmaps.geocode(geo_string)
    # check to see if the result is empty and if so return zeros to indicate unparseable junction value
    if len(geocode_result) > 0:
        locs = geocode_result[0]["geometry"]["location"]
        return [locs["lat"], locs["lng"]]
    else:
        return [0.0,0.0]



In [12]:
# test geocode api with value that will return empty result

locs = get_geocode_result("roncesvalles to longbranch")
print("locs ",locs)

NameError: name 'gmaps' is not defined

In [13]:
# test geocode api with value that will return non-empty result
get_geocode_result("queen and bathurst")[0]

NameError: name 'gmaps' is not defined

In [14]:
df.shape

(69603, 11)

In [15]:

# to avoid making multiple calls to the geocode API, bring in the latitude and longitude values as a single 
# column to a dataframe containing just the unique location values and once we have that go through steps
# to get the desired columns in the overall dataframe
df_unique['lat_long'] = df_unique.Location.apply(lambda s: get_geocode_result(s))



NameError: name 'gmaps' is not defined

In [16]:
df_unique.head()

Unnamed: 0,Location
0,dundas west stationt to broadview station
1,fleet st. and strachan
2,queen st. west and roncesvalles
3,lake shore blvd. and superior st.
4,roncesvalles to neville park


In [17]:
df_unique.shape

(10074, 1)

In [18]:
# derive latitude and longitude columns from list column
# df["new_col"] = df["A"].str[0]
df_unique["latitude"] = df_unique["lat_long"].str[0]
df_unique["longitude"] = df_unique["lat_long"].str[1]
df_unique.head()

KeyError: 'lat_long'

In [19]:
df_unique.shape

(10074, 1)

In [20]:
# join df_unique dataframe with original df dataframe on Location column to get latitude and longitude cols in original df dataframe
# result1 = pd.merge(date_frame, routedirection_frame, on='count', how='outer')
df_out = pd.merge(df, df_unique, on="Location", how='left')
df_out.head()

Unnamed: 0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
0,2016-01-01,505,00:00:00,Friday,dundas west stationt to broadview station,General Delay,7.0,14.0,w,4028,2016-01-01 00:00:00
1,2016-01-01,511,02:14:00,Friday,fleet st. and strachan,Mechanical,10.0,20.0,e,4018,2016-01-01 02:14:00
2,2016-01-01,301,02:22:00,Friday,queen st. west and roncesvalles,Mechanical,9.0,18.0,w,4201,2016-01-01 02:22:00
3,2016-01-01,301,03:28:00,Friday,lake shore blvd. and superior st.,Mechanical,20.0,40.0,e,4251,2016-01-01 03:28:00
4,2016-01-01,501,14:28:00,Friday,roncesvalles to neville park,Mechanical,6.0,12.0,e,4242,2016-01-01 14:28:00


In [21]:
df_out.head(30)

Unnamed: 0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
0,2016-01-01,505,00:00:00,Friday,dundas west stationt to broadview station,General Delay,7.0,14.0,w,4028,2016-01-01 00:00:00
1,2016-01-01,511,02:14:00,Friday,fleet st. and strachan,Mechanical,10.0,20.0,e,4018,2016-01-01 02:14:00
2,2016-01-01,301,02:22:00,Friday,queen st. west and roncesvalles,Mechanical,9.0,18.0,w,4201,2016-01-01 02:22:00
3,2016-01-01,301,03:28:00,Friday,lake shore blvd. and superior st.,Mechanical,20.0,40.0,e,4251,2016-01-01 03:28:00
4,2016-01-01,501,14:28:00,Friday,roncesvalles to neville park,Mechanical,6.0,12.0,e,4242,2016-01-01 14:28:00
5,2016-01-01,505,15:42:00,Friday,broadview station loop,Investigation,4.0,10.0,w,4187,2016-01-01 15:42:00
6,2016-01-01,504,15:54:00,Friday,broadview and queen,Mechanical,6.0,12.0,e,4181,2016-01-01 15:54:00
7,2016-01-01,501,16:05:00,Friday,roncesvalles to humber loop,Mechanical,6.0,12.0,w,4245,2016-01-01 16:05:00
8,2016-01-01,506,16:27:00,Friday,main station,Mechanical,8.0,16.0,w,4092,2016-01-01 16:27:00
9,2016-01-01,510,16:34:00,Friday,richmond st. and spadina,Diversion,41.0,46.0,s,bad vehicle,2016-01-01 16:34:00


In [22]:
df_out.shape

(69603, 11)

In [23]:
print("Bad route latitude:",df_out[df_out.latitude == 0.0].shape[0])

AttributeError: 'DataFrame' object has no attribute 'latitude'

# Remove bad rows

In [24]:
print("Location count post cleanup:",df['Location'].nunique())
print("Route count post cleanup:",df['Route'].nunique())
print("Direction count post cleanup:",df['Direction'].nunique())
print("Vehicle count post cleanup:",df['Vehicle'].nunique())
# print("Bad Location count":df[df.Vehicle == 'bad vehicle'].shape[0])
print("Bad route count:",df[df.Route == 'bad route'].shape[0])
print("Bad direction count:",df[df.Direction == 'bad direction'].shape[0])
print("Bad vehicle count:",df[df.Vehicle == 'bad vehicle'].shape[0])

Location count post cleanup: 10074
Route count post cleanup: 15
Direction count post cleanup: 6
Vehicle count post cleanup: 1017
Bad route count: 2370
Bad direction count: 302
Bad vehicle count: 11221


In [25]:
# remove rows with bad vehicle value
if remove_bad_values:
    df = df[df.Vehicle != 'bad vehicle']
    df = df[df.Direction != 'bad direction']
    df = df[df.Route != 'bad route']

In [26]:
df.shape

(69603, 11)

In [27]:
pickled_output_dataframe

'2014_2018_df_cleaned_keep_bad_loc_geocoded.pkl'

In [28]:
# pickle the cleansed dataframe
file_name = path + pickled_output_dataframe
df_out.to_pickle(file_name)

In [29]:
dfn = pd.read_pickle(file_name)
dfn.head()

Unnamed: 0,Report Date,Route,Time,Day,Location,Incident,Min Delay,Min Gap,Direction,Vehicle,Report Date Time
0,2016-01-01,505,00:00:00,Friday,dundas west stationt to broadview station,General Delay,7.0,14.0,w,4028,2016-01-01 00:00:00
1,2016-01-01,511,02:14:00,Friday,fleet st. and strachan,Mechanical,10.0,20.0,e,4018,2016-01-01 02:14:00
2,2016-01-01,301,02:22:00,Friday,queen st. west and roncesvalles,Mechanical,9.0,18.0,w,4201,2016-01-01 02:22:00
3,2016-01-01,301,03:28:00,Friday,lake shore blvd. and superior st.,Mechanical,20.0,40.0,e,4251,2016-01-01 03:28:00
4,2016-01-01,501,14:28:00,Friday,roncesvalles to neville park,Mechanical,6.0,12.0,e,4242,2016-01-01 14:28:00


In [30]:
dfn.shape

(69603, 11)

In [31]:
file_outname = "2014_2018_df_cleaned_keep_bad_loc_geocoded_apr29.csv"
dfn.to_csv(path+file_outname)

# Visualize cleaned data

In [32]:
!pip install pixiedust

Collecting pixiedust
  Using cached pixiedust-1.1.18.tar.gz (197 kB)
Collecting mpld3
  Using cached mpld3-0.5.1.tar.gz (1.0 MB)
Collecting lxml
  Downloading lxml-4.5.2-cp37-cp37m-win_amd64.whl (3.5 MB)
Collecting geojson
  Using cached geojson-2.5.0-py2.py3-none-any.whl (14 kB)
Collecting colour
  Using cached colour-0.1.5-py2.py3-none-any.whl (23 kB)
Collecting jinja2
  Downloading Jinja2-2.11.2-py2.py3-none-any.whl (125 kB)
Collecting MarkupSafe>=0.23
  Downloading MarkupSafe-1.1.1-cp37-cp37m-win_amd64.whl (16 kB)
Building wheels for collected packages: pixiedust, mpld3
  Building wheel for pixiedust (setup.py): started
  Building wheel for pixiedust (setup.py): finished with status 'done'
  Created wheel for pixiedust: filename=pixiedust-1.1.18-py3-none-any.whl size=321731 sha256=1894ee6e7cc387722947385a50a8166e83962f5a9ef8a87f45f5aa2756e77023
  Stored in directory: c:\users\user\appdata\local\pip\cache\wheels\41\4c\20\08a843440aaeffc976c1848c9eb44be6ec68dcd964421ec6f7
  Building 

In [33]:
import pixiedust

Pixiedust database opened successfully


In [34]:
display(df)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>