## Optimization of Employee Shuttle Stops

by Armin, Jan 2017

With the explosion of user location data, data science can be used to optimize many of the services cities offer to their citizens. Transportation optimization is an example of that, but there are so many other possible applications. All this often goes under the (buzz name) "smart city" and it is one of the most interesting future applications of data science. Here we explore one example.

It is a common practice for tech companies to use shuttle buses to ferry their employees from home to the office. The goal of this work is to figure out the optimal stops for a bus shuttle. The company is based in Mountain View and the shuttle provides transportation for employees based in San Francisco.

The city of San Francisco has given the company a list of potential bus stop locations to choose from and asked to not have more than 10 stops within the city. We have been given the home address of all employees interested in taking the shuttle and asked to come up with the ten most efficient stops. While we have been given a certain freedom in defining what is "efficient", the general consensus within the company is that the most efficient way to select the bus stops is to minimize the overall walking distance between employee homes and the closest bus stop.

Here I find the 10 stops that have a high probability of being the best stops. Let's first load and explore the data.


In [29]:
import pandas as pd
import numpy as np

employees = pd.read_csv('data/Bus/Employee_Addresses.csv')
stops = pd.read_csv('data/Bus/Potentail_Bust_Stops.csv')


In [30]:
employees.head(1)

Unnamed: 0,address,employee_id
0,"98 Edinburgh St, San Francisco, CA 94112, USA",206


In [31]:
employees.isnull().sum()

address        0
employee_id    0
dtype: int64

In [32]:
stops.head(1)

Unnamed: 0,Street_One,Street_Two
0,MISSION ST,ITALY AVE


In [33]:
stops.isnull().sum()

Street_One    0
Street_Two    0
dtype: int64

In [63]:
# employees['lat'] = 0
# employees.ix[0,'lon']=1
# employees.ix[0]
len(employees)

2191

This particular dataset is clean. To be able to calculate distances, we have to geocode employee house addresses and also the potential location of shuttle stops. We're limited to 2500 requests per day for google geocoding API and sending many requests at the same time would block us. So I'll send requests in batches and with a 1-2 seconds delay between each request. 




In [209]:
from geopy.geocoders import GoogleV3
from geopy.distance import vincenty
from time import sleep
from geopy.exc import GeocoderTimedOut
import random


# for each employee geocode their address 

geolocator = GoogleV3()

#using apply will result in too many requests for google maps api so will have to 
#loop through the list and introduce a delay between each request 

#employees['coords'] = employees["address"].apply(lambda x: geolocator.geocode(x) )

def getGeoCode(i):
    try:
        coords = geolocator.geocode(employees.iloc[i].address)
        print coords.latitude,coords.longitude   
        sleep(random.randint(1,2))

    except GeocoderTimedOut:
        print "GeocoderTimedOut"
        coords = getGeoCode(i)

    return coords

#work with small sample for now, just enough to make clustering work
for i in range(0,1):
    coords = getGeoCode(i)
    employees.ix[i,'lat'] = coords.latitude
    employees.ix[i,'lon'] = coords.longitude   


GeocoderTimedOut
37.7274747 -122.4273257


Now geocoding the potential shuttle stop locations, the format of which is the intersection of two streets

In [132]:
def getGeocodeForIntersections(i):
    #intersection = geolocator.geocode("MISSION ST at ITALY AVE")

    try:
        intersection = geolocator.geocode(stops.iloc[i].Street_One+" at "+stops.iloc[i].Street_Two)
        print intersection.latitude,intersection.longitude   
        sleep(random.randint(1,2))

    except GeocoderTimedOut:
        print "GeocoderTimedOut"
        intersection = getGeocodeForIntersections(i)

    return intersection



# for each bus stop geocode the intersection 
#work with small sample for now
for i in range(0,len(stops)):
    intersection = getGeocodeForIntersections(i)
    stops.ix[i,'lat'] = intersection.latitude
    stops.ix[i,'lon'] = intersection.longitude   

37.7184779 -122.4395356


Save geocoded data to avoid re-geocoding in future

In [202]:
import pickle 

with open('data/Bus/employees_geocoded.pickle', 'wb') as handle:
    pickle.dump(employees, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('data/Bus/stops_geocoded.pickle', 'wb') as handle:
    pickle.dump(stops, handle, protocol=pickle.HIGHEST_PROTOCOL)
     

In [208]:
## ONLY if data is already geo coded

""" 
with open('data/Bus/employees_geocoded.pickle', 'r') as handle:
    t1 = pickle.load(handle)

with open('data/Bus/stops_geocoded.pickle', 'r') as handle:
    t2 = pickle.load(handle)
"""

" \nwith open('data/Bus/employees_geocoded.pickle', 'r') as handle:\n    t1 = pickle.load(handle)\n\nwith open('data/Bus/stops_geocoded.pickle', 'r') as handle:\n    t2 = pickle.load(handle)\n"

Plotting employee locations and intersections on map

In [194]:
import folium



emp_subset_with_geo = employees[employees.lon.notnull()]

map_osm = folium.Map(location=[emp_subset_with_geo.ix[0,'lat'], emp_subset_with_geo.ix[0,'lon']],zoom_start=13)


#add employee markers to the map
for index, row in emp_subset_with_geo.iterrows():
    #print row.lat
    folium.CircleMarker(location=[row.lat, row.lon], radius=50,
                    color='#3186cc',
                    fill_color='#3186cc',popup="Employee House").add_to(map_osm)



#add stop markers to the map
for index, row in stops[stops.lon.notnull()].iterrows():
#     folium.Marker([row.lat, row.lon]).add_to(map_osm)
    folium.CircleMarker(location=[row.lat, row.lon], radius=150,
                    color='#BA2121',
                    fill_color='#BA2121',popup="Potential Shuttle Stop").add_to(map_osm)

map_osm
# map_osm.save('osm.html')


#### Clustering employee houses

One way to approach the problem would be to group employees based on how close they live together, and then assign for each group a shuttle stop. Let's use K Means to cluster our employee data based on latitude and longitude.

In [192]:
from sklearn.cluster import KMeans

kmeans = KMeans(10)
kmeans.fit(emp_subset_with_geo[["lat","lon"]],)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=10, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [195]:
for c in kmeans.cluster_centers_:
    folium.CircleMarker(location=[c[0], c[1]], radius=100,
                    color='#008000',popup="Cluster Center").add_to(map_osm)    
map_osm

In above map, cluster centers are marked in green. 

#### Finding the right stop for each cluster center

Now that we have managed to group employees; for each group, we have to find the closest possible shuttle stop for them. There are possible more effecient way, but for now I find the distance between each cluster center and potential shuttle stop and the choose the closest one. 

In [149]:
#find the closest bus top to each cluster center

df1 = pd.DataFrame({'key':[1]*len(stops), 'potential_lat':stops.lat,'potential_lon':stops.lon})
df2 = pd.DataFrame({'key':[1]*10, 'cluster_lat':kmeans.cluster_centers_[:,0],'cluster_lon':kmeans.cluster_centers_[:,1]})

#cartesian product 
stops_n_clusters= pd.merge(df1, df2,on='key')[['cluster_lat',"cluster_lon",'potential_lat', 'potential_lon']]

In [150]:
print len(df1), len(df2), len(stops_n_clusters)

119 10 1190


In [166]:
stops_n_clusters['distance'] = stops_n_clusters.apply(lambda row:  
                                                      vincenty((row.cluster_lat,row.cluster_lon), 
                                                               (row.potential_lat,row.potential_lon)).kilometers, 
                                                      axis=1)

In [167]:
#distance is in kliometers
stops_n_clusters.head()

Unnamed: 0,cluster_lat,cluster_lon,potential_lat,potential_lon,distance
0,37.769754,-122.41442,37.718478,-122.439536,6.106509
1,37.728282,-122.427884,37.718478,-122.439536,1.496433
2,37.793707,-122.441888,37.718478,-122.439536,8.352388
3,37.74093,-122.425326,37.718478,-122.439536,2.789111
4,37.709849,-122.412542,37.718478,-122.439536,2.565569


In [181]:
stops_n_clusters.sort_values(by="distance", axis=0,inplace=True)
recommended_markers = stops_n_clusters.groupby(["cluster_lat","cluster_lon"]).nth(1)


In [196]:
for index, row in recommended_markers.iterrows():
     folium.Marker([row.potential_lat, row.potential_lon],popup='Recommended Shuttle Stop').add_to(map_osm)

# map_osm.save('recommended_shuttle_stops.html')


In [197]:
map_osm        


Here we have our shuttle stop recommendations; they're marked as pins.