<img src="https://upload.wikimedia.org/wikipedia/commons/5/53/Bixi_logo.svg" alt="Drawing" align="left" style="width: 800px;"/>

# [bIXI Montreal](https://montreal.bixi.com) - Clustering Stations Based on Nearby Venues.


This is the capstone project for course 9 from the [IBM Data Science Professional Certificate](https://www.coursera.org/specializations/ibm-data-science-professional-certificate) specialization. 
In this notebook we aquire [trips and stations data](https://montreal.bixi.com/en/open-data) from BIXI Montreal bike sharing system. We choose a station and we perform clustering of destination stations based on surrounding venues collected thanks to Foursquare API.

Let's start by loading all the necessary libraries:

In [1]:
import numpy as np 

import pandas as pd 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import matplotlib.cm as cm
import matplotlib.colors as colors

import json # library to handle JSON files

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

from sklearn.cluster import KMeans # import k-means from clustering stage

import folium # map rendering library  #!conda install -c conda-forge folium=0.5.0 --yes

print('Libraries imported.')

Libraries imported.


## Loading and Preprocessing Data

#### Merging the monthly trips


The open data available [here](https://montreal.bixi.com/en/open-data) from Bixi website has a monthly format. Here I'm merging them into one dataframe to have all 2018's trips:

In [2]:
%%time
trips2018 = pd.read_csv('OD_2018-04.csv')
for m in range(5,10):
    aux = pd.read_csv('OD_2018-0'+ str(m)+'.csv')
    trips2018 = trips2018.append(aux,ignore_index=True)
for m in range(10,12):
    aux = pd.read_csv('OD_2018-'+ str(m)+'.csv')
    trips2018 = trips2018.append(aux,ignore_index=True)
trips2018.to_csv('trips2018.csv',index=False)


Wall time: 35.1 s


In [3]:
trips2018.tail()

Unnamed: 0,start_date,start_station_code,end_date,end_station_code,duration_sec,is_member
5277536,2018-11-15 23:55,6154,2018-11-16 00:11,6211,924,1
5277537,2018-11-15 23:56,6184,2018-11-16 00:00,6164,258,1
5277538,2018-11-15 23:56,6734,2018-11-15 23:59,6324,197,1
5277539,2018-11-15 23:57,6255,2018-11-16 00:00,6270,207,1
5277540,2018-11-15 23:57,6248,2018-11-16 00:07,6017,565,1


So there was more than 5 millions trips last year. We follow up now by loading the stations data:

In [4]:
station2018 = pd.read_csv('Stations_2018.csv')
station2018.head(3)

Unnamed: 0,code,name,latitude,longitude
0,7030,de Bordeaux / Marie-Anne,45.533409,-73.570657
1,6141,de Bordeaux / Rachel,45.53227,-73.56828
2,6100,Mackay / de Maisonneuve,45.49659,-73.57851


#### Combine trips and stations data + slicing necessay columns

Now we need to add the stations' names and locations to the trips data. For that purpose, we join the two dataframes on stations'codes. Also, we won't be using the date information or the membership status, thus we slice only necessary data from the resulting dataframe. The following _merge_ function performs what we just detailed as follows:

In [5]:
def merge(trip, station):
    """ function to merge trip, station and slices only necessay columns for our task here """

    # merge trip with station information
    aux1 = pd.merge(left=trip, right=station, how='left',
                          left_on='start_station_code', right_on='code')
    trip_station = pd.merge(left=aux1, right=station, how='left',
                            left_on='end_station_code', right_on='code',
                            suffixes=('_start', '_end'))

    columns = [ 'start_station_code', 'name_start',
               'latitude_start', 'longitude_start',
               'end_station_code', 'name_end', 'latitude_end',
               'longitude_end','duration_sec']
    trip_station = trip_station[columns]

    return trip_station

Let's check how the final dataframe looks like:

In [6]:
data = merge(trips2018, station2018)
data.tail(3)

Unnamed: 0,start_station_code,name_start,latitude_start,longitude_start,end_station_code,name_end,latitude_end,longitude_end,duration_sec
5277538,6734,Lajeunesse / Villeray (place Tapéo),45.542119,-73.622547,6324,de Liège / Lajeunesse,45.545604,-73.63474,197
5277539,6255,Boyer / St-Zotique,45.53848,-73.60556,6270,Fabre / St-Zotique,45.543452,-73.60101,207
5277540,6248,St-Dominique / Rachel,45.518593,-73.581566,6017,du Square Ahmerst / Wolfe,45.521015,-73.563745,565


## Analysis of venues at destination station

Okay I choose to focus the analysis on one particular station, the one closest to where I live now. The station is near the corner _St-André / Cherrier_:

In [7]:
station2018[station2018['name']=='St-André / Cherrier']

Unnamed: 0,code,name,latitude,longitude
287,6175,St-André / Cherrier,45.520458,-73.567575


In [8]:
studied_station_code = station2018[station2018['name']=='St-André / Cherrier'].code.values[0] # 6175

Let's keep only the trips that started from our station:

In [9]:
new_data = data[data['start_station_code']==studied_station_code].reset_index(drop=True)

To analyse the trips patterns, we focus on the most popular ones. We select the trips that went to the top 100 destination stations from our station:

In [10]:
top_100_destinations_trips = new_data[new_data['end_station_code'].isin(new_data['end_station_code'].value_counts()[:100].index)==True].reset_index(drop=True)
top_100_destinations_trips = top_100_destinations_trips[['name_start','start_station_code', 'name_end', 'end_station_code', 'latitude_end', 'longitude_end']]
#ignore trips that started and ended at same station
top_100_destinations_trips = top_100_destinations_trips[top_100_destinations_trips.end_station_code!=studied_station_code]
top_100_destinations_trips.reset_index(drop=True,inplace=True)
top_100_destinations_trips.tail()

Unnamed: 0,name_start,start_station_code,name_end,end_station_code,latitude_end,longitude_end
13918,St-André / Cherrier,6175,Ste-Catherine / Labelle,6009,45.515038,-73.559201
13919,St-André / Cherrier,6175,Beaudry / Ontario,6902,45.521556,-73.562264
13920,St-André / Cherrier,6175,du Square Ahmerst / Wolfe,6017,45.521015,-73.563745
13921,St-André / Cherrier,6175,Terrasse Mercure / Fullum,6133,45.53569,-73.565923
13922,St-André / Cherrier,6175,du Mont-Royal / Clark,6221,45.51941,-73.58685


In [11]:
# select the top 100 destination stations
top_100_codes = top_100_destinations_trips.end_station_code.unique()
top_100_stations = station2018[station2018.code.isin(top_100_codes)]

Let's visualize the distribution of those end sations on Montreal's map:

In [12]:
# create map of end stations using latitude and longitude values
station_lat = data[data['start_station_code']==studied_station_code].iloc[0].latitude_start
station_long = data[data['start_station_code']==studied_station_code].iloc[0].longitude_start
station_name = data[data['start_station_code']==studied_station_code].iloc[0].name_start


end_stations_map = folium.Map(location=[station_lat, station_long], zoom_start=13)

# add markers to map
for lat, lng, station, code in zip(top_100_stations['latitude'], top_100_stations['longitude'],
                             top_100_stations['name'],top_100_stations['code']):
    label = '{}, {}'.format(station,code)
    label = folium.Popup(label,parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(end_stations_map)  
folium.CircleMarker(
    [station_lat,station_long],
    radius=5,
    popup=folium.Popup(station_name, parse_html=True),
    color='red',
    fill=True,
    fill_color='red',
    fill_opacity=0.7).add_to(end_stations_map)  
end_stations_map

#### Nearby venues - Foursquare API

The idea now is to use the [Foursquare API](https://developer.foursquare.com/) to explore the surrounding venues of end stations

In [13]:
# Foursquare Credentials
CLIENT_ID = 'add_you_client_ID_here'
CLIENT_SECRET = 'add_your_client_secret_code_here'
VERSION = '20181111' # Foursquare API version7

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=100, LIMIT=100):
    """ returns the nearby venues' names, locations and categories """

    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
#         print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()
        
        if 'groups' not in results['response']:
            print('No groups field in response. No venues near {} within {} meters.'.format(name, radius))
            continue
        else:
            results = results["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Station', 
                  'Station Latitude', 
                  'Station Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [17]:
%%time
mtl_venues = getNearbyVenues(names=top_100_stations['name'],
                                   latitudes=top_100_stations['latitude'],
                                   longitudes=top_100_stations['longitude'])
print('Venues data retrieved!')

Venues data retrieved!
Wall time: 30.5 s


In [18]:
# save the venues data
mtl_venues.to_csv('mtl_venues_6175.csv', index=False)

In [19]:
mtl_venues = pd.read_csv('mtl_venues_6175.csv')
print('There are {} venue categories'.format(mtl_venues['Venue Category'].nunique()))

There are 139 venue categories


In [20]:
print(mtl_venues.shape)
mtl_venues.head()

(525, 7)


Unnamed: 0,Station,Station Latitude,Station Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,de Bordeaux / Marie-Anne,45.533409,-73.570657,Arena Mt-Royal,45.533194,-73.571881,Hockey Arena
1,de Bordeaux / Rachel,45.53227,-73.56828,Provi-soir,45.532434,-73.568001,Convenience Store
2,Mackay / de Maisonneuve,45.49659,-73.57851,Café Myriade,45.496103,-73.577927,Café
3,Mackay / de Maisonneuve,45.49659,-73.57851,Thé Kiosque,45.496151,-73.577563,Tea Room
4,Mackay / de Maisonneuve,45.49659,-73.57851,Cinéma J.A. De Sève,45.496527,-73.577756,College Theater


Let's do now a one hot encoding for all those categories and then group by en station, taking the mean of the frequency of occurrence of each category:

In [21]:
# one hot encoding
mtl_onehot = pd.get_dummies(mtl_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
mtl_onehot['Station'] = mtl_venues['Station'] 

# move neighborhood column to the first column
fixed_columns = ['Station'] + [c for c in mtl_onehot.columns if c!='Station']
mtl_onehot = mtl_onehot[fixed_columns]

mtl_grouped = mtl_onehot.groupby('Station').mean().reset_index()
mtl_grouped.head(7)

Unnamed: 0,Station,Afghan Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Bagel Shop,Bakery,Bank,Bar,Beer Bar,Bike Rental / Bike Share,Bistro,Bookstore,Boutique,Breakfast Spot,Brewery,Building,Burger Joint,Bus Station,Butcher,Café,Caribbean Restaurant,Cheese Shop,Chinese Restaurant,Chocolate Shop,Church,Clothing Store,Cocktail Bar,Coffee Shop,College Theater,Concert Hall,Convenience Store,Cosmetics Shop,Costume Shop,Creperie,Cuban Restaurant,Cupcake Shop,Dance Studio,Deli / Bodega,Department Store,Dessert Shop,Diner,Discount Store,Dog Run,Dumpling Restaurant,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flower Shop,Food & Drink Shop,Food Court,Food Service,Food Truck,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,General Entertainment,Gift Shop,Gourmet Shop,Greek Restaurant,Grocery Store,Gym,Gym / Fitness Center,Harbor / Marina,Hawaiian Restaurant,Health Food Store,History Museum,Hobby Shop,Hockey Arena,Hostel,Hot Dog Joint,Hot Spring,Hotel,Ice Cream Shop,Indian Restaurant,Intersection,Italian Restaurant,Japanese Restaurant,Jazz Club,Jewelry Store,Juice Bar,Karaoke Bar,Liquor Store,Lounge,Market,Mediterranean Restaurant,Metro Station,Mexican Restaurant,Middle Eastern Restaurant,Moroccan Restaurant,Motorcycle Shop,Movie Theater,Museum,Music Store,Music Venue,Nightclub,Park,Pastry Shop,Performing Arts Venue,Pet Café,Pet Store,Pharmacy,Pizza Place,Playground,Plaza,Pool,Pool Hall,Portuguese Restaurant,Poutine Place,Pub,Record Shop,Restaurant,Rock Climbing Spot,Rock Club,Salad Place,Sandwich Place,Seafood Restaurant,Shopping Mall,Snack Place,Soup Place,South American Restaurant,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Bar,Steakhouse,Supermarket,Sushi Restaurant,Taco Place,Tea Room,Thai Restaurant,Theater,Tunnel,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Beaudry / Ontario,0.0,0.0,0.0,0.0,0.0,0.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Berri / Cherrier,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Berri / Rachel,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Boyer / du Mont-Royal,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Calixa-Lavallée / Rachel,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Chabot / du Mont-Royal,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.181818,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.090909,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Chambord / Laurier,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let's put that into a pandas dataframe and display the top 20 venues for each station.

In [22]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [23]:
num_top_venues = 20

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Station']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
stations_venues_sorted = pd.DataFrame(columns=columns)
stations_venues_sorted['Station'] = mtl_grouped['Station']

for ind in np.arange(mtl_grouped.shape[0]):
    stations_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mtl_grouped.iloc[ind, :], num_top_venues)

stations_venues_sorted.head()

Unnamed: 0,Station,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,Beaudry / Ontario,Bakery,French Restaurant,Breakfast Spot,Café,Dessert Shop,Diner,Department Store,Discount Store,Dog Run,Food Truck,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flower Shop,Deli / Bodega,Food & Drink Shop,Food Court,Food Service,Dumpling Restaurant,Yoga Studio
1,Berri / Cherrier,Dessert Shop,Yoga Studio,Dumpling Restaurant,Food Court,Food & Drink Shop,Flower Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Dog Run,Cuban Restaurant,Discount Store,Diner,Department Store,Deli / Bodega,Dance Studio,Food Service,Food Truck,French Restaurant,Fried Chicken Joint
2,Berri / Rachel,Coffee Shop,Health Food Store,Burger Joint,Bakery,Clothing Store,Cocktail Bar,Yoga Studio,Food & Drink Shop,Flower Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Food Service,Dumpling Restaurant,Dog Run,Discount Store,Diner,Dessert Shop,Department Store,Food Court
3,Boyer / du Mont-Royal,Liquor Store,Gourmet Shop,Flower Shop,Restaurant,Frozen Yogurt Shop,Breakfast Spot,Supermarket,Mediterranean Restaurant,Café,Bar,Diner,Asian Restaurant,Dessert Shop,Dumpling Restaurant,Dog Run,Discount Store,Farmers Market,Fast Food Restaurant,Department Store,Food & Drink Shop
4,Calixa-Lavallée / Rachel,Hotel,Yoga Studio,Dumpling Restaurant,Food Court,Food & Drink Shop,Flower Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Dog Run,Food Truck,Discount Store,Diner,Dessert Shop,Department Store,Deli / Bodega,Dance Studio,Food Service,French Restaurant,Hockey Arena


#### Clustering End Stations

Let's run *k*-means to cluster the end stations into 5 clusters.

In [24]:
# set number of clusters
kclusters = 5

mtl_grouped_clustering = mtl_grouped.drop('Station', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(mtl_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[:10] 

array([1, 2, 2, 2, 2, 2, 2, 1, 2, 2])

In [25]:
# add clustering labels
stations_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

mtl_merged = top_100_stations

# merge top_100_stations with stations_venues_sorted to add latitude/longitude for each end station
mtl_merged = mtl_merged.join(stations_venues_sorted.set_index('Station'), on='name')
mtl_merged['Cluster Labels'] = np.nan_to_num(mtl_merged['Cluster Labels']).astype(int)
mtl_merged.head() 

Unnamed: 0,code,name,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,11th Most Common Venue,12th Most Common Venue,13th Most Common Venue,14th Most Common Venue,15th Most Common Venue,16th Most Common Venue,17th Most Common Venue,18th Most Common Venue,19th Most Common Venue,20th Most Common Venue
0,7030,de Bordeaux / Marie-Anne,45.533409,-73.570657,2,Hockey Arena,Dog Run,Food & Drink Shop,Flower Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Dumpling Restaurant,Discount Store,Food Service,Diner,Dessert Shop,Department Store,Deli / Bodega,Dance Studio,Cupcake Shop,Food Court,Yoga Studio,Cuban Restaurant,Grocery Store
1,6141,de Bordeaux / Rachel,45.53227,-73.56828,2,Convenience Store,Yoga Studio,Dog Run,Food & Drink Shop,Flower Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Dumpling Restaurant,Discount Store,Food Service,Diner,Dessert Shop,Department Store,Deli / Bodega,Dance Studio,Food Court,Food Truck,Cuban Restaurant,Grocery Store
2,6100,Mackay / de Maisonneuve,45.49659,-73.57851,1,Tea Room,Asian Restaurant,College Theater,Café,Department Store,Dessert Shop,Diner,Discount Store,Deli / Bodega,Food Service,Dog Run,Dumpling Restaurant,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Flower Shop,Food & Drink Shop,Food Court,Yoga Studio,French Restaurant
3,6064,Métro Peel (de Maisonneuve / Stanley),45.50038,-73.57507,2,Gym,Yoga Studio,Portuguese Restaurant,Steakhouse,Restaurant,Clothing Store,Café,Pizza Place,Dessert Shop,Department Store,Diner,Flower Shop,Deli / Bodega,Dance Studio,Dog Run,Dumpling Restaurant,Falafel Restaurant,Farmers Market,Fast Food Restaurant,Discount Store
14,6073,de Maisonneuve / Aylmer,45.50501,-73.57069,2,Department Store,Coffee Shop,Thai Restaurant,Clothing Store,Sushi Restaurant,Yoga Studio,Dog Run,Flower Shop,Fast Food Restaurant,Farmers Market,Falafel Restaurant,Dumpling Restaurant,Discount Store,Food Court,Diner,Dessert Shop,Deli / Bodega,Dance Studio,Food & Drink Shop,Food Truck


Finally, let's visualize the resulting clusters:

In [94]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[station_lat, station_long], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [(i+1) + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i**2) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(mtl_merged['latitude'], mtl_merged['longitude'], mtl_merged['name'], mtl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
folium.CircleMarker(
    [station_lat,station_long],
    radius=7,
    popup=folium.Popup(station_name, parse_html=True),
    color='black',
    fill=True,
    fill_color='black',
    fill_opacity=0.7).add_to(map_clusters)



legend_html =   '''
                <div style="position: fixed; 
                            top: 50px; left: 50px; width: 100px; height: 100px; 
                            border:2px solid grey; z-index:9999; font-size:14px;
                            ">&nbsp; Cluster 0 &nbsp; <i class="fa fa-circle" style="color:red"></i><br>
                              &nbsp; Cluster 1 &nbsp; <i class="fa fa-circle" style="color:#4000ff"></i><br>
                              &nbsp; Cluster 2 &nbsp; <i class="fa fa-circle" style="color:#0080d9"></i><br>
                              &nbsp; Cluster 3 &nbsp; <i class="fa fa-circle" style="color:#41ff7f"></i><br>
                              &nbsp; Cluster 4 &nbsp; <i class="fa fa-circle" style="color:#ff7d25"></i>
                </div>
                ''' 

map_clusters.get_root().html.add_child(folium.Element(legend_html))      
map_clusters

The value counts of each of the clusters is as follows: 

In [65]:
for i in np.arange(kclusters):
    print ('Cluster {}: {} stations'.format(i,mtl_merged['Cluster Labels'].value_counts()[i]))

Cluster 0: 15 stations
Cluster 1: 18 stations
Cluster 2: 62 stations
Cluster 3: 2 stations
Cluster 4: 2 stations


So we have one big cluster, two medium and two very small ones. Let's check what ech of them consists of:

In [66]:
for i in np.arange(kclusters):
    print('Most commom venues in cluster {} are: {} '.format(i,mtl_merged[mtl_merged['Cluster Labels'] == i].iloc[:,5:7].stack().value_counts(normalize=True).index.values[:4]))

Most commom venues in cluster 0 are: ['Yoga Studio' 'Hot Spring' 'Dog Run' 'Intersection'] 
Most commom venues in cluster 1 are: ['Bakery' 'Supermarket' 'Gourmet Shop' 'Liquor Store' 'Grocery Store'] 
Most commom venues in cluster 2 are: ['Restaurant' 'Bar' 'Coffee Shop' 'Food & Drink Shop' 'Sandwich Place'] 
Most commom venues in cluster 3 are: ['Yoga Studio' 'Bus Station'] 
Most commom venues in cluster 4 are: ['French Restaurant' 'Dumpling Restaurant'] 


As we can see the biggest cluster, 2, is related to food and drinks shops. Cluster 1 is for markets and stores. Yoga studios and Hot spring are mostly in cluster 0. It is not that clear for the last two small cluster, since indeed, they're too small and have like random venues.