Segmenting and Clustering Neighborhoods in Toronto

1. Importing packages

In [424]:
import pandas as pd
import numpy as np
import geocoder
import requests
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import folium
from geopy.geocoders import Nominatim

2. Importing and preprocessing of Toronto neighborhoods data

Since our postal codes dataset will be coming from Wikipedia, we'll be needing the page's URL. We will be using 'read_html' from pandas to import the dataset. 

In [2]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response=pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
response

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

It appears that the 'response' variable is returning us three dataframes representing the three tables present in the Wikipedia page. Our table of interest is only the first one.

In [3]:
df=response[0]
df.head(n=10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


We will be dropping the rows with 'Not assigned' boroughs and we will be assigning the 'Not assigned' neighborhoods to their respective boroughs. This leads us to our cleaned dataset below.

In [4]:
df=df[df[df.columns[1]]!='Not assigned'].reset_index(drop=True)
for i in range(len(df)):
    if df[df.columns[2]][i]=='Not assigned':
        df[df.columns[2]][i]=df[df.columns[1]][i]
df.head(n=10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


We now look at the shape of our final dataset to ensure that we include all 103 FSAs in Toronto.

In [5]:
df.shape

(103, 3)

3. Importing and preprocessing of Toronto neighborhoods geographical locations

For this part, we use the Geocoder API Python package to get geographical locations of each of Toronto's postal codes. These will then be utilized for venue explorations later using the Foursquare API. 

In [117]:
lats=[]
longs=[]
for i in df['Postal Code']:
    lat_lng_coords=None
    while(lat_lng_coords is None):
        g=geocoder.arcgis('{}, Toronto, Ontario'.format(i))
        lat_lng_coords=g.latlng
    latitude=lat_lng_coords[0]
    longitude=lat_lng_coords[1]
    lats.append(latitude)
    longs.append(longitude)

In [164]:
latlong=pd.concat([df['Postal Code'],pd.DataFrame(lats,columns=['Latitude']),pd.DataFrame(longs,columns=['Longitude'])],axis=1)
latlong.head(n=10)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M3A,43.75245,-79.32991
1,M4A,43.73057,-79.31306
2,M5A,43.65512,-79.36264
3,M6A,43.72327,-79.45042
4,M7A,43.66253,-79.39188
5,M9A,43.66263,-79.52831
6,M1B,43.81139,-79.19662
7,M3B,43.74923,-79.36186
8,M4B,43.70718,-79.31192
9,M5B,43.65739,-79.37804


We now concatenate this data with our original dataset.

In [172]:
df_ll=latlong.drop(columns=['Postal Code'])
df_aug=pd.concat([df,df_ll],axis=1)
df_aug.head(n=10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.66263,-79.52831
6,M1B,Scarborough,"Malvern, Rouge",43.81139,-79.19662
7,M3B,North York,Don Mills,43.74923,-79.36186
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70718,-79.31192
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804


4. Exploring and clustering the neighborhoods in Toronto

For this part, I chose to only analyze neighborhoods in boroughs with "Toronto" in the name as there are more than 100 neighborhoods initially for consideration. As can be seen below, there are 39 such neighborhoods which will significantly decrease our exploration calls to Foursquare.

In [173]:
mask=[]
for i in df_aug['Borough']:
    mask.append('Toronto' in i)
df_final=df_aug[mask].reset_index(drop=True)
print('{} neighborhoods with "Toronto" in the borough name'.format(len(df_final)))
df_final.head(n=10)

39 neighborhoods with "Toronto" in the borough name


Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66253,-79.39188
2,M5B,Downtown Toronto,"Garden District, Ryerson",43.65739,-79.37804
3,M5C,Downtown Toronto,St. James Town,43.65215,-79.37587
4,M4E,East Toronto,The Beaches,43.67709,-79.29547
5,M5E,Downtown Toronto,Berczy Park,43.64536,-79.37306
6,M5G,Downtown Toronto,Central Bay Street,43.65609,-79.38493
7,M6G,Downtown Toronto,Christie,43.66869,-79.42071
8,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.6497,-79.38258
9,M6H,West Toronto,"Dufferin, Dovercourt Village",43.66505,-79.43891


We explore venues within 500 meters of each point in our dataset and we explore up to 100 of the top places per neighborhood.

In [141]:
RADIUS=500
CLIENT_ID='MSOG3Y4IVKOYHIACJTPVUGCPTKDMCZNA4XH1315QT0HGFSVP'
CLIENT_SECRET='P2LYJ3XZDP04J2TN23VUFLPHGBVI4XLU0IP3EVEN3YOCX1YL'
VERSION='20180605'
LIMIT=100

In [142]:
def getNearbyVenues(names,latitudes,longitudes,radius=500):
    venues_list=[]
    super_counter=0
    for name,lat,lng in zip(names,latitudes,longitudes):
        
        # venue counter for each neighborhood
        counter=0
        
        # create the API request URL
        url='https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID,CLIENT_SECRET,VERSION,lat,lng,radius,LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        for v in results:
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name'])])
            counter+=1
        print('{} venues found near {}'.format(counter,name))
        super_counter+=counter
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    print('{} total venues found.'.format(super_counter))
    return(nearby_venues)

In [143]:
toronto_venues=getNearbyVenues(names=df_final['Neighbourhood'],latitudes=df_final['Latitude'],longitudes=df_final['Longitude'])

22 venues found near Regent Park, Harbourfront
17 venues found near Queen's Park, Ontario Provincial Government
100 venues found near Garden District, Ryerson
76 venues found near St. James Town
4 venues found near The Beaches
64 venues found near Berczy Park
61 venues found near Central Bay Street
11 venues found near Christie
100 venues found near Richmond, Adelaide, King
17 venues found near Dufferin, Dovercourt Village
48 venues found near Harbourfront East, Union Station, Toronto Islands
42 venues found near Little Portugal, Trinity
7 venues found near The Danforth West, Riverdale
100 venues found near Toronto Dominion Centre, Design Exchange
85 venues found near Brockton, Parkdale Village, Exhibition Place
20 venues found near India Bazaar, The Beaches West
100 venues found near Commerce Court, Victoria Hotel
52 venues found near Studio District
3 venues found near Lawrence Park
2 venues found near Roselawn
7 venues found near Davisville North
2 venues found near Forest Hill Nort

It is important to note as well that there are neighborhoods that only returned an insignificant amount of venues and, hence, frequencies of venues there would be bloated especially since we will be considering percentage data instead of actual frequencies. To remedy this, we will only consider neighborhoods that returned more than 20 venues. 

In [210]:
toronto_total=toronto_onehot.groupby(['Neighbourhood']).sum().reset_index()
total_mask=toronto_total.sum(axis=1)>20

Let's now take a look at the venues returned by Foursquare.

In [144]:
toronto_venues

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Regent Park, Harbourfront",43.65512,-79.36264,Roselle Desserts,43.653447,-79.362017,Bakery
1,"Regent Park, Harbourfront",43.65512,-79.36264,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,"Regent Park, Harbourfront",43.65512,-79.36264,Figs Breakfast & Lunch,43.655675,-79.364503,Breakfast Spot
3,"Regent Park, Harbourfront",43.65512,-79.36264,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot
4,"Regent Park, Harbourfront",43.65512,-79.36264,The Yoga Lounge,43.655515,-79.364955,Yoga Studio
...,...,...,...,...,...,...,...
1708,"Business reply mail Processing Centre, South C...",43.64869,-79.38544,Red Eye Espresso,43.651150,-79.390146,Café
1709,"Business reply mail Processing Centre, South C...",43.64869,-79.38544,Kanga,43.649955,-79.389352,Pie Shop
1710,"Business reply mail Processing Centre, South C...",43.64869,-79.38544,Condom Shack,43.650542,-79.388138,Hobby Shop
1711,"Business reply mail Processing Centre, South C...",43.64869,-79.38544,Druxy's,43.648015,-79.379907,Deli / Bodega


As the results above show, we have found around 1,700 venues within our selected neighborhoods. We now turn these into a workable dataframe (onehot-encoded data).

In [376]:
toronto_onehot=pd.get_dummies(toronto_venues['Venue Category'],prefix='',prefix_sep='')
toronto_onehot=pd.concat([toronto_venues['Neighbourhood'],toronto_onehot],axis=1)
toronto_onehot

Unnamed: 0,Neighbourhood,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Regent Park, Harbourfront",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1708,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1709,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1710,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1711,"Business reply mail Processing Centre, South C...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Getting the mean of results per neighborhood, we have the following dataframe. Take note that we will be filtering this to neighborhoods that returned more than 20 venues. This data is what we will be using later for clustering.

In [377]:
toronto_grouped=toronto_onehot.groupby(['Neighbourhood']).mean().reset_index()[total_mask].reset_index(drop=True)
toronto_grouped

Unnamed: 0,Neighbourhood,Accessories Store,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Yoga Studio
0,Berczy Park,0.0,0.0,0.015625,0.0,0.015625,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015625,0.0,0.0,0.0,0.0,0.0,0.0,0.015625
1,"Brockton, Parkdale Village, Exhibition Place",0.011765,0.0,0.0,0.0,0.0,0.0,0.023529,0.0,0.0,...,0.0,0.0,0.011765,0.0,0.0,0.0,0.0,0.0,0.0,0.011765
2,"Business reply mail Processing Centre, South C...",0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.03,0.0,...,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.0
3,"CN Tower, King and Spadina, Railway Lands, Har...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0125,0.0,...,0.0,0.0,0.0,0.0125,0.0,0.0,0.0,0.0,0.0,0.0125
4,Central Bay Street,0.0,0.0,0.0,0.0,0.016393,0.016393,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.016393,0.016393,0.016393,0.0,0.0,0.0
5,Church and Wellesley,0.0,0.011628,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011628,0.0,0.0,0.0,0.011628
6,"Commerce Court, Victoria Hotel",0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.01,0.0,...,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0
7,Davisville,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"First Canadian Place, Underground city",0.0,0.03,0.0,0.0,0.01,0.0,0.0,0.03,0.0,...,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0
9,"Garden District, Ryerson",0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0


To better view our resulting data, we look at the 10 most common venues for each neighborhood. We will use this view to analyse the results of our clustering later.

In [378]:
def return_most_common_venues(row,num_top_venues):
    row_categories=row.iloc[1:]
    row_categories_sorted=row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [380]:
num_top_venues=10

indicators=['st','nd','rd']

# create columns according to number of top venues
columns=['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1,indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted=pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood']=toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind,1:]=return_most_common_venues(toronto_grouped.iloc[ind,:],num_top_venues)

neighborhoods_venues_sorted.head(n=25)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Coffee Shop,Seafood Restaurant,Cocktail Bar,Restaurant,Cheese Shop,Beer Bar,Breakfast Spot,Bakery,Farmers Market,Park
1,"Brockton, Parkdale Village, Exhibition Place",Bar,Coffee Shop,Café,Restaurant,Gift Shop,Sandwich Place,Supermarket,Japanese Restaurant,Furniture / Home Store,French Restaurant
2,"Business reply mail Processing Centre, South C...",Coffee Shop,Hotel,Café,Gym,Asian Restaurant,Restaurant,Thai Restaurant,Steakhouse,Japanese Restaurant,Pizza Place
3,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Coffee Shop,Gym / Fitness Center,Café,French Restaurant,Park,Restaurant,Bar,Bakery,Speakeasy
4,Central Bay Street,Coffee Shop,Clothing Store,Plaza,Bubble Tea Shop,Restaurant,Middle Eastern Restaurant,Sandwich Place,Hotel,Cosmetics Shop,Shoe Store
5,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Fast Food Restaurant,Café,Gay Bar,Dance Studio,Hotel,Pub
6,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Hotel,Café,Italian Restaurant,Japanese Restaurant,Gym,Beer Bar,American Restaurant,Deli / Bodega
7,Davisville,Dessert Shop,Park,Café,Sandwich Place,Pizza Place,Italian Restaurant,Coffee Shop,Tennis Court,Seafood Restaurant,Restaurant
8,"First Canadian Place, Underground city",Coffee Shop,Hotel,Café,Restaurant,Gym,Asian Restaurant,Deli / Bodega,Japanese Restaurant,Seafood Restaurant,American Restaurant
9,"Garden District, Ryerson",Coffee Shop,Clothing Store,Middle Eastern Restaurant,Cosmetics Shop,Café,Japanese Restaurant,Hotel,Bubble Tea Shop,Diner,Ramen Restaurant


Ideally, we want to determine the optimal number of clusters to use and there are methods in place do that. However, let's only use two clusters for this example for simplicity.

Using two clusters, we now perform our k-means clustering process. We can see our clustering result below.

In [367]:
n_clusters=2
kmeans=KMeans(n_clusters=n_clusters,random_state=0)
kmeans.fit(toronto_grouped.drop(columns=['Neighbourhood']))

KMeans(n_clusters=2, random_state=0)

In [368]:
df_clustered=pd.concat([pd.DataFrame(kmeans.labels_,columns=['Cluster Label']),neighborhoods_venues_sorted],axis=1)
df_clustered.head(n=20)

Unnamed: 0,Cluster Label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Berczy Park,Coffee Shop,Seafood Restaurant,Cocktail Bar,Restaurant,Cheese Shop,Beer Bar,Breakfast Spot,Bakery,Farmers Market,Park
1,0,"Brockton, Parkdale Village, Exhibition Place",Bar,Coffee Shop,Café,Restaurant,Gift Shop,Sandwich Place,Supermarket,Japanese Restaurant,Furniture / Home Store,French Restaurant
2,1,"Business reply mail Processing Centre, South C...",Coffee Shop,Hotel,Café,Gym,Asian Restaurant,Restaurant,Thai Restaurant,Steakhouse,Japanese Restaurant,Pizza Place
3,0,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Coffee Shop,Gym / Fitness Center,Café,French Restaurant,Park,Restaurant,Bar,Bakery,Speakeasy
4,1,Central Bay Street,Coffee Shop,Clothing Store,Plaza,Bubble Tea Shop,Restaurant,Middle Eastern Restaurant,Sandwich Place,Hotel,Cosmetics Shop,Shoe Store
5,1,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Fast Food Restaurant,Café,Gay Bar,Dance Studio,Hotel,Pub
6,1,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Hotel,Café,Italian Restaurant,Japanese Restaurant,Gym,Beer Bar,American Restaurant,Deli / Bodega
7,0,Davisville,Dessert Shop,Park,Café,Sandwich Place,Pizza Place,Italian Restaurant,Coffee Shop,Tennis Court,Seafood Restaurant,Restaurant
8,1,"First Canadian Place, Underground city",Coffee Shop,Hotel,Café,Restaurant,Gym,Asian Restaurant,Deli / Bodega,Japanese Restaurant,Seafood Restaurant,American Restaurant
9,1,"Garden District, Ryerson",Coffee Shop,Clothing Store,Middle Eastern Restaurant,Cosmetics Shop,Café,Japanese Restaurant,Hotel,Bubble Tea Shop,Diner,Ramen Restaurant


Let's further examine our resulting clusters. First, we check how many neighborhoods belong to each cluster.

In [369]:
n_count=0
for i in range(0,n_clusters):
    print('Cluster {} has {} neighbourhoods.'.format(i,df_clustered[df_clustered['Cluster Label']==i].shape[0]))
    n_count+=df_clustered[df_clustered['Cluster Label']==i].shape[0]
print('{} neighborhoods accounted for.'.format(n_count))

Cluster 0 has 10 neighbourhoods.
Cluster 1 has 14 neighbourhoods.
24 neighborhoods accounted for.


It seems we have divided our neighborhoods almost evenly. Let's also obtain the respective percentage frequencies of our top venue categories.

In [371]:
values=np.zeros(df_clustered.drop(columns=['Cluster Label','Neighbourhood']).shape)
a=0
for i in df_clustered['Neighbourhood']:
    b=0
    for j in df_clustered[df_clustered['Neighbourhood']==i].drop(columns=['Cluster Label','Neighbourhood']).values[0]:
        values[a,b]=toronto_grouped[toronto_grouped['Neighbourhood']==i][j]
        b+=1
    a+=1
col=[]
for ind in np.arange(num_top_venues):
    try:
        col.append('{}{} Most Common Venue'.format(ind+1,indicators[ind]))
    except:
        col.append('{}th Most Common Venue'.format(ind+1))
values=pd.DataFrame(values,columns=col)
df_clustered_values=values=pd.concat([df_clustered[['Cluster Label','Neighbourhood']],values],axis=1)
df_clustered_values

Unnamed: 0,Cluster Label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Berczy Park,0.078125,0.046875,0.046875,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.015625
1,0,"Brockton, Parkdale Village, Exhibition Place",0.070588,0.058824,0.058824,0.047059,0.035294,0.035294,0.023529,0.023529,0.023529,0.023529
2,1,"Business reply mail Processing Centre, South C...",0.07,0.05,0.04,0.03,0.03,0.02,0.02,0.02,0.02,0.02
3,0,"CN Tower, King and Spadina, Railway Lands, Har...",0.075,0.0625,0.05,0.05,0.0375,0.0375,0.0375,0.0375,0.025,0.025
4,1,Central Bay Street,0.131148,0.065574,0.032787,0.032787,0.032787,0.032787,0.032787,0.032787,0.032787,0.016393
5,1,Church and Wellesley,0.104651,0.05814,0.046512,0.046512,0.034884,0.034884,0.034884,0.023256,0.023256,0.023256
6,1,"Commerce Court, Victoria Hotel",0.11,0.07,0.06,0.05,0.05,0.04,0.04,0.03,0.03,0.03
7,0,Davisville,0.107143,0.071429,0.071429,0.071429,0.071429,0.071429,0.071429,0.035714,0.035714,0.035714
8,1,"First Canadian Place, Underground city",0.12,0.07,0.06,0.05,0.04,0.03,0.03,0.03,0.03,0.03
9,1,"Garden District, Ryerson",0.11,0.06,0.03,0.03,0.03,0.03,0.03,0.02,0.02,0.02


We are now ready to try and analyse what separates the two clusters. Let's look at each cluster.

In [372]:
i=0
print('Neighborhoods under cluster {}'.format(i))
df_clustered[df_clustered['Cluster Label']==i].head(n=10)

Neighborhoods under cluster 0


Unnamed: 0,Cluster Label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,0,"Brockton, Parkdale Village, Exhibition Place",Bar,Coffee Shop,Café,Restaurant,Gift Shop,Sandwich Place,Supermarket,Japanese Restaurant,Furniture / Home Store,French Restaurant
3,0,"CN Tower, King and Spadina, Railway Lands, Har...",Italian Restaurant,Coffee Shop,Gym / Fitness Center,Café,French Restaurant,Park,Restaurant,Bar,Bakery,Speakeasy
7,0,Davisville,Dessert Shop,Park,Café,Sandwich Place,Pizza Place,Italian Restaurant,Coffee Shop,Tennis Court,Seafood Restaurant,Restaurant
11,0,"Kensington Market, Chinatown, Grange Park",Café,Coffee Shop,Arts & Crafts Store,Vegetarian / Vegan Restaurant,Farmers Market,Mexican Restaurant,Gaming Cafe,Caribbean Restaurant,Vietnamese Restaurant,Grocery Store
13,0,"Parkdale, Roncesvalles",Coffee Shop,Sushi Restaurant,Bakery,Eastern European Restaurant,Restaurant,Thai Restaurant,American Restaurant,Grocery Store,Bookstore,Gift Shop
16,0,"Runnymede, Swansea",Café,Coffee Shop,Bakery,Pizza Place,Bank,Pharmacy,Dance Studio,Shoe Store,Restaurant,Pub
18,0,"St. James Town, Cabbagetown",Coffee Shop,Café,Pizza Place,Park,Pub,Italian Restaurant,Bakery,Restaurant,Grocery Store,Sandwich Place
20,0,Studio District,Pizza Place,Diner,Bakery,Italian Restaurant,Brewery,Sushi Restaurant,Gastropub,Coffee Shop,Bar,Arts & Crafts Store
21,0,"The Annex, North Midtown, Yorkville",Café,Sandwich Place,Pharmacy,Indian Restaurant,Modern European Restaurant,Burger Joint,Liquor Store,French Restaurant,Furniture / Home Store,Coffee Shop
23,0,"University of Toronto, Harbord",Café,Bakery,Coffee Shop,Japanese Restaurant,Bubble Tea Shop,Bar,Sushi Restaurant,Gym,Italian Restaurant,Restaurant


In [382]:
print('Percentage Frequencies under cluster {}'.format(i))
df_clustered_values[df_clustered_values['Cluster Label']==i].head(n=10)

Percentage Frequencies under cluster 1


Unnamed: 0,Cluster Label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Berczy Park,0.078125,0.046875,0.046875,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.015625
2,1,"Business reply mail Processing Centre, South C...",0.07,0.05,0.04,0.03,0.03,0.02,0.02,0.02,0.02,0.02
4,1,Central Bay Street,0.131148,0.065574,0.032787,0.032787,0.032787,0.032787,0.032787,0.032787,0.032787,0.016393
5,1,Church and Wellesley,0.104651,0.05814,0.046512,0.046512,0.034884,0.034884,0.034884,0.023256,0.023256,0.023256
6,1,"Commerce Court, Victoria Hotel",0.11,0.07,0.06,0.05,0.05,0.04,0.04,0.03,0.03,0.03
8,1,"First Canadian Place, Underground city",0.12,0.07,0.06,0.05,0.04,0.03,0.03,0.03,0.03,0.03
9,1,"Garden District, Ryerson",0.11,0.06,0.03,0.03,0.03,0.03,0.03,0.02,0.02,0.02
10,1,"Harbourfront East, Union Station, Toronto Islands",0.125,0.083333,0.0625,0.041667,0.041667,0.041667,0.041667,0.020833,0.020833,0.020833
12,1,"Little Portugal, Trinity",0.071429,0.071429,0.071429,0.047619,0.047619,0.047619,0.02381,0.02381,0.02381,0.02381
14,1,"Regent Park, Harbourfront",0.227273,0.090909,0.045455,0.045455,0.045455,0.045455,0.045455,0.045455,0.045455,0.045455


In [381]:
i=1
print('Neighborhoods under cluster {}'.format(i))
df_clustered[df_clustered['Cluster Label']==i].head(n=10)

Neighborhoods under cluster 1


Unnamed: 0,Cluster Label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Berczy Park,Coffee Shop,Seafood Restaurant,Cocktail Bar,Restaurant,Cheese Shop,Beer Bar,Breakfast Spot,Bakery,Farmers Market,Park
2,1,"Business reply mail Processing Centre, South C...",Coffee Shop,Hotel,Café,Gym,Asian Restaurant,Restaurant,Thai Restaurant,Steakhouse,Japanese Restaurant,Pizza Place
4,1,Central Bay Street,Coffee Shop,Clothing Store,Plaza,Bubble Tea Shop,Restaurant,Middle Eastern Restaurant,Sandwich Place,Hotel,Cosmetics Shop,Shoe Store
5,1,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Restaurant,Fast Food Restaurant,Café,Gay Bar,Dance Studio,Hotel,Pub
6,1,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Hotel,Café,Italian Restaurant,Japanese Restaurant,Gym,Beer Bar,American Restaurant,Deli / Bodega
8,1,"First Canadian Place, Underground city",Coffee Shop,Hotel,Café,Restaurant,Gym,Asian Restaurant,Deli / Bodega,Japanese Restaurant,Seafood Restaurant,American Restaurant
9,1,"Garden District, Ryerson",Coffee Shop,Clothing Store,Middle Eastern Restaurant,Cosmetics Shop,Café,Japanese Restaurant,Hotel,Bubble Tea Shop,Diner,Ramen Restaurant
10,1,"Harbourfront East, Union Station, Toronto Islands",Coffee Shop,Hotel,Japanese Restaurant,Aquarium,Park,Boat or Ferry,Plaza,Shopping Mall,Salad Place,Sports Bar
12,1,"Little Portugal, Trinity",Cocktail Bar,Coffee Shop,Bar,Restaurant,Vietnamese Restaurant,Asian Restaurant,Yoga Studio,New American Restaurant,Seafood Restaurant,Record Shop
14,1,"Regent Park, Harbourfront",Coffee Shop,Breakfast Spot,Yoga Studio,Theater,Spa,Food Truck,Event Space,Restaurant,Electronics Store,Pub


In [383]:
print('Percentage Frequencies under cluster {}'.format(i))
df_clustered_values[df_clustered_values['Cluster Label']==i].head(n=10)

Percentage Frequencies under cluster 1


Unnamed: 0,Cluster Label,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1,Berczy Park,0.078125,0.046875,0.046875,0.03125,0.03125,0.03125,0.03125,0.03125,0.03125,0.015625
2,1,"Business reply mail Processing Centre, South C...",0.07,0.05,0.04,0.03,0.03,0.02,0.02,0.02,0.02,0.02
4,1,Central Bay Street,0.131148,0.065574,0.032787,0.032787,0.032787,0.032787,0.032787,0.032787,0.032787,0.016393
5,1,Church and Wellesley,0.104651,0.05814,0.046512,0.046512,0.034884,0.034884,0.034884,0.023256,0.023256,0.023256
6,1,"Commerce Court, Victoria Hotel",0.11,0.07,0.06,0.05,0.05,0.04,0.04,0.03,0.03,0.03
8,1,"First Canadian Place, Underground city",0.12,0.07,0.06,0.05,0.04,0.03,0.03,0.03,0.03,0.03
9,1,"Garden District, Ryerson",0.11,0.06,0.03,0.03,0.03,0.03,0.03,0.02,0.02,0.02
10,1,"Harbourfront East, Union Station, Toronto Islands",0.125,0.083333,0.0625,0.041667,0.041667,0.041667,0.041667,0.020833,0.020833,0.020833
12,1,"Little Portugal, Trinity",0.071429,0.071429,0.071429,0.047619,0.047619,0.047619,0.02381,0.02381,0.02381,0.02381
14,1,"Regent Park, Harbourfront",0.227273,0.090909,0.045455,0.045455,0.045455,0.045455,0.045455,0.045455,0.045455,0.045455


Based on observation, we can give the two clusters the following initial characteristics/distinctions:

Cluster 1 is dominated by coffee shops (almost 10% of the venues are coffee shops) and most hotels are also in this cluster.

Cluster 0 has venues that offer a relatively more options in activities such as parks, gyms, and food

Let's now visualize these clusters in Toronto's map.

In [418]:
i='The Beaches'
df_final[df_final['Neighbourhood']==i]['Latitude'].reset_index(drop=True)[0]

43.67709000000008

In [438]:
df_final
lat_f=[]
long_f=[]
bor=[]
for i in df_clustered['Neighbourhood']:
    lat_f.append(df_final[df_final['Neighbourhood']==i]['Latitude'].reset_index(drop=True)[0])
    long_f.append(df_final[df_final['Neighbourhood']==i]['Longitude'].reset_index(drop=True)[0])
    bor.append(df_final[df_final['Neighbourhood']==i]['Borough'].reset_index(drop=True)[0])
ll1=pd.DataFrame(lat_f,columns=['Latitude'])
ll2=pd.DataFrame(long_f,columns=['Longitude'])
ll3=pd.DataFrame(bor,columns=['Borough'])
df_vis=pd.concat([df_clustered[['Cluster Label','Neighbourhood']],ll3,ll1,ll2],axis=1)
df_vis

Unnamed: 0,Cluster Label,Neighbourhood,Borough,Latitude,Longitude
0,1,Berczy Park,Downtown Toronto,43.64536,-79.37306
1,0,"Brockton, Parkdale Village, Exhibition Place",West Toronto,43.63941,-79.42676
2,1,"Business reply mail Processing Centre, South C...",East Toronto,43.64869,-79.38544
3,0,"CN Tower, King and Spadina, Railway Lands, Har...",Downtown Toronto,43.64082,-79.39818
4,1,Central Bay Street,Downtown Toronto,43.65609,-79.38493
5,1,Church and Wellesley,Downtown Toronto,43.66659,-79.38133
6,1,"Commerce Court, Victoria Hotel",Downtown Toronto,43.6484,-79.37914
7,0,Davisville,Central Toronto,43.7034,-79.38659
8,1,"First Canadian Place, Underground city",Downtown Toronto,43.64828,-79.38146
9,1,"Garden District, Ryerson",Downtown Toronto,43.65739,-79.37804


In [425]:
address='Toronto,ON'

geolocator=Nominatim(user_agent="tor_explorer")
location=geolocator.geocode(address)
latitude=location.latitude
longitude=location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [440]:
map_toronto=folium.Map(location=[latitude,longitude],zoom_start=13)

# add markers to map
for lat,lng,label in zip(df_vis[df_vis['Cluster Label']==0]['Latitude'],df_vis[df_vis['Cluster Label']==0]['Longitude'],df_vis[df_vis['Cluster Label']==0]['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
for lat,lng,label in zip(df_vis[df_vis['Cluster Label']==1]['Latitude'],df_vis[df_vis['Cluster Label']==1]['Longitude'],df_vis[df_vis['Cluster Label']==1]['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)
map_toronto

It can be noticed from the map above that cluster 0 tends to be on the outer areas of Toronto while cluster 1 are on the inner areas. This makes sense since cluster 1 (dominated by coffee shops and hotels) are mostly in the Downtown area.