 # Segmenting and Clustering Neighborhoods in Toronto!

First, I'm importing everything I will need for this assignment.

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import json
from geopy.geocoders import Nominatim
import requests
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
from sklearn.cluster import MeanShift
import folium


Now, I get the data from Wikipedia using using BeautifulSoup

In [2]:
req = requests.get("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050")

soup = BeautifulSoup(req.content,'lxml')

table = soup.find_all('table')[0]

df = pd.read_html(str(table))

nbh=pd.DataFrame(df[0])

In [3]:
nbh.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Notice that the last column is called NeighboUrhood. So, to make things easier ahead, I decided to rename it to Neighborhood, without the "U".

In [4]:
nbh.rename(columns={"Neighbourhood": "Neighborhood"}, inplace=True)
nbh.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


 Let's see how many rows and columns exists in this table before processing it's data.

In [6]:
nbh.shape

(287, 3)

 Now i need to process the data, removing the rows that contains "Not assigned" in the Borough column and concatenating the neighborhoods from the same borough.

 First, let's remove the "Not Assigned" rows.

In [7]:
nbh.set_index('Borough', inplace=True) 
nbh.drop('Not assigned', axis=0, inplace=True)
nbh.reset_index(inplace=True)
nbh.head()



Unnamed: 0,Borough,Postcode,Neighborhood
0,North York,M3A,Parkwoods
1,North York,M4A,Victoria Village
2,Downtown Toronto,M5A,Harbourfront
3,North York,M6A,Lawrence Heights
4,North York,M6A,Lawrence Manor


In [8]:
nbh = nbh[['Postcode','Borough','Neighborhood']]
nbh.sort_values(by='Postcode', inplace=True)
nbh.head()


Unnamed: 0,Postcode,Borough,Neighborhood
7,M1B,Scarborough,Rouge
8,M1B,Scarborough,Malvern
22,M1C,Scarborough,Port Union
21,M1C,Scarborough,Rouge Hill
20,M1C,Scarborough,Highland Creek


In [9]:
nbh.shape


(210, 3)

 Now, let's concatenate the neighborhoods from the same Borough.

In [10]:
nbh2=nbh.groupby(['Postcode','Borough']).apply(lambda x: ','.join(x['Neighborhood']))
nbh2 = nbh2.reset_index()
nbh2.columns = ['Postcode','Borough','Neighborhood']
nbh2.head(10)



Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Golden Mile,Oakridge,Clairlea"
8,M1M,Scarborough,"Cliffcrest,Scarborough Village West,Cliffside"
9,M1N,Scarborough,"Cliffside West,Birch Cliff"


In [11]:
nbh2.shape

(103, 3)

Now, I get the latitude and longitude data from the provided CSV file and assign it to the Post Codes from the neighborhood dataframe.

In [12]:
geodata = pd.read_csv('http://cocl.us/Geospatial_data')
geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
geodata.shape

(103, 3)

In the next step, I create two empty columns in my nbh2 pandas dataframe to populate with the latitude and longitude data later on.

In [14]:
nbh2['Latitude']=""
nbh2['Longitude']=""
nbh2.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",,
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",,
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


In [15]:
for i in range(len(nbh2['Postcode'])):
    for j in range(len(nbh2['Postcode'])):
        if nbh2.loc[i,'Postcode']==geodata.loc[j,'Postal Code']:
            nbh2.loc[i,'Latitude']=geodata.loc[j, 'Latitude']
            nbh2.loc[i,'Longitude']=geodata.loc[j,'Longitude']

In [16]:
nbh2.tail()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
98,M9N,York,Weston,43.7069,-79.5182
99,M9P,Etobicoke,Westmount,43.6963,-79.5322
100,M9R,Etobicoke,"Richview Gardens,Kingsview Village,St. Phillip...",43.6889,-79.5547
101,M9V,Etobicoke,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",43.7394,-79.5884
102,M9W,Etobicoke,Northwest,43.7067,-79.5941


In [17]:
nbh2.shape

(103, 5)

Checking the type of the data in my dataframe. Notice that the Latitude and Longitude data are of the object type.

In [18]:
nbh2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postcode      103 non-null    object
 1   Borough       103 non-null    object
 2   Neighborhood  103 non-null    object
 3   Latitude      103 non-null    object
 4   Longitude     103 non-null    object
dtypes: object(5)
memory usage: 2.1+ KB


So, to work with the data and plot the neighborhoods on a map, i transform the types of the Latitude and Longitude columns to float using the "to_numeric()" function.

In [19]:
pd.to_numeric(nbh2['Latitude'])
pd.to_numeric(nbh2['Longitude'])

0     -79.194353
1     -79.160497
2     -79.188711
3     -79.216917
4     -79.239476
         ...    
98    -79.518188
99    -79.532242
100   -79.554724
101   -79.588437
102   -79.594054
Name: Longitude, Length: 103, dtype: float64

And now I plot in the map using folium.

In [20]:
address = 'Toronto, ON, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


In [22]:
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=9)

for lat, lng, borough, neighborhood in zip(nbh2['Latitude'], nbh2['Longitude'], nbh2['Borough'], nbh2['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map)  
    
toronto_map

The next step is to define the credentials to Foursquare and get the nearby venues for the neighborhoods in the dataframe.

In [23]:
CLIENT_ID = 'NC4TRHBM4MGO1PLWOCEBEIRLU2KQFISD00NISNDOYE3PVHWQ'
CLIENT_SECRET = 'JVVDOMNE4PPSOM2MCLRBVM1IAEBCFE4O2D13UYX5IZ1FDZJS'
VERSION = '20200416'

In [24]:
LIMIT = 100

In [25]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)            
        
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
                    
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [32]:
Toronto_venues = getNearbyVenues(names=nbh2['Neighborhood'],
latitudes=nbh2['Latitude'],
longitudes=nbh2['Longitude'])

Rouge,Malvern
Port Union,Rouge Hill,Highland Creek
Guildwood,Morningside,West Hill
Woburn
Cedarbrae
Scarborough Village
East Birchmount Park,Ionview,Kennedy Park
Golden Mile,Oakridge,Clairlea
Cliffcrest,Scarborough Village West,Cliffside
Cliffside West,Birch Cliff
Wexford Heights,Dorset Park,Scarborough Town Centre
Maryvale,Wexford
Agincourt
Sullivan,Clarks Corners,Tam O'Shanter
Steeles East,Milliken,L'Amoreaux East,Agincourt North
L'Amoreaux West
Upper Rouge
Hillcrest Village
Henry Farm,Fairview,Oriole
Bayview Village
Silver Hills,York Mills
Willowdale,Newtonbrook
Willowdale South
York Mills West
Willowdale West
Parkwoods
Don Mills North
Don Mills South,Flemingdon Park
Bathurst Manor,Wilson Heights,Downsview North
Northwood Park,York University
Downsview East,CFB Toronto
Downsview West
Downsview Central
Downsview Northwest
Victoria Village
Woodbine Gardens,Parkview Hill
Woodbine Heights
The Beaches
Leaside
Thorncliffe Park
East Toronto
Riverdale,The Danforth West
The Beaches West,Indi

In [33]:
print(Toronto_venues.shape)
Toronto_venues.head()

(2115, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Rouge,Malvern",43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,"Port Union,Rouge Hill,Highland Creek",43.784535,-79.160497,RIGHT WAY TO GOLF,43.785177,-79.161108,Golf Course
2,"Port Union,Rouge Hill,Highland Creek",43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,"Guildwood,Morningside,West Hill",43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
4,"Guildwood,Morningside,West Hill",43.763573,-79.188711,Big Bite Burrito,43.766299,-79.19072,Mexican Restaurant


Let's see how many venues there are for each neighborhood.

In [34]:
Toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide,Richmond,King",93,93,93,93,93,93
Agincourt,4,4,4,4,4,4
"Albion Gardens,Beaumond Heights,Humbergate,Jamestown,Mount Olive,South Steeles,Thistletown,Silverstone",10,10,10,10,10,10
"Bathurst Manor,Wilson Heights,Downsview North",21,21,21,21,21,21
Bayview Village,4,4,4,4,4,4
...,...,...,...,...,...,...
Woburn,4,4,4,4,4,4
"Woodbine Gardens,Parkview Hill",11,11,11,11,11,11
Woodbine Heights,10,10,10,10,10,10
York Mills West,3,3,3,3,3,3


And how many unique categories of venues there are.

In [35]:
print('There are {} uniques categories.'.format(len(Toronto_venues['Venue Category'].unique())))


There are 267 uniques categories.


The next step is to create a one hot dataframe with the Venue Categories so i can work with them later on.

In [36]:
Toronto_onehot = pd.get_dummies(Toronto_venues[['Venue Category']], prefix="", prefix_sep="")

Toronto_onehot['Neighborhood'] = Toronto_venues['Neighborhood'] 

Toronto_onehot.head()

Unnamed: 0,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
Toronto_onehot.shape

(2115, 267)

I notice that the Neighborhood column is not the first one in my dataframe, so in the next steps i locate it and move it to the first position.

In [38]:
neig_loc = Toronto_onehot.columns.get_loc('Neighborhood')
print(neig_loc)

189


In [39]:
fixed_columns = [Toronto_onehot.columns[neig_loc]] + list(Toronto_onehot.drop('Neighborhood',axis=1).columns)
Toronto_onehot = Toronto_onehot[fixed_columns]


And this is how the dataframe is after adjusting it.

In [40]:
print(Toronto_onehot.columns[:5])
print(Toronto_onehot.shape)

Index(['Neighborhood', 'Accessories Store', 'Airport', 'Airport Food Court',
       'Airport Lounge'],
      dtype='object')
(2115, 267)


And let's create a new dataframe by grouping the one hot df by neighborhood considering it's mean.

In [41]:
Toronto_grouped = Toronto_onehot.groupby('Neighborhood').mean().reset_index()
Toronto_grouped

Unnamed: 0,Neighborhood,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide,Richmond,King",0.0,0.0,0.0,0.0,0.0,0.0,0.032258,0.0,0.0,...,0.010753,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.010753,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
2,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
3,"Bathurst Manor,Wilson Heights,Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
4,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
94,Woburn,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
95,"Woodbine Gardens,Parkview Hill",0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
96,Woodbine Heights,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.100000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0
97,York Mills West,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,...,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0


Checking the number of neighborhoods in the original dataframe (called nbh2) and in the toronto_venues one, it's seen that there are more neighborhoods in the original. That means that there must be a few neighborhoods that did not get any venues from Foursquare. Let's check which neighborhoods had no venues.

In [43]:
print(nbh2['Neighborhood'].count())
print(Toronto_venues['Neighborhood'].drop_duplicates().count())

103
99


In [44]:
for hood in nbh2['Neighborhood']:
    try:
        Toronto_venues.set_index('Neighborhood').loc[hood]
    except:
        print('The neighborhood "{}" does not have any venues located on FourSquare'.format(hood))

The neighborhood "Upper Rouge" does not have any venues located on FourSquare
The neighborhood "Willowdale,Newtonbrook" does not have any venues located on FourSquare
The neighborhood "Islington Avenue" does not have any venues located on FourSquare
The neighborhood "Cloverdale,West Deane Park,Princess Gardens,Martin Grove,Islington" does not have any venues located on FourSquare


Now, I'm creating a new dataframe with the top 10 venues from each neighborhood.

In [46]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [47]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = Toronto_grouped['Neighborhood']

for ind in np.arange(Toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(Toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,Richmond,King",Coffee Shop,Café,Restaurant,Thai Restaurant,Hotel,American Restaurant,Gym,Clothing Store,Deli / Bodega,Pizza Place
1,Agincourt,Latin American Restaurant,Lounge,Breakfast Spot,Chinese Restaurant,Drugstore,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
2,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Coffee Shop,Fast Food Restaurant,Beer Store,Sandwich Place,Fried Chicken Joint,Liquor Store,Pharmacy,Pizza Place,Construction & Landscaping
3,"Bathurst Manor,Wilson Heights,Downsview North",Coffee Shop,Bank,Frozen Yogurt Shop,Bridal Shop,Sandwich Place,Diner,Restaurant,Deli / Bodega,Supermarket,Ice Cream Shop
4,Bayview Village,Japanese Restaurant,Café,Bank,Chinese Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio


With all of that done, it's time to cluster the neighborhoods. For that, i'm using KMeans with a number of clusters iquals to 5.

In [48]:
kclusters = 5

Toronto_grouped_clustering = Toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_grouped_clustering)

kmeans.labels_[0:10] 

array([4, 0, 4, 4, 4, 4, 4, 0, 1, 4])

Now, let's merge the fitted dataframe with the sorted neighborhood venues dataframe and add the cluster labels.

In [49]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

Toronto_merged = nbh2

Toronto_merged = Toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

Toronto_merged.head()


Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.8067,-79.1944,0.0,Fast Food Restaurant,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",43.7845,-79.1605,0.0,Golf Course,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7636,-79.1887,0.0,Medical Center,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Mexican Restaurant,Bank,Distribution Center,Dog Run,Doner Restaurant
3,M1G,Scarborough,Woburn,43.771,-79.2169,4.0,Coffee Shop,Korean Restaurant,Convenience Store,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,M1H,Scarborough,Cedarbrae,43.7731,-79.2395,0.0,Caribbean Restaurant,Bakery,Fried Chicken Joint,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Hakka Restaurant,Eastern European Restaurant,Dumpling Restaurant


In [50]:
Toronto_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Postcode                103 non-null    object 
 1   Borough                 103 non-null    object 
 2   Neighborhood            103 non-null    object 
 3   Latitude                103 non-null    object 
 4   Longitude               103 non-null    object 
 5   Cluster Labels          99 non-null     float64
 6   1st Most Common Venue   99 non-null     object 
 7   2nd Most Common Venue   99 non-null     object 
 8   3rd Most Common Venue   99 non-null     object 
 9   4th Most Common Venue   99 non-null     object 
 10  5th Most Common Venue   99 non-null     object 
 11  6th Most Common Venue   99 non-null     object 
 12  7th Most Common Venue   99 non-null     object 
 13  8th Most Common Venue   99 non-null     object 
 14  9th Most Common Venue   99 non-null     ob

Notice that there are 103 rows in this dataframe, since there were 103 neighborhoods, however there are only 99 lines of venues. To fix this, I'm droping the rows wiith no venues using the dropna function.

In [51]:
Toronto_merged.dropna(inplace = True)

Also, the 'Cluster Labels' column is of the float dtype, so it can't be used in the folium parameters later, so I'm converting it to an integer type.

In [52]:
Toronto_merged['Cluster Labels'] = Toronto_merged['Cluster Labels'].astype(int)
Toronto_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 0 to 102
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Postcode                99 non-null     object
 1   Borough                 99 non-null     object
 2   Neighborhood            99 non-null     object
 3   Latitude                99 non-null     object
 4   Longitude               99 non-null     object
 5   Cluster Labels          99 non-null     int32 
 6   1st Most Common Venue   99 non-null     object
 7   2nd Most Common Venue   99 non-null     object
 8   3rd Most Common Venue   99 non-null     object
 9   4th Most Common Venue   99 non-null     object
 10  5th Most Common Venue   99 non-null     object
 11  6th Most Common Venue   99 non-null     object
 12  7th Most Common Venue   99 non-null     object
 13  8th Most Common Venue   99 non-null     object
 14  9th Most Common Venue   99 non-null     object
 15  10th Mo

Now, let's plot the clustered neighborhood on the map using folium.

In [53]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged['Latitude'], Toronto_merged['Longitude'], Toronto_merged['Neighborhood'], Toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters

And let's check what each of the clusters look like.

In [54]:
cluster0 = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 0, Toronto_merged.columns[[1] + [2] + list(range(5, Toronto_merged.shape[1]))]]
print(cluster0.shape)
cluster0.head()

(29, 13)


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,"Rouge,Malvern",0,Fast Food Restaurant,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
1,Scarborough,"Port Union,Rouge Hill,Highland Creek",0,Golf Course,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
2,Scarborough,"Guildwood,Morningside,West Hill",0,Medical Center,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Mexican Restaurant,Bank,Distribution Center,Dog Run,Doner Restaurant
4,Scarborough,Cedarbrae,0,Caribbean Restaurant,Bakery,Fried Chicken Joint,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Hakka Restaurant,Eastern European Restaurant,Dumpling Restaurant
7,Scarborough,"Golden Mile,Oakridge,Clairlea",0,Bakery,Bus Line,Ice Cream Shop,Metro Station,Bus Station,Intersection,Soccer Field,Park,Dumpling Restaurant,Drugstore


In [56]:
cluster1 = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 1, Toronto_merged.columns[[1] + [2] + list(range(5, Toronto_merged.shape[1]))]]
print(cluster1.shape)
cluster1.head()

(9, 13)


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
23,North York,York Mills West,1,Park,Bank,Convenience Store,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
25,North York,Parkwoods,1,Park,Food & Drink Shop,Dessert Shop,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore,Donut Shop
30,North York,"Downsview East,CFB Toronto",1,Park,Airport,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
31,North York,Downsview West,1,Grocery Store,Bank,Park,Comfort Food Restaurant,Diner,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant
40,East York,East Toronto,1,Park,Convenience Store,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant


In [55]:
cluster2 = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 2, Toronto_merged.columns[[1] + [2] + list(range(5, Toronto_merged.shape[1]))]]
print(cluster2.shape)
cluster2.head()

(3, 13)


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
32,North York,Downsview Central,2,Food Truck,Korean Restaurant,Baseball Field,Yoga Studio,Drugstore,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
91,Etobicoke,"Old Mill South,King's Mill Park,Humber Bay,The...",2,Baseball Field,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Fast Food Restaurant
97,North York,"Humberlea,Emery",2,Baseball Field,Food Service,Yoga Studio,Donut Shop,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore,Farmers Market


In [57]:
cluster3 = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 3, Toronto_merged.columns[[1] + [2] + list(range(5, Toronto_merged.shape[1]))]]
print(cluster3.shape)
cluster3.head()

(3, 13)


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Scarborough,Scarborough Village,3,Playground,Health & Beauty Service,Yoga Studio,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
14,Scarborough,"Steeles East,Milliken,L'Amoreaux East,Agincour...",3,Park,Playground,Drugstore,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
48,Central Toronto,"Summerhill East,Moore Park",3,Playground,Yoga Studio,Donut Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Drugstore


In [58]:
cluster4 = Toronto_merged.loc[Toronto_merged['Cluster Labels'] == 4, Toronto_merged.columns[[1] + [2] + list(range(5, Toronto_merged.shape[1]))]]
print(cluster4.shape)
cluster4.head()

(55, 13)


Unnamed: 0,Borough,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Scarborough,Woburn,4,Coffee Shop,Korean Restaurant,Convenience Store,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
6,Scarborough,"East Birchmount Park,Ionview,Kennedy Park",4,Department Store,Train Station,Discount Store,Coffee Shop,Chinese Restaurant,Dim Sum Restaurant,Diner,Distribution Center,Dog Run,Doner Restaurant
9,Scarborough,"Cliffside West,Birch Cliff",4,Café,General Entertainment,Skating Rink,College Stadium,Concert Hall,Construction & Landscaping,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Comfort Food Restaurant
13,Scarborough,"Sullivan,Clarks Corners,Tam O'Shanter",4,Pharmacy,Pizza Place,Italian Restaurant,Bank,Noodle House,Intersection,Fast Food Restaurant,Gas Station,Coffee Shop,Chinese Restaurant
15,Scarborough,L'Amoreaux West,4,Chinese Restaurant,Fast Food Restaurant,Coffee Shop,Gym,Grocery Store,Breakfast Spot,Pharmacy,Pizza Place,Supermarket,Bank


Looking at the clusters, it seems to me that the clustering wasn't very good since there is one cluster with 55 rows, another one with 29 and the rest with very few. 
The ones with few rows share many of the venues between one another, however the ones with many rows don't seem to have a clear division.

Based on that, I decided to try and cluster them again, but using MeanShift to let the algorithm decide the number of clusters and check if it's any better.

In [59]:
ms = MeanShift(max_iter = 10000)
ms.fit(Toronto_grouped_clustering)

MeanShift(bandwidth=None, bin_seeding=False, cluster_all=True, max_iter=10000,
          min_bin_freq=1, n_jobs=None, seeds=None)

In [60]:
ms.labels_

array([ 0, 13,  0,  0,  0,  0,  0,  0, 26,  0,  0,  0,  0,  0,  4, 16,  0,
        0,  0,  0,  0,  0, 11,  3, 18,  6,  5,  0,  0,  1,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0, 19, 20, 10, 22,  0,  0,  0,  0,  0, 14,  0,
        0,  0,  0,  0,  0,  8,  0,  0, 27,  9, 23,  7,  0,  0,  0,  0,  2,
       24, 21,  0, 25, 15,  0,  0,  0,  0,  2,  0,  0,  0, 28,  0,  0, 12,
        0,  0,  0,  0,  0, 17,  0,  0,  0,  0,  0,  0,  1,  0],
      dtype=int32)

As it's seen above, there are many more labels this time. So, let's get the merged dataframe again to see how the clusters look.

In [61]:
neighborhoods_venues_sorted2 = neighborhoods_venues_sorted.copy()
neighborhoods_venues_sorted2.drop('Cluster Labels', axis=1, inplace=True)
neighborhoods_venues_sorted2.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,Richmond,King",Coffee Shop,Café,Restaurant,Thai Restaurant,Hotel,American Restaurant,Gym,Clothing Store,Deli / Bodega,Pizza Place
1,Agincourt,Latin American Restaurant,Lounge,Breakfast Spot,Chinese Restaurant,Drugstore,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Eastern European Restaurant
2,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Grocery Store,Coffee Shop,Fast Food Restaurant,Beer Store,Sandwich Place,Fried Chicken Joint,Liquor Store,Pharmacy,Pizza Place,Construction & Landscaping
3,"Bathurst Manor,Wilson Heights,Downsview North",Coffee Shop,Bank,Frozen Yogurt Shop,Bridal Shop,Sandwich Place,Diner,Restaurant,Deli / Bodega,Supermarket,Ice Cream Shop
4,Bayview Village,Japanese Restaurant,Café,Bank,Chinese Restaurant,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio


In [62]:
neighborhoods_venues_sorted2.insert(0, 'Cluster Labels', ms.labels_)

Toronto_merged2 = nbh2

Toronto_merged2 = Toronto_merged2.join(neighborhoods_venues_sorted2.set_index('Neighborhood'), on='Neighborhood')

Toronto_merged2.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge,Malvern",43.8067,-79.1944,21.0,Fast Food Restaurant,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Empanada Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Drugstore
1,M1C,Scarborough,"Port Union,Rouge Hill,Highland Creek",43.7845,-79.1605,7.0,Golf Course,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.7636,-79.1887,0.0,Medical Center,Intersection,Rental Car Location,Breakfast Spot,Electronics Store,Mexican Restaurant,Bank,Distribution Center,Dog Run,Doner Restaurant
3,M1G,Scarborough,Woburn,43.771,-79.2169,0.0,Coffee Shop,Korean Restaurant,Convenience Store,Yoga Studio,Drugstore,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Donut Shop
4,M1H,Scarborough,Cedarbrae,43.7731,-79.2395,0.0,Caribbean Restaurant,Bakery,Fried Chicken Joint,Thai Restaurant,Athletics & Sports,Gas Station,Bank,Hakka Restaurant,Eastern European Restaurant,Dumpling Restaurant


In [63]:
Toronto_merged2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103 entries, 0 to 102
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Postcode                103 non-null    object 
 1   Borough                 103 non-null    object 
 2   Neighborhood            103 non-null    object 
 3   Latitude                103 non-null    object 
 4   Longitude               103 non-null    object 
 5   Cluster Labels          99 non-null     float64
 6   1st Most Common Venue   99 non-null     object 
 7   2nd Most Common Venue   99 non-null     object 
 8   3rd Most Common Venue   99 non-null     object 
 9   4th Most Common Venue   99 non-null     object 
 10  5th Most Common Venue   99 non-null     object 
 11  6th Most Common Venue   99 non-null     object 
 12  7th Most Common Venue   99 non-null     object 
 13  8th Most Common Venue   99 non-null     object 
 14  9th Most Common Venue   99 non-null     ob

In [64]:
Toronto_merged2.dropna(inplace = True)

In [65]:
Toronto_merged2['Cluster Labels'] = Toronto_merged2['Cluster Labels'].astype(int)
Toronto_merged2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 99 entries, 0 to 102
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Postcode                99 non-null     object
 1   Borough                 99 non-null     object
 2   Neighborhood            99 non-null     object
 3   Latitude                99 non-null     object
 4   Longitude               99 non-null     object
 5   Cluster Labels          99 non-null     int32 
 6   1st Most Common Venue   99 non-null     object
 7   2nd Most Common Venue   99 non-null     object
 8   3rd Most Common Venue   99 non-null     object
 9   4th Most Common Venue   99 non-null     object
 10  5th Most Common Venue   99 non-null     object
 11  6th Most Common Venue   99 non-null     object
 12  7th Most Common Venue   99 non-null     object
 13  8th Most Common Venue   99 non-null     object
 14  9th Most Common Venue   99 non-null     object
 15  10th Mo

In [66]:
map_clusters2 = folium.Map(location=[latitude, longitude], zoom_start=10)

x = np.arange(neighborhoods_venues_sorted2['Cluster Labels'].drop_duplicates().count())
ys = [i + x + (i*x)**2 for i in range(len(x))]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_merged2['Latitude'], Toronto_merged2['Longitude'], Toronto_merged2['Neighborhood'], Toronto_merged2['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters2)

map_clusters2

Now, looking at the map above and the labels that the clustering algorithm made, seems to me that it's not better then the previous one.
The majority of the neighborhood fell under the label 0 and, again, the logic for this is not clear to me.

With that, I conclude that this was not a particularly usefull method of exploring this data. Maybe if there were more information about other aspects of the venues it could lead to better results.