# Clustering neighborhoods in Toronto on the basis of venues found on Foursquare

Import of relevant packages

In [1]:
import pandas as pd
import numpy as np

## Data import

Scrape a list of postal codes in Canada from [wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

In [2]:
scrape = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

Check import:

In [3]:
scrape[0].head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [4]:
scrape[0].tail()

Unnamed: 0,0,1,2
285,M8Z,Etobicoke,Mimico NW
286,M8Z,Etobicoke,The Queensway West
287,M8Z,Etobicoke,Royal York South West
288,M8Z,Etobicoke,South of Bloor
289,M9Z,Not assigned,Not assigned


## Data wrangling

Some wrangling is necessary. The columns are labeled correctly and postal codes which are not assigned are dropped.

In [5]:
df_postals = scrape[0]
df_postals.columns = df_postals.iloc[0].values
df_postals.Borough.replace(to_replace='Not assigned', value=np.nan, inplace=True)
df_postals.Neighbourhood.replace(to_replace='Not assigned', value='', inplace=True)
df_postals.dropna(subset=['Borough'],axis=0, inplace=True)
df_postals.drop(index=0, axis=0, inplace=True)
df_postals.reset_index(drop=True, inplace=True)

Check the resulting data frame:

In [6]:
df_postals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 212 entries, 0 to 211
Data columns (total 3 columns):
Postcode         212 non-null object
Borough          212 non-null object
Neighbourhood    212 non-null object
dtypes: object(3)
memory usage: 5.0+ KB


In [7]:
df_postals.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [8]:
df_postals.describe()

Unnamed: 0,Postcode,Borough,Neighbourhood
count,212,212,212
unique,103,11,210
top,M8Y,Etobicoke,St. James Town
freq,8,45,2


It is still necessary to group the different neighbourhoods with the same postal code together:

In [9]:
df_postals = df_postals.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x)).to_frame().reset_index()

In [10]:
df_postals.shape

(103, 3)

## Getting geographical coordinates for the postcodes

Necessary imports:

In [11]:
import geocoder

This would be the code for reading in the coordinates. In my case the while loop got stuck/didn't do anything for more than 5 min, so I am proceeding with the readin in by file.

Read in file:

In [12]:
df_latlong = pd.read_csv('https://cocl.us/Geospatial_data')

In [13]:
df_latlong.shape

(103, 3)

In [14]:
df_latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [15]:
df_latlong.set_index('Postal Code', inplace=True)
df_latlong.head()

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476


In [16]:
latlongval = df_latlong.loc[df_postals.Postcode].values

In [17]:
df_postals['Latitude'] = latlongval[:,0]
df_postals['Longitude'] = latlongval[:,1]

In [18]:
df_postals.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Visualise the neighbourhoods on a map

Necessary imports:

In [19]:
import folium

In [20]:
tor_lat = 43.66135
tor_long = -79.383087
map_toronto = folium.Map(location=[tor_lat, tor_long], zoom_start=11)

# Draw all neighbourhoods in a borough that doesn't have 'Toronto' in their name in blue:
for lat, lng, label in df_postals[df_postals.Borough.map(lambda x: x.count('Toronto')) == 0][['Latitude','Longitude','Neighbourhood']].values:
    label = folium.Popup(html=label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)
    
# Draw all neighbourhoods in a borough that do have 'Toronto' in their name in red:
for lat, lng, label in df_postals[df_postals.Borough.map(lambda x: x.count('Toronto')) != 0][['Latitude','Longitude','Neighbourhood']].values:
    label = folium.Popup(html=label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='lightcoral',
        fill_opacity=0.7).add_to(map_toronto)

map_toronto

## Load top 100 venues in each neighbourhood for all central boroughs in Toronto

Import relevant packages

In [21]:
import requests, json
import pickle

Check manually

In [22]:
df_postals.Borough.unique()

array(['Scarborough', 'North York', 'East York', 'East Toronto',
       'Central Toronto', 'Downtown Toronto', 'York', 'West Toronto',
       "Queen's Park", 'Mississauga', 'Etobicoke'], dtype=object)

Manually create list of all central boroughs.
One could also use 
df_postals.Borough.map(lambda x: x.count('Toronto')) != 0
to get all boroughs with 'Toronto' in their name, but this would exclude the central Queen's Park.

In [23]:
tor_borough = ['East Toronto', 'Central Toronto', 'Downtown Toronto', 'West Toronto', "Queen's Park"]

Create new dataframe that includes only the relevant rows

In [24]:
df_toronto = df_postals[df_postals.Borough.isin(tor_borough)]

Load venue data from Foursquare:

In [25]:
CLIENT_ID = 'HKQDTK5X5LPHG5SVYUPGWIHAE3DZ05K2R1FNWXYF3OXWTSIS' # your Foursquare ID
CLIENT_SECRET = '3FGO55PFW3LV0GPYSG2NAZ32WMWOX5B15MCQU0GZMFGQLPT4' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

url = 'https://api.foursquare.com/v2/venues/explore'

result_list = []
for lat, long in df_toronto[['Latitude','Longitude']].values:
    params = dict(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    v=VERSION,
    ll=str(lat)+','+str(long),
    radius=500,
    #query='coffee',
    limit=100
    )
    
    result_list.append(requests.get(url=url, params=params).json())#


Possibility to store/load data locally:

Create a dataframe df_venues with all venues:

In [26]:
venue_list = []
for ind, neigh_venues in enumerate(result_list):
    items = neigh_venues['response']['groups'][0]['items']
    for item in items:
        row = []
        row.append(df_toronto.iloc[ind, 0])
        row.append(df_toronto.iloc[ind, 1])
        row.append(df_toronto.iloc[ind, 2])
        row.append(df_toronto.iloc[ind, 3])
        row.append(df_toronto.iloc[ind, 4])
        row.append(item['venue']['name'])
        row.append(item['venue']['categories'][0]['name'])
        row.append(item['venue']['location']['lat'])
        row.append(item['venue']['location']['lng'])
        venue_list.append(row)

In [27]:
ven_columns = ['Postcode', 'Borough', 'Neighbourhood', 'Neighbourhood_Latitude', 'Neighbourhood_Longitude', 'Venue_Name', 'Venue_Category', 'Venue_Latitude', 'Venue_Longitude']
df_venues = pd.DataFrame(venue_list, columns=ven_columns)

Check the dataframe:

In [28]:
df_venues.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Neighbourhood_Latitude,Neighbourhood_Longitude,Venue_Name,Venue_Category,Venue_Latitude,Venue_Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,Grover Pub and Grub,Pub,43.679181,-79.297215
1,M4E,East Toronto,The Beaches,43.676357,-79.293031,Starbucks,Coffee Shop,43.678798,-79.298045
2,M4E,East Toronto,The Beaches,43.676357,-79.293031,Glen Manor Ravine,Trail,43.676821,-79.293942
3,M4E,East Toronto,The Beaches,43.676357,-79.293031,Upper Beaches,Neighborhood,43.680563,-79.292869
4,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,Greek Restaurant,43.677621,-79.351434


In [29]:
df_venues.shape

(1743, 9)

Get number of unique venue categories:

In [30]:
df_venues.Venue_Category.unique().shape

(238,)

## Analyse the venues

Get number of venues per neighbourhood:

In [31]:
df_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Postcode,Borough,Neighbourhood_Latitude,Neighbourhood_Longitude,Venue_Name,Venue_Category,Venue_Latitude,Venue_Longitude
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
,42,42,42,42,42,42,42,42
"Adelaide, King, Richmond",100,100,100,100,100,100,100,100
Berczy Park,55,55,55,55,55,55,55,55
"Brockton, Exhibition Place, Parkdale Village",20,20,20,20,20,20,20,20
Business Reply Mail Processing Centre 969 Eastern,16,16,16,16,16,16,16,16
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",14,14,14,14,14,14,14,14
"Cabbagetown, St. James Town",49,49,49,49,49,49,49,49
Central Bay Street,81,81,81,81,81,81,81,81
"Chinatown, Grange Park, Kensington Market",100,100,100,100,100,100,100,100
Christie,16,16,16,16,16,16,16,16


There are several neighbourhoods with very few (<5) resulting venues. This will probably result in bad clustering. One could either increase the radius in the search on foursquare or drop these neighbourhoods from the table. For the sake of this exercise I will leave them in to see what happens.

### Get category cound for each neighbourhood

In [32]:
df_category_counts = pd.get_dummies(df_venues.Venue_Category)
df_category_counts['Neighbourhood'] = df_venues['Neighbourhood']
df_category_counts['Postcode'] = df_venues['Postcode']
df_category_counts = df_category_counts.groupby(['Postcode','Neighbourhood']).mean().reset_index()

In [33]:
df_category_counts.head()

Unnamed: 0,Postcode,Neighbourhood,Adult Boutique,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M4E,The Beaches,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M4K,"The Danforth West, Riverdale",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381
2,M4L,"The Beaches West, India Bazaar",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M4M,Studio District,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641
4,M4N,Lawrence Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Get top five frequencies of each neighbourhood

In [34]:
stat_list = []
for num, row in df_category_counts.iterrows():
    code = row['Postcode']
    name = row['Neighbourhood']
    stat_row = [code, name]
    values = row[2:].astype(float).sort_values(ascending=False)
    values = values/values.sum()
    
    for ind in range(0,5):
        stat_row.append(values.index[ind])
        stat_row.append(values[ind])
    stat_list.append(stat_row)
    
df_stats = pd.DataFrame(stat_list, columns=['Postcode', 'Neighborhood', '1st name', '1st frequ', '2st name', '2st frequ', '3st name', '3st frequ', '4st name', '4st frequ', '5st name', '5st frequ'])

In [35]:
df_stats

Unnamed: 0,Postcode,Neighborhood,1st name,1st frequ,2st name,2st frequ,3st name,3st frequ,4st name,4st frequ,5st name,5st frequ
0,M4E,The Beaches,Trail,0.25,Coffee Shop,0.25,Pub,0.25,Neighborhood,0.25,Falafel Restaurant,0.0
1,M4K,"The Danforth West, Riverdale",Greek Restaurant,0.238095,Coffee Shop,0.095238,Ice Cream Shop,0.071429,Bookstore,0.047619,Italian Restaurant,0.047619
2,M4L,"The Beaches West, India Bazaar",Park,0.142857,Gym,0.047619,Burger Joint,0.047619,Steakhouse,0.047619,Fish & Chips Shop,0.047619
3,M4M,Studio District,Café,0.102564,Coffee Shop,0.076923,Bakery,0.051282,Italian Restaurant,0.051282,American Restaurant,0.051282
4,M4N,Lawrence Park,Gym / Fitness Center,0.25,Park,0.25,Swim School,0.25,Bus Line,0.25,Yoga Studio,0.0
5,M4P,Davisville North,Food & Drink Shop,0.111111,Gym,0.111111,Grocery Store,0.111111,Sandwich Place,0.111111,Park,0.111111
6,M4R,North Toronto West,Sporting Goods Shop,0.1,Coffee Shop,0.1,Clothing Store,0.1,Yoga Studio,0.05,Dessert Shop,0.05
7,M4S,Davisville,Sandwich Place,0.083333,Dessert Shop,0.083333,Sushi Restaurant,0.055556,Seafood Restaurant,0.055556,Italian Restaurant,0.055556
8,M4T,"Moore Park, Summerhill East",Gym,0.25,Playground,0.25,Tennis Court,0.25,Park,0.25,Yoga Studio,0.0
9,M4V,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,0.142857,Pub,0.142857,American Restaurant,0.071429,Convenience Store,0.071429,Pizza Place,0.071429


## Cluster the neighbourhoods

In [36]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [37]:
clusters = 5
kmean = KMeans(n_clusters=clusters)
kmean.fit(df_category_counts.iloc[:, 2:])

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [38]:
df_toronto.insert(loc=2, column='Cluster_Label', value=kmean.labels_)

In [46]:
df_toronto.head()

Unnamed: 0,Postcode,Borough,Cluster_Label,Neighbourhood,Latitude,Longitude
37,M4E,East Toronto,0,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,0,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,0,"The Beaches West, India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,0,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,4,Lawrence Park,43.72802,-79.38879


## Visualise clusters on the map

In [40]:
tor_lat = 43.66135
tor_long = -79.383087
map_toronto = folium.Map(location=[tor_lat, tor_long], zoom_start=11)
color_dict = {0 : 'blue',
              1 : 'red',
              2 : 'green',
              3 : 'yellow',
              4 : 'cyan'}
    
# Draw all neighbourhoods in a borough that do have 'Toronto' in their name in red:
for lat, lng, neigh_label, cluster_label in df_toronto[['Latitude','Longitude','Neighbourhood', 'Cluster_Label']].values:
    label = folium.Popup(html=label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=neigh_label,
        color=color_dict[cluster_label],
        fill=True,
        fill_color=color_dict[cluster_label],
        fill_opacity=0.7).add_to(map_toronto)

map_toronto

## Get statistics for the venues in the different clusters

### Cluster 0

In [41]:
df_stats[df_stats.Postcode.isin(df_toronto[df_toronto.Cluster_Label==0].Postcode)]

Unnamed: 0,Postcode,Neighborhood,1st name,1st frequ,2st name,2st frequ,3st name,3st frequ,4st name,4st frequ,5st name,5st frequ
0,M4E,The Beaches,Trail,0.25,Coffee Shop,0.25,Pub,0.25,Neighborhood,0.25,Falafel Restaurant,0.0
1,M4K,"The Danforth West, Riverdale",Greek Restaurant,0.238095,Coffee Shop,0.095238,Ice Cream Shop,0.071429,Bookstore,0.047619,Italian Restaurant,0.047619
2,M4L,"The Beaches West, India Bazaar",Park,0.142857,Gym,0.047619,Burger Joint,0.047619,Steakhouse,0.047619,Fish & Chips Shop,0.047619
3,M4M,Studio District,Café,0.102564,Coffee Shop,0.076923,Bakery,0.051282,Italian Restaurant,0.051282,American Restaurant,0.051282
5,M4P,Davisville North,Food & Drink Shop,0.111111,Gym,0.111111,Grocery Store,0.111111,Sandwich Place,0.111111,Park,0.111111
6,M4R,North Toronto West,Sporting Goods Shop,0.1,Coffee Shop,0.1,Clothing Store,0.1,Yoga Studio,0.05,Dessert Shop,0.05
7,M4S,Davisville,Sandwich Place,0.083333,Dessert Shop,0.083333,Sushi Restaurant,0.055556,Seafood Restaurant,0.055556,Italian Restaurant,0.055556
9,M4V,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Coffee Shop,0.142857,Pub,0.142857,American Restaurant,0.071429,Convenience Store,0.071429,Pizza Place,0.071429
11,M4X,"Cabbagetown, St. James Town",Coffee Shop,0.102041,Café,0.061224,Restaurant,0.061224,Pizza Place,0.040816,Bakery,0.040816
12,M4Y,Church and Wellesley,Japanese Restaurant,0.068966,Sushi Restaurant,0.057471,Coffee Shop,0.057471,Gay Bar,0.045977,Restaurant,0.034483


### Cluster 1

In [42]:
df_stats[df_stats.Postcode.isin(df_toronto[df_toronto.Cluster_Label==1].Postcode)]

Unnamed: 0,Postcode,Neighborhood,1st name,1st frequ,2st name,2st frequ,3st name,3st frequ,4st name,4st frequ,5st name,5st frequ
10,M4W,Rosedale,Park,0.5,Trail,0.25,Playground,0.25,Eastern European Restaurant,0.0,Discount Store,0.0
23,M5P,"Forest Hill North, Forest Hill West",Trail,0.25,Jewelry Store,0.25,Park,0.25,Sushi Restaurant,0.25,Electronics Store,0.0


### Cluster 2

In [43]:
df_stats[df_stats.Postcode.isin(df_toronto[df_toronto.Cluster_Label==2].Postcode)]

Unnamed: 0,Postcode,Neighborhood,1st name,1st frequ,2st name,2st frequ,3st name,3st frequ,4st name,4st frequ,5st name,5st frequ
8,M4T,"Moore Park, Summerhill East",Gym,0.25,Playground,0.25,Tennis Court,0.25,Park,0.25,Yoga Studio,0.0


### Cluster 3

In [44]:
df_stats[df_stats.Postcode.isin(df_toronto[df_toronto.Cluster_Label==3].Postcode)]

Unnamed: 0,Postcode,Neighborhood,1st name,1st frequ,2st name,2st frequ,3st name,3st frequ,4st name,4st frequ,5st name,5st frequ
22,M5N,Roselawn,Garden,1.0,Yoga Studio,0.0,Fish Market,0.0,Filipino Restaurant,0.0,Fast Food Restaurant,0.0


### Cluster 4

In [45]:
df_stats[df_stats.Postcode.isin(df_toronto[df_toronto.Cluster_Label==4].Postcode)]

Unnamed: 0,Postcode,Neighborhood,1st name,1st frequ,2st name,2st frequ,3st name,3st frequ,4st name,4st frequ,5st name,5st frequ
4,M4N,Lawrence Park,Gym / Fitness Center,0.25,Park,0.25,Swim School,0.25,Bus Line,0.25,Yoga Studio,0.0
