# Segmenting and Clustering Neighborhoods in Toronto

This notebook performs a clustering analysis to determine similarities in neighborhoods of Toronto, as part of the requirements of the IBM Applied Data Science Capstone Project.

## Part 1: Scraping Toronto Postal Code Data

We begin by importing all necessary libraries.

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

We scrape the raw data from the Wikipedia page on Toronto postal codes using BeautifulSoup.  The data is stored in the dataframe df.

In [3]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
results = requests.get(url)

soup = BeautifulSoup(results.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))[0]
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We drop all entries that do not correspond to a specific borough.  If the borough is assigned but the neighborhood is not, we set the neighborhood to be the same as the borough.

In [4]:
indexNames = df[df['Borough'] == 'Not assigned'].index
df_clean = df.drop(indexNames).reset_index(drop=True)
df_clean[df_clean['Neighbourhood'] == 'Not assigned']['Neighbourhood'] = df_clean['Borough']
df_clean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor


We combine all neighborhoods in a given postal code into a single row of a new dataframe, df_post.

In [16]:
grouped = df_clean.groupby(['Postcode','Borough'])
df_post = grouped.Neighbourhood.agg([('Neighbourhood', ', '.join)]).reset_index()
df_post.rename(columns={'Postcode':'Postal Code', 'Neighbourhood':'Neighborhood'},inplace=True)
df_post.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [6]:
df_post.shape

(103, 3)

## Part 2: Finding Latitude and Longitude

Since the geocoder package proved extremely unreliable, we obtain the latitude and longitude of each postal code directly from the provided csv file.

In [20]:
coords = pd.read_csv('Geospatial_Coordinates.csv')
coords.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


We merge these coordinates into our previous data to obtain the final dataframe, df_fin.

In [23]:
df_fin = df_post.merge(coords, on = 'Postal Code')
df_fin

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


## Part 3: Clustering Analysis

Import the necessary packages.

In [124]:
from geopy.geocoders import Nominatim
import folium
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

Set relevant parameters for calls to Foursquare API.

In [91]:
CLIENT_ID = '1IEN4PLSOERI14U34SSJWBW3Y50KNRC5UW0VK1EQET3BCSBO'
CLIENT_SECRET = 'T1NE5VSSWQGFXOYT5MRG1GIJFUNYGTCRCSJHCWD4DCTRAYRK'
VERSION = '20180605'
LIMIT = 100
RADIUS = 1500

Get the top venues in each postal code, within 1500 meters of its center.  (Increased radius from 500 to 1500 meters to get better data, since Toronto is less dense than New York City.)

In [159]:
venues_list=[]
for name, lat, lng in zip(df_fin['Postal Code'], df_fin['Latitude'], df_fin['Longitude']):
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, RADIUS, LIMIT)    
    results = requests.get(url).json()["response"]['groups'][0]['items']
    venues_list.append([(name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], 
        v['venue']['location']['lng'], v['venue']['categories'][0]['name']) for v in results])

toronto_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
toronto_venues.columns = ['Postal Code', 'Postal Code Latitude', 'Postal Code Longitude', 'Venue', 
                  'Venue Latitude', 'Venue Longitude', 'Venue Category']
toronto_venues.head()

Unnamed: 0,Postal Code,Postal Code Latitude,Postal Code Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Canadiana exhibit,43.817962,-79.193374,Zoo Exhibit
1,M1B,43.806686,-79.194353,Wendy's,43.802008,-79.19808,Fast Food Restaurant
2,M1B,43.806686,-79.194353,LCBO,43.796671,-79.204586,Liquor Store
3,M1B,43.806686,-79.194353,Caribbean Wave,43.798558,-79.195777,Caribbean Restaurant
4,M1B,43.806686,-79.194353,Harvey's,43.80002,-79.198307,Restaurant


Convert the venue categories to one-hot encoding, then find the relative frequency of each type of venue.

In [160]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix='',prefix_sep='')
toronto_onehot['Postal Code'] = toronto_venues['Postal Code']
toronto_grouped = toronto_onehot.groupby('Postal Code').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Postal Code,Accessories Store,Afghan Restaurant,Airport,Airport Lounge,American Restaurant,Amphitheater,Animal Shelter,Antique Shop,Aquarium,...,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,M1B,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.285714
1,M1C,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M1E,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,M1G,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,M1H,0.0,0.0,0.0,0.0,0.015385,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.030769,0.015385,0.0,0.015385,0.0,0.0


Define a function to return the most common venues in a postal code, then apply it to the Toronto data.

In [78]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [162]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head(5)

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Zoo Exhibit,Fast Food Restaurant,Pizza Place,Restaurant,Liquor Store,Gas Station,Big Box Store,Fruit & Vegetable Store,Movie Theater,Financial or Legal Service
1,M1C,Park,Grocery Store,Hotel,Breakfast Spot,Burger Joint,Gym,Gym / Fitness Center,Neighborhood,Italian Restaurant,Farm
2,M1E,Pizza Place,Breakfast Spot,Juice Bar,Coffee Shop,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,Burger Joint,Bank,Discount Store
3,M1G,Coffee Shop,Indian Restaurant,Fast Food Restaurant,Sandwich Place,Pharmacy,Chinese Restaurant,Beer Store,Bar,Bank,Thrift / Vintage Store
4,M1H,Coffee Shop,Clothing Store,Indian Restaurant,Restaurant,Gas Station,Sandwich Place,Bakery,Intersection,Toy / Game Store,Sporting Goods Shop


Apply a kmeans clustering algorithm to the data, assuming 5 clusters.  Add cluster label to the data.

In [163]:
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Postal Code', 1)
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_fin
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')
toronto_merged.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,1,Zoo Exhibit,Fast Food Restaurant,Pizza Place,Restaurant,Liquor Store,Gas Station,Big Box Store,Fruit & Vegetable Store,Movie Theater,Financial or Legal Service
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,0,Park,Grocery Store,Hotel,Breakfast Spot,Burger Joint,Gym,Gym / Fitness Center,Neighborhood,Italian Restaurant,Farm
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,1,Pizza Place,Breakfast Spot,Juice Bar,Coffee Shop,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,Burger Joint,Bank,Discount Store
3,M1G,Scarborough,Woburn,43.770992,-79.216917,1,Coffee Shop,Indian Restaurant,Fast Food Restaurant,Sandwich Place,Pharmacy,Chinese Restaurant,Beer Store,Bar,Bank,Thrift / Vintage Store
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,1,Coffee Shop,Clothing Store,Indian Restaurant,Restaurant,Gas Station,Sandwich Place,Bakery,Intersection,Toy / Game Store,Sporting Goods Shop


Construct a map to visualize the data, with different clusters indicated by different colors.  We see that there are two largest clusters, one representing more downtown neighborhoods and one representing suburban areas.  There is also a second smaller suburban cluster, plus two outlier neighborhoods.

In [158]:
geolocator = Nominatim(user_agent = 'toronto_explorer')
location = geolocator.geocode('Toronto')
latitude = location.latitude
longitude = location.longitude

map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Cluster 0 is an outlier neighborhood, with a mix of unrelated venues.

In [153]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Scarborough,0,Park,Grocery Store,Hotel,Breakfast Spot,Burger Joint,Gym,Gym / Fitness Center,Neighborhood,Italian Restaurant,Farm


Cluster 1 contains many restaurants, with a wide variety of ethnic cuisines.

In [148]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Scarborough,1,Zoo Exhibit,Fast Food Restaurant,Pizza Place,Restaurant,Liquor Store,Gas Station,Big Box Store,Fruit & Vegetable Store,Movie Theater,Financial or Legal Service
2,Scarborough,1,Pizza Place,Breakfast Spot,Juice Bar,Coffee Shop,Fast Food Restaurant,Fried Chicken Joint,Sandwich Place,Burger Joint,Bank,Discount Store
3,Scarborough,1,Coffee Shop,Indian Restaurant,Fast Food Restaurant,Sandwich Place,Pharmacy,Chinese Restaurant,Beer Store,Bar,Bank,Thrift / Vintage Store
4,Scarborough,1,Coffee Shop,Clothing Store,Indian Restaurant,Restaurant,Gas Station,Sandwich Place,Bakery,Intersection,Toy / Game Store,Sporting Goods Shop
5,Scarborough,1,Sandwich Place,Pharmacy,Ice Cream Shop,Grocery Store,Pizza Place,Coffee Shop,Breakfast Spot,Optical Shop,Japanese Restaurant,Beer Store
6,Scarborough,1,Coffee Shop,Fast Food Restaurant,Pharmacy,Chinese Restaurant,Sandwich Place,Pizza Place,Grocery Store,Soccer Field,Light Rail Station,Liquor Store
7,Scarborough,1,Coffee Shop,Pizza Place,Park,Grocery Store,Burger Joint,Sandwich Place,Discount Store,Diner,Department Store,Beer Store
10,Scarborough,1,Coffee Shop,Pizza Place,Fast Food Restaurant,Pharmacy,Restaurant,Grocery Store,Chinese Restaurant,Indian Restaurant,Pet Store,Park
11,Scarborough,1,Middle Eastern Restaurant,Pizza Place,Coffee Shop,Grocery Store,Breakfast Spot,Restaurant,Asian Restaurant,Chinese Restaurant,Seafood Restaurant,Medical Center
12,Scarborough,1,Chinese Restaurant,Cantonese Restaurant,Park,Bakery,Pizza Place,Hong Kong Restaurant,Caribbean Restaurant,Coffee Shop,Gym / Fitness Center,Shopping Mall


Cluster 2 is another outlier, with a mix of random venues.

In [155]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Scarborough,2,Donut Shop,Farm,National Park,Zoo Exhibit,Field,Electronics Store,Ethiopian Restaurant,Event Space,Falafel Restaurant,Farmers Market


Cluster 3 contains many recreational activities, such as parks, gyms, ice cream stores, and entertainment venues.

In [156]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Scarborough,3,Park,Harbor / Marina,Ice Cream Shop,Sandwich Place,Coffee Shop,Pizza Place,Pharmacy,Beach,Fast Food Restaurant,Grocery Store
9,Scarborough,3,Park,Ice Cream Shop,Bus Stop,Café,Skating Rink,Gym Pool,Business Service,Restaurant,Gym,General Entertainment
19,North York,3,Japanese Restaurant,Gas Station,Trail,Park,Bank,Skating Rink,Café,Grocery Store,Restaurant,Chinese Restaurant
28,North York,3,Park,Coffee Shop,Gas Station,Pizza Place,Ice Cream Shop,Middle Eastern Restaurant,Frozen Yogurt Shop,Fried Chicken Joint,Sushi Restaurant,French Restaurant
31,North York,3,Park,Moving Target,Plaza,Pizza Place,Coffee Shop,Bank,Tea Room,Zoo Exhibit,Farmers Market,Egyptian Restaurant
68,Downtown Toronto,3,Park,Coffee Shop,Café,Gym,Harbor / Marina,Bar,Restaurant,Scenic Lookout,Brewery,Hotel
88,Etobicoke,3,Park,Café,Grocery Store,Bakery,Breakfast Spot,Indian Restaurant,Restaurant,Bank,Bar,General Entertainment
93,Etobicoke,3,Pharmacy,Ice Cream Shop,Park,Bus Line,Supermarket,Bus Stop,Café,Shopping Mall,Liquor Store,Golf Course
94,Etobicoke,3,Hotel,Park,Restaurant,Fish & Chips Shop,Farmers Market,Café,Electronics Store,Theater,Bank,Grocery Store
96,North York,3,Park,Skating Rink,Electronics Store,Asian Restaurant,Mexican Restaurant,Vietnamese Restaurant,Sports Bar,Shopping Mall,Latin American Restaurant,Bank


Cluster 4 is dominated by a high prevalence of coffee shops, plus a mix of different ethnic foods.

In [157]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,North York,4,Athletics & Sports,Park,Coffee Shop,Bus Station,Turkish Restaurant,Gym / Fitness Center,Metro Station,Racetrack,Italian Restaurant,Beer Store
40,East York,4,Greek Restaurant,Café,Coffee Shop,Ethiopian Restaurant,Bakery,Thai Restaurant,Ice Cream Shop,Park,Pizza Place,Cosmetics Shop
41,East Toronto,4,Café,Greek Restaurant,Park,Coffee Shop,Pizza Place,Pub,Bakery,Vietnamese Restaurant,Grocery Store,Italian Restaurant
42,East Toronto,4,Coffee Shop,Park,Indian Restaurant,Brewery,Café,Bakery,Pub,Grocery Store,Italian Restaurant,Beach
43,East Toronto,4,Coffee Shop,Café,Bakery,Brewery,Park,Vietnamese Restaurant,Pizza Place,Bar,Thai Restaurant,Restaurant
44,Central Toronto,4,Coffee Shop,Sushi Restaurant,Italian Restaurant,Bakery,Pub,Hobby Shop,Park,Dog Run,Fast Food Restaurant,Café
45,Central Toronto,4,Italian Restaurant,Coffee Shop,Sushi Restaurant,Park,Restaurant,Pizza Place,Bakery,Indian Restaurant,Gym,Café
46,Central Toronto,4,Coffee Shop,Italian Restaurant,Fast Food Restaurant,Café,Japanese Restaurant,Bakery,Sushi Restaurant,Ice Cream Shop,Restaurant,Sporting Goods Shop
47,Central Toronto,4,Italian Restaurant,Coffee Shop,Bakery,Café,Gym,Indian Restaurant,Sushi Restaurant,Park,Restaurant,Japanese Restaurant
48,Central Toronto,4,Park,Italian Restaurant,Coffee Shop,Sushi Restaurant,Grocery Store,Restaurant,Spa,Bakery,Gas Station,Trail
