# Clustering Toronto Neighborhoods

The purpose of this exercise is to scrape data from this Wikipedia page about 
<a href="https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M">Canadian postal codes beginning with M</a>, and use that data to cluster Toronto neighborhoods. Canadian postal codes beginning with 'M' belong to neighborhoods in Toronto.

## Scraping the Wikipedia Page

In [106]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In the next cell, we use Beautiful Soup to scrape the page and extract the sortable table of postal codes.

In [107]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('table', class_='sortable')

Here, we use pandas' read html method to parse the results from Beautiful Soup, and we read the data into a data frame. 

In [108]:
df = pd.read_html(str(results))
df = pd.DataFrame(df[0])
df.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [109]:
df.shape

(180, 3)

## Cleaning the Data

The following loop creates a list of row indices where neither a borough nor a neighbourhood is assigned to a postal code. Using the list created in that loop, we use the drop method to get rid of those rows, and leave only postal codes that are in use in our data frame.

In [110]:
null_list = []
for i in range(len(df.index)):
    if df.loc[i, 'Borough'] == 'Not assigned' and df.loc[i,'Neighbourhood'] == 'Not assigned':
        null_list.append(i)
print(null_list)

[0, 1, 7, 10, 15, 16, 19, 24, 25, 28, 29, 33, 34, 35, 37, 38, 42, 43, 44, 51, 52, 53, 60, 61, 62, 69, 70, 71, 78, 79, 87, 88, 96, 97, 101, 105, 106, 110, 115, 118, 119, 123, 124, 125, 127, 128, 131, 132, 133, 134, 136, 137, 140, 141, 145, 146, 149, 150, 154, 155, 158, 159, 161, 162, 163, 164, 166, 167, 170, 171, 172, 173, 174, 175, 176, 177, 179]


In [111]:
df.drop(index= null_list, inplace=True)
df.head(20)

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [112]:
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Check for any entries where the borough is assigned but a neighborhood isn't:

In [113]:
null_neighbourhood_list = []
for i in range(len(df.index)):
    if df.loc[i, 'Borough'] != 'Not assigned' and df.loc[i,'Neighbourhood'] == 'Not assigned':
        null_neighbourhood_list.append(i)
print(null_neighbourhood_list)

[]


Here we create a dictionary to keep track of the rows where there are multiple neighbourhoods assigned to a postal code. The keys in the dictionary correspond to the indices of these rows, and the values are lists of the neighbourhoods that have the postal code in that index's row.

In [114]:
multihood_dict = {}
for i in range(len(df.index)):
    if ',' in df.loc[i, 'Neighbourhood']:
        neighbourhood_list = df.loc[i, 'Neighbourhood'].split(", ")
        multihood_dict[i] = neighbourhood_list
        
#multihood_dict

Next, we will create an empty dataframe fill it so that each postal code that has multiple neighbourhoods associated with it will be represented multiple times in the new data frame, with one row for each neighbourhood it's used for. 

In [115]:
expanded_df = pd.DataFrame(columns=['Postal Code', 'Borough', 'Neighbourhood'])

In [116]:
multihood_indices = list(multihood_dict.keys())
for i in range(len(df.index)):
    if i in multihood_indices:
        for neighbourhood in multihood_dict[i]:
            expanded_df = expanded_df.append({'Postal Code': df.loc[i, 'Postal Code'], 'Borough': df.loc[i,'Borough'], 'Neighbourhood': neighbourhood}, ignore_index = True)
    else:
        expanded_df = expanded_df.append({'Postal Code': df.loc[i, 'Postal Code'], 'Borough': df.loc[i,'Borough'], 'Neighbourhood': df.loc[i, 'Neighbourhood']}, ignore_index = True)

In [117]:
expanded_df.head(30)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor
5,M6A,North York,Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park
7,M7A,Downtown Toronto,Ontario Provincial Government
8,M9A,Etobicoke,Islington Avenue
9,M9A,Etobicoke,Humber Valley Village


In [118]:
expanded_df.shape

(217, 3)

Here we install and import the package geocoder, so that we can use it to access <a href='https://www.openstreetmap.org'>Open Street Map's</a> latitude/longitude data for these neighbourhoods, and store it in a dictionary.

In [119]:
pip install geocoder

Note: you may need to restart the kernel to use updated packages.


In [120]:
import geocoder

coords_dict = {}

for neighbourhood in list(expanded_df.Neighbourhood):
    g = geocoder.osm('{}, Toronto, Canada'.format(neighbourhood))
    coords_dict[neighbourhood] = g.latlng
    posn_coords = g.latlng
        
#coords_dict

This initial dictionary contains some neighbourhoods that Open Street Map can't find, so we'll remove them.

In [121]:
clean_coords_dict = {}
for neighbourhood in coords_dict:
    if coords_dict[neighbourhood] != None:
        clean_coords_dict[neighbourhood] = coords_dict[neighbourhood]
        
#clean_coords_dict

Next, we will turn this dictionary into a dataframe and merge it with our previous data frame.

In [122]:
latlng_df = pd.DataFrame(list(clean_coords_dict.items()), columns = ['Neighbourhood', 'Coordinates'])
latlng_df.head()

Unnamed: 0,Neighbourhood,Coordinates
0,Parkwoods,"[43.7587999, -79.3201966]"
1,Victoria Village,"[43.732658, -79.3111892]"
2,Regent Park,"[43.6607056, -79.3604569]"
3,Harbourfront,"[43.6400801, -79.3801495]"
4,Lawrence Manor,"[43.7220788, -79.4375067]"


In [123]:
final_df = pd.merge(expanded_df, latlng_df, on = 'Neighbourhood')
final_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Coordinates
0,M3A,North York,Parkwoods,"[43.7587999, -79.3201966]"
1,M4A,North York,Victoria Village,"[43.732658, -79.3111892]"
2,M5A,Downtown Toronto,Regent Park,"[43.6607056, -79.3604569]"
3,M5A,Downtown Toronto,Harbourfront,"[43.6400801, -79.3801495]"
4,M6A,North York,Lawrence Manor,"[43.7220788, -79.4375067]"


In [124]:
final_df.shape

(196, 4)

Finally, we'll split the coordinates column into a latitude and longitude column, and drop the coordinates column.

In [125]:
latitude = []
longitude = []

for coords_set in final_df.Coordinates:
    latitude.append(coords_set[0])
    longitude.append(coords_set[1])

In [126]:
final_df['Latitude'] = latitude
final_df['Longitude'] = longitude

final_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Coordinates,Latitude,Longitude
0,M3A,North York,Parkwoods,"[43.7587999, -79.3201966]",43.7588,-79.320197
1,M4A,North York,Victoria Village,"[43.732658, -79.3111892]",43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,"[43.6607056, -79.3604569]",43.660706,-79.360457
3,M5A,Downtown Toronto,Harbourfront,"[43.6400801, -79.3801495]",43.64008,-79.38015
4,M6A,North York,Lawrence Manor,"[43.7220788, -79.4375067]",43.722079,-79.437507


In [127]:
final_df.drop(columns=['Coordinates'], inplace = True)
final_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7588,-79.320197
1,M4A,North York,Victoria Village,43.732658,-79.311189
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457
3,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015
4,M6A,North York,Lawrence Manor,43.722079,-79.437507


## Clustering the Neighbourhoods

Now, we will use our final_df to plot the neighbourhoods by latitude and longitude on a map of Toronto. We will fetch the latitude and longitude of Toronto from Nominatim and Geolocator, then use Folium to create a map and plot the neighbourhoods.

In [128]:
import folium
from geopy.geocoders import Nominatim
import numpy as np

In [129]:
address = 'Toronto, ON, CA'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude_Toronto = location.latitude
longitude_Toronto = location.longitude

In [130]:
#Create the map
map_toronto = folium.Map(location=[latitude_Toronto, longitude_Toronto], zoom_start=11)

#Add the markers for each neighbourhood
for lat, lng, label in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighbourhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

Now, we will set up the KMeans algorithm and cluster the neighbourhoods. By eye, it sort of looks like there are 7 clusters.

In [131]:
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [132]:
latlng_df = final_df[['Latitude', 'Longitude']].copy()
latlng_df.head()

Unnamed: 0,Latitude,Longitude
0,43.7588,-79.320197
1,43.732658,-79.311189
2,43.660706,-79.360457
3,43.64008,-79.38015
4,43.722079,-79.437507


In [141]:
k_means = KMeans(init = "k-means++", n_clusters = 7, n_init = 12)
k_means.fit(latlng_df)

KMeans(n_clusters=7, n_init=12)

In [142]:
latlng_df['Cluster Label'] = k_means.labels_
latlng_df.head()

Unnamed: 0,Latitude,Longitude,Cluster Label
0,43.7588,-79.320197,3
1,43.732658,-79.311189,6
2,43.660706,-79.360457,0
3,43.64008,-79.38015,0
4,43.722079,-79.437507,2


In [143]:
final_df['Cluster Label'] = k_means.labels_
final_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Label
0,M3A,North York,Parkwoods,43.7588,-79.320197,3
1,M4A,North York,Victoria Village,43.732658,-79.311189,6
2,M5A,Downtown Toronto,Regent Park,43.660706,-79.360457,0
3,M5A,Downtown Toronto,Harbourfront,43.64008,-79.38015,0
4,M6A,North York,Lawrence Manor,43.722079,-79.437507,2


In [163]:
map_clusters = folium.Map(location=[latitude_Toronto, longitude_Toronto], zoom_start=11)

#Set the color scheme for the clusters
x = np.arange(6)
ys = [i + x + (i*x)**2 for i in range(6)]
colors_array = cm.Set1(np.linspace(0, 1, 7))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#Add markers to the map- each cluster has its own color
markers_colors = []
for lat, lon, poi, cluster in zip(final_df['Latitude'], final_df['Longitude'], final_df['Neighbourhood'], final_df['Cluster Label']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters