# Segmenting and Clustering Neighborhoods in Toronto 

## Part 1: Getting postal codes

In [1]:
import pandas as pd

We actually don't need any external parsing libraries for this job. Pandas can already call lxml or Beautiful Soup behind the scenes when we use the `pd.read_html()` function. As long as the page you're scraping uses HTML `<table>` elements, which Wikipedia thankfully does, we don't need to write any new parsing code.

In [2]:
postal_codes = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M", header=0)[0]
postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


Let's get rid of all those unassigned values...

In [3]:
postal_codes = postal_codes[postal_codes['Borough'] != 'Not assigned']

def replace_unassigned_neighbourhoods(row): # Function to replace unassigned Neighbourhood values with their Borough
    if(row['Neighbourhood'] == 'Not assigned'): # This could just be a lambda, really, but I think it's clearer this way
        row['Neighbourhood'] = row['Borough']
    return row

postal_codes.apply(replace_unassigned_neighbourhoods, axis=1)
postal_codes

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


...And now we can group and aggregate our values using the `groupby()` and `agg()` functions. I used value_counts as a quick hack to get only the most common value for "Borough", which should also be the only value, assuming each postal code has one borough.

In [4]:
def comma_list(x):
    return ', '.join(x) # Function to make comma separated lists

postal_codes_grouped = postal_codes.groupby(by='Postcode').agg({
    'Neighbourhood': lambda x : ', '.join(x),
    'Borough':  lambda x: x.value_counts().index[0] # Assumung each postal code has one Borough, picking the most popular value means picking the only value.
})
postal_codes_grouped

Unnamed: 0_level_0,Neighbourhood,Borough
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,"Rouge, Malvern",Scarborough
M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough
M1E,"Guildwood, Morningside, West Hill",Scarborough
M1G,Woburn,Scarborough
M1H,Cedarbrae,Scarborough
M1J,Scarborough Village,Scarborough
M1K,"East Birchmount Park, Ionview, Kennedy Park",Scarborough
M1L,"Clairlea, Golden Mile, Oakridge",Scarborough
M1M,"Cliffcrest, Cliffside, Scarborough Village West",Scarborough
M1N,"Birch Cliff, Cliffside West",Scarborough


In [5]:
postal_codes_grouped.shape

(103, 2)

## Part 2: Adding coordinates

In [22]:
import geocoder

Alright, let's try this swanky new geocoder!

In [27]:
for postal_code in postal_codes_grouped.index:
    print('Getting postal code ' + postal_code)
    coords = None
    attempts = 0
    while coords == None:
        g = geocoder.osm(postal_code + ', Toronto, Ontario') # OSM at least gave a couple results, Google did nothing for me
        coords = g.latlng
        attempts += 1
        print(attempts) # You'll see in a second why I'm not even bothering to store the data
        if attempts > 20:
            raise ValueError("This clearly isn't working, just give up and use the CSV")
    print(coords)

Getting postal code M1B
1
[43.653963, -79.387207]
Getting postal code M1C
1
[43.653963, -79.387207]
Getting postal code M1E
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21


ValueError: This clearly isn't working, just give up and use the CSV

...Oh. It doesn't work. At all. Well, let's follow the exception message's advice and just use the provided CSV.

(I also tried skipping the failed lookups, but that only got me about a 10% success rate)

In [31]:
postal_coords = pd.read_csv("Toronto_Postal_Geospatial_Coordinates.csv", index_col=0)
postal_coords.head(10)

Unnamed: 0_level_0,Latitude,Longitude
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1
M1B,43.806686,-79.194353
M1C,43.784535,-79.160497
M1E,43.763573,-79.188711
M1G,43.770992,-79.216917
M1H,43.773136,-79.239476
M1J,43.744734,-79.239476
M1K,43.727929,-79.262029
M1L,43.711112,-79.284577
M1M,43.716316,-79.239476
M1N,43.692657,-79.264848


Since we've made the index of each dataframe the postal code, we don't need to specify the columns to join on.

In [51]:
postal_codes_merged = postal_codes_grouped.join(postal_coords)
postal_codes_merged

Unnamed: 0_level_0,Neighbourhood,Borough,Latitude,Longitude
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
M1B,"Rouge, Malvern",Scarborough,43.806686,-79.194353
M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough,43.784535,-79.160497
M1E,"Guildwood, Morningside, West Hill",Scarborough,43.763573,-79.188711
M1G,Woburn,Scarborough,43.770992,-79.216917
M1H,Cedarbrae,Scarborough,43.773136,-79.239476
M1J,Scarborough Village,Scarborough,43.744734,-79.239476
M1K,"East Birchmount Park, Ionview, Kennedy Park",Scarborough,43.727929,-79.262029
M1L,"Clairlea, Golden Mile, Oakridge",Scarborough,43.711112,-79.284577
M1M,"Cliffcrest, Cliffside, Scarborough Village West",Scarborough,43.716316,-79.239476
M1N,"Birch Cliff, Cliffside West",Scarborough,43.692657,-79.264848


Yes, I know the example dataframe shows the index as being an integer range (i.e. the default index), but there's really no reason to reset the index at the moment. Having it be the postal code works just fine.

## Part 3: The fun part

In [52]:
import folium

First, let's take a quick look at where all these neighborhoods are...

In [53]:
map_neighbourhoods = folium.Map(location=[43.6532, -79.3832], zoom_start=11)
for index, row in postal_codes_merged.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']],
        popup=row['Neighbourhood']
    ).add_to(map_neighbourhoods)
map_neighbourhoods

Yep, that's a map. Let's start clustering them using K Means.

In [54]:
from sklearn.cluster import KMeans

At this point, having integer row indexes is really handy, so let's do that:

In [55]:
postal_codes_merged.reset_index(inplace=True)
postal_codes_merged.head()

Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude
0,M1B,"Rouge, Malvern",Scarborough,43.806686,-79.194353
1,M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough,43.784535,-79.160497
2,M1E,"Guildwood, Morningside, West Hill",Scarborough,43.763573,-79.188711
3,M1G,Woburn,Scarborough,43.770992,-79.216917
4,M1H,Cedarbrae,Scarborough,43.773136,-79.239476


And now we can start clustering:

In [75]:
km_X = []
for index, row in postal_codes_merged.iterrows():
    km_X.append([row['Latitude'], row['Longitude']])
km_X

kmeans = KMeans(n_clusters=4, random_state=1).fit(km_X)
postal_codes_merged['Cluster'] = kmeans.labels_
postal_codes_merged.head(20)

Unnamed: 0,Postcode,Neighbourhood,Borough,Latitude,Longitude,Cluster
0,M1B,"Rouge, Malvern",Scarborough,43.806686,-79.194353,1
1,M1C,"Highland Creek, Rouge Hill, Port Union",Scarborough,43.784535,-79.160497,1
2,M1E,"Guildwood, Morningside, West Hill",Scarborough,43.763573,-79.188711,1
3,M1G,Woburn,Scarborough,43.770992,-79.216917,1
4,M1H,Cedarbrae,Scarborough,43.773136,-79.239476,1
5,M1J,Scarborough Village,Scarborough,43.744734,-79.239476,1
6,M1K,"East Birchmount Park, Ionview, Kennedy Park",Scarborough,43.727929,-79.262029,1
7,M1L,"Clairlea, Golden Mile, Oakridge",Scarborough,43.711112,-79.284577,1
8,M1M,"Cliffcrest, Cliffside, Scarborough Village West",Scarborough,43.716316,-79.239476,1
9,M1N,"Birch Cliff, Cliffside West",Scarborough,43.692657,-79.264848,1


And finally map out our clustered locations:

In [76]:
cluster_colors = ['red', 'blue', 'green', 'purple']

map_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=11)
for index, row in postal_codes_merged.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']],
                    popup=row['Neighbourhood'] +'<br>'+ row['Borough'] +'<br>'+ row['Postcode'],
                    icon=folium.Icon(color=cluster_colors[row['Cluster']])
    ).add_to(map_clusters)
map_clusters

K Means has done a good job separating the city into 4 main boroughs, roughly matching the borders between Central Toronto (purple), North York (red), West Toronto and Etobicoke (green), and Scarborough (blue).