# Capstone Project Week 3 by Loïc BRISSOT: Clustering Toronto Neighborhoods

Peer-graded Assignment: Segmenting and Clustering Neighborhoods in Toronto

https://www.coursera.org/learn/applied-data-science-capstone/peer/I1bDq/segmenting-and-clustering-neighborhoods-in-toronto/submit

## Part 1: Scraping neighborhood and postal code information from Wikipedia

We target the following page: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to extract the table containing the postal codes.

In [1]:
# Setting up the environment
import numpy as np
import pandas as pd
import urllib

In [2]:
## Recovering the Wikipedia with a request
page_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(page_url)


## Reading the page
raw = page.read()
source = raw.decode('UTF-8') # Decoding bytes to UTF-8

## Extracting the table
# We target the postal code table
table = source[source.find('<table class="wikitable sortable">'):source.find('</table>')+8]
# We import the table with read_html from Pandas
PC_data = pd.read_html(table, header = 0)[0]

PC_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We now treat our table with the following instructions:

* The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
* Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.
* If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough. So for the 9th cell in the table on the Wikipedia page, the value of the Borough and the Neighborhood columns will be Queen's Park.

In [3]:
## Cleaning the data
# Removing unassigned boroughs
df1 = PC_data[PC_data.Borough != 'Not assigned']

# Cleanup and renaming
df1 = df1.sort_values(by=['Postcode','Borough'])
df1 = df1.rename(index=str, columns={'Postcode': 'Postal Code'})

# Resetting the index
df1.reset_index(inplace=True)
df1.drop('index',axis=1,inplace=True)

df1.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1B,Scarborough,Rouge
1,M1B,Scarborough,Malvern
2,M1C,Scarborough,Highland Creek
3,M1C,Scarborough,Rouge Hill
4,M1C,Scarborough,Port Union


In [4]:
## Concatenation of the neighborhood names
# Creating a new dataframe to host the concatenated result
df2 = pd.DataFrame(df1['Postal Code'].drop_duplicates())
df2['Borough'] = '';
df2['Neighborhood'] = '';

# Aligning the axis for df1 and df2
df2.reset_index(inplace=True)
df2.drop('index', axis=1, inplace=True)
df1.reset_index(inplace=True)
df1.drop('index', axis=1, inplace=True)

# We concatenate the neighborhood names
for i in df2.index:
    for j in df1.index:
        if df2.iloc[i, 0] == df1.iloc[j, 0]:
            df2.iloc[i, 1] = df1.iloc[j, 1]
            df2.iloc[i, 2] = df2.iloc[i, 2] + ',' + df1.iloc[j, 2]
            
for i in df2.index:
    s = df2.iloc[i, 2]
    if s[0] == ',':
        s =s [1:]
    df2.iloc[i,2 ] = s

df2.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [5]:
# Shape of the result
df2.shape

(103, 3)

There are 103 different postal codes covering the various neighborhoods of Toronto in our table.

## Part 2: Recovering geospatial data

In [6]:
# Import of the csv file containing the postal code area coordinates
df_geo = pd.read_csv('Geospatial_Coordinates.csv')
df_geo.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
# Merging the geospatial coordinates into the results of part 1
df_postal = pd.merge(df2, df_geo, how='inner', on='Postal Code')
df_postal.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [8]:
# We verify all the neighborhoods were correctly merged
if(df_postal.shape[0] == df2.shape[0]):
    print('The '+str(df2.shape[0])+' postal codes were correctly merged.')
else:
    print('Some postal codes from the results in "Part 1" were different from the postal codes in the the file "Geospatial_Coordinates.csv"')

The 103 postal codes were correctly merged.


In [9]:
# Export of the results to a CSV file
# df_postal.to_csv(path_or_buf='TorontoPostalCodes.csv')

## Part 3: Clustering the Toronto neighborhoods

## Required environment

In [10]:
# Setting up the environment
from sklearn.cluster import KMeans

import folium

import matplotlib.cm as cm
import matplotlib.colors as colors

## Limited clustering

We do a first clustering limited to the borough in the heart of the city (i.e. the borough containing the word 'Toronto').

In [11]:
# Filtering out the boroughs whose name do not contain 'Toronto'
neighborhoods = df_postal.copy()
neighborhoods = neighborhoods[df_postal.Borough.str.contains('Toronto')]
neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [12]:
# Clustering with kmeans
model = np.stack((neighborhoods['Latitude'], neighborhoods['Longitude']), axis=1)
kmeans = KMeans(n_clusters=5, random_state=0).fit(model)

clusters = kmeans.labels_
neighborhoods['Cluster'] = clusters

neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,3
41,M4K,East Toronto,"The Danforth West,Riverdale",43.679557,-79.352188,3
42,M4L,East Toronto,"The Beaches West,India Bazaar",43.668999,-79.315572,3
43,M4M,East Toronto,Studio District,43.659526,-79.340923,3
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1


In [13]:
# Displaying the map
toronto_map = folium.Map(location=[43.68, -79.35], zoom_start=12)

colors = ['red', 'green', 'blue', 'orange', 'yellow', 'purple'] # Colors for the cluster points

for borough, latitude, longitude, cluster in zip(neighborhoods['Borough'], 
                                                 neighborhoods['Latitude'], 
                                                 neighborhoods['Longitude'], 
                                                 neighborhoods['Cluster']):
    label = folium.Popup(cluster, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=8,
        popup=borough,
        color='black',
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=1
    ).add_to(toronto_map)  

toronto_map