# Clustering Toronto neighborhoods

In this notebook we're gonna cluster the neighborhoods of Toronto.

First, we have to import the pandas and numpy libraries.

In [1]:
import pandas as pd
import numpy as np


If we don't have lxml library installed, we have to install it now, or else the next cell will run into an error:

In [68]:
# uncomment below and run if you don't have lxml installed yet
#!conda install -y -c anaconda lxml

Next, we're importing the data from the Wikipedia url. The read_html function is quite flexible, so we don't have to do much here. After import, the result is a list of dataframes. We need only the first one.

In [4]:
postalCodes = pd.read_html('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', header=0)
df = postalCodes[0] #we only need the first dataframe
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
...,...,...,...
175,M5Z,Not assigned,
176,M6Z,Not assigned,
177,M7Z,Not assigned,
178,M8Z,Etobicoke,Mimico NW / The Queensway West / South of Bloo...


So, this data is a bit unclean. We need to drop the rows where the Borough values are "Not assigned". After that, we're checking if any NaN values remained in the dataframe.

In [5]:
df = df[df['Borough'] != 'Not assigned']
df.isna().any()

Postal code     False
Borough         False
Neighborhood    False
dtype: bool

The indexes are a bit off, so we have to reset them.

In [8]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park / Harbourfront
3,M6A,North York,Lawrence Manor / Lawrence Heights
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing Centre
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...


Checking the shape of the dataframe:

In [9]:
df.shape

(103, 3)

## Part 2 - Importing the coordinates

Let's fetch the geocoordinate data. I chose the simpliest way.

In [11]:
!wget -O gcoor.csv https://cocl.us/Geospatial_data

--2020-05-04 11:33:48--  https://cocl.us/Geospatial_data
Resolving cocl.us (cocl.us)... 158.85.108.83, 158.85.108.86, 169.48.113.194
Connecting to cocl.us (cocl.us)|158.85.108.83|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-04 11:33:51--  https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Resolving ibm.box.com (ibm.box.com)... 107.152.27.197
Connecting to ibm.box.com (ibm.box.com)|107.152.27.197|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv [following]
--2020-05-04 11:33:51--  https://ibm.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv
Reusing existing connection to ibm.box.com:443.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://ibm.ent.box.com/public/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.

There's a case sensitivity inconsistency in the column names, so we have to fix this after import. It's important, because it's the column occuring in both dataframes, so this will be the key of the data join operation.

In [27]:
gcoor = pd.read_csv('gcoor.csv')
gcoor.rename(columns = {'Postal Code':'Postal code'}, inplace = True) 


Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.654260,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,The Kingsway / Montgomery Road / Old Mill North,43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business reply mail Processing Centre,43.662744,-79.321558
101,M8Y,Etobicoke,Old Mill South / King's Mill Park / Sunnylea /...,43.636258,-79.498509


The common column of the dataframe is Postal code. Let's sort both dataframes by this, just to avoid confusion. After that, we can merge the two dataframes into one.

In [29]:
gcoor.sort_values(by=['Postal code'])
df.sort_values(by=['Postal code'])

df_inner = pd.merge(df, gcoor, on=['Postal code'], how='inner')

df_inner.head()



Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Regent Park / Harbourfront,43.65426,-79.360636
3,M6A,North York,Lawrence Manor / Lawrence Heights,43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government,43.662301,-79.389494
