Segmenting and Clustering Neighborhoods in Toronto

by Nathan Rathge

1. Blank notebook created in Watson Studio.

2. Scrape the wikipedia page using pandas.

In [1]:
# Find out the number of tables on the page
import pandas as pd

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
dfs = pd.read_html(url)

print(len(dfs))

3


In [3]:
# Try the first table.
df = dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
...,...,...,...
175,M5Z,Not assigned,Not assigned
176,M6Z,Not assigned,Not assigned
177,M7Z,Not assigned,Not assigned
178,M8Z,Etobicoke,"Mimico NW, The Queensway West, South of Bloor,..."


In [7]:
# Check the size of the dataframe
df.shape

(180, 3)

3. Clean the dataframe.

In [12]:
# Remove rows where Borough is "Not assigned"
# First replace "Not assigned" with NaN
import numpy as np
df["Borough"].replace("Not assigned", np.nan, inplace=True)
# Drop the rows with NaN values
df.dropna(subset=["Borough"], axis=0, inplace=True)
# Reset the index
df.reset_index(drop=True, inplace=True)
df.head(5)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [13]:
df.shape

(103, 3)

In [14]:
# Check if any neighborhoods need to be combined.
df['Postal Code'].value_counts()

M1P    1
M5S    1
M4W    1
M5T    1
M7R    1
      ..
M1E    1
M4A    1
M6N    1
M1L    1
M5K    1
Name: Postal Code, Length: 103, dtype: int64

In [15]:
# Since the size of the value_counts dataframe is the same as the df, no combinations need to be made.

In [36]:
# Check for "Not assigned" in Neighborhood column.
NotA = df['Neighbourhood'].str.find("Not assigned")
NotA.value_counts()

-1    103
Name: Neighbourhood, dtype: int64

In [31]:
# Since value_counts dataframe does not find "Not assigned", no neighborhood combinations need to be made.

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [37]:
# Display the final shape of the dataframe.
df.shape

(103, 3)

This is the end of part one of the peer graded assignment.

This is the beginning of part two of the peer graded assignment.

Getting the GPS coordinates of each neighborhood.

In [47]:
# Download the data file and read it into a dataframe.
!wget -q -O 'Geospatial_Coordinates.csv' https://cocl.us/Geospatial_data/Geospatial_Coordinates.csv
torontogps = pd.read_csv('Geospatial_Coordinates.csv')
torontogps.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [51]:
# Merge the neighborhood data with the postal code data.
dfgps=pd.merge(df,torontogps,on='Postal Code',how='outer')
dfgps.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


This is the end of part two of the peer graded assignment.