# Clustering neighborhoods in Toronto on the basis of venues found on Foursquare

Import of relevant packages

In [1]:
import pandas as pd
import numpy as np

## Data import

Scrape a list of postal codes in Canada from [wikipedia](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

In [2]:
scrape = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

Check import:

In [3]:
scrape[0].head()

Unnamed: 0,0,1,2
0,Postcode,Borough,Neighbourhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village


In [4]:
scrape[0].tail()

Unnamed: 0,0,1,2
285,M8Z,Etobicoke,Mimico NW
286,M8Z,Etobicoke,The Queensway West
287,M8Z,Etobicoke,Royal York South West
288,M8Z,Etobicoke,South of Bloor
289,M9Z,Not assigned,Not assigned


## Data wrangling

Some wrangling is necessary. The columns are labeled correctly and postal codes which are not assigned are dropped.

In [5]:
df_postals = scrape[0]
df_postals.columns = df_postals.iloc[0].values
df_postals.replace(to_replace='Not assigned', value=np.nan, inplace=True)
df_postals.dropna(axis=0, inplace=True)
df_postals.drop(index=0, axis=0, inplace=True)
df_postals.reset_index(drop=True, inplace=True)

Check the resulting data frame:

In [6]:
df_postals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 3 columns):
Postcode         211 non-null object
Borough          211 non-null object
Neighbourhood    211 non-null object
dtypes: object(3)
memory usage: 5.0+ KB


In [7]:
df_postals.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [8]:
df_postals.describe()

Unnamed: 0,Postcode,Borough,Neighbourhood
count,211,211,211
unique,102,10,209
top,M9V,Etobicoke,St. James Town
freq,8,45,2


It is still necessary to group the different neighbourhoods with the same postal code together:

In [9]:
df_postals = df_postals.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ', '.join(x)).to_frame().reset_index()

In [10]:
df_postals.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [11]:
df_postals.shape

(102, 3)