Segmenting and Clustering Neighbourhoods in Toronto

import necessary libraries

In [19]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests

Prepare web scraping code by utilizing BeautifulSoup

In [20]:
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).content
content = BeautifulSoup(requests.get(url).content, 'lxml')

Create required dataframe

In [21]:
table = content.find('table')
td = table.find_all('td')
postcode = []
borough = []
neighbourhood = []

Create a list with scraped data

In [22]:
for i in range(0, len(td), 3):
    postcode.append(td[i].text.strip())
    borough.append(td[i+1].text.strip())
    neighbourhood.append(td[i+2].text.strip())

Create an actual DataFrame with the lists previously scraped and give the columns names

In [23]:
df_codes = pd.DataFrame(data=[postcode, borough, neighbourhood]).transpose()
df_codes.columns = ['Postal Code', 'Borough', 'Neighborhood']

Cleaning - Ignore cells with a borough that is Not assigned; If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough

In [24]:
df_codes['Borough'].replace('Not assigned', np.nan, inplace=True)
df_codes.dropna(subset=['Borough'], inplace=True)
df_codes['Neighborhood'].replace('Not assigned', "Queen's Park", inplace=True)

Cleaning - More than one neighborhood can exist in one postal code area. Combine rows into one row with the neighborhoods separated with a comma

In [25]:
df_codes = df_codes.groupby(['Postal Code', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()
df_codes.columns = ['Postal Code', 'Borough', 'Neighborhood']
df_codes.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Use the .shape method to print the number of rows of your dataframe.

In [26]:
df_codes.shape

(103, 3)