# Segmenting and Clustering Neighborhoods in Toronto

The Notebook scrapes the list of Toronto Neighborhoods from the Web and transforms it into pandas dataframe

Install BeautifulSoup - a library for web pages information scraping - and related packages

In [1]:
%pip install tabulate
%pip install lxml
%pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from tabulate import tabulate

res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
toronto_df = pd.read_html(str(table))
toronto_df = pd.DataFrame(toronto_df[0])
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


- Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
- If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [4]:
#Replace "Not assigned" with NaN
toronto_df.replace('Not assigned', np.nan, inplace = True)

# Drop rows with missing "Borough" assignment
toronto_df.dropna(subset=["Borough"], axis=0, inplace=True)

#If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
toronto_df['Neighbourhood'].fillna(toronto_df['Borough'], inplace=True)

# reset index
toronto_df.reset_index(drop=True, inplace=True)
toronto_df.head(5)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


Combine Neighbourhoods with one postal code 

In [5]:
toronto_df = toronto_df.groupby(['Postcode','Borough'], sort=False)['Neighbourhood'].apply(", ".join).reset_index()
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park


In [6]:
toronto_df.shape

(103, 3)