This notebook will explore, segment, and cluster the neighborhoods in the city of Toronto

<h4>Import Library and Parser</h4>

In [1]:
# Installed beautifulsoup4
# Installed lml and html5lib parser and request library
# Note: For simplicity, outputs were cleared out

In [None]:
pip install beautifulsoup4

In [None]:
pip install lxml

In [None]:
pip install html5lib

In [None]:
pip install requests

In [6]:
import pandas as pd
from bs4 import BeautifulSoup
import requests

<h4>Reading the data set from the URL</h4>

In [7]:
# Getting source code from Wikipedia page using 'requests libary' 
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

# Pass source code to Beautifulsoup4
soup = BeautifulSoup(source, 'lxml')

# print(soup.prettify()) - This will format our output
# Note: For simplicity, the long output was cleared out. However, you can see snippet below

In [8]:
# From formatted output we then parsed out 'table' from <table class="wikitable sortable"> to create dataframe
table = soup.find_all('table')[0]

<img src = "Table.png" width = 800, align = "center"></a>

In [9]:
# Read table to the dataframe
df = pd.read_html(str(table))[0]

In [10]:
# Now, let's see what our table looks like
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West
286,M8Z,Etobicoke,South of Bloor


<h4>Evaluating and Cleaning Data</h4>

In [11]:
# From our table, we can select rows with 'Not assigned' value
df_Na = df[df.Borough.isin(['Not assigned'])]

In [12]:
# Let's see all rows with "Not assigned' value
df_Na

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned
20,M7B,Not assigned,Not assigned
...,...,...,...
278,M4Z,Not assigned,Not assigned
279,M5Z,Not assigned,Not assigned
280,M6Z,Not assigned,Not assigned
281,M7Z,Not assigned,Not assigned


In [13]:
# After finding out rows with 'Not assigned', we can now ignore them
df_Na = df[~df.Borough.isin(['Not assigned'])]

In [14]:
# Let's check and see if rows are still displaying value of 'Not assigned'
df_Na

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
...,...,...,...
282,M8Z,Etobicoke,Kingsway Park South West
283,M8Z,Etobicoke,Mimico NW
284,M8Z,Etobicoke,The Queensway West
285,M8Z,Etobicoke,Royal York South West


In [15]:
# From our generated table, we can group values with same Postcode and Borough in the same Neighbourhood row, separated by a comma
df_same_postcode = df_Na.groupby(['Postcode','Borough'])['Neighbourhood'].apply(','.join).reset_index()
df_same_postcode.tail(18)

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Not assigned
86,M7R,Mississauga,Canada Post Gateway Processing Centre
87,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
88,M8V,Etobicoke,"Humber Bay Shores,Mimico South,New Toronto"
89,M8W,Etobicoke,"Alderwood,Long Branch"
90,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
91,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."
92,M8Z,Etobicoke,"Kingsway Park South West,Mimico NW,The Queensw..."
93,M9A,Etobicoke,Islington Avenue
94,M9B,Etobicoke,"Cloverdale,Islington,Martin Grove,Princess Gar..."


In [16]:
# Notice first row Neighbourhood has 'Not assigned' value?

<h4>Identify and Handle values</h4>

In [17]:
# From above table, for Neighborhood with 'Not assigned' value, will be assign a value same as their Borough 
df_same_postcode.loc[df_same_postcode['Neighbourhood']=="Not assigned",'Neighbourhood'] = df_same_postcode.loc[df_same_postcode['Neighbourhood']=="Not assigned",'Borough']

In [18]:
# Here, we can see the return assigned value
df_same_postcode.tail(18)

Unnamed: 0,Postcode,Borough,Neighbourhood
85,M7A,Queen's Park,Queen's Park
86,M7R,Mississauga,Canada Post Gateway Processing Centre
87,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
88,M8V,Etobicoke,"Humber Bay Shores,Mimico South,New Toronto"
89,M8W,Etobicoke,"Alderwood,Long Branch"
90,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
91,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."
92,M8Z,Etobicoke,"Kingsway Park South West,Mimico NW,The Queensw..."
93,M9A,Etobicoke,Islington Avenue
94,M9B,Etobicoke,"Cloverdale,Islington,Martin Grove,Princess Gar..."


<h4>Checking the Data</h4>

In [19]:
# Finally, for our dataframe we can use '.shape' method to check number of rows
df_same_postcode.shape

(103, 3)