<h1>Segmentation and Clustering of Neighbourhoods in Toronto </h1>

Importing the required python libraries

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

<h3>Data Collection</h3>

Extracting Table Data from the Wikipedia Page using Beautiful Soup

In [2]:
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

soup = BeautifulSoup(website_url, 'lxml')
my_table = soup.find('table', {'class':'wikitable sortable'})

<h4>Parsing Table Data and storing in a Pandas Dataframe</h4>

Getting Table Header for Column Names

In [3]:
th = my_table.findAll('th')

header = []

for t in th:
    header.append(t.get_text())
    
header[len(header)-1] = header[len(header)-1][:len(header[len(header)-1])-1] #done to remove the escape charachters of any last value in series like '\n'
header

['Postcode', 'Borough', 'Neighbourhood']

Getting all Data Column-Wise using 'Slicing of Tuples'

In [4]:
td = my_table.findAll('td')

_postcode = []
_borough = []
_neighbourhood = []

td_Postcode = td[::3]
for t in td_Postcode:
    _postcode.append(t.get_text())

td_Borough = td[1:][::3]
for t in td_Borough:
    _borough.append(t.get_text())

td_Neighbourhood = td[2:][::3]
for t in td_Neighbourhood:
    _neighbourhood.append(t.get_text()[:(len(t.get_text())-1)])

Compiling all data and storing it in a Pandas Dataframe

In [5]:
df = pd.DataFrame(columns = header)

df['Postcode'] = _postcode
df['Borough'] = _borough
df['Neighbourhood'] = _neighbourhood

df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


<h3>Data Wrangling</h3>

Dropping rows where Borough is set to 'Not assigned'

In [6]:
df = df[df.Borough != 'Not assigned']
df.reset_index(inplace=True)

Grouping Neighbourhoods with the same Postal Code

In [7]:
df['Neighbourhood'].replace('Not assigned', df.Borough, inplace=True)
df = df.groupby('Postcode').agg({'Borough':'first', 'Neighbourhood':', '.join}).reset_index()
df = df[['Postcode', 'Borough', 'Neighbourhood']]
df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Describing Data

In [8]:
df.shape

(103, 3)

<h3>Adding Coordinates to the Dataframe</h3>

Importing the provided CSV file for Geospatial Data

In [9]:
# The code was removed by Watson Studio for sharing.

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Preparing new Dataframe to be merged with the original

In [10]:
df_geo.rename(index=str, columns={'Postal Code':'Postcode'}, inplace=True)
df_geo.head()

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging Dataframes with Primary Key = Postcode

In [11]:
df_final = pd.merge(df, df_geo, on='Postcode')
df_final = df_final[['Postcode', 'Borough', 'Neighbourhood', 'Latitude', 'Longitude']]
df_final.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
