# Segmenting and Clustering Neighborhoods in Toronto

## Question 1 - Scraping Wikipedia Page

We scrape the Wikipedia page of postal codes in Canada and assume the first table is the table of interest.

In [11]:
from bs4 import BeautifulSoup
import pandas as pd
import requests

response = requests.get(r'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.findAll('table')[0]
df = pd.read_html(str(table))[0]

We rename columns, drop not assigned burrows and collapse neighbourhoods into Postal Codes and burrows. 
If some multi neighborhoods are already expressed with a slash, clean it up and replace with a comma

In [30]:
# Rename columns as requested
df.columns = ['PostalCode', 'Borough', 'Neighborhood']

# Drop unassigned boroughs as requested
df = df[df['Borough'] != 'Not assigned']
df

# Combine neighborhoods with commas
pt = pd.pivot_table(df, index=['PostalCode', 'Borough'], values='Neighborhood', aggfunc=','.join).reset_index()
pt['Neighborhood'] = pt['Neighborhood'].str.replace(' / ', ', ')

In [32]:
# Display in the same order as requested
pt.set_index('PostalCode').loc[
    ['M5G', 'M2H', 'M4B', 'M1J', 'M4G', 'M4M', 'M1R', 'M9V', 'M9L', 'M5V', 'M1B', 'M5A']
].reset_index()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M5G,Downtown Toronto,Central Bay Street
1,M2H,North York,Hillcrest Village
2,M4B,East York,"Parkview Hill, Woodbine Gardens"
3,M1J,Scarborough,Scarborough Village
4,M4G,East York,Leaside
5,M4M,East Toronto,Studio District
6,M1R,Scarborough,"Wexford, Maryvale"
7,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."
8,M9L,North York,Humber Summit
9,M5V,Downtown Toronto,"CN Tower, King and Spadina, Railway Lands, Har..."


Check how many entries we have

In [33]:
print(pt.shape)

(103, 3)
