# Segmenting and Clustering Neighborhoods in Toronto

Mario Ambrosino. 2019/05/04.

## Goal

Use the Notebook to build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe like the one shown below:

In [1]:
# Libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# Import HTML Table from Wikipedia
wiki_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
req = requests.get(wiki_url)
soup = BeautifulSoup(req.content, 'lxml')
table_classes = {"class": ["sortable", "plainrowheaders"]}
wiki_table = soup.find("table", table_classes)

In [3]:
# Create raw data frame from HTML table

html_table = wiki_table.prettify()
raw_data = pd.read_html(html_table)
df = raw_data[0]
df.head(12)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned


In [4]:
# Filters the dataframe with the assigment prescription:

# Drop out the "Not assigned" Borough:
f1_df = df[df['Borough'] != "Not assigned"]

# join same postcode neighbourhood:
f2_df = pd.DataFrame()
# generate the index of the final table
unique_postcode = f1_df["Postcode"].unique()

for pc in unique_postcode:
    c_borough = f1_df[f1_df["Postcode"] == pc]["Borough"].unique()[0]
    c_neigh = ", ".join(f1_df[f1_df["Postcode"]==pc]["Neighbourhood"].values)
    # Use "Borough" when "Neighbourhood" is Not Assigned
    if c_neigh == "Not assigned":
        c_neigh = c_borough
    f2_df = f2_df.append({"Postcode" : pc, 
                          "Borough" : c_borough, 
                          "Neighbourhood": c_neigh}, ignore_index=True)    
f2_df.set_index("Postcode", inplace = True)


In [5]:
# Show the result
f2_df.to_csv("toronto.csv")
f2_df

Unnamed: 0_level_0,Borough,Neighbourhood
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Harbourfront, Regent Park"
M6A,North York,"Lawrence Heights, Lawrence Manor"
M7A,Queen's Park,Queen's Park
M9A,Etobicoke,Islington Avenue
M1B,Scarborough,"Rouge, Malvern"
M3B,North York,Don Mills North
M4B,East York,"Woodbine Gardens, Parkview Hill"
M5B,Downtown Toronto,"Ryerson, Garden District"


In [6]:
f2_df.shape

(103, 2)