### Segmenting and Clustering Neighborhoods in Toronto

#### In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

#### For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas dataframe so that it is in a structured format like the New York dataset.

#### Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

#### Your submission will be a link to your Jupyter Notebook on your Github repository.

In [1]:
import pandas as pd
import requests # library to handle requests
from bs4 import BeautifulSoup # library to parse html page
import re

In [2]:
# scrape wiki page
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
r = requests.get(wiki_url)

# parse HTML and get the table
page = BeautifulSoup(r.text, 'html.parser')
row = page.table.tbody.find_all('tr')

In [3]:
# create empty dataframe
df = pd.DataFrame(columns = ['PostalCode', 'Borough', 'Neighborhood'])

# loop through each row of table and append to dataframe
for i, r in enumerate(row):
    tds = r.find_all('td')
    if len(tds) > 0:
        postcode, borough, neighbourhood = tds
        obj = {'PostalCode': postcode.text, 'Borough': borough.text, 'Neighborhood': re.sub(r'\n', '', neighbourhood.text)}
        df = df.append(obj, ignore_index=True)

df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# save the item that 'Not assigned' to use in future
not_assigned_df = df[df['Neighborhood'] == 'Not assigned']
not_assigned_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
8,M7A,Queen's Park,Not assigned
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned


In [5]:
# remove 'Not assigned' items
df = df[df['Neighborhood'] != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [7]:
# group items by ['PostalCode', 'Borough'] and join neighborhood together
df = df.groupby(['PostalCode', 'Borough'], as_index=False)['Neighborhood'].agg(lambda x: ', '.join(x))
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
# test one item
df[df['PostalCode'] == 'M5A']

Unnamed: 0,PostalCode,Borough,Neighborhood
53,M5A,Downtown Toronto,"Harbourfront, Regent Park"


In [9]:
# assign 'Neighborhood' column = 'Borough' column
not_assigned_df['Neighborhood'] = not_assigned_df['Borough']
not_assigned_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
8,M7A,Queen's Park,Queen's Park
9,M8A,Not assigned,Not assigned
13,M2B,Not assigned,Not assigned


In [10]:
# append items to df
df = df.append(not_assigned_df[not_assigned_df['Neighborhood'] != 'Not assigned'])

# remove dupplicated items
df.drop_duplicates(inplace=True)

# test one item that has 'Neighborhood'=Not Assign, but 'Borough' is defined
df[df['PostalCode'] == 'M7A']

Unnamed: 0,PostalCode,Borough,Neighborhood
8,M7A,Queen's Park,Queen's Park


In [11]:
df.shape

(103, 3)