<h1 align="center">Segmenting and Clustering Neighborhoods in Toronto</h1>

## To scrape the Toronto postal data from Wiki

### 1. Get the Wiki data

In [86]:
import requests
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plot
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url).text
html_tree = BeautifulSoup(html_data, 'html5lib')
# html_tree.prettify()

### 2. Parse the table data into a DataFrame

In [87]:
trs = html_tree.find('tbody').find_all('tr')
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
column = ['PostalCode', 'Borough', 'Neighborhood']
toronto_df = pd.DataFrame(columns=column)
# extract raw data from html table to dataframe.
for tr in trs:
    tds = tr.find_all('td')
    if(tds != []):
        toronto_df = toronto_df.append({'PostalCode':tds[0].text.strip(), 'Borough':tds[1].text.strip(), 'Neighborhood':tds[2].text.strip()}, ignore_index=True)
# Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.
# remove Borough value = 'Not assigned'
index_na = toronto_df[(toronto_df['Borough']=='Not assigned')].index
toronto_df.drop(index=index_na, inplace=True)
toronto_df.reset_index(inplace=True)
toronto_df.drop(columns='index', axis=1, inplace=True)
# If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.
if(toronto_df[(toronto_df['Neighborhood']=='Not assigned')].shape[0]!=0):
    print('Found Neighborhood is not assigned; use Borough value')
    toronto_df.loc[toronto_df['Neighborhood']=='Not assigned', 'Neighborhood']=toronto_df.loc[toronto_df['Neighborhood']=='Not assigned', 'Borough'] 
# In the last cell of your notebook, use the .shape method to print the number of rows of your dataframe.
print(toronto_df.shape)
print('The dataframe has {} rows with {} columns'.format(toronto_df.shape[0], toronto_df.shape[1]))
    

(103, 3)
The dataframe has 103 rows with 3 columns


In [88]:
# More than one neighborhood can exist in one postal code area. 
# For example, in the table on the Wikipedia page, 
# you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. 
# These two rows will be combined into one row with the neighborhoods separated with 
# a comma as shown in row 11  in the above table.
postal_code_mask = toronto_df['PostalCode'].value_counts()==1
postal_df = pd.DataFrame()
if(toronto_df[postal_code_mask.values].shape[0] != 0):
    print('Some postal code has multiple neighborhoods')
    postal_df = toronto_df[postal_code_mask.values]

for code in postal_df['PostalCode']:
#     print(code)
    neighborhood_agg = ''
#     print(postal_df.loc[postal_df['PostalCode']==code, 'Neighborhood'].values)
    for i in postal_df.loc[postal_df['PostalCode']==code, 'Neighborhood'].values:
        neighborhood_agg += i
        print(neighborhood_agg)
        postal_df.loc[postal_df['PostalCode']==code, 'Neighborhood'] = neighborhood_agg
postal_df

Some postal code has multiple neighborhoods
Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal,

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."
