## Segmenting and Clustering Neighborhoods in Toronto

#### Import libraries

In [37]:
import numpy as np        # library to handle data in a vectorized manner
import pandas as pd       # library for data analysis
from bs4 import BeautifulSoup   # package to transform the data of webpage into pandas dataframe
import requests           # library to handle requests 

#### Download and preprocess dataset

We are going to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, 
in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

We will use the BeautifulSoup package to transform the data in the table on the Wikipedia page into a pandas dataframe.

In [38]:
#download data and parse it
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(url).text
soup = BeautifulSoup(source, 'html.parser')

In [39]:
#obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe
table = soup.find('table')
td = table.find_all('td')
postcode = []
borough = []
neighborhood = []

for i in range(0, len(td), 3):
    postcode.append(td[i].text.strip())
    borough.append(td[i+1].text.strip())
    neighborhood.append(td[i+2].text.strip())

# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
df = pd.DataFrame(data=[postcode, borough, neighborhood]).transpose()
df.columns = ['Postal Code', 'Borough', 'Neighborhood']

# Drop cells with a borough that is 'Not assigned'
df['Borough'].replace('Not assigned', np.nan, inplace=True)
df.dropna(subset=['Borough'], inplace=True)

# If a cell has a Borough but a 'Not assigned' Neighborhood, then the Neighborhood will be the same as the Borough
df['Neighborhood'].replace('Not assigned',df.Borough, inplace=True)

# If more than one neighborhood exist in one postal code area, the relevant rows will be combined into one row 
# with the neighborhoods separated with a comma
df = df.groupby(['Postal Code','Borough'])['Neighborhood'].apply(', '.join).reset_index()
df.head(12)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [40]:
# The number of rows of the dataframe
df.shape

(103, 3)