## Segmenting and Clustering Neighborhoods in the city of Toronto, Canada

First we need to install the beautiful soup library to be able to scrape websites

In [1]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


Next we need to install a parser to parse the html files

In [3]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.


Import all the relevant libraries

In [1]:
from bs4 import BeautifulSoup
import requests

Get the source code from the website and pass it to BeautifulSoup

In [4]:
source = requests.get('http://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup = BeautifulSoup(source.text, 'lxml')

Find the table in the HTML script and add the relevant columns to respective lists 

In [78]:
table = soup.find('table', class_= "wikitable sortable")
postcode = []
borough = []
neigh = []

for row in table.find_all('tr'):
    values = row.text.split('<th>')[0].split('\n')
    postcode.append(values[1])
    borough.append(values[2])
    neigh.append(values[3]) 

Convert the lists to a pandas dataframe

In [88]:
import pandas as pd
df = pd.DataFrame({'Postcode': postcode,
                   'Borough': borough,
                   'Neighborhood':neigh
                  })
df.drop(df.index[:1], inplace=True)
df.head()


Unnamed: 0,Postcode,Borough,Neighborhood
1,M1A,Not assigned,Not assigned
2,M2A,Not assigned,Not assigned
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront


## Clean the data

Remove all the rows where Borough is Not Assigned

In [89]:
df=df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,Harbourfront
6,M5A,Downtown Toronto,Regent Park
7,M6A,North York,Lawrence Heights


Combine all the rows with the same Postcode and Borough

In [105]:
df = df.groupby(['Postcode','Borough'], as_index=False).agg({'Neighborhood' : ', '.join}) 
df.head()

Unnamed: 0,Postcode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Assigning the Borough name to the Neighborhood name where the Neighborhood is not assigned

In [112]:
df.loc[df['Neighborhood']=='Not assigned']

Unnamed: 0,Postcode,Borough,Neighborhood
85,M7A,Queen's Park,Not assigned


In [115]:
df.loc[df.Neighborhood == 'Not assigned', 'Neighborhood'] = df.Borough 

In [117]:
df.loc[df['Neighborhood']=="Queen's Park"]

Unnamed: 0,Postcode,Borough,Neighborhood
85,M7A,Queen's Park,Queen's Park


Print the number of rows in the dataframe

In [118]:
df.shape[0]

103