# Segmenting and clustering neighbourhoods in Toronto

## 1. Scrape a table from Wikipedia: _List of postal codes of Canada: M_


First, we import some libraries.

In [1]:
import requests #to get a raw html page
from bs4 import BeautifulSoup #a scraping package
import pandas as pd #to work with dataframes

Get the html page.

In [2]:
raw_page=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')
soup=BeautifulSoup(raw_page.text,'lxml')

Get the table from it.

In [3]:
table=soup.find_all('table')[0]

Preprocess the table before merge with geospatial data.

In [4]:
df=pd.read_html(str(table), header=0)[0] #read table by Pandas
df=df[df.Borough !='Not assigned'] #drop rows with 'Not assigned' in Borough
df = df.reset_index(drop=True) #reset index 0, 1, 2, ...
#Combine neighbourhoods with same postcode and borough
df=df.groupby(['Postcode','Borough'], as_index=False).agg(lambda x: ', '.join(set(x.dropna()))) 
#'Not assigned' neighborhoods are changed to their boroughs
df.loc[df.Neighbourhood=='Not assigned', 'Neighbourhood']=df.Borough
#Show first rows of table after preprocessing
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Morningside, Guildwood, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


The table size after preprocessing is 

In [5]:
df.shape

(103, 3)

# 2. Add coordinate to Neighbourhoods

Load geospatial data from a csv file.

In [7]:
df_coordinate=pd.read_csv('http://cocl.us/Geospatial_data')

The column _Postal Code_ in the csv file is the same as the column _Postcode_ in the dataframe. So we change the name _Postal Code_ into _Postcode_ so that we can merge later. 

In [8]:
df_coordinate.rename(columns={'Postal Code':'Postcode'},inplace=True)

Merge 2 dataframes.

In [9]:
df_merge=pd.merge(df,df_coordinate)
df_merge.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Morningside, Guildwood, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
