# Clustering Neighbourhoods in Toronto 

This notebook tries to use K-Clustering to cluster neighbourhoods in Toronto as an assignment of the coursera Data Science Capstone project.

First, import all required libraries. I'll only use the Pandas library to manipulate the data.

In [1]:
import pandas as pd
import geocoder
import requests

Now, use the method _read__html( )_ to read the wikipedia link into a pandas dataframe with all tables contained in the page.

The table containing the postal codes is in the first one (index 0) in the **tables** data frame. So I can save the table as a dataframe named **df** by accessing that index.


In [2]:
wiki_link='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
tables=pd.read_html(wiki_link,'Postcode', header=0)
df=tables[0]

Now we can clean up the dataframe:

In [3]:
#rename the column Postcode to Postalcode
df.rename(columns={'Postcode':'Postalcode'},inplace=True)

#remove rows without and assigned borough
not_assigned_borough=df[df['Borough']=='Not assigned']
df.drop(not_assigned_borough.index,inplace=True)
df.reset_index(drop=True,inplace=True)

#Make it so that if a cell has a borough but a Not assigned neighborhood, 
#then the neighborhood will be the same as the borough.
not_assigned_neigh=df[df['Neighbourhood']=='Not assigned']
for i in not_assigned_neigh.index:
    df.replace(to_replace=df['Neighbourhood'][i],value=df['Borough'][i],inplace=True)

Then I can group the dataframe by the postal code and join the neighbourhoods names which are under the same pastal code.

In [4]:
grouped_df=df.groupby(['Postalcode','Borough'],sort=False)['Neighbourhood'].apply(', '.join).reset_index()
grouped_df

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Now it's possible to filter the postal codes toget the same dataframe shown in the assignment.

In [5]:
pc=['M5G','M2H','M4B','M1J','M4G','M4M','M1R','M9V','M9L','M5V','M1B','M5A']
final_df=grouped_df[grouped_df['Postalcode'].isin(pc)].reset_index(drop=True)
final_df

Unnamed: 0,Postalcode,Borough,Neighbourhood
0,M5A,Downtown Toronto,"Harbourfront, Regent Park"
1,M1B,Scarborough,"Rouge, Malvern"
2,M4B,East York,"Woodbine Gardens, Parkview Hill"
3,M4G,East York,Leaside
4,M5G,Downtown Toronto,Central Bay Street
5,M2H,North York,Hillcrest Village
6,M1J,Scarborough,Scarborough Village
7,M9L,North York,Humber Summit
8,M4M,East Toronto,Studio District
9,M1R,Scarborough,"Maryvale, Wexford"


Finally, we get the shape of the dataframes.

In [6]:
print('The shape of the grouped dataframe is:{}.\nThe shape of the final dataframe is:{}.'.format(grouped_df.shape,final_df.shape))

The shape of the grouped dataframe is:(103, 3).
The shape of the final dataframe is:(12, 3).


## Obtaining the Geodata

The geodata for this was obtained using the link provided in coursera containing the geodata as a csv file. 

In [7]:
geodata=pd.read_csv('http://cocl.us/Geospatial_data')

In [10]:
r=list(range(0,len(geodata)))
grouped_df['Latitude']=''
grouped_df['Longitude']=''

Add the information about the Latitude and Longitude of each postal code using the information on the read csv file. 

In [11]:
for i in r:
    grouped_df['Latitude'][grouped_df['Postalcode']==geodata['Postal Code'][i]]=geodata['Latitude'][i]
    grouped_df['Longitude'][grouped_df['Postalcode']==geodata['Postal Code'][i]]=geodata['Longitude'][i]
grouped_df

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.7533,-79.3297
1,M4A,North York,Victoria Village,43.7259,-79.3156
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6543,-79.3606
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.7185,-79.4648
4,M7A,Queen's Park,Queen's Park,43.6623,-79.3895
5,M9A,Etobicoke,Islington Avenue,43.6679,-79.5322
6,M1B,Scarborough,"Rouge, Malvern",43.8067,-79.1944
7,M3B,North York,Don Mills North,43.7459,-79.3522
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7064,-79.3099
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.6572,-79.3789


Filter so that it contains only the postal codes shown by in the assignment.

In [12]:
final_df=grouped_df[grouped_df['Postalcode'].isin(pc)].reset_index(drop=True)
final_df

Unnamed: 0,Postalcode,Borough,Neighbourhood,Latitude,Longitude
0,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.6543,-79.3606
1,M1B,Scarborough,"Rouge, Malvern",43.8067,-79.1944
2,M4B,East York,"Woodbine Gardens, Parkview Hill",43.7064,-79.3099
3,M4G,East York,Leaside,43.7091,-79.3635
4,M5G,Downtown Toronto,Central Bay Street,43.658,-79.3874
5,M2H,North York,Hillcrest Village,43.8038,-79.3635
6,M1J,Scarborough,Scarborough Village,43.7447,-79.2395
7,M9L,North York,Humber Summit,43.7563,-79.566
8,M4M,East Toronto,Studio District,43.6595,-79.3409
9,M1R,Scarborough,"Maryvale, Wexford",43.7501,-79.2958
