# CLUSTERING NEIGHBOURHOODS IN TORONTO (1 of 3) - LUIS MdP

In this notebook, I will cluster the neighbourhoods in Toronto as it is stated in the assigment instructions.

First of all, I am going to import the neighbourhood table from Wikipedia, and I am going to proceed with the data wrangling phase.

In [57]:
import pandas as pd
import numpy as np

url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
TOR_neigh_list=pd.read_html(url)

In [58]:
TOR_neigh = TOR_neigh_list[0]
TOR_neigh.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,Not assigned,Not assigned
9,M9A,Downtown Toronto,Queen's Park


In [59]:
# Retreiving info about the dimension of the dataframe

TOR_neigh.shape

(287, 3)

In [60]:
# Checking that all data is in the correct format
TOR_neigh.dtypes

Postcode         object
Borough          object
Neighbourhood    object
dtype: object

The next step is to drop from the created dataframe ("TOR_neigh") all the rows with "Not assigned" borough. To do this, first it is necessary to replace "Not assigned" with "NaN" (Python's default missing value marker).

In [61]:
# replace "Not assigned" to NaN
TOR_neigh["Borough"].replace("Not assigned", np.nan, inplace = True)
TOR_neigh.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,,Not assigned
1,M2A,,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
8,M8A,,Not assigned
9,M9A,Downtown Toronto,Queen's Park


In [62]:
# drop rows with "NaN" in "Borough"

TOR_neigh.dropna(subset=["Borough"], axis=0, inplace=True)
TOR_neigh.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Queen's Park,Not assigned
9,M9A,Downtown Toronto,Queen's Park
10,M1B,Scarborough,Rouge
11,M1B,Scarborough,Malvern
13,M3B,North York,Don Mills North


In [None]:
# Retreiving info about the dimension of the new dataframe

TOR_neigh.shape

In [38]:
# Grouping Neighbourhood in every unique Postcode

TOR_neigh = TOR_neigh.groupby(["Postcode","Borough"])["Neighbourhood"].apply(list)
TOR_neigh = TOR_neigh.sample(frac=1).reset_index()
TOR_neigh["Neighbourhood"]= TOR_neigh["Neighbourhood"].str.join(", ")
TOR_neigh.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1J,Scarborough,Scarborough Village
1,M4C,East York,Woodbine Heights
2,M3B,North York,Don Mills North
3,M5T,Downtown Toronto,"Chinatown, Grange Park, Kensington Market"
4,M6M,York,"Del Ray, Keelesdale, Mount Dennis, Silverthorn"
5,M4L,East Toronto,"The Beaches West, India Bazaar"
6,M1S,Scarborough,Agincourt
7,M3A,North York,Parkwoods
8,M9W,Etobicoke,Northwest
9,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In [39]:
# Retreiving info about the dimension of the new dataframe

TOR_neigh.shape

(103, 3)

In [42]:
# replace "Not assigned" in Neighbourhood with the name of Borough

TOR_neigh["Neighbourhood"].replace("Not assigned", TOR_neigh["Borough"], inplace = True)
TOR_neigh.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1J,Scarborough,Scarborough Village
1,M4C,East York,Woodbine Heights
2,M3B,North York,Don Mills North
3,M5T,Downtown Toronto,"Chinatown, Grange Park, Kensington Market"
4,M6M,York,"Del Ray, Keelesdale, Mount Dennis, Silverthorn"
5,M4L,East Toronto,"The Beaches West, India Bazaar"
6,M1S,Scarborough,Agincourt
7,M3A,North York,Parkwoods
8,M9W,Etobicoke,Northwest
9,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


In [53]:
# Confirmation the replace has been done

TOR_neigh[TOR_neigh.Postcode=="M7A"]

Unnamed: 0,Postcode,Borough,Neighbourhood
23,M7A,Queen's Park,Queen's Park


In [54]:
# Retreiving info about the dimension of the new dataframe

TOR_neigh.shape

(103, 3)