# Cluster Neighborhood in Toronto
For the Toronto neighborhood data, a Wikipedia page exists that has all the information needed to explore and cluster the neighborhoods in Toronto. The Wikipedia page will be scraped and the data gets wrangled, cleaned, and then read into a pandas dataframe so that it is in a structured format.
Once the data is in a structured format, it can get analysed in order to explore and cluster the neighborhoods in the city of Toronto.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [18]:
toronto_data = pd.read_html("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
df_toronto = toronto_data[0]
df_toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


In [19]:
# check how many Borughs are not assigned
print(df_toronto['Borough'].value_counts())

Not assigned        77
North York          24
Downtown Toronto    19
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
East Toronto         5
York                 5
East York            5
Mississauga          1
Name: Borough, dtype: int64


In [20]:
# remove rows where Borough is not assigned
df_toronto = df_toronto[df_toronto['Borough'] != 'Not assigned']

In [21]:
df_toronto.head()

Unnamed: 0,Postal code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront
5,M6A,North York,Lawrence Manor / Lawrence Heights
6,M7A,Downtown Toronto,Queen's Park / Ontario Provincial Government


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [34]:
# group by postal code in order to combine all neighbourhoods having the same postal code into the same row
df_toronto_filtered = df_toronto.groupby(['Postal code'], as_index=False).first()

# replace '/' sign with ',' sign
for idx, element in enumerate(df_toronto_filtered['Neighborhood']):
    df_toronto_filtered['Neighborhood'][idx] = df_toronto_filtered['Neighborhood'][idx].replace('/', ',').replace(' ,', ',')
df_toronto_filtered.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [42]:
for idx, element in enumerate(df_toronto_filtered['Neighborhood']):
    if df_toronto_filtered['Neighborhood'][idx] == 'Not assigned':
        df_toronto_filtered['Neighborhood'][idx] = df_toronto_filtered['Borough'][idx]

In [43]:
# print number of rows in final dataframe
df_toronto_filtered.shape

(103, 3)