# Segmenting and Clustering Neighborhoods in Toronto

## 1. Introduction

In this project, we will explore and cluster the neighborhoods in Toronto. First, we will use pandas to scrape the postal codes table of Toronto, and then create a dataframe to store the data. After that, we will add the latitude and longitude to this dataframe using geocoder.google function. Finally, we will explore and cluster the neighborhoods in Toronto using k-means clustering algorithm.  

Before getting started, let's install some useful packages. Uncomment these codes after packages are installed.

Install package folium

In [1]:
# !conda install -c conda-forge folium=0.5.0 --yes

Install package geopy

In [2]:
# !conda install -c conda-forge geopy --yes

Import some useful packages

In [3]:
import numpy as np
import pandas as pd

import json
import requests
import folium

from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

print('Libraries imported.')

Libraries imported.


## 2. Download Dataset

Scrape the table on the wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M using pandas.

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df = pd.read_html(url, header=0)
df = df[0]

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Remove rows with a borough that is Not assigned, and check the new dataframe

In [5]:
df = df[~(df['Borough']=='Not assigned')]
df['Borough'].unique()

array(['North York', 'Downtown Toronto', "Queen's Park", 'Etobicoke',
       'Scarborough', 'East York', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [6]:
for i in df.index:
    if (df.loc[i, 'Neighbourhood']=='Not assigned'):
        df.loc[i, 'Neighbourhood'] = df.loc[i, 'Borough']

df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: Harbourfront and Regent Park. These two rows will be combined into one row with the neighborhoods separated with a comma.

In [7]:
df = df.groupby(['Postcode']).agg({lambda x: ', '.join(set(x))})
df.head()

Unnamed: 0_level_0,Borough,Neighbourhood
Unnamed: 0_level_1,<lambda>,<lambda>
Postcode,Unnamed: 1_level_2,Unnamed: 2_level_2
M1B,Scarborough,"Malvern, Rouge"
M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
M1E,Scarborough,"West Hill, Morningside, Guildwood"
M1G,Scarborough,Woburn
M1H,Scarborough,Cedarbrae


In [8]:
df = df.reset_index()
df.head()

Unnamed: 0_level_0,Postcode,Borough,Neighbourhood
Unnamed: 0_level_1,Unnamed: 1_level_1,<lambda>,<lambda>
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [9]:
df.columns = df.columns.droplevel(1)
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Highland Creek, Port Union, Rouge Hill"
2,M1E,Scarborough,"West Hill, Morningside, Guildwood"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [10]:
df.shape

(103, 3)