# Clustering Toronto Neighbourhoods


## Introduction
In this notebook, we will explore and cluster neighbourhoods in Toronto. 

To do this we will need a list of all the neighborhoods in Toronto with details like their names, postal codes, boroughs, latitude and logitude values.

Once we have this data we can use it to find neighborhoods that are similar. We will use the K-Means algorithm to cluster the neighborhoods. Finally, we will visualize the clusters on a map.

This notebook will have 3 sections: Data Collection and Preprocessing, Fetching location data and Analysis. In the first section, we will get the data for the neighborhoods and process it. In the second section, we will get the location information for each neighborhood through an API. In the third section, we will use K-Means on the dataset and visualize the result on a map.

## Table of Contents
I. <a href="#section1">Data Collection and Preprocessing</a>
  1. <a href="#step1">Scrape neighbourhood data</a>
  2. <a href="#step2">Extract required data<a>
  3. <a href="#step3">Explore and Preprocess the dataset<a>
    
II. <a href="#section2">Fetch Location Data</a>
  1. <a href="#step4">Get location data</a>
  2. <a href="#step5">Add location data to the dataset</a>

II. <a href="#section3">Clustering and Analysis</a>

## <a id="section1" style="text-decoration:none; color: #000;">I. Data Collection and Preprocessing</a>

### <a id="step1" style="text-decoration:none; color: #000;">1. Scrape neighbourhood data</a>
Let's start by getting the data for the neighbourhoods in Toronto.

The data we need can be found here: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

The wikipedia page has the neighborhood data displayed in a table. We will scrape this table and then extract the text content.

There are many python libraries and packages for web scraping. We will use one of the most common ones, BeautifulSoup. The installation details and documentation can be found here: https://beautiful-soup-4.readthedocs.io/en/latest/

In [1]:
# Import the libraries
from bs4 import BeautifulSoup
import pandas as pd
import requests

Get the html content from the web page and pass it to the BeautifulSoup constructor.

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")

# Print the title of the web page
print(soup.title.string)

List of postal codes of Canada: M - Wikipedia


The BeautifulSoup constructor also takes a parser argument. There are different parsers available. We will use lxml for it's speed. 

The soup object is an object which represents the html document as a tree. This can then be used to find elements by type, id, class or any other attributes.

### <a id="step2" style="text-decoration:none; color: #000;">2. Extract required data<a>

The html table element has the css classes wikitable and sortable. We can pass these as arguments to the soup object to get the table.

We will then loop through each row and extract the text content of each cell.

In [3]:
# Extract the table
postal_table = soup.find("table", {"class": "wikitable sortable"})

postal_data = []

# Get the table headers
headers = postal_table.findAll("th")
headers = [h.string.replace("\n", "") for h in headers]

# Loop through the table rows and extract the text of the elements
for row in postal_table.findAll("tr"):
    columns = row.findAll("td")
    if len(columns) > 0:
        post = {}
        for index in range(len(columns)):
            link = columns[index].find("a")
            if link is not None:
                post[headers[index]] = link.string.replace("\n", "")
            else:
                post[headers[index]] = columns[index].string.replace("\n", "")
        postal_data.append(post)

In [4]:
postal_data[0:5]

[{'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned',
  'Postcode': 'M1A'},
 {'Borough': 'Not assigned',
  'Neighbourhood': 'Not assigned',
  'Postcode': 'M2A'},
 {'Borough': 'North York', 'Neighbourhood': 'Parkwoods', 'Postcode': 'M3A'},
 {'Borough': 'North York',
  'Neighbourhood': 'Victoria Village',
  'Postcode': 'M4A'},
 {'Borough': 'Downtown Toronto',
  'Neighbourhood': 'Harbourfront',
  'Postcode': 'M5A'}]

Now that we have the table content in a list, let's convert it into a pandas dataframe.

In [5]:
postal_df = pd.DataFrame(postal_data)
postal_df.columns = ["Borough", "Neighborhood", "PostalCode"]

# Sort the values first by PostalCode and then by Neighborhood
postal_df = postal_df.sort_values(by=["PostalCode", "Neighborhood"]).reset_index(drop=True)

# Make PostalCode the first column
fixed_columns = [postal_df.columns[-1]] + list(postal_df.columns[:-1])
postal_df = postal_df[fixed_columns]

### <a id="step3" style="text-decoration:none; color: #000;">3. Explore and Preprocess the dataset<a>

Let's explore the dataset and fix any inconsistencies.

In [6]:
postal_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 289 entries, 0 to 288
Data columns (total 3 columns):
PostalCode      289 non-null object
Borough         289 non-null object
Neighborhood    289 non-null object
dtypes: object(3)
memory usage: 6.9+ KB


In [7]:
postal_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M1B,Scarborough,Malvern
2,M1B,Scarborough,Rouge
3,M1C,Scarborough,Highland Creek
4,M1C,Scarborough,Port Union


The first row has neither borough nor neighborhood. 

In [8]:
postal_df.describe(include="all")

Unnamed: 0,PostalCode,Borough,Neighborhood
count,289,289,289
unique,180,12,210
top,M9V,Not assigned,Not assigned
freq,8,77,78


There are 77 boroughs and 78 neighborhoods with the value "Not assigned". 

Let's drop the rows without borough names. 

In [9]:
print("Original size of the dataset: {0}, {1}".format(postal_df.shape[0], postal_df.shape[1]))

postal_df = postal_df[postal_df["Borough"] != "Not assigned"]

print("New size of the dataset: {0}, {1}".format(postal_df.shape[0], postal_df.shape[1]))

Original size of the dataset: 289, 3
New size of the dataset: 212, 3


In [10]:
unique_neighborhoods = postal_df["Neighborhood"].unique().tolist()
print("There are {} unique neighborhoods".format(len(unique_neighborhoods)))

print("\n\nNumber of unassigned neighborhoods: {0}\n".format(unique_neighborhoods.count("Not assigned")))

postal_df[postal_df["Neighborhood"] == "Not assigned"]

There are 210 unique neighborhoods


Number of unassigned neighborhoods: 1



Unnamed: 0,PostalCode,Borough,Neighborhood
195,M7A,Queen's Park,Not assigned


One row has an assigned Borough but no Neighborhood. We will set the value of the Borough to the Neighborhood.

In [11]:
postal_df.loc[195, :]["Neighborhood"] = postal_df.loc[195, :]["Borough"]
postal_df.loc[195, :]

PostalCode               M7A
Borough         Queen's Park
Neighborhood    Queen's Park
Name: 195, dtype: object

In [12]:
postal_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
1,M1B,Scarborough,Malvern
2,M1B,Scarborough,Rouge
3,M1C,Scarborough,Highland Creek
4,M1C,Scarborough,Port Union
5,M1C,Scarborough,Rouge Hill


Some of the postal codes have multiple neighborhoods. For example, Highland Creek, Port Union and Rouge Hill have the postal code M1C. We will combine these into a single row with the neighborhood names separated by commas.

In [13]:
grouped_df = postal_df.groupby(["Borough", "PostalCode"])["Neighborhood"].apply(lambda x: ', '.join(x)).reset_index()
grouped_df.head(15)

Unnamed: 0,Borough,PostalCode,Neighborhood
0,Central Toronto,M4N,Lawrence Park
1,Central Toronto,M4P,Davisville North
2,Central Toronto,M4R,North Toronto West
3,Central Toronto,M4S,Davisville
4,Central Toronto,M4T,"Moore Park, Summerhill East"
5,Central Toronto,M4V,"Deer Park, Forest Hill SE, Rathnelly, South Hi..."
6,Central Toronto,M5N,Roselawn
7,Central Toronto,M5P,"Forest Hill North, Forest Hill West"
8,Central Toronto,M5R,"North Midtown, The Annex, Yorkville"
9,Downtown Toronto,M4W,Rosedale


In [14]:
grouped_df.shape

(103, 3)

## <a id="section2" style="text-decoration:none; color: #000;">II. Fetch Location Data</a>

We will be using the Foursquare API to get information about the different neighborhoods. For this, we need to get the latitude and longitude of each neighborhood.

### <a id="step4" style="text-decoration:none; color: #000;">1. Get location data</a>
The geocoder python package can be used to get location data for each neighborhood in the dataset. It takes in an address and returns the latitude and longitude. Documentation for the packages can be found here: https://geocoder.readthedocs.io/index.html. 

The geocoder API does not always return the location data. So we will use the following csv file containing the location data for the neighborhoods: https://cocl.us/Geospatial_data

In [15]:
loc_url = "https://cocl.us/Geospatial_data"

loc_df = pd.read_csv(loc_url)
loc_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


### <a id="step5" style="text-decoration:none; color: #000;">2. Add location data to the dataset</a>


Now we can merge both the datasets on postal code. The name for this column is different in each dataframe. So we use the left_on and right_on parameters to the merge function and drop the duplicate column.

In [16]:
toronto_df = grouped_df.merge(loc_df, left_on="PostalCode", right_on="Postal Code")
toronto_df.drop("Postal Code", axis=1, inplace=True)
toronto_df.head()

Unnamed: 0,Borough,PostalCode,Neighborhood,Latitude,Longitude
0,Central Toronto,M4N,Lawrence Park,43.72802,-79.38879
1,Central Toronto,M4P,Davisville North,43.712751,-79.390197
2,Central Toronto,M4R,North Toronto West,43.715383,-79.405678
3,Central Toronto,M4S,Davisville,43.704324,-79.38879
4,Central Toronto,M4T,"Moore Park, Summerhill East",43.689574,-79.38316


Rearraging the columns,

In [17]:
fixed_columns = [toronto_df.columns[1], toronto_df.columns[0]] + list(toronto_df.columns[2:])
toronto_df = toronto_df[fixed_columns]
toronto_df.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
2,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
3,M4S,Central Toronto,Davisville,43.704324,-79.38879
4,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
5,M4V,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",43.686412,-79.400049
6,M5N,Central Toronto,Roselawn,43.711695,-79.416936
7,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
8,M5R,Central Toronto,"North Midtown, The Annex, Yorkville",43.67271,-79.405678
9,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529
