# About the project
This project is an attempt to segment, explore, and cluster Toronto neighbourhoods.

First, we create a dataset of Toronto neighbourhoods. We do this through scraping some information from Wikipedia about Toronto including postal codes, boroughs, and neighbourhoods. We also clean the resulting dataset by addressing missing values.

Second, we add longitude and latitude for each record. We do this by merging the Toronto dataset with geo coordinates. 

Third, we explore, segment, and cluster the neighbourhoods. We do this using kmeans and visualise the map with superimposed markers using Folium. 


# Part 1

Part 1 involves:
- Setting up libraries needed
- Scraping online source Wikipedia for Toronto neighbourhoods information including Borough, Postal Code, and Neighbourhood
- Creating a dataframe out of the scraped source
- Cleaning up missing values, i.e., 'Not Assigned' values in the Borough column

In [1]:
# Set-up libraries
from bs4 import BeautifulSoup
import requests
from IPython.display import display_html
import pandas as pd
import numpy as np
import folium
from geopy.geocoders import Nominatim
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

In [2]:
# Scrape wiki source
path = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
source = requests.get(path).text
soup = BeautifulSoup(source, 'lxml')
print(soup.title)
toronto_table = str(soup.table)
display_html(toronto_table, raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postcode,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,Harbourfront
M6A,North York,Lawrence Heights
M6A,North York,Lawrence Manor
M7A,Downtown Toronto,Queen's Park
M8A,Not assigned,Not assigned
M9A,Etobicoke,Islington Avenue


In [3]:
# Read-in source to dataframe
toronto_dfs = pd.read_html(toronto_table)
toronto_df = toronto_dfs[0]
toronto_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# Discard the rows with 'Not assigned' value in Borough column   
toronto_df1 = toronto_df[toronto_df.Borough != 'Not assigned']

# Join neighbourhoods with identical Postcode values
toronto_df2 = toronto_df1.groupby(['Postcode', 'Borough'], sort=False).agg(', '.join)
toronto_df2.reset_index(inplace=True)

# Impute 'Not assigned' in Neighbourhood with Borough names
toronto_df2['Neighbourhood'] = np.where(toronto_df2['Neighbourhood'] == 'Not assigned', 
                                toronto_df2['Borough'], 
                                toronto_df2['Neighbourhood'])
toronto_df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


In [5]:
toronto_df2.shape

(103, 3)

# Part 2
Part 2 involves:
- Reading in geographic coordinates contaiing latitudes and longitudes
- Merging the geographic coordinates to the Toronto dataframe

In [6]:
# Read-in coordinates to dataframe
geo_loc_df = pd.read_csv('Geospatial_Coordinates.csv')
geo_loc_df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [7]:
toronto_df3 = pd.merge(toronto_df2, geo_loc_df, left_on='Postcode', right_on='Postal Code')
toronto_df3.drop(columns=['Postal Code'], inplace=True)
toronto_df3.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


# Part 3

## A look at Toronto overall
- Get Toronto geographical coordinates
- Visualise Toronto and its neighbourhoods

In [8]:
# Get Toronto geo coordinates
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


In [9]:
# Creat a map of Toronto with neighbourhoods superimposed on top
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_df3['Latitude'], 
                                           toronto_df3['Longitude'], 
                                           toronto_df3['Borough'], 
                                           toronto_df3['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## A look at just the boroughs with the string 'Toronto'
- Get Toronto geographical coordinates of downtown Toronto
- Visualise downtown Toronto and its neighbourhoods

In [10]:
# Slice just Toronto boroughs from Toronto dataframe
toronto_borough_df = toronto_df3[toronto_df3['Borough'].str.contains('Toronto', regex=False)]
toronto_borough_df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [11]:
# Get Toronto borough geo coordinates
address = 'Downtown Toronto, TO'
geolocator = Nominatim(user_agent="downtown_toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Downtown Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Downtown Toronto are 43.6541737, -79.38081164513409.


In [12]:
# Creat a map of Toronto borough subset with neighbourhoods superimposed on top
map_toronto_borough = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_borough_df['Latitude'], 
                                           toronto_borough_df['Longitude'], 
                                           toronto_borough_df['Borough'], 
                                           toronto_borough_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_borough)  
    
map_toronto_borough

## Cluster neighbourhoods

In [13]:
# Create clustering
k=5
toronto_clustering = toronto_borough_df.drop(['Postcode','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
toronto_borough_df.insert(0, 'Cluster Labels', kmeans.labels_)

In [14]:
# See cluster labels
toronto_borough_df.head()

Unnamed: 0,Cluster Labels,Postcode,Borough,Neighbourhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
4,0,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [15]:
# Create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# Set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(toronto_borough_df['Latitude'], 
                                            toronto_borough_df['Longitude'], 
                                            toronto_borough_df['Neighbourhood'], 
                                            toronto_borough_df['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters