# Segmenting and Clustering Neighborhoods in Toronto

*In this project, a Wikipedia page is scraped to retrieve data on the postal codes of Canada. The data is used to create a table and is then preprocessed and cleaned. A CSV file is then imported to give the geogaphical coordinates of each postal code. Next, the data from the CSV file is added to the postal code table. This data is then used to explore and cluster the neighborhoods of Toronto. K Means Clustering is used to cluster the data and the Folium Library is used to visualize the data. Only the boroughs containing the word 'Toronto' were used in the clustering and visualization.*

### Installing and Importing Libraries

In [15]:
import requests
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import random
import json # library to handle JSON files

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt # plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

#!conda install -c conda-forge geopy --yes #
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#!conda install -c conda-forge folium=0.5.0 --yes #
import folium # map rendering library

print('Libraries Imported')

Libraries Imported


### Scraping the Wikipedia Page for Postal Code Table

*The Pandas Library was used here for web scraping of the table on the Wikipedia Page. The data from the table is stored in the Dataframe 'df_toronto'. This dataframe is then printed.*

In [16]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_webpage = pd.read_html(url)

In [17]:
df_toronto = df_webpage[0]
df_toronto

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


### Data Preprocessing / Cleaning

In [18]:
#renaming columns to match the required names
df_toronto.rename(columns={"Postal Code": "PostalCode", "Neighbourhood": "Neighborhood"}, inplace = True)

In [19]:
#dropping rows in which the Borough is 'Not assigned'
df_toronto_1 = df_toronto[df_toronto.Borough != 'Not assigned']

#combining the neighborhoods which share the same Postal Code
df_toronto_2 = df_toronto_1.groupby(['PostalCode','Borough'], sort = False).agg(', '.join)

#reset the index
df_toronto_2.reset_index(inplace = True)

#replacing the name of the neighborhoods that are 'Not assigned' with the name of the Borough
df_toronto_2['Neighborhood'] = np.where(df_toronto_2['Neighborhood'] == 'Not assigned', df_toronto_2['Borough'], df_toronto_2['Neighborhood'])

#print the dataframe
df_toronto_2

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [20]:
#shape of the dataframe
df_toronto_2.shape

(103, 3)

### Importing the CSV File Containing the Geographical Coordinates of Each Postal Code

In [21]:
geo_coordinates = pd.read_csv('https://cocl.us/Geospatial_data')
geo_coordinates

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


### Merging the 2 Tables to get the Geographical Coordinates of Each Neighborhood

In [22]:
#renaming the columns to match the columns of df_toronto_2
geo_coordinates.rename(columns = {'Postal Code': 'PostalCode'}, inplace = True)

#merging the two dataframes
df_toronto_3 = pd.merge(df_toronto_2, geo_coordinates, on = 'PostalCode')

#printing the final dataframe
df_toronto_3

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


## Clustering and Visualization of Toronto Neighborhoods

### Create Dataframe of all rows with Boroughs containing 'Toronto'

In [23]:
df_toronto_4 = df_toronto_3[df_toronto_3['Borough'].str.contains('Toronto', regex = False)]
df_toronto_4

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


### Visualizing the Neighborhoods of 'df_toronto_4'

In [24]:
# create map of Toronto using latitude and longitude values found from Google
map_toronto = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_toronto_4['Latitude'], df_toronto_4['Longitude'], df_toronto_4['Borough'], df_toronto_4['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

# print the map
map_toronto

### Clustering the Neighborhoods Using K Means Clustering

In [26]:
toronto_cluster = df_toronto_4.drop(['PostalCode', 'Borough', 'Neighborhood'], 1)
k_means = KMeans(n_clusters = 5, random_state = 0)
k_means.fit(toronto_cluster)
k_means_labels = k_means.labels_
df_toronto_4.insert(0, 'ClusterLabel', k_means.labels_)

In [27]:
#print the new dataframe with cluster labels
df_toronto_4

Unnamed: 0,ClusterLabel,PostalCode,Borough,Neighborhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,3,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


### Visualizing the Clustered Neighborhoods of 'df_toronto_4'

In [31]:
# create map of Toronto using latitude and longitude values found from Google
map_toronto_clusters = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

# cluster color scheme
x = np.arange(5)
y = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(y)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
for lat, lng, neighborhood, cluster in zip(df_toronto_4['Latitude'], df_toronto_4['Longitude'], df_toronto_4['Neighborhood'], df_toronto_4['ClusterLabel']):
    label = folium.Popup( ' Cluster ' + str(cluster) , parse_html = True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster - 1],
        fill=True,
        fill_color=rainbow[cluster - 1],
        fill_opacity=0.7
        ).add_to(map_toronto_clusters)  

In [32]:
# print map
map_toronto_clusters