<h2><b>Segmenting and Clustering Neighborhoods in Toronto</b></h2>
<h5>
This repository is part of the IBM Data Science Professional Certificate Capstone Project. This project aims to segment and cluster Toronto's neighbourhoods using K-means clustering method.
</h5>


<h4><b>Part 1: Scraping Toronto Postal Code Data from Wikipidea</b></h4>
data source: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M 

In [129]:
import bs4
import requests 
import pandas as pd
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url)

# passing neighbourhood table into dataframe
content = pd.read_html(r.text)
alldata = content[0]

# drop data in Borough that is "Not Assigned"
dataexclna = alldata[alldata["Borough"]!="Not assigned"]

# merge neighborhood with same post code
cdata = dataexclna.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
cdata.reset_index(inplace=True)

cdata.shape

(103, 3)

<h4><b>Part 2: Merging latitude and longitude coordinates to each Toronto Postal Code</b></h4>
data source: https://cocl.us/Geospatial_data


In [133]:
# examine the latitude and longitude data file
latlongurl = 'https://cocl.us/Geospatial_data'
latlong = pd.read_csv(latlongurl)

latlong.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [139]:
# merge neighborhood data with longlat dataset by Postal Code
torontodata = pd.merge(left=cdata, right=latlong, how='left', left_on='Postal Code', right_on='Postal Code')
torontodata.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


<h4><b>Part 3: Visualizing and Analyzing the Toronto Neighborhoods Data</b></h4>
This section aims to provide insight to the Toronto Neighborhoods, by examining, visualizing, analyzing and clustering the data.


<u>3.1 - Visuzliaing Neighborhoods in Folium Map</u>

In [226]:
import folium

torontomap = folium.Map(location=[43.6532, -79.3832], zoom_start=11)
locations = torontodata[['Latitude', 'Longitude']]
locationlist = locations.values.tolist()

for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point], popup=torontodata['Neighbourhood'][point]).add_to(torontomap)

torontomap.save('originalmap.html')
torontomap

<u>3.2 Clustering Toronto Neighborhoods</u>

Run K-means to cluster neighborhoods into 5 clusters.


In [212]:
import numpy as np
from sklearn.cluster import KMeans

k=5
tor_cluster = torontodata.drop(['Neighbourhood', 'Postal Code', 'Borough'], 1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(tor_cluster)
kmeans.labels_[0:10]

clustered_data = pd.merge(left=cdata, right=latlong, how='left', left_on='Postal Code', right_on='Postal Code')
clustered_data.head()

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,4,M3A,North York,Parkwoods,43.753259,-79.329656
1,4,M4A,North York,Victoria Village,43.725882,-79.315572
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,0,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


In [225]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors

# re-create cluster map
torontomap = folium.Map(location=[43.6532, -79.3832], zoom_start=11)

# set color scheme for the clusters
x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, long, neighbourhood, cluster in zip(clustered_data['Latitude'], clustered_data['Longitude'], clustered_data['Neighbourhood'], clustered_data['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, long],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(torontomap)
       
torontomap.save('clustered.html')
torontomap

In [229]:
import IPython
print(IPython.sys_info())

{'commit_hash': 'b467d487e',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': '/usr/local/lib/python3.6/dist-packages/IPython',
 'ipython_version': '5.5.0',
 'os_name': 'posix',
 'platform': 'Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic',
 'sys_executable': '/usr/bin/python3',
 'sys_platform': 'linux',
 'sys_version': '3.6.9 (default, Apr 18 2020, 01:56:04) \n[GCC 8.4.0]'}


In [230]:
!pip freeze | grep -E 'folium|matplotlib|biopython'

folium==0.8.3
matplotlib==3.2.2
matplotlib-venn==0.11.5
