# Segmenting and Clustering Neighborhoods in Toronto

#### By Lucia Hasfura

Import necessary packages:

In [18]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
from pandas.io.json import json_normalize 
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium 

#!conda install -c conda-forge geocoder --yes
#from geopy.geocoders import Nominatim

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.1.0               |             py_1         614 KB  conda-forge
    branca-0.4.1               |             py_0          26 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         713 KB

The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.1-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge


Downloading and Extracting Packages
altair-4.1.0         | 614 KB    | #####

Scrape the data off the given Wikipedia site and display the table.

In [19]:
url= "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
requests_data= requests.get(url).text
beautiful_soup= BeautifulSoup(requests_data, 'lxml')
print(beautiful_soup.title)
from IPython.display import display_html
table= str(beautiful_soup.table)
display_html(table, raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal Code,Borough,Neighbourhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


The data is originally in html form so I also convert it to a pandas dataframe.

In [20]:
data= pd.read_html(table)
data=data[0]
data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


I only process the cells that have an assigned borough, and I ignore cells with a borough that is 'Not assigned'.

In [21]:
new_data=data[data.Borough != 'Not assigned']
new_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


I combine the neighborhoods that have the same postal code to avoid repetition. Also, if a cell has a borough but a 'Not assigned' neighborhood, then the neighborhood will be the same as the borough.

In [22]:
new_data= new_data.groupby(['Postal Code', 'Borough'], sort= False).agg(',' .join)
new_data.reset_index(inplace= True)
new_data['Neighbourhood']= np.where(new_data['Neighbourhood'] == 'Not assigned', new_data['Borough'], new_data['Neighbourhood'])
new_data.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


Determine the number of rows of the new dataframe.

In [23]:
new_data.shape

(103, 3)

Import the csv file with the geographical coordinates of each postal code.

In [24]:
lat_lng = pd.read_csv('http://cocl.us/Geospatial_data')
lat_lng.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


The above table only includes the columns for Postal Code, Latitude, and Longitude, so I merge it with the previous data table to include the Neighborhood and Borough as well.

In [25]:
new_data2= pd.merge(new_data, lat_lng, on= 'Postal Code')
new_data2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Now I explore and cluster the neighborhoods in Toronto. I will only work with the boroughs that contain the word 'Toronto'

In [26]:
new_data2= new_data2[new_data2['Borough'].str.contains('Toronto')]
new_data2.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [27]:
latitude= new_data2['Latitude'].mean() 
longitude= new_data2['Longitude'].mean() 
print('The latitude and longitude of Toronto are:', latitude, longitude)

The latitude and longitude of Toronto are: 43.66713498717949 -79.38987324871795


Create a map of Toronto with neighborhoods superimposed on top. 

In [44]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=11)

for lat,lng,borough,neighbourhood in zip(new_data2['Latitude'],new_data2['Longitude'],new_data2['Borough'],new_data2['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

Using KMeans clustering, I cluster the Toronto neighborhoods.

In [39]:
k=4
toronto_clustering = new_data2.drop(['Postal Code','Borough','Neighbourhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
new_data2.insert(0, 'KMeans Clustering Labels', kmeans.labels_)
new_data2.head()

Unnamed: 0,KMeans Clustering Labels,Cluster Labels,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,0,1,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,1,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,1,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,1,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,1,3,M4E,East Toronto,The Beaches,43.676357,-79.293031


The following map displays the neighborhoods again but this time divided into their clusters. Each color represents a cluster. 

In [45]:
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=10)
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(new_data2['Latitude'], new_data2['Longitude'], new_data2['Neighbourhood'], new_data2['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters