# Project Segmenting and Clustering Neighborhoods in Toronto

João Martins | __Assignment 2__

Date: 13/MAR/2021

## TASK 1

__FIRST:__ define the url to extract the data from the wikipedia website and request json file.<br>__SECOND:__ Use BeautifulSoup package to take out the text content and select the table
<br>
<br>

In [1]:
# Import all the libraries required for this project
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
html_data = requests.get(url)

In [3]:
soup = BeautifulSoup(html_data.text, "html.parser")

In [4]:
table = soup.find_all('table')
toronto_postal_data = pd.read_html(str(table))[0]
toronto_postal_data.columns = ['Postal Code', 'Borough', 'Neighborhood']
toronto_postal_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [5]:
toronto_postal_data.shape

(180, 3)

<br>

__THIRD:__ Identify the cells within Borough that contain _Not assigned_ values and drop down those values
<br>
<br>

In [6]:
toronto_postal_data = toronto_postal_data[toronto_postal_data["Borough"].str.contains("Not assigned")==False]

In [7]:
toronto_postal_data.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"



__Fourth:__ Verify if there is cells within Neighborhood with _Not assigned_ content. Sicne the new dataframe has no content, there is no cells with Not assigned values within the neighborhood column.




In [8]:
Toronto_Neighborhood_notassigned = toronto_postal_data[toronto_postal_data["Neighborhood"].str.contains("Not assigned")==True]
Toronto_Neighborhood_notassigned.head()

Unnamed: 0,Postal Code,Borough,Neighborhood


In [9]:
toronto_postal_data.describe()

Unnamed: 0,Postal Code,Borough,Neighborhood
count,103,103,103
unique,103,11,99
top,M1E,North York,Downsview
freq,1,24,4



__Fifth:__ Cell with same _Post code_ are grouped together separated by commas

In [10]:
toronto_postal_data.groupby('Postal Code', as_index=False).agg(lambda x: ', '.join(set(x.astype(str))))
toronto_postal_data.head(20)

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [11]:
toronto_postal_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 2 to 178
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Postal Code   103 non-null    object
 1   Borough       103 non-null    object
 2   Neighborhood  103 non-null    object
dtypes: object(3)
memory usage: 8.2+ KB


In [12]:
toronto_postal_data.shape

(103, 3)

## TASK 2

### Google Maps Geocoding API

__FOURTH__ reading the data from the link provided and merge both tables.
<br>

In [13]:
path = "https://cocl.us/Geospatial_data"

df = pd.read_csv(path)
df.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [14]:
toronto_neighborhoods = pd.merge(toronto_postal_data, df, on=df['Postal Code'])
toronto_neighborhoods.head()

Unnamed: 0,key_0,Postal Code_x,Borough,Neighborhood,Postal Code_y,Latitude,Longitude
0,M1B,M3A,North York,Parkwoods,M1B,43.806686,-79.194353
1,M1C,M4A,North York,Victoria Village,M1C,43.784535,-79.160497
2,M1E,M5A,Downtown Toronto,"Regent Park, Harbourfront",M1E,43.763573,-79.188711
3,M1G,M6A,North York,"Lawrence Manor, Lawrence Heights",M1G,43.770992,-79.216917
4,M1H,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",M1H,43.773136,-79.239476


In [15]:
toronto_neighborhoods.drop(columns=['Postal Code_x','Postal Code_y'], axis=1, inplace=True)
toronto_neighborhoods

Unnamed: 0,key_0,Borough,Neighborhood,Latitude,Longitude
0,M1B,North York,Parkwoods,43.806686,-79.194353
1,M1C,North York,Victoria Village,43.784535,-79.160497
2,M1E,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
3,M1G,North York,"Lawrence Manor, Lawrence Heights",43.770992,-79.216917
4,M1H,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476
...,...,...,...,...,...
98,M9N,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.706876,-79.518188
99,M9P,Downtown Toronto,Church and Wellesley,43.696319,-79.532242
100,M9R,East Toronto,"Business reply mail Processing Centre, South C...",43.688905,-79.554724
101,M9V,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.739416,-79.588437


## TASK 3

### Explore and cluster the neighborhoods in Toronto

__FIFTH__ Determine the Georeferences of Toronto city, and push the map of Toronto city. Then add markers to toronto city map.

__SIXTH__ Create a new table with only the data from longitude and latitute to determine the clusters poisitions and them identifiy those clusters within the Toronto city map.

In [16]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(toronto_neighborhoods['Borough'].unique()),
        toronto_neighborhoods.shape[0]
    )
)

The dataframe has 11 boroughs and 103 neighborhoods.


In [17]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [18]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


In [None]:
Neighborhood = toronto_neighborhoods[['Neighborhood']]

In [None]:
borough = toronto_neighborhoods[['Borough']]

In [19]:
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

In [20]:
# create map of Totonto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(toronto_neighborhoods['Latitude'], toronto_neighborhoods['Longitude'], toronto_neighborhoods['Borough'], toronto_neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [22]:
toronto_cluster = toronto_neighborhoods.drop(['key_0','Neighborhood', 'Borough'], axis=1)
toronto_cluster.head()

Unnamed: 0,Latitude,Longitude
0,43.806686,-79.194353
1,43.784535,-79.160497
2,43.763573,-79.188711
3,43.770992,-79.216917
4,43.773136,-79.239476


In [23]:
from sklearn.preprocessing import StandardScaler
X = toronto_cluster.values[:,1:]
X = np.nan_to_num(X)
Clus_dataSet = StandardScaler().fit_transform(X)
Clus_dataSet

array([[ 2.09777597],
       [ 2.44798852],
       [ 2.15613628],
       [ 1.86437197],
       [ 1.6310228 ],
       [ 1.6310228 ],
       [ 1.39772948],
       [ 1.16449306],
       [ 1.6310228 ],
       [ 1.3685726 ],
       [ 1.28110403],
       [ 1.04789553],
       [ 1.39772948],
       [ 0.96045696],
       [ 1.16449306],
       [ 0.81474393],
       [ 1.98106673],
       [ 0.3486083 ],
       [ 0.523382  ],
       [ 0.1156253 ],
       [ 0.23210904],
       [-0.1173008 ],
       [-0.1173008 ],
       [-0.0299605 ],
       [-0.46658445],
       [ 0.69818881],
       [ 0.465121  ],
       [ 0.58164715],
       [-0.46658445],
       [-0.9320953 ],
       [-0.6993678 ],
       [-1.13568453],
       [-1.01935285],
       [-1.28107896],
       [ 0.84388426],
       [ 0.90216906],
       [ 0.81474393],
       [ 1.07704414],
       [ 0.3486083 ],
       [ 0.49425098],
       [ 0.61078127],
       [ 0.465121  ],
       [ 0.84388426],
       [ 0.58164715],
       [ 0.08650566],
       [ 0

In [24]:
clusterNum = 5
k_means = KMeans(init = "k-means++", n_clusters = clusterNum, n_init = 12)
k_means.fit(X)
labels = k_means.labels_
print(labels)

[1 1 1 1 1 1 1 2 1 1 2 2 1 2 2 2 1 4 2 4 4 4 4 4 0 2 2 2 0 0 0 0 0 3 2 2 2
 2 4 2 2 2 2 2 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 0 0 4
 0 4 0 4 4 0 0 0 0 0 0 4 3 2 0 3 0 0 3 3 3 3 3 3 3 3 3 3 3]


In [26]:
toronto_neighborhoods["Clus labels"] = labels
toronto_neighborhoods.head(5)

Unnamed: 0,key_0,Borough,Neighborhood,Latitude,Longitude,Clus labels
0,M1B,North York,Parkwoods,43.806686,-79.194353,1
1,M1C,North York,Victoria Village,43.784535,-79.160497,1
2,M1E,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711,1
3,M1G,North York,"Lawrence Manor, Lawrence Heights",43.770992,-79.216917,1
4,M1H,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476,1


In [27]:
toronto_neighborhoods.groupby('Clus labels').mean()

Unnamed: 0_level_0,Latitude,Longitude
Clus labels,Unnamed: 1_level_1,Unnamed: 2_level_1
0,43.696416,-79.474234
1,43.764626,-79.224859
2,43.724206,-79.319248
3,43.686536,-79.553666
4,43.688108,-79.391963


In [28]:
toronto_neighborhoods.head()

Unnamed: 0,key_0,Borough,Neighborhood,Latitude,Longitude,Clus labels
0,M1B,North York,Parkwoods,43.806686,-79.194353,1
1,M1C,North York,Victoria Village,43.784535,-79.160497,1
2,M1E,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711,1
3,M1G,North York,"Lawrence Manor, Lawrence Heights",43.770992,-79.216917,1
4,M1H,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476,1


In [None]:
# add clustering labels
toronto_ neighborhoods_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_neighborhoods

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_neighborhoods.set_index('Neighborhood'), on='Neighborhood')

manhattan_merged.head() # check the last columns!

In [31]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(clusterNum)
ys = [i + x + (i*x)**2 for i in range(clusterNum)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_neighborhoods['Latitude'], toronto_neighborhoods['Longitude'], toronto_neighborhoods['Neighborhood'], toronto_neighborhoods['Clus labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters