# Clustering Toronto Neighbohoods

Assignment for the __Applied Data Capstone Project__

_by Marco N_

------

## Notebook contents

1. <a href="#Part-1">Part 1</a>: Building the Neighborhood DataFrame

2. <a href="#Part-2">Part 2</a>: Adding geo-coordinates

3. <a href="#Part-3">Part 3</a>: Clustering and Visualizing the Neighborhoods

-----

## Part 1

### Building the Neighborhood DataFrame

We'll be using Pandas' native __read_html__ functionality to download the Wikipedia table into a DataFrame

In [1]:
#!pip3 install lxml # Html parser
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np

In [2]:
dfs = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

In [3]:
df = dfs[0].copy() # The output of read_html is a list, so we'll be selecting the first item as our df
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Regent Park / Harbourfront


### Cleaning the DataFrame

Now we'll begin the __data wrangling__ phase and come up with the requested df

In [4]:
df.drop(df[df['Borough'] == 'Not assigned'].index.values, inplace=True) # Dropping rows with a borough that is Not assigned

df['Neighborhood'] = df['Neighborhood'].str.replace(' / ', ', ') # Replacing the default '/' notation with commas as requested by the assignment

df.reset_index(drop=True, inplace=True) # Resetting the index for convenience

df

Unnamed: 0,Postal code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business reply mail Processing CentrE
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
df.shape

(103, 3)

------

## Part 2

### Adding geo-coordinates

We'll be using the __Geocoder__ package to get the latitude and the longitude coordinates of each neighborhood.

In [8]:
#!pip3 install geocoder
import geocoder

After some tests, I found the __Arcgis__ provider is the most reliable one, so we'll be using it instead of Google

In [43]:
g = geocoder.arcgis('M5A, Canada')

print('Latitude:', g.latlng)
print('Longitude:', g.latlng)

Latitude: [43.65096410900003, -79.35304116399999]
Longitude: [43.65096410900003, -79.35304116399999]


We'll setup a __for loop__ to populate two new __Latitude__ and __Longitude__ lists

In [None]:
# Initialize the two empty lists
lat = []
long = [] 

# Populate the lists with the geocoded latitudes and longitudes
for row in df.index.values:
    g = geocoder.arcgis('{}, Canada'.format(df.loc[row,'Postal code']))
    lat.append(g.latlng[0])
    long.append(g.latlng[1])

In [67]:
# Check if the geocoder has fetched the coordinates for each postal code

if df.shape[0] == len(lat) == len(long):
    print('Ok, we fetched the coordinates for each of the {} postal codes'.format(df.shape[0]))
else :
    print('Oops, something went wrong. df contains {} rows, lat contains {} items and long contains {} items'.format(df.shape[0], len(lat), len(long)))
    

Ok, we fetched the coordinates for each of the 103 postal codes


Now we may use the two lists to populate the two new DataFrame columns

In [68]:
df['Latitude'] = lat
df['Longitude'] = long
df.head()

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


-----

## Part 3

### Clustering the neighborhoods

Each neighborhood belongs to only one borough, so the __boroughs__ will be our __clusters__.

Since the boroughs are represented by a categorical feature, we can use the __label encoding__ technique to transform it to a numerical feature.

In [82]:
df["Borough_num"] = df["Borough"].astype('category') # Create a new column Borough_num as a categorical feature

df["Borough_num"] = df["Borough_num"].cat.codes # Convert the new categorical feature to a numerical one

df

Unnamed: 0,Postal code,Borough,Neighborhood,Latitude,Longitude,Borough_num
0,M3A,North York,Parkwoods,43.752935,-79.335641,6
1,M4A,North York,Victoria Village,43.728102,-79.311890,6
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,1
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211,6
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.661790,-79.389390,1
...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653340,-79.509766,4
99,M4Y,Downtown Toronto,Church and Wellesley,43.666659,-79.381472,1
100,M7Y,East Toronto,Business reply mail Processing CentrE,43.648700,-79.385450,2
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.632798,-79.493017,4


Let's use __matplotlib__ to build a color array to display each cluster with a different color

In [83]:
import matplotlib.cm as cm
import matplotlib.colors as colors

borough_total = len(df['Borough'].unique()) # The total number of clusters

# build the color array
x = np.arange(borough_total)
ys = [i + x + (i*x)**2 for i in range(borough_total)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

### Visualizing the neighborhoods

We can now visualize the neighborhood data on a map using the __Folium__ library.

In [71]:
import folium

Getting the coordinates for the map center (the city of Toronto):

In [72]:
g = geocoder.arcgis('Toronto, Canada')

latitude = g.latlng[0]
longitude = g.latlng[1]

Building the map:

In [94]:
toronto_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# adding markers
for lat, lng, borough, neighborhood, bor_num in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood'], df['Borough_num']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[bor_num], # Using the color array we built before to display each borough with a different color
        fill=True,
        fill_color=rainbow[bor_num],
        fill_opacity=0.7,
        parse_html=False).add_to(toronto_map) 
    
toronto_map