# Segmenting and Clustering Neighborhoods in Toronto

## Import and Clean the Data

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Now let's load the data from the Wikipedia page using requests and pandas.

In [2]:
website_url = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M").text
df = pd.read_html(website_url, header = 0)[0]

This is what the data looks like in the beginning

In [3]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


We should first get rid of the cells with "Not assigned" Boroughs.  We also want to see if any cell has a borough but a "Not assigned" neighborhood.

In [4]:
df = df[df.Borough != "Not assigned"]
df[df.Neighbourhood == "Not assigned"]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


Since there is only one cell that has a borough but a "Not assigned" neighborhood, We can simply use .at[] and replace the value.

In [5]:
df.at[8, 'Neighbourhood'] = "Queen\'s park"
df.loc[[8]]

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Queen's park


Now that we have cleaned the "Not assigned" cells, we can group the cells.  But after using .agg(), the index and headers will become strange, so we have to do some modifications to make it prettier.

In [6]:
df = df.groupby(['Postcode', 'Borough']).agg([('Neighbourhood', ', '.join)])
df.columns = df.columns.droplevel(0)
df = df.reset_index()

Changing the column name, we now get the data as we wanted.

In [7]:
df.columns = ['PostalCode', 'Borough', 'Neighborhood']
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [8]:
print("There are", df.shape[0], "rows and", df.shape[1], "columns in the data frame.")

There are 103 rows and 3 columns in the data frame.


## Combine the coordination data

The Geocoder package can extract coordinates of a given postal code, but it is sometimes unreliable and takes a long time.  Thus, we read the coordination data directly from the link.

In [8]:
df_co = pd.read_csv("http://cocl.us/Geospatial_data")
df_co.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
df_co.columns = ['PostalCode', 'Latitude', 'Longitude']

Now we join df and df_co using .merge().

In [10]:
df_tor = pd.merge(left = df, right = df_co, on = 'PostalCode')
df_tor.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


## Generate map

First, we need to find the exact location of Toronto.

In [11]:
address = 'Toronto'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.653963, -79.387207.


We can now draw the points where the neighborhoods are on the map.

In [12]:
Toronto_data = folium.Map(location=[latitude, longitude], zoom_start = 11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df_tor['Latitude'], df_tor['Longitude'], df_tor['Borough'], df_tor['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius = 5,
        popup = label,
        color = 'blue',
        fill = True,
        fill_color = '#3186cc',
        fill_opacity = 0.7,
        parse_html = False).add_to(Toronto_data)  

Toronto_data