# Segmenting and Clustering Neighborhoods in Toronto

In [1]:
import os
import requests
import json

from dotenv import load_dotenv
import pandas as pd
from pandas.io.json import json_normalize


## Section 1: Scrape Toronto neighbourhoods and postal codes

As the Wikipedia page contains a single table, it is pretty straightforward to use the read_html() function in Pandas. It is returned as a list of tables (one item here), so it is the first dataframe loaded.

In [2]:
URL = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df_neighborhoods = pd.read_html(URL)[0]
df_neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Filter out rows which are 'Not assigned'

In [3]:
df_neighborhoods = df_neighborhoods[df_neighborhoods['Borough'] != 'Not assigned']
df_neighborhoods.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [4]:
df_neighborhoods.shape

(103, 3)

## Section 2 - Merge geospatial coordinates

Seems the simplest option is to use the Geospatial_Coordinates CSV file, and use the Pandas merge function to join the tables on the Postal Code column.

First, having downloaded the file we can load it into a dataframe

In [5]:
path = os.path.join(os.path.abspath('../data'), 'Geospatial_Coordinates.csv')
df_geodata = pd.read_csv(path)
df_geodata.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


And make the merge.

In [6]:
df_complete = df_neighborhoods.merge(df_geodata, how='inner', on='Postal Code')
df_complete.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494


Check the number of rows, should still be 103

In [7]:
df_complete.shape

(103, 5)

## Section 3 - Explore and cluster neighborhoods

This next cell loads the Foursquare API credentials from environment variables saved on my local machine. In this way the credentials can be kept confidential, and will not be published on the Github repository, for example. Were one to clone the repository, they can easily load their own credentials on their local environment.

In [8]:
load_dotenv()
CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")