# Segmenting and Clustering Neighborhoods in Toronto

In this notebook, I will explore and cluster neighborhoods in Toronto.

### Table of Contents
* [Part 1: Creating the dataframe from the wikipedia page](#scrape)
* [Part 2: Adding latitude and longitude to our existing dataframe](#latlong)
* Part 3

#### Part 1: Creating the dataframe from the Wikipedia page <a id='scrape'></a>

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

In [2]:
# I will be using read_html from pandas instead of beautiful soup due to simplicity
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', skiprows=1)
df = df[0] # read_html brings in a list of tables, we only want the first one

In [3]:
# assign column names to the dataframe
col = ['PostalCode', 'Borough', 'Neighborhood']
df.columns = col
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# drop rows where the Borough is 'Not assigned'
df1 = df[df['Borough'] != 'Not assigned'].reset_index(drop=True)
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [5]:
# where the Neighborhood is 'Not assigned,' assign the Bourough name
df1['Neighborhood'] = np.where(df1['Neighborhood'] == 'Not assigned', 
                               df1['Borough'], df1['Neighborhood'])
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [6]:
# where a borough has more than 1 neighborhood name, aggregate to 1 row
df_final = df1.groupby(['PostalCode', 'Borough'], sort=False).agg(lambda x: ', '.join(x)).reset_index()
df_final.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


To end Part 1, we will see the shape of our cleaned dataframe.

In [7]:
df_final.shape

(103, 3)

#### Part 2: Adding latitude and longitude to our existing dataframe<a id='latlong'></a>

<p>This section wound up shorter than expected due to both the geocoder and geopy packages not working with our data. Luckily, a csv with the postal codes and lat/long data was given to us for that reason.</p>

In [8]:
# It will be important to name our first column 'Postal Code'
names = ['PostalCode', 'Latitide', 'Longitude']
coord = pd.read_csv('Data_Files/GeoSpatial_Coordinates.csv', 
                    names=names, header=0)
coord.head()

Unnamed: 0,PostalCode,Latitide,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# In order to merge dataframes, our 'on' column must have same name
df_complete = pd.merge(df_final, coord, on='PostalCode')
df_complete.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitide,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


This is the dataframe we will be using to cluster neighborhoods.