# Segmenting and Clustering Neighborhoods in Toronto

# Part 1: scrapping Toronto neighborhood data from Wikipedia

Start off by importing the necessary libraries.

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

## Scrapping Wikipedia table


Use Python's BeautifulSoup library to parse the Wikipedia page of Toronto's postal codes.

In [4]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_url = requests.get(url).text # get the HTML text from the wiki page
soup = BeautifulSoup(wiki_url, 'lxml') # scrap it
# print(soup.prettify()) # prettify shows how the HTML tags are nested in the document

Table containing the actual postal codes is under the section "wikitable sortable". Isolate that table by using `find`.

In [22]:
my_table = soup.find('table', {'class': 'wikitable sortable'});
# my_table

Break the table into rows, and extract the values belonging to _Postcode_, _Borough_, _Neighbourhood_ into their own individual lists.

In [6]:
postcodes = []
boroughs = []
neighborhoods = []

# save scraped values into list
for row in my_table.find_all('tr'):
    cells = row.find_all('td')
    if len(cells) == 3:
        postcodes.append(cells[0].find(text=True))
        boroughs.append(cells[1].find(text=True))
        neighborhoods.append(cells[2].find(text=True))
        
# show some sample values
print(postcodes[:5])
print(boroughs[:5])
print(neighborhoods[:5])

['M1A', 'M2A', 'M3A', 'M4A', 'M5A']
['Not assigned', 'Not assigned', 'North York', 'North York', 'Downtown Toronto']
['Not assigned\n', 'Not assigned\n', 'Parkwoods', 'Victoria Village', 'Harbourfront']


Create a new dataframe to store the values and clean entries according to specifications.

In [7]:
# fill dataframe with data
col_names = ['PostCode', 'Borough', 'Neighborhood']
df = pd.DataFrame({'PostCode':postcodes, 'Borough':boroughs, 'Neighborhood':neighborhoods}, columns=col_names)

# some neighborhood names contain an extra "\n" at the end, drop it
df['Neighborhood'] = df[['Neighborhood']].applymap(lambda x: x.split('\n')[0])

# drop any rows that does not have assigned value in the Borough column
df.drop(df[(df['Borough'] == 'Not assigned')].index, axis=0, inplace=True)

# replace "Not assigned" entries in the Neighborhood column with the row's corresponding borough value
no_neighb_idx = (df['Neighborhood'] == 'Not assigned')
df['Neighborhood'].loc[no_neighb_idx] = df['Borough'].loc[no_neighb_idx]

# combine rows with same the post code into the same neighborhood entry that is separated by commas
df = df.groupby(['PostCode', 'Borough'], sort=False)['Neighborhood'].apply(', '.join).reset_index()

df.head(20)

Unnamed: 0,PostCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


In [8]:
print('The number of samples in this dataset is {}'.format(df.shape[0]))

The number of samples in this dataset is 103


## Part 2: Convert Toronto neighborhood data into geospatial coordinates
In order to use Foursquare location data with above neighborhood data, we need to first find the corresponding spatial coordinates (e.g. latitude, longitude).

Get the data from csv file.

In [9]:
!wget -q -O 'geospatial_data.csv' http://cocl.us/Geospatial_data # -q = quiet, -O = output name

In [18]:
geo_data = pd.read_csv('geospatial_data.csv')
geo_data.rename(columns={'Postal Code':'PostCode'}, inplace=True)
geo_data.head()

Unnamed: 0,PostCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Join geospatial data (geo_data) with Toronto neighborhood data (df) into one dataframe on the _PostCode_ column.

In [21]:
df2 = df.join(geo_data.set_index('PostCode'), on='PostCode')
df2.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
