# Toronto Neighborhood Clustering

Import necessary libraries and web page.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

r = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(r.content,"lxml")

Save table as variable

In [2]:
table = soup.find_all('table')[0]

Construct a dictionary from the table containing the postal codes, boroughs, and neighborhoods.

In [3]:
postal_code = []
borough = []
neighborhood = []
for row in table.find_all('tr'):
    i = 0
    for col in row.find_all('td'):
        if i == 0:
            postal_code += [col.get_text()]
        elif i == 1:
            borough += [col.get_text()]
        else:
            neighborhood += [col.get_text()[:-1]]
        i += 1

d = {'PostalCode': postal_code, 'Borough': borough, 'Neighborhood': neighborhood}

Create data frame.

In [4]:
df = pd.DataFrame.from_dict(d)

Remove rows with a missing borough.

In [5]:
df = df[df['Borough'] != 'Not assigned']

Reindex data frame

In [6]:
df.index = range(df.shape[0])

Create new dictionary combining neighborhoods with the same postal code. This loop also fixes any neighborhoods that are not assigned.

In [7]:
d2 = {'PostalCode': [], 'Borough': [], 'Neighborhood': []}
for i in range(df.shape[0]):
    if df.loc[i,'PostalCode'] in d2['PostalCode']:
        d2['Neighborhood'][-1] += ', ' + df.loc[i,'Neighborhood']
    else:
        d2['PostalCode'] += [df.loc[i,'PostalCode']]
        d2['Borough'] += [df.loc[i,'Borough']]
        if df.loc[i,'Neighborhood'] == 'Not assigned':
            d2['Neighborhood'] += [df.loc[i,'Borough']]
        else:
            d2['Neighborhood'] += [df.loc[i,'Neighborhood']]

New data frame based on new dictionary.

In [8]:
df2 = pd.DataFrame.from_dict(d2)
df2.index = range(df2.shape[0])

Reorder columns.

In [9]:
df2 = df2[['PostalCode','Borough','Neighborhood']]

Importing geocoder.

In [10]:
import geocoder # import geocoder

Loop to get lat/lng coordinates for each postal code.

In [11]:
latitude = []
longitude = []

for m in df2['PostalCode']:
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.bing('{}, Toronto, Ontario'.format(m), key='')
        lat_lng_coords = g.latlng

    latitude.append(lat_lng_coords[0])
    longitude.append(lat_lng_coords[1])

Adding lat/lng columns to data frame.

In [12]:
df2['Latitude'] = latitude
df2['Longitude'] = longitude

## Final Data Frame

Note that the data frame is not in the same order as the one in the assignment, but the values match. My data frame is ordered by postal code.

In [13]:
df2.head(n=10)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.751255,-79.329895
1,M4A,North York,Victoria Village,43.729958,-79.314201
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65522,-79.361969
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.722801,-79.450691
4,M7A,Queen's Park,Queen's Park,43.664486,-79.393021
5,M9A,Etobicoke,Islington Avenue,43.662743,-79.528427
6,M1B,Scarborough,"Rouge, Malvern",43.810154,-79.194603
7,M3B,North York,Don Mills North,43.749134,-79.362007
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.707577,-79.310913
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657467,-79.377708


In what follows we duplicate the neighborhood analysis done on the New York dataset.

Data frame for boroughs with the word 'Toronto' in them.

In [14]:
df2.shape

(103, 5)