# Segmenting and Clustering Neighborhoods in Toronto Part 2

Obtaining the postal codes and transform into panda dataframe

In [1]:
!pip3 install pandas lxml



In [2]:
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

toronto_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]

toronto_df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


The dataframe should consist of three columns: PostalCode, Borough, and Neighborhood. 
Therefore we'll rename "Postal Code" to "PostalCode" and "Neighbourhood" to "Neighborhood"

In [3]:
toronto_df.rename(columns={'Postal Code': 'PostalCode'}, inplace=True)
toronto_df.rename(columns={'Neighbourhood': 'Neighborhood'}, inplace=True)

toronto_df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [4]:
list(toronto_df.columns.values)

['PostalCode', 'Borough', 'Neighborhood']

Only process the cells that have an assigned borough

In [5]:
toronto_df.shape

(180, 3)

In [6]:
toronto_df_filtered = toronto_df[toronto_df['Borough'] != 'Not assigned']
toronto_df_filtered.reset_index(drop=True, inplace=True)
toronto_df_filtered.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [7]:
toronto_df_filtered.shape

(103, 3)

So, we have reduced our row size. Now, we are checking, if there are any  duplicated postal codes:

In [8]:
toronto_df_filtered['PostalCode'].value_counts()

M2M    1
M5K    1
M1L    1
M5G    1
M4X    1
M2R    1
M1X    1
M6J    1
M5M    1
M2N    1
M9A    1
M8Z    1
M5S    1
M4J    1
M9M    1
M5R    1
M6A    1
M8Y    1
M5B    1
M5A    1
M3A    1
M4H    1
M4M    1
M9R    1
M1R    1
M5P    1
M1W    1
M3B    1
M2H    1
M9V    1
M1G    1
M5H    1
M3N    1
M4Y    1
M4R    1
M6B    1
M5J    1
M5W    1
M3M    1
M6E    1
M4L    1
M7A    1
M3J    1
M8W    1
M6N    1
M3H    1
M4E    1
M4G    1
M3C    1
M1H    1
M4T    1
M7Y    1
M9B    1
M4K    1
M4B    1
M9W    1
M1E    1
M4C    1
M5C    1
M5V    1
M5T    1
M4P    1
M1T    1
M7R    1
M9C    1
M4N    1
M6P    1
M1N    1
M8V    1
M1C    1
M6G    1
M1K    1
M9P    1
M1S    1
M1V    1
M6H    1
M1P    1
M1J    1
M1B    1
M6R    1
M5L    1
M9L    1
M6S    1
M3K    1
M1M    1
M2J    1
M2L    1
M4S    1
M4A    1
M2K    1
M8X    1
M6C    1
M2P    1
M4V    1
M9N    1
M4W    1
M5E    1
M5N    1
M6M    1
M6L    1
M5X    1
M3L    1
M6K    1
Name: PostalCode, dtype: int64

Since the row size is the same, all postal codes are unique.

Assign all "Not assigned" neighborhood to the borough ones. Let's check these rows:

In [9]:
toronto_df_filtered[toronto_df_filtered['Neighborhood'] == 'Not assigned'].value_counts()

Series([], dtype: int64)

Which means there is any not assigned Neighborhood value. Already, formated accordingly in the wiki page

In [10]:
row, col = toronto_df_filtered.shape

print('The row size is', row, 'and the column size is', col)

The row size is 103 and the column size is 3


Now, we need to get the latitude and the longitude coordinates of each neighborhood. First we'll try to read from geogratis.gc.ca's API service. If we get a failure, both lat and lon values will be obtained from the provided csv URL: http://cocl.us/Geospatial_data

In [11]:
geo_df = pd.read_csv('https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv')

print('shape', geo_df.shape)

geo_df.head()

shape (103, 3)


Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [12]:
import requests

def getLatLang(post_code):
    lat = 0
    lon = 0
    try:
        geo_json = requests.get('http://geogratis.gc.ca/services/geolocation/en/locate?q=' + post_code).json()
        lat = geo_json[0]['geometry']['coordinates'][1]
        lon = geo_json[0]['geometry']['coordinates'][0]
    except:
        # Got an error. reading from csv.
        print('Error for', post_code, 'retrying with backup service')
        geo_df = pd.read_csv('https://ibm.box.com/shared/static/9afzr83pps4pwf2smjjcf1y5mvgb18rr.csv')
        lat = geo_df[geo_df['Postal Code']==post_code]['Latitude'].values[0]
        lon = geo_df[geo_df['Postal Code']==post_code]['Longitude'].values[0]
        
    return lat, lon

lat, lon = getLatLang('M1B')

print('M1B postal code geo','Latitude:',lat,'Longitude:',lon)

M1B postal code geo Latitude: 43.809444 Longitude: -79.193321


Now, we will append the latitute and longitude coordinates to the filterd dataframe.

In [13]:
lat_list =[]
lon_list = []

for pc in toronto_df_filtered['PostalCode']:
    lat,lon = getLatLang(pc)
    lat_list.append(lat)
    lon_list.append(lon)

toronto_df_filtered = toronto_df_filtered.assign(Latitude=pd.Series(lat_list))
toronto_df_filtered = toronto_df_filtered.assign(Longitude=pd.Series(lon_list))

toronto_df_filtered.head()

Error for M7R retrying with backup service


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752804,-79.32959
1,M4A,North York,Victoria Village,43.723358,-79.312927
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.647398,-79.35292
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.724842,-79.451431
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.663681,-79.392097


In [14]:
toronto_df_filtered

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752804,-79.32959
1,M4A,North York,Victoria Village,43.723358,-79.312927
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.647398,-79.35292
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.724842,-79.451431
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.663681,-79.392097
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.664909,-79.529373
6,M1B,Scarborough,"Malvern, Rouge",43.809444,-79.193321
7,M3B,North York,Don Mills,43.750324,-79.359451
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70858,-79.310875
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.658165,-79.378456
