# Segmenting and Clustering Neighborhoods in Toronto
## Get locations of all Neighborhoods in Toronto
---
**Xu Qianyi**

Data Scientist

> Now that you have built a dataframe of the postal code of each neighborhood along with the borough name and neighborhood name, in order to utilize the Foursquare location data, we need to get the latitude and the longitude coordinates of each neighborhood. 

>Given that this package can be very unreliable, in case you are not able to get the geographical coordinates of the neighborhoods using the Geocoder package, here is a link to a csv file that has the geographical coordinates of each postal code: http://cocl.us/Geospatial_data

>Use the Geocoder package or the csv file to create the following dataframe:

![img](https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/HZ3jNHNOEeiMwApe4i-fLg_f44f0f10ccfaf42fcbdba9813364e173_Screen-Shot-2018-06-18-at-7.18.16-PM.png?expiry=1546819200000&hmac=q7gXbf083OyXa-aunovWvdEpf5MPX6QfsXD7GZTaaew)

## Load location of neighborhoods in Toronta

In [14]:
import pandas as pd

In [19]:
df = pd.read_csv('http://cocl.us/Geospatial_data')
df_loc = df.rename(columns={'Postal Code' : 'PostalCode'})
print(df_loc.shape)
df_loc.head()

(103, 3)


Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


## Fetch all neighborhood infomraiton from wikipedia

Import **requests** library for scraping data from wikipedia, and import **lxml** library for html parsing

In [6]:
import requests
from lxml import etree
import pandas as pd

Get the html content of the website page with requests.get() function

In [7]:
# using requests.get(url) to get html content
wiki_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
resp = requests.get(wiki_url)

#delete all '\n' in the content for further string proces
resp_str=resp.text.replace('\n', '')
 

Get table column names and data with xpah function in lxml library.

In [8]:
#Parse from html string
root = etree.fromstring(resp_str)
trs = root.xpath('//table[contains(@class, "wikitable")]/.//tr')

#Get table headers, and use them to construct headers of a new dataframe
ths = trs[0].xpath('th/text()')
df_original = pd.DataFrame(columns=[th for th in ths])

#Get all Postcode, Borough, Neighborhood from the table 
loc_idx = 0
for tr in trs[1:]:
    tds = tr.xpath('td/text() | td/a/text()')
    df_original.loc[loc_idx] = [td for td in tds]
    loc_idx += 1
    

In [9]:
print('We get {} rows of neighborhoods in Toronto.'.format(df_original.shape[0]))
df_original.head()

We get 289 rows of neighborhoods in Toronto.


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


Process the Neighborhood dataframe, replace 'Not assigned' cells

In [20]:
df_neighborhood = df_original[df_original["Borough"] != 'Not assigned']
df_neighborhood.head()

for index, row in df_neighborhood.iterrows():
    if row['Neighbourhood'] == 'Not assigned':
        row['Neighbourhood'] = row['Borough']

df_postcodes = df_neighborhood.groupby(['Postcode', 'Borough']).agg({'Neighbourhood':lambda x: '%s'%', '.join(x)}).reset_index()
df_postcodes = df_postcodes.rename(columns={'Postcode' : 'PostalCode'})

print('We finally get {} rows of different postcodes.'.format(df_postcodes.shape))
df_postcodes.head()

We finally get (103, 3) rows of different postcodes.


Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


## Inner join two tables into one

In [22]:
df_infos = df_postcodes.join(df_loc.set_index('PostalCode'), on='PostalCode')
print(df_infos.shape)
df_infos.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
