<a id='menu'></a>

# Segmenting and Clustering Neighborhoods in Toronto
1. <a href="#menu1">__Part 1__</a> Scrapping neighborhood
2. <a href="#menu2">__Part 2__</a> Locating neighborhood

Importing libraries

In [4]:
import pandas as pd # library for data analysis
import requests
import folium
import os
import numpy as np
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from bs4 import BeautifulSoup #for scraping
from sklearn.cluster import KMeans

%matplotlib inline
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported.')

Libraries imported.


<a id='menu1'></a>
## Part 1 Scrapping neighborhood

We are going to scraping the neighborhood data from Wikipedia. 

In [5]:
wiki = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

r = requests.get(wiki)

In [7]:
soup = BeautifulSoup(r.content)
trs = soup.find('table','wikitable sortable').find_all('tr')
columns = ['postcode', 'borough', 'neighborhood']
df_postcode = pd.DataFrame(columns=columns)
for tr in trs:
    tds = tr.find_all('td')
    if len(tds) == 0: continue
    postcode = tds[0].get_text().strip()
    borough = tds[1].get_text().strip()
    neighborhood = tds[2].get_text().strip()
    if borough != 'Not assigned':
#         print("{} - {} - {}".format(postcode, borough, neighborhood))
        df_postcode = df_postcode.append({'postcode': postcode, 'borough': borough, 'neighborhood':neighborhood}, ignore_index=True)

print("Postcode count: {}".format(df_postcode.shape[0]))
df_postcode.head()

Postcode count: 211


Unnamed: 0,postcode,borough,neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


## Data Preparation

Find ___"Not assigned"___ neighborhood and replace it with borough value

In [8]:
df_postcode[df_postcode.neighborhood=='Not assigned']

Unnamed: 0,postcode,borough,neighborhood
6,M7A,Queen's Park,Not assigned


In [9]:
df_postcode['neighborhood'] = df_postcode.apply(lambda x: x.borough if x.neighborhood == 'Not assigned' else x.neighborhood, axis=1)

In [10]:
df_postcode[df_postcode.neighborhood=='Not assigned']

Unnamed: 0,postcode,borough,neighborhood


Now there is no neighborhood with ___"Not Assigned"___ value

Next step is __grouping neighborhood__ with the same postcode into one row

In [12]:
df_postcode_grouped = df_postcode.groupby(['postcode','borough'], as_index=False).count()

df_postcode_clean = pd.DataFrame(columns=columns)
for index, row in df_postcode_grouped.iterrows():
    neighborhood = df_postcode[df_postcode.postcode==row[0]].neighborhood.str.cat(sep=',')
    df_postcode_clean = df_postcode_clean.append({'postcode':row[0], 'borough':row[1], 'neighborhood':neighborhood}, ignore_index=True)
    
df_postcode_clean.head()

Unnamed: 0,postcode,borough,neighborhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


<a href="#menu">_Back to top_</a>
<a id='menu2'></a>
## Part 2 Locating neighborhood

Find __coordinate__ of each postcode by reading from https://cocl.us/Geospatial_data

In [28]:
coordinate_csv = 'data/coordinate.csv'
# read from file downloaded before if available
if os.path.exists(coordinate_csv):
    df_coordinate = pd.read_csv('data/coordinate.csv')
else:
    url = 'https://cocl.us/Geospatial_data'
    df_coordinate = pd.read_csv(url)    

Postcode' coordinates dataframe:

In [14]:
df_coordinate.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


__Join__ with df_postcode_clean dataframe

In [15]:
df_postcode_coordinate = df_postcode_clean.join(df_coordinate.set_index('Postal Code'), on='postcode')
df_postcode_coordinate.head()

Unnamed: 0,postcode,borough,neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood,Morningside,West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


In [17]:
df_postcode_coordinate.shape

(103, 5)

__All__ postcode now has coordinates

<a id='menu3'></a>

<a href="#menu">_Back to top_</a>