<a href="https://cognitiveclass.ai"><img src = "https://ibm.box.com/shared/static/9gegpsmnsoo25ikkbl4qzlvlyjbgxs5x.png" width = 400> </a>

<h1 align=center><font size = 5>Segmenting and Clustering Neighborhoods in Toronto</font></h1>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#!conda install -c conda-forge beautifulsoup4 --yes # uncomment this line if you haven't completed the Foursquare API lab
from bs4 import BeautifulSoup  # map HTML Parsing Tool

import requests # import requests

#!conda install -c conda-forge geocoder --yes
import geocoder # import geocoder

print('Libraries imported.')

Libraries imported.


## 1. Download Webpage and Extract Dataset 

Neighborhood has a total of 5 boroughs and 306 neighborhoods. In order to segement the neighborhoods and explore them, we will essentially need a dataset that contains the 5 boroughs and the neighborhoods that exist in each borough as well as the the latitude and logitude coordinates of each neighborhood. 

Luckily, this dataset exists for free on the web. Feel free to try to find this dataset on your own, but here is the link to the dataset: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

For your convenience, I downloaded the files and placed it on the server, so you can simply run a `wget` command and access the data. So let's go ahead and do that.

In [2]:
# specify the url
neighborhoods_page = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'

In [3]:
# query the website and return the html to the variable ‘page’
html_string = requests.get(neighborhoods_page).content

In [4]:
# Parse webpage
soup = BeautifulSoup(html_string, 'html.parser')

In [5]:
#extract tables
tables = soup.find_all("table")

In [6]:
#Move first table into DataFram
table = tables[0]
tab_data = [[cell.text.rstrip('\n') for cell in row.find_all(["th","td"])]
                        for row in table.find_all("tr")]
df = pd.DataFrame(tab_data)

In [7]:
#Move first row into header
df.columns = df.iloc[0,:]
df.drop(index=0,inplace=True)

In [8]:
#Only process rows with Borough assigned
df = df.query('Borough!="Not assigned"')

In [9]:
#Print Dataframe shape
df.shape

(103, 3)

In [10]:
#Print Dataframe
df

Unnamed: 0,Postal Code,Borough,Neighbourhood
3,M3A,North York,Parkwoods
4,M4A,North York,Victoria Village
5,M5A,Downtown Toronto,"Regent Park, Harbourfront"
6,M6A,North York,"Lawrence Manor, Lawrence Heights"
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
10,M1B,Scarborough,"Malvern, Rouge"
12,M3B,North York,Don Mills
13,M4B,East York,"Parkview Hill, Woodbine Gardens"
14,M5B,Downtown Toronto,"Garden District, Ryerson"


## 2. Fetch Coordinates

### Define function that fetches coordinates using Geocode

In [11]:
def fetch_coordinates(row):
    # initialize your variable to None
    g_ok = False

    # loop until you get the coordinates
    while(g_ok is False):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(row['Postal Code']))
        g_ok  = g.ok
    print([g.lat,g.lng])
    return [g.lat, g.lng];

### Map fetch coordinates to Postal Code  column

In [18]:
res = df.apply(fetch_coordinates, axis=1, result_type='expand')
res.columns = ['Latitude','Longitude']
df[res.columns]= res

[43.75188000000003, -79.33035999999998]
[43.73042000000004, -79.31281999999999]
[43.655140000000074, -79.36264999999997]
[43.72321000000005, -79.45140999999995]
[43.66449000000006, -79.39301999999998]
[43.66277000000008, -79.52830999999998]
[43.81153000000006, -79.19551999999999]
[43.74929000000003, -79.36168999999995]
[43.707940000000065, -79.31159999999994]
[43.65736000000004, -79.37817999999999]
[43.70799000000005, -79.44837999999999]
[43.65279000000004, -79.55405999999994]
[43.78564000000006, -79.15870999999999]
[43.72184000000004, -79.34339999999997]
[43.68970000000007, -79.30679999999995]
[43.65143000000006, -79.37556999999998]
[43.69211000000007, -79.43035999999995]
[43.648900000000026, -79.57824999999997]
[43.765750000000025, -79.17519999999996]
[43.67703000000006, -79.29541999999998]
[43.64531000000005, -79.37367999999998]
[43.68784000000005, -79.45045999999996]
[43.768200000000036, -79.21760999999998]
[43.70909000000006, -79.36409999999995]
[43.65609000000006, -79.38492999999

In [19]:
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
3,M3A,North York,Parkwoods,43.75188,-79.33036
4,M4A,North York,Victoria Village,43.73042,-79.31282
5,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65514,-79.36265
6,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72321,-79.45141
7,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66449,-79.39302
9,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.66277,-79.52831
10,M1B,Scarborough,"Malvern, Rouge",43.81153,-79.19552
12,M3B,North York,Don Mills,43.74929,-79.36169
13,M4B,East York,"Parkview Hill, Woodbine Gardens",43.70794,-79.3116
14,M5B,Downtown Toronto,"Garden District, Ryerson",43.65736,-79.37818
