# Segmenting and Clustering Neighbourhoods in Toronto

The project involves scraping the Wikipedia page for postal codes of Canada then processing and cleaning the data for the clustering. Clustering is carried out by K Means techniques and the clusters are mapped using the Folium Library. The Boroughs in Toronto are mapped before clustering and then mapped again showing clusters in Toronto.

###### The notebook includes codes for Web scraping, Data Cleaning and Clustering.

###### Installing and Importing the required Libraries

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans
import folium
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values (OpenStreetMaps)
import requests
import json
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
from bs4 import BeautifulSoup as bs

### Web Scraping for Postal Codes

Use the BeautifulSoup Library for web scraping then display table of postal codes for Canada. Print title of web page to show that the page has been scraped successfully.

In [4]:
html_data = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=bs(html_data,'lxml')
print(soup.title)
from IPython.display import display_html
table_contents=[]
tab=soup.find('table')
for row in tab.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['Postcode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

# print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.head()

<title>List of postal codes of Canada: M - Wikipedia</title>


Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


###### Convert the scraped html table to a Pandas DataFrame and dispaly first five rows of the table.

###### Clean the data: Ignore cells with a borough that is Not assigned.

In [5]:
# Drop rows where Borough is 'Not assigned'
df_na = df[df.Borough != 'Not assigned']

# Group neighbourhoods with same Postalcode
df_grouped = df_na.groupby(['Postcode','Borough'], sort=False).agg(', '.join)
df_grouped.reset_index(inplace=True)

# Replace name of neighbourhoods which are 'Not assigned' with names of Borough
df_grouped['Neighborhood'] = np.where(df_grouped['Neighborhood'] == 'Not assigned',df_grouped['Borough'], df_grouped['Neighborhood'])

df_grouped

Unnamed: 0,Postcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Malvern, Rouge"
7,M3B,North York,Don Mills North
8,M4B,East York,"Parkview Hill, Woodbine Gardens"
9,M5B,Downtown Toronto,"Garden District, Ryerson"


In [6]:
df_grouped.shape

(103, 3)

###### Import the CSV file of Geographical Coordinates for various neighbourhoods in Canada and create new dataframe with latitude and longitude values for the neighborhoods

In [7]:
#Import CSV file of geographical coordinates of neighborhoods in Canada
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [8]:
lat_lon.rename(columns={'Postal Code':'Postcode'},inplace=True)
lat_lon

Unnamed: 0,Postcode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


In [9]:
#Append the geographical coordinates of neighborhoods to the dataframe of grouped neighborhoods based on postcodes and create new dataframe
df_neigh = pd.merge(df_grouped,lat_lon,on='Postcode')
df_neigh.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494


###### Obtain Geographical Coordinates of Toronto using Geolocator.

In [10]:
city = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="Coordinates")
location = geolocator.geocode(city)
latitude = location.latitude
longitude = location.longitude
print('Geograpical coordinates of Toronto are {}, {}.'.format(latitude, longitude))

Geograpical coordinates of Toronto are 43.6534817, -79.3839347.


##### Get geographical coordinates of the neighborhoods in Toronto and create map of Toronto with markers using Folium Library.

In [11]:
neighborhoods = df_neigh
#Create map of Toronto using latitude and longitude values
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10.4)

# Add markers to map
for lat, lng, borough, neighbourhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)
    
map_toronto

### The Map does not show in Notebook. Please see html.

##### Use K-Means to Cluster Neighborhoods in Toronto using randomly selected datapoints with K=4.

Define a new dataframe containing only neighborhoods with Toronto in their Borough and show first five rows of the dataframe

In [12]:
df_toronto = df_neigh[df_neigh['Borough'].str.contains('Toronto',regex=False)]
df_toronto.head()

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


Initialize Clustering with K=4

In [13]:
k=5
toronto_clustering = df_toronto.drop(['Postcode','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df_toronto.insert(0, 'Cluster Labels', kmeans.labels_)

Create map showing Clusters with initial K=5

In [14]:
# create map
map_clusters = folium.Map(location=[latitude, longitude],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df_toronto['Latitude'], df_toronto['Longitude'], df_toronto['Neighborhood'], df_toronto['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### The Map does not show in Notebook. Please see html.