# Segmenting and Clustering Neighborhoods in Toronto 


### About

This is a Junyper Notebook about segmenting and clustering Neighborhoods in Toronto created by: Ahmed B. Darwish for week#3 Assignment for the Applied Data Science Capstone project Coursera/IBM Course

### Importing and downloading Required Libraries and Packages

In [27]:
pip install beautifulsoup4



You should consider upgrading via the 'c:\users\abdar\appdata\local\programs\python\python37-32\python.exe -m pip install --upgrade pip' command.





In [2]:
# Before we get the data and start exploring it, let's download all the dependencies that we will need
from bs4 import BeautifulSoup
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# for table analysis
import pandas as pd

# write to csv
import csv

# Time
import time

#Visuals
import matplotlib.pyplot as plt


In [3]:
# Checking if BeautifulSoup installed correctly
BeautifulSoup

bs4.BeautifulSoup

## 1- Now, lets Scrapping Toronto Data from wikipedia

In [4]:
# below is url link for wikipedia data from Toronto 
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
s = requests.Session()
response = s.get(url, timeout=10)
response
# If the request is successful, then reponse output = '200'.

<Response [200]>

In [5]:
# scrape the request response to HTML
soup = BeautifulSoup(response.content, 'html.parser')

# to view the content in html format
pretty_soup = soup.prettify()

In [6]:
# getting Wikipedia page title
soup.title.string

'List of postal codes of Canada: M - Wikipedia'

In [9]:
# find all the tables in the wikipedia link
all_tables=soup.find_all('table')

# get right table to scrap
right_table=soup.find('table', {"class":'wikitable sortable'})

In [11]:
# Number of columns in the table
for row in right_table.findAll("tr"):
    cells = row.findAll('td')

len(cells)

3

In [12]:
# number of rows in the table including header
rows = right_table.findAll("tr")
len(rows)

181

In [13]:
# header attributes of the table
header = [th.text.rstrip() for th in rows[0].find_all('th')]
print(header)
print(len(header))

['Postal Code', 'Borough', 'Neighbourhood']
3


In [14]:
# Getting the table data 
lst_data = []
for row in rows[1:]:
            data = [d.text.rstrip() for d in row.find_all('td')]
            lst_data.append(data)

In [15]:
lst_data

[['M1A', 'Not assigned', 'Not assigned'],
 ['M2A', 'Not assigned', 'Not assigned'],
 ['M3A', 'North York', 'Parkwoods'],
 ['M4A', 'North York', 'Victoria Village'],
 ['M5A', 'Downtown Toronto', 'Regent Park, Harbourfront'],
 ['M6A', 'North York', 'Lawrence Manor, Lawrence Heights'],
 ['M7A', 'Downtown Toronto', "Queen's Park, Ontario Provincial Government"],
 ['M8A', 'Not assigned', 'Not assigned'],
 ['M9A', 'Etobicoke', 'Islington Avenue, Humber Valley Village'],
 ['M1B', 'Scarborough', 'Malvern, Rouge'],
 ['M2B', 'Not assigned', 'Not assigned'],
 ['M3B', 'North York', 'Don Mills'],
 ['M4B', 'East York', 'Parkview Hill, Woodbine Gardens'],
 ['M5B', 'Downtown Toronto', 'Garden District, Ryerson'],
 ['M6B', 'North York', 'Glencairn'],
 ['M7B', 'Not assigned', 'Not assigned'],
 ['M8B', 'Not assigned', 'Not assigned'],
 ['M9B',
  'Etobicoke',
  'West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale'],
 ['M1C', 'Scarborough', 'Rouge Hill, Port Union, Highland Creek'],
 ['M

In [16]:
# Convert the data into pandas dataframe
df = pd.DataFrame(lst_data)
df

Unnamed: 0,0,1,2
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,Not assigned
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"


In [18]:
# Adding Headr information to the df
df.columns = header
df.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


## 2- Preprocess and clean the dataframe


#### Now, Lets first delete raws with "Not assigned" values

In [19]:
# Delete raws with "Not assigned"
Toronto_df = df[df.Borough != 'Not assigned']
Toronto_df

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
8,M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
9,M1B,Scarborough,"Malvern, Rouge"
11,M3B,North York,Don Mills
12,M4B,East York,"Parkview Hill, Woodbine Gardens"
13,M5B,Downtown Toronto,"Garden District, Ryerson"


In [20]:
# Lets, examin our data
Toronto_df.shape

(103, 3)

In [21]:
# Save Toronto_df into Toronto_df.CSV file
Toronto_df.to_csv(r'Toronto_df.CSV')

## 3- Get the latitude and the longitude coordinates of each Postal Code from given CSV 


In [23]:
# get coordinates from given downloaded CSV file 'Geospatial_Coordinates.csv' 
df_loc = pd.read_csv('Geospatial_Coordinates.csv')
df_loc

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
5,M1J,43.744734,-79.239476
6,M1K,43.727929,-79.262029
7,M1L,43.711112,-79.284577
8,M1M,43.716316,-79.239476
9,M1N,43.692657,-79.264848


Now, we will get latitude, longitude for Toronto_df using panads merge method

In [24]:
# Adding coordinates in the Toronto_df
Toronto_df2 = pd.merge(Toronto_df, df_loc, on ='Postal Code', how ='inner')
Toronto_df2

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


In [26]:
# Save Toronto_df into Toronto_df.CSV
Toronto_df2.to_csv(r'Toronto_df2.CSV')

## 4- Explore and cluster the neighborhoods in Toronto

You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City data. It is up to you.

In [28]:
# select only Boroughs that contains the word Toronto
Toronto_df3 = Toronto_df2[Toronto_df2['Borough'].str.contains("Toronto")]
Toronto_df3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [29]:
# Lets see How many Bouroughs we have
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(Toronto_df3['Borough'].unique()),
        Toronto_df3.shape[0]
    )
)

The dataframe has 4 boroughs and 39 neighborhoods.


In [30]:
# To Know Toronto, canada Location (Lat, Long), we need to get it from geopy.geocoders
# i used foursquare_agent as User_Agent
from geopy.geocoders import Nominatim
address = ['Toronto, Canada']
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Toronto City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto City are 43.6534817, -79.3839347.


In [31]:
# create map of New York using latitude and longitude values
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, Neighbourhood in zip(Toronto_df3['Latitude'], Toronto_df3['Longitude'], Toronto_df3['Borough'], Toronto_df3['Neighbourhood']):
    label = '{}, {}'.format(Neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

### Clustering Toronto_df3

In [32]:
# set number of clusters
kclusters = 4

Toronto_clustering = Toronto_df3.drop(['Postal Code','Borough','Neighbourhood'],1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(Toronto_clustering)

# check cluster labels generated for each row in the dataframe
label= kmeans.labels_[0:10] 

In [33]:
label

array([1, 1, 1, 1, 3, 1, 1, 2, 1, 2])

In [34]:
# Now, Adding cluster labels to Toronto_df3
Toronto_df3.insert(5, 'Cluster Labels', kmeans.labels_)
Toronto_df3.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,1
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,1
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,1
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418,1
19,M4E,East Toronto,The Beaches,43.676357,-79.293031,3


In [35]:
# Now, we will plot clustered Neighbours
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(Toronto_df3['Latitude'], Toronto_df3['Longitude'], Toronto_df3['Neighbourhood'], Toronto_df3['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now, we can see that there are 4 Clusters colored diffrently as shown above in the maps for each Borough


Thanks a lot

Coded by: Ahmed B. Darwish