<h2>Segmenting and Clustering Neighbourhoods in Toronto</h2>





<h3>All 3 requirements of <i>Web-Scrapping</i>, <i>Cleaning</i> and <i>Clustering</i> are made within the same notebook, just to make it easier for everyone.</h3>

<h3>Installing and Importing dependencies</h3>

In [66]:
!pip install beautifulsoup4
!pip install lxml
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 


from IPython.display import display_html
import pandas as pd
import numpy as np
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Everything has been installed and imported')



Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Everything has been installed and imported


<h3>Scraping the Wikipedia page for the table of postal codes of Canada with BeautifulSoup library</h3>


In [70]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
print(soup.title)
from IPython.display import display_html
tab = str(soup.table)
display_html(tab,raw=True)

<title>List of postal codes of Canada: M - Wikipedia</title>


Postal Code,Borough,Neighborhood
M1A,Not assigned,Not assigned
M2A,Not assigned,Not assigned
M3A,North York,Parkwoods
M4A,North York,Victoria Village
M5A,Downtown Toronto,"Regent Park, Harbourfront"
M6A,North York,"Lawrence Manor, Lawrence Heights"
M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
M8A,Not assigned,Not assigned
M9A,Etobicoke,"Islington Avenue, Humber Valley Village"
M1B,Scarborough,"Malvern, Rouge"


<h3>The html-web table is converted to Pandas DataFrame for Cleaning and better observation.</h3>

In [4]:
dfs = pd.read_html(tab)
df=dfs[0]
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<h3>Data Cleaning</h3>

In [7]:
# Dropping the rows where Borough is 'Not assigned'
df1 = df[df.Borough != 'Not assigned']

# Combining the neighbourhoods with same Postalcode
df2 = df1.groupby(['Postal Code','Borough'], sort=False).agg(', '.join)
df2.reset_index(inplace=True)

# Replacing the name of the neighbourhoods which are 'Not assigned' with names of Borough
df2['Neighborhood'] = np.where(df2['Neighborhood'] == 'Not assigned',df2['Borough'], df2['Neighborhood'])

df2

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
# Shape of data frame
df2.shape

(103, 3)

<h3>Import the .csv for the latitudes and longitudes for various neighborhoods in Canada</h3>

In [9]:
lat_lon = pd.read_csv('https://cocl.us/Geospatial_data')
lat_lon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


<h3>Merge both tables for Latitudes and Longitudes for various neighbourhoods in Canada with the method of concatonating the two frames on each other. Once that happens, two postcode columns appear, so we drop one.</h3>

In [49]:
df3 = pd.concat([df2, lat_lon.reindex(df2.index)], axis=1)
df3.head()
df4 = df3.drop('Postcode', axis=1)
df4.head()



Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.806686,-79.194353
1,M4A,North York,Victoria Village,43.784535,-79.160497
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.770992,-79.216917
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476


<h2>This part includes : Clustering and visualizing of the neighbohoods of Canada which contain Toronto in their Borough</h2>

<h3> - Getting all the rows from the data frame which contains Toronto in their Borough.</h3>

In [50]:
df5 = df4[df4['Borough'].str.contains('Toronto',regex=False)]
df5

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.692657,-79.264848
15,M5C,Downtown Toronto,St. James Town,43.799525,-79.318389
19,M4E,East Toronto,The Beaches,43.786947,-79.385975
20,M5E,Downtown Toronto,Berczy Park,43.75749,-79.374714
24,M5G,Downtown Toronto,Central Bay Street,43.782736,-79.442259
25,M6G,Downtown Toronto,Christie,43.753259,-79.329656
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.737473,-79.464763
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.739015,-79.506944


<h3>Visualizing all the Neighbourhoods of the above data frame using Folium</h3>

In [51]:
map_toronto = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

for lat,lng,borough,neighbourhood in zip(df4['Latitude'],df4['Longitude'],df4['Borough'],df4['Neighborhood']):
    label = '{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

<h3>Finally, KMeans clustering for the neighborhoods</h3>

In [61]:
k=5
toronto_clustering = df5.drop(['Postal Code','Borough','Neighborhood'],1)
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
#df4.insert(0, 'Cluster Labels', kmeans.labels_)


array([4, 4, 4, 3, 3, 3, 3, 3, 1, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2,
       2, 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 1, 0, 0, 0, 0], dtype=int32)

In [58]:
df5

Unnamed: 0,Cluster Labels,Postal Code,Borough,Neighborhood,Latitude,Longitude
2,2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.763573,-79.188711
4,2,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.773136,-79.239476
9,2,M5B,Downtown Toronto,"Garden District, Ryerson",43.692657,-79.264848
15,4,M5C,Downtown Toronto,St. James Town,43.799525,-79.318389
19,4,M4E,East Toronto,The Beaches,43.786947,-79.385975
20,4,M5E,Downtown Toronto,Berczy Park,43.75749,-79.374714
24,4,M5G,Downtown Toronto,Central Bay Street,43.782736,-79.442259
25,4,M6G,Downtown Toronto,Christie,43.753259,-79.329656
30,3,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.737473,-79.464763
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.739015,-79.506944


In [60]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=10)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df5['Latitude'], df5['Longitude'], df5['Neighborhood'], df5['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters