<H1><b>Segmenting and Clustering Neighborhoods in Toronto</b></H1>

### Table of Contents

<a href="#item1">1. Scrapping the Wikipedia Table as DataFrame<br>
<a href="#item2">2. Use the csv file to create the following dataframe</a><br>
<a href="#item2">3. Explore and cluster the neighborhoods in Toronto</a><br>

Maps may not be seen on GitHub so here is a to original notebook.

https://jupyterlab-8.labs.cognitiveclass.ai/hub/user-redirect/lab/tree/Capstone%20Project/Segmenting%20and%20Clustering%20Neighborhoods%20in%20Toronto-3.ipynb


In [18]:
#import libraries
%pip install beautifulsoup4
from bs4 import BeautifulSoup
%pip install lxml
import requests
import pandas as pd
import numpy as np


Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 1. Scrapping the Wikipedia Table as DataFrame<a class="anchor" id="item1"></a>


In [19]:
# The dataframe will consist of three columns: PostalCode, Borough, and Neighborhood
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
df=pd.read_html(source)
df=df[0]
df=df.rename(columns={'Postal Code':'PostalCode'}) #Changing Column Names
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


<H3>...Only process the cells that have an assigned borough. Ignore cells with a borough that is Not assigned.</H3>

In [20]:
df=df[df.Borough != 'Not assigned']
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<H3>...These two rows will be combined into   one row with the neighborhoods separated with a comma as shown in row 11 in the above table.</H3>

In [21]:
df=df.groupby(['PostalCode','Borough'], sort=False).agg(','.join)
df.reset_index(inplace=True)
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<H3>...If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.</H3>

In [22]:
df['Neighbourhood']=np.where(df['Neighbourhood']=='Not assigned',df['Borough'],df['Neighbourhood'])
df.head()

Unnamed: 0,PostalCode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


<H3>...use the .shape method to print the number of rows of your dataframe</H3>

In [23]:
df.shape

(103, 3)

## 2. Use the csv file to create the following dataframe <a class="anchor" id="item2"></a>

In [24]:
codf=pd.read_csv('https://cocl.us/Geospatial_data') #reading csv file from url
codf=codf.rename(columns={'Postal Code':'PostalCode'}) #Changing Column Names
codf #showing csv file as dataframe

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [25]:
df = pd.merge(df, codf,on ='PostalCode')   #adding Latitude and Longitude data to main dataframe 
df             

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


## 3. Explore and cluster the neighborhoods in Toronto <a class="anchor" id="item3"></a>

<h3>Retrieve rows contain "Toronto" in Borough Column</h3>

In [26]:
df=df[df['Borough'].str.contains('Toronto',regex=False)]
df

Unnamed: 0,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


<h4>Getting Toronto's Coordinates using Geopy & Geolocator</h4>

In [27]:
%pip install geopy
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address='Toronto,CA'
geolocator=Nominatim(user_agent="Toronto_explorer")
location=geolocator.geocode(address)
to_latitude = location.latitude
to_longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(to_latitude, to_longitude))


Note: you may need to restart the kernel to use updated packages.
The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<h4>Create a map of New York with neighborhoods superimposed on top.</h4>

In [28]:
import folium
print('folium imported!')
toronto_map=folium.Map(location=[to_latitude,to_longitude],zoom_start=12) #to_latitude and to_latituded values used.
for lat,lng,borough,neighbourhood in zip(df['Latitude'],df['Longitude'],df['Borough'],df['Neighbourhood']):
    label='{},{}'.format(neighbourhood,borough)
    label=folium.Popup(label,parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=7.5,
    popup=label,
    color='gray',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.75,
    parse_html=False).add_to(toronto_map)
toronto_map

folium imported!


<H2>Using K Means Clustering for Neighbourhoods of Toronto</H2>

In [29]:
import sklearn
from sklearn.cluster import KMeans  # import k-means from clustering stage

kclusters = 5 # set number of clusters

toronto_clustered=df.drop(['PostalCode','Borough','Neighbourhood'],1).reset_index(drop=True) # run k-means clustering

kmeans=KMeans(n_clusters=kclusters,random_state=0).fit(toronto_clustered)

kmeans.labels_
df.insert(0, 'Cluster Labels', kmeans.labels_)
df.head(10)

Unnamed: 0,Cluster Labels,PostalCode,Borough,Neighbourhood,Latitude,Longitude
2,0,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
4,0,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
9,0,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937
15,0,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
19,4,M4E,East Toronto,The Beaches,43.676357,-79.293031
20,0,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306
24,0,M5G,Downtown Toronto,Central Bay Street,43.657952,-79.387383
25,3,M6G,Downtown Toronto,Christie,43.669542,-79.422564
30,0,M5H,Downtown Toronto,"Richmond, Adelaide, King",43.650571,-79.384568
31,1,M6H,West Toronto,"Dufferin, Dovercourt Village",43.669005,-79.442259


<h3>Creating Clustered Map</h3>

In [36]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors


# create map
map_clusters = folium.Map(location=[to_latitude,to_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['Latitude'], df['Longitude'], df['Neighbourhood'], df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters