# Applied Data Science Capstone project

This notebook will be used for the Applied Data Science Capstone project

In [1]:
import pandas as pd
import numpy as np

print("Hello Capstone Project Course!")

Hello Capstone Project Course!


# Week 3 of the Capstone project - Segmenting and Clustering Neighborhoods in Toronto

# Description of the assignment

In this assignment, you will be required to explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information.. However, unlike New York, the neighborhood data is not readily available on the internet. What is interesting about the field of data science is that each project can be challenging in its unique way, so you need to learn to be agile and refine the skill to learn new libraries and tools quickly depending on the project.

For the Toronto neighborhood data, a Wikipedia page exists that has all the information we need to explore and cluster the neighborhoods in Toronto. You will be required to scrape the Wikipedia page and wrangle the data, clean it, and then read it into a pandas  dataframe so that it is in a structured format like the New York dataset.

Once the data is in a structured format, you can replicate the analysis that we did to the New York City dataset to explore and cluster the neighborhoods in the city of Toronto.

# 1. Webscrapping Wikipedia content

In [42]:
!pip install folium
import pandas as pd
import requests
from sklearn.cluster import KMeans
import numpy as np
import folium 
from IPython.display import HTML, display
import json
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['Borough'] = (row.span.text).split('(')[0]
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)

#print(table_contents)
df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
df.head(110)

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [43]:
df.shape

(103, 3)

# 2. Adding geographical data to the dataframe

For the second section of the task, data from the Geospatial coordinates csv dataset are added into a separate dataframe.

In [None]:
df1= pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv")
df1.head(110)

In [None]:
Since the dataframe columns in both dataframes (the initial webscrapped dataframe and the dataframe with the csv data) are identical
but only have a slightly different column name, they are renamed to be the same "PostalCode".

In [None]:
df1.rename(columns={'Postal Code':'PostalCode'},inplace=True)
df1.head(10)

In [None]:
After having the dataframes having the same name for the postal code column, they are merged together in a new dataframe.

In [None]:
df2 = pd.merge(df1,df,on='PostalCode')
df2.head(10)

In [None]:
df3 = df2[df2['Borough'].str.contains('Toronto', regex=False)]
df3
df3.shape

# 3. Exploring and clustering the neighborhoods 

The third part of the assignment was made based on the code for the Lab - Segmenting and Clustering Neighborhoods in NYC in the same course.


In [None]:
#map of Toronto is made using the latitude and longitude values
map_toronto = folium.Map(location=[43.651070,-79.347015], zoom_start=15)
#markers to the map are added
for lat, lng, borough, neighborhood in zip(df3['Latitude'], df3['Longitude'], df3['Borough'], df3['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_toronto)
map_toronto

K-Means Clustering method is run and used to cluster the data. 

In [45]:
#set number of clusters
k=5
toronto_clustering = df3.drop(['PostalCode','Borough','Neighborhood'],1)
#run k-means clustering
kmeans = KMeans(n_clusters = k,random_state=0).fit(toronto_clustering)
kmeans.labels_
df3.insert(0, 'Cluster Labels', kmeans.labels_)

ValueError: cannot insert Cluster Labels, already exists

In [46]:
df3

Unnamed: 0,Cluster Labels,PostalCode,Latitude,Longitude,Borough,Neighborhood
37,4,M4E,43.676357,-79.293031,East Toronto,The Beaches
40,4,M4J,43.685347,-79.338106,East York/East Toronto,The Danforth East
41,4,M4K,43.679557,-79.352188,East Toronto,"The Danforth West, Riverdale"
42,4,M4L,43.668999,-79.315572,East Toronto,"India Bazaar, The Beaches West"
43,4,M4M,43.659526,-79.340923,East Toronto,Studio District
44,2,M4N,43.72802,-79.38879,Central Toronto,Lawrence Park
45,2,M4P,43.712751,-79.390197,Central Toronto,Davisville North
46,2,M4R,43.715383,-79.405678,Central Toronto,North Toronto West
47,2,M4S,43.704324,-79.38879,Central Toronto,Davisville
48,2,M4T,43.689574,-79.38316,Central Toronto,"Moore Park, Summerhill East"


The clusters are visualized in the map.

In [47]:
# create map
map_clusters = folium.Map(location=[43.651070,-79.347015],zoom_start=15)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, neighborhood, cluster in zip(df3['Latitude'], df3['Longitude'], df3['Neighborhood'], df3['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)      
map_clusters