# Applied Data Science Capstone

# Segmenting and Clustering Neighborhoods in Toronto

I will be exploring and clustering the neighborhoods in Toronto.

1. I will first build the code to scrape the following Wikipedia page, https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M, in order to obtain the data that is in the table of postal codes and to transform the data into a pandas dataframe.

In [43]:
#Step 1: Installing & Importing libraries:
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [44]:
#Step 2: Scraping Wikipedia page to obtain postal codes data
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(source,'lxml')
table=soup.find('table')

In [45]:
#Step 3: Selecting the three columns of data we want: PostalCode, Borough, and Neighborhood
column_names = ['Postalcode','Borough','Neighborhood']
df = pd.DataFrame(columns = column_names)

In [46]:
#Step 4: Searching for all the postcode, borough, neighborhood data
for tr_cell in table.find_all('tr'):
    row_data=[]
    for td_cell in tr_cell.find_all('td'):
        row_data.append(td_cell.text.strip())
    if len(row_data)==3:
        df.loc[len(df)] = row_data

In [47]:
#displaying data
df.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


Data Cleaning:

In [48]:
#Step 5: Removing data where Borough is "Not assigned"
df = df[df.Borough != 'Not assigned']

In [49]:
#Step 6: Combining neighbourhoods with the same postalcode

df1=df.groupby('Postalcode')['Neighborhood'].apply(lambda x: "%s" % ', '.join(x))
df1=df1.reset_index(drop=False)
df1.rename(columns={'Neighborhood':'Neighborhoods'},inplace=True)

df2 = pd.merge(df, df1, on='Postalcode')
df2.drop(['Neighborhood'],axis=1,inplace=True)
df2.drop_duplicates(inplace=True)
df2.rename(columns={'Neighborhoods':'Neighborhood'},inplace=True)

In [50]:
df2.head()

Unnamed: 0,Postalcode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [51]:
#shape of dataframe
df2.shape

(103, 3)

2. To utilize the Foursquare location data, I will now need to get the latitude and the longitude coordinates of each neighborhood.

In [52]:
#Step 1: Importing CSV with the latitudes and longitudes
latlon = pd.read_csv('https://cocl.us/Geospatial_data')
latlon.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [53]:
#Step 2: Merging data for neighbourhoods in Canada
latlon.rename(columns={'Postal Code':'Postalcode'},inplace=True)
dfm = pd.merge(latlon, df2, on='Postalcode')
dfm.head()

Unnamed: 0,Postalcode,Latitude,Longitude,Borough,Neighborhood
0,M1B,43.806686,-79.194353,Scarborough,"Malvern, Rouge"
1,M1C,43.784535,-79.160497,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,43.763573,-79.188711,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,43.770992,-79.216917,Scarborough,Woburn
4,M1H,43.773136,-79.239476,Scarborough,Cedarbrae


In [54]:
#reordering dataframe
dfm1=dfm[['Postalcode','Borough','Neighborhood','Latitude','Longitude']]
dfm1.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


3. Exploring and clustering the neighborhoods in Toronto.

In [55]:
#Step 1: Getting only the rows in the data frame that contains Toronto in their Borough.
df3 = dfm1[dfm1['Borough'].str.contains('Toronto',regex=False)]
df3.head()

Unnamed: 0,Postalcode,Borough,Neighborhood,Latitude,Longitude
37,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,M4M,East Toronto,Studio District,43.659526,-79.340923
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


In [60]:
#Step 2: Importing folium & visualizing the data
import folium
map1 = folium.Map(location=[43.651070,-79.347015],zoom_start=11)
for lat,lng,borough,neighborhood in zip(df3['Latitude'],df3['Longitude'],df3['Borough'],df3['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,lng],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map1)
map1

*Step 3: Using KMeans clustering to cluster Toronto neighborhoods

In [57]:
#import visualization tools (KMeans for clustering)
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np

In [58]:
#number of clusters
k=5 
torontoclustering = df3.drop(['Postalcode','Borough','Neighborhood'],1)

#running kmeans clustering 
kmeans = KMeans(n_clusters = k,random_state=0).fit(torontoclustering)

#adjusting clusterlabels 
kmeans.labels_
df3.insert(0, 'Cluster Labels', kmeans.labels_)
df3

Unnamed: 0,Cluster Labels,Postalcode,Borough,Neighborhood,Latitude,Longitude
37,0,M4E,East Toronto,The Beaches,43.676357,-79.293031
41,0,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
42,0,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
43,0,M4M,East Toronto,Studio District,43.659526,-79.340923
44,1,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
45,1,M4P,Central Toronto,Davisville North,43.712751,-79.390197
46,1,M4R,Central Toronto,"North Toronto West, Lawrence Park",43.715383,-79.405678
47,1,M4S,Central Toronto,Davisville,43.704324,-79.38879
48,1,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316
49,1,M4V,Central Toronto,"Summerhill West, Rathnelly, South Hill, Forest...",43.686412,-79.400049


Creating cluster map

In [59]:
mapclusters = folium.Map(location=[43.651070,-79.347015],zoom_start=11)

#Adjusting color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#Adding markers to the map
markers_colors = []
for lat, lon, neighbourhood, cluster in zip(df3['Latitude'], df3['Longitude'], df3['Neighborhood'], df3['Cluster Labels']):
    label = folium.Popup(' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(mapclusters)
       
mapclusters