<h1>Capstone Project for Coursera's Applied Data Science</h1>

This is a jupyter notebook for clustering neighborhoods in Toronto.

In [1]:
import pandas as pd
import numpy as np
import os, requests
from bs4 import BeautifulSoup as BS
fs_id, fs_secret = os.environ['FOURSQUARE_ID'], os.environ['FOURSQUARE_SECRET']
version = "20201231"

<h3>Scraping Toronto's Neighborhoods</h3>
As described in the instructions, we will scrape neighborhood data from Wikipedia.  First, get the Beautiful Soupified html:

In [2]:
toronto_neighbs_url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
wiki_raw = requests.get(toronto_neighbs_url )
wiki = BS(wiki_raw.content,'lxml')

Next, extract the needed data:

In [5]:
table_body = wiki.find('tbody')
table_rows = table_body.find_all('tr')
results = [[cell.text.strip() for cell in row.find_all('td')] for row in table_rows[1:] ]
df = pd.DataFrame(results,columns=["PostalCode", "Borough", "Neighborhood"])
df.head()


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


And filter according to the directions, replace not assigned neighborhoods with Borough Names, and join neighborhoods in the same postal codes together:

In [12]:
df = df[df.Borough != 'Not assigned']
mask = df.Neighborhood == 'Not assigned'
df[mask]['Neighborhood'] = df[mask]['Borough']
df = df.groupby(['PostalCode','Borough']).agg({'Neighborhood':lambda x: ', '.join(x)}).reset_index()
df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood]], Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


Finally, get the shape of the scraped and munged dataset:

In [13]:
df.shape

(103, 3)

In [14]:
#!pip install geocoder
import geocoder # import geocoder

def getCoors(postal_code):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng
    return lat_lng_coords[0], lat_lng_coords[1]

In [15]:
df['Latitude'] = np.NaN
df['Longtitude'] = np.NaN
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longtitude
0,M1B,Scarborough,"Rouge, Malvern",,
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",,
2,M1E,Scarborough,"Guildwood]], Morningside, West Hill",,
3,M1G,Scarborough,Woburn,,
4,M1H,Scarborough,Cedarbrae,,


After adding empty values for latitude and longtitude, loop through and get them for each postal code:

In [16]:
import tqdm
i = 0
for code in tqdm.tqdm(df.PostalCode):
    lat, long = getCoors(code)
    df.iloc[i, 3] = lat
    df.iloc[i, 4] = long
    i += 1

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103/103 [02:28<00:00,  1.44s/it]


In [9]:
df.tail()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longtitude
33,M6K,West Toronto,"Brockton, Exhibition Place, Parkdale Village",43.636847,-79.428191
34,M6P,West Toronto,"High Park, The Junction South",43.661608,-79.464763
35,M6R,West Toronto,"Parkdale, Roncesvalles",43.64896,-79.456325
36,M6S,West Toronto,"Runnymede, Swansea",43.651571,-79.48445
37,M7Y,East Toronto,Business reply mail Processing Centre969 Eastern,43.662744,-79.321558


Now, get the mean coordinates around these neighborhoods:

In [19]:
c_lat = df.Latitude.mean()
c_lon = df.Longtitude.mean()
c_lat,c_lon

(43.704607733980588, -79.397152911650466)

In [25]:
#!pip install folium
from sklearn.cluster import KMeans
import folium # map rendering library

kclusters = 5,m
toronto_grouped_clustering = df.drop(['Neighborhood','Borough','PostalCode'], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)


In [41]:
import matplotlib.cm as cm
import matplotlib.colors as colors

df['Cluster Labels'] = kmeans.labels_
map_clusters = folium.Map(location=[c_lat, c_lon], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df['Latitude'], df['Longtitude'], df['Neighborhood'], df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster: ' + str(cluster) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

In [42]:
map_clusters

And in case you can't see this because you are using Chrome or an outdated browser:

![title](toronto.jpg)