## Segmenting and Clustering for Toronto
By Manuel Aaron Cruz

Following allong with the <a href="https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DP0701EN/DP0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb" > Segmenting and Clustering Neighborhoods in New York City </a> lab, I will be pulling data from multiple sources and will analyze the data by segmenting and clustering

In [1]:
#Imports needed for the project

import pandas as pd
import numpy as np

# Request for both pulling the wikipedia page and for building requests to  FourSquare 
import urllib.request
import requests

# Beautiful soup for reading and parsing HTML pages

from bs4 import BeautifulSoup as bs

# For analyzing clustering of neighborhodos
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# For mapping
import folium 

## Step 1: Gather Data From Wikipedia and CSV file

Following the lab, we will use requests and beautiful soup to parse the Wikipedia Webpage

In [2]:
# Begin by creating a request for the URL and pass it through to Beautiful Soup
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
page = urllib.request.urlopen(url)

soup = bs(page)

In [3]:
pcTable = soup.find('table', 'wikitable sortable')

postCodes = []
boroughs = []
neighbors = []

# For each row in the table of the CSV file:
for row in pcTable.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 3:
        post = str(cells[0].find(text=True)).replace('\n', '')
        borough = str(cells[1].find(text=True)).replace('\n', '')
        neigh = str(cells[2].find(text=True)).replace('\n', '')
        
        if borough != 'Not assigned':
            if post in postCodes:
                neighbors[postCodes.index(post)] += str(", " + neigh)
            else:
                postCodes.append(post)
                boroughs.append(borough)
                neighbors.append(neigh)


In [4]:

cols = ['PostalCode', 'Borough', 'Neighborhood']
df = pd.DataFrame(columns = cols )
df['PostalCode'] = postCodes
df['Borough'] = boroughs
df['Neighborhood'] = neighbors
print(df.shape)
df.head()

(103, 3)


Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Not assigned


Now that we have the postal code and the neighborhood data, we need to import the CSV file from the submission page
Note - I couldn't get the geocoder library working appropriately, so per the course documenation, I downloaded the geospatial data from <a href=http://cocl.us/Geospatial_data> http://cocl.us/Geospatial_data </a>

In [5]:
geoCoordDF = pd.read_csv('Geospatial_Coordinates.csv')
geoCoordDF.rename(columns = {'Postal Code':'PostalCode'}, inplace =True)
print( geoCoordDF.shape)
geoCoordDF.head()

(103, 3)


Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


With both tables created, we join them to create the final dataset

In [6]:
mergeDF = pd.merge(df, geoCoordDF, on = 'PostalCode', how = 'inner')
print(mergeDF.shape)
mergeDF.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Not assigned,43.662301,-79.389494


## Step 2: Setting Up Folium
I wanted to see the intial layout of just the Toronto City proper.
I will use folium to mark the longitude and latitude of the points given by the coordinates data, making sure we are going in the right direction

In [7]:
# Limit data to only Boroughs containing the word "Toronto"
torontoData = mergeDF[mergeDF['Borough'].str.contains('Toronto')].reset_index(drop=True)
print(torontoData.shape)
torontoData.head()

# This leaves us with 35 Post Codes

(38, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937
2,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
3,M4E,East Toronto,The Beaches,43.676357,-79.293031
4,M5E,Downtown Toronto,Berczy Park,43.644771,-79.373306


In [8]:
# Create a basic map for each of the 38 post codes
homeMap = folium.Map(location = [43.6487, -79.38544], zoom_start = 11)

for index, row in torontoData.iterrows():
    coords = [float(row['Latitude']),float(row['Longitude'])]
    folium.CircleMarker(coords, popup=row['PostalCode'], radius = 5, color ='black' , fill = True, fill_color='red').add_to(homeMap)

homeMap


It seems that much of the zipcodes are clustered geographically right in Tornoto Proper.
The denses area of post codes there all have the "M5X" naming convention (ex. M5B, M5G, etc)


## Step 3: Exploring nearby Neighborhoods
Using our own client/secret, we will run API requests to FourSquare for all reccomended venues near each neighborhood.

#### Note: Parts of code were adapted from following the "Segmenting and Clustering Neighborhoods in New York City" lab from the following link:
https://labs.cognitiveclass.ai/tools/jupyterlab/lab/tree/labs/DP0701EN/DP0701EN-3-3-2-Neighborhoods-New-York-py-v1.0.ipynb

In [21]:
# Note - Client ID and Secret were removed for security reasons
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [11]:
# Function getNearybyVenues
# This function will create and execute a FourSquare API explore requests
#  to get reccomendend nearby places. The information is parsed and returned, providing
#  the venue's location, and category
#

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit=500'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [12]:
# Run the getNearbyVenues on each neighborhood of toronto

torontoVenues = getNearbyVenues(names=torontoData['Neighborhood'],
                                   latitudes=torontoData['Latitude'],
                                   longitudes=torontoData['Longitude']
                                  )


Harbourfront
Ryerson, Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide, King, Richmond
Dovercourt Village, Dufferin
Harbourfront East, Toronto Islands, Union Station
Little Portugal, Trinity
The Danforth West, Riverdale
Design Exchange, Toronto Dominion Centre
Brockton, Exhibition Place, Parkdale Village
The Beaches West, India Bazaar
Commerce Court, Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North, Forest Hill West
High Park, The Junction South
North Toronto West
The Annex, North Midtown, Yorkville
Parkdale, Roncesvalles
Davisville
Harbord, University of Toronto
Runnymede, Swansea
Moore Park, Summerhill East
Chinatown, Grange Park, Kensington Market
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown, St. James Town
First Canadian P

Let's take a look on how many venues we have

In [13]:
print(torontoVenues.shape)
torontoVenues.head()

(1685, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Morning Glory Cafe,43.653947,-79.361149,Breakfast Spot


Now we will use one-hot encoding to identify , and calculate the frequency of occurance in each cateogry

In [14]:
torontoOneHot = pd.get_dummies(torontoVenues[['Venue Category']], prefix="", prefix_sep="")

torontoOneHot['Neighborhood'] = torontoVenues['Neighborhood']

torontoGrouped = torontoOneHot.groupby('Neighborhood').mean().reset_index()
print(torontoGrouped.shape)
torontoGrouped.head()

(38, 233)


Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wings Joint,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0625,0.0625,0.0625,0.0625,0.125,0.125,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Using the frequency table, we can now find the top most occuring venue categories in each post code

In [15]:
num_top_venues = 5

for hood in torontoGrouped['Neighborhood']:
    print("----"+hood+"----")
    temp = torontoGrouped[torontoGrouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
             venue  freq
0      Coffee Shop  0.07
1             Café  0.05
2              Bar  0.04
3       Steakhouse  0.04
4  Thai Restaurant  0.04


----Berczy Park----
                venue  freq
0         Coffee Shop  0.07
1              Bakery  0.05
2  Seafood Restaurant  0.04
3      Farmers Market  0.04
4        Cocktail Bar  0.04


----Brockton, Exhibition Place, Parkdale Village----
                    venue  freq
0          Breakfast Spot  0.10
1             Coffee Shop  0.10
2                    Café  0.10
3      Italian Restaurant  0.05
4  Furniture / Home Store  0.05


----Business Reply Mail Processing Centre 969 Eastern----
           venue  freq
0     Comic Shop  0.06
1  Auto Workshop  0.06
2     Skate Park  0.06
3            Spa  0.06
4  Burrito Place  0.06


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
              venue  freq
0   Airport Service  0.12
1  Airport Te

We will now turn this data into a dataframe, allowing us to run clustering analysis:

In [16]:
# Function: return_most_common_venues
#  This helper function will take a single row of the Toronto data
#  and return the num_top_venues number of venues (I.E. if num_top_venues = 5, 
#  it will return the top 5 most occcuring venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [17]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = torontoGrouped['Neighborhood']

for ind in np.arange(torontoGrouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(torontoGrouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Asian Restaurant,Bakery,Restaurant,Cosmetics Shop,American Restaurant
1,Berczy Park,Coffee Shop,Bakery,Seafood Restaurant,Cheese Shop,Beer Bar,Cocktail Bar,Farmers Market,Café,Steakhouse,Italian Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Gym,Intersection,Performing Arts Venue,Grocery Store,Pet Store,Climbing Gym,Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Light Rail Station,Garden Center,Smoke Shop,Brewery,Farmers Market,Fast Food Restaurant,Spa,Burrito Place,Restaurant,Auto Workshop
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Terminal,Harbor / Marina,Bar,Coffee Shop,Plane,Rental Car Location,Sculpture Garden,Boutique,Boat or Ferry


In [18]:

# set number of clusters
kclusters = 7

torontoGroupedCluster = torontoGrouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(torontoGroupedCluster)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

torontoMerged = torontoData

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
torontoMerged = torontoMerged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

# create map
map_clusters = folium.Map(location=[43.6487, -79.38544], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(torontoMerged['Latitude'], torontoMerged['Longitude'], torontoMerged['Neighborhood'], torontoMerged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### General analysis:
After trying multiple Ks, K = 7 seemed to generate some for of segmentation

With all other K < 7, the majority the neighborhoods fell within a single cluster. See the next Markdown line for an image of attemtped clustering with lower K values

With K = 7, we see more diverse clusters, with two major clusters forming closer to the ocean. I wanted ot take a look at these two clusters to examine what properties they shared


See the following for K = 5:
![K Equals 5](Cluster_With_5.png)

In [19]:
# Cluster 0:
clusterZero = torontoMerged.loc[torontoMerged['Cluster Labels'] == 0, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]
clusterZero

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Downtown Toronto,0,Grocery Store,Café,Park,Restaurant,Convenience Store,Baby Store,Candy Store,Coffee Shop,Nightclub,Italian Restaurant
8,West Toronto,0,Supermarket,Pharmacy,Bakery,Gym / Fitness Center,Music Venue,Middle Eastern Restaurant,Café,Brewery,Bar,Bank
16,East Toronto,0,Café,Coffee Shop,Italian Restaurant,American Restaurant,Bakery,Park,Seafood Restaurant,Stationery Store,Bar,Coworking Space
19,Central Toronto,0,Gym,Hotel,Pizza Place,Convenience Store,Clothing Store,Sandwich Place,Breakfast Spot,Food & Drink Shop,Park,General Entertainment
21,West Toronto,0,Grocery Store,Café,Bar,Mexican Restaurant,Thai Restaurant,Discount Store,Fast Food Restaurant,Park,Bakery,Diner
26,Downtown Toronto,0,Café,Italian Restaurant,Bar,Bakery,Sandwich Place,Bookstore,Restaurant,Japanese Restaurant,Pub,Poutine Place
29,Downtown Toronto,0,Café,Bar,Dumpling Restaurant,Chinese Restaurant,Coffee Shop,Mexican Restaurant,Vietnamese Restaurant,Bakery,Gaming Cafe,Grocery Store
37,East Toronto,0,Light Rail Station,Garden Center,Smoke Shop,Brewery,Farmers Market,Fast Food Restaurant,Spa,Burrito Place,Restaurant,Auto Workshop


In [20]:
# Cluster 5:
clusterFive = torontoMerged.loc[torontoMerged['Cluster Labels'] == 5, torontoMerged.columns[[1] + list(range(5, torontoMerged.shape[1]))]]
clusterFive

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,5,Coffee Shop,Bakery,Pub,Park,Mexican Restaurant,Breakfast Spot,Restaurant,Café,Theater,Yoga Studio
1,Downtown Toronto,5,Coffee Shop,Clothing Store,Middle Eastern Restaurant,Fast Food Restaurant,Café,Cosmetics Shop,Bakery,Restaurant,Bubble Tea Shop,Pizza Place
2,Downtown Toronto,5,Café,Coffee Shop,Restaurant,Hotel,Italian Restaurant,Clothing Store,Cosmetics Shop,Beer Bar,Bakery,Diner
4,Downtown Toronto,5,Coffee Shop,Bakery,Seafood Restaurant,Cheese Shop,Beer Bar,Cocktail Bar,Farmers Market,Café,Steakhouse,Italian Restaurant
5,Downtown Toronto,5,Coffee Shop,Italian Restaurant,Sandwich Place,Chinese Restaurant,Burger Joint,Café,Ice Cream Shop,Bubble Tea Shop,Bar,Spa
7,Downtown Toronto,5,Coffee Shop,Café,Bar,Steakhouse,Thai Restaurant,Asian Restaurant,Bakery,Restaurant,Cosmetics Shop,American Restaurant
9,Downtown Toronto,5,Coffee Shop,Hotel,Aquarium,Café,Fried Chicken Joint,Italian Restaurant,Restaurant,Brewery,Scenic Lookout,History Museum
10,West Toronto,5,Bar,Coffee Shop,Restaurant,Men's Store,Asian Restaurant,Café,Bakery,Vietnamese Restaurant,Pizza Place,New American Restaurant
11,East Toronto,5,Greek Restaurant,Coffee Shop,Ice Cream Shop,Italian Restaurant,Restaurant,Bookstore,Furniture / Home Store,Indian Restaurant,Fruit & Vegetable Store,Juice Bar
12,Downtown Toronto,5,Coffee Shop,Café,Hotel,American Restaurant,Bar,Restaurant,Italian Restaurant,Gastropub,Seafood Restaurant,Steakhouse


##  Distinguishing Cluster Types
Cluster 5 seems to be centralzied around the heavily populated downtown area, and has a heavy lean towards Coffee Shops and Cafes being within the top 2 venues.

Cluster 0, while having similar 'Cafe' appearances in the top, does not have as high a concentraiton of 'Coffee Shops'. 

