# Segmenting and Clustering Neighborhoods in Toronto

In this notebook, I will explore and cluster neighborhoods in Toronto.

### Table of Contents
* [Part 1: Creating the dataframe from the wikipedia page](#scrape)
* [Part 2: Adding latitude and longitude to our existing dataframe](#latlong)
* [Part 3: Explore and cluster neighborhoods in Toronto](#cluster)

### Part 1: Creating the dataframe from the Wikipedia page <a id='scrape'></a>

Let us first import the first couple dependencies we will need.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np

Using pandas's read_html, I will read the postal code table into a DataFrame and arrange it how I would like.

In [2]:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M', skiprows=1)
df = df[0] # read_html brings in a list of tables, we only want the first one

In [3]:
# assign column names to the dataframe
col = ['PostalCode', 'Borough', 'Neighborhood']
df.columns = col
df.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [4]:
# drop rows where the Borough is 'Not assigned'
df1 = df[df['Borough'] != 'Not assigned'].reset_index(drop=True)
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [5]:
# where the Neighborhood is 'Not assigned,' assign the Bourough name
df1['Neighborhood'] = np.where(df1['Neighborhood'] == 'Not assigned', 
                               df1['Borough'], df1['Neighborhood'])
df1.head()

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M5A,Downtown Toronto,Regent Park
4,M6A,North York,Lawrence Heights


In [6]:
# where a borough has more than 1 neighborhood name, aggregate to 1 row
df_final = df1.groupby(['PostalCode', 'Borough'], sort=False).agg(lambda x: ', '.join(x)).reset_index()
df_final.head(10)

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Harbourfront, Regent Park"
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Queen's Park,Queen's Park
5,M9A,Etobicoke,Islington Avenue
6,M1B,Scarborough,"Rouge, Malvern"
7,M3B,North York,Don Mills North
8,M4B,East York,"Woodbine Gardens, Parkview Hill"
9,M5B,Downtown Toronto,"Ryerson, Garden District"


Now that we have the completed first iteration of our dataframe, let's see how many PostalCodes we will be looking at.

In [73]:
print("There are {} Postal Codes we will be working with.".format(df_final.shape[0]))

There are 103 Postal Codes we will be working with.


### Part 2: Adding latitude and longitude to our existing dataframe<a id='latlong'></a>

For this exercise, we were to use the geocoder package to find the latitude and longitude of the Postal Codes. However, many students were having issues with running it (as we were warned about). I also gave the geopy package a shot, but I learned our postal codes are missing an additional 3 digits that aren't included in the Wikipedia Page, so instead the postal codes returned locations in Europe.

Anyway, the instructor for the course supplied as a csv file with the latitudes and longitudes for the postal codes, so we will read that in and merge it with our existing dataframe.

In [74]:
# It will be important to name our first column 'Postal Code, made apparent in the next code cell'
names = ['PostalCode', 'Latitude', 'Longitude']
coord = pd.read_csv('Data_Files/GeoSpatial_Coordinates.csv', 
                    names=names, header=0)
coord.head()

Unnamed: 0,PostalCode,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [9]:
# In order to merge dataframes, our 'on' column must have same name
df_complete = pd.merge(df_final, coord, on='PostalCode')
df_complete.head(12)

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Harbourfront, Regent Park",43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Queen's Park,Queen's Park,43.662301,-79.389494
5,M9A,Etobicoke,Islington Avenue,43.667856,-79.532242
6,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937
9,M5B,Downtown Toronto,"Ryerson, Garden District",43.657162,-79.378937


This is the dataframe we will be using to cluster neighborhoods.

### Part 3: Explore and cluster neighborhoods in Toronto <a id='cluster'></a>

Let us add some additional dependencies that will help us to cluster and visualize neighborhoods.

In [10]:
import requests
import json
from pandas.io.json import json_normalize

from geopy.geocoders import Nominatim

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

import folium

print('Libraries imported.')

Libraries imported.


We weren't able to use geopy earlier for neighborhood latitudes and longitudes, but we can still use it for getting our coordinates of Toronto.

In [11]:
# locate Ontario's lat/long
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent='toronto_explorer') # Custom name user agent
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinates of Toronto are {}, {}'.format(latitude, longitude))

The geographical coordinates of Toronto are 43.653963, -79.387207


Create a map of Toronto with neighborhoods superimposed.

In [12]:
# initialize map object using lat/long values
map_toronto = folium.Map(location=[latitude,longitude], zoom_start=10)

# add markers denoting boroughs
for lat, lng, borough, neighborhood in zip(df_complete['Latitude'], df_complete['Longitude'],
                                           df_complete['Borough'], df_complete['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#32cd3d',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)

map_toronto

Let us subset our dataframe by seeing what potential groups are the largest.

In [13]:
df_complete.groupby('Borough')['Neighborhood'].count().sort_values(ascending=False)

Borough
North York          24
Downtown Toronto    18
Scarborough         17
Etobicoke           12
Central Toronto      9
West Toronto         6
York                 5
East York            5
East Toronto         5
Queen's Park         1
Mississauga          1
Name: Neighborhood, dtype: int64

It looks like our two best groupings are those with either 'Toronto' or 'York' in the name.
I will arbitrarily choose bourough names including 'York.'

In [14]:
york_data = df_complete[df_complete['Borough'].str.contains('York')]
york_data.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
7,M3B,North York,Don Mills North,43.745906,-79.352188
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937


We will now update our map.

In [71]:
# create Toronto map
map_york = folium.Map(location=[latitude,longitude], zoom_start=10)

# add markers denoting boroughs
for lat, lng, borough, neighborhood in zip(york_data['Latitude'], york_data['Longitude'],
                                           york_data['Borough'], york_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#32cd3d',
        fill_opacity=0.7,
        parse_html=False).add_to(map_york)

map_york

Now that we have our subset data, it's time to use the Foursquare API and segment them.

We will use this function to explore the neighborhoods in York bouroughs that we borrow from the Foursquare lab:

In [17]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000, LIMIT=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
        
    return(nearby_venues)

Running the function on each neighborhood, we create a new dataframe called *york_venues*.

In [18]:
york_venues = getNearbyVenues(names=york_data['Neighborhood'],
                              latitudes=york_data['Latitude'],
                              longitudes=york_data['Longitude']
                             )

Parkwoods
Victoria Village
Lawrence Heights, Lawrence Manor
Don Mills North
Woodbine Gardens, Parkview Hill
Glencairn
Flemingdon Park, Don Mills South
Woodbine Heights
Humewood-Cedarvale
Caledonia-Fairbanks
Leaside
Hillcrest Village
Bathurst Manor, Downsview North, Wilson Heights
Thorncliffe Park
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Bayview Village
CFB Toronto, Downsview East
Silver Hills, York Mills
Downsview West
Downsview, North Park, Upwood Park
Humber Summit
Newtonbrook, Willowdale
Downsview Central
Bedford Park, Lawrence Manor East
Del Ray, Keelesdale, Mount Dennis, Silverthorn
Emery, Humberlea
Willowdale South
Downsview Northwest
The Junction North, Runnymede
Weston
York Mills West
Willowdale West


In [77]:
print('The size of the resulting dataframe is {}.'.format(york_venues.shape))
york_venues.head()

The size of the resulting dataframe is (988, 7).


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Allwyn's Bakery,43.75984,-79.324719,Caribbean Restaurant
1,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
2,Parkwoods,43.753259,-79.329656,Tim Hortons,43.760668,-79.326368,Café
3,Parkwoods,43.753259,-79.329656,A&W Canada,43.760643,-79.326865,Fast Food Restaurant
4,Parkwoods,43.753259,-79.329656,Food Basics,43.760865,-79.326015,Supermarket


Let's check how many venues were returned for each neighborhood.

In [20]:
york_venues.groupby('Neighborhood')['Venue'].count()

Neighborhood
Bathurst Manor, Downsview North, Wilson Heights     26
Bayview Village                                     12
Bedford Park, Lawrence Manor East                   41
CFB Toronto, Downsview East                         18
Caledonia-Fairbanks                                 23
Del Ray, Keelesdale, Mount Dennis, Silverthorn      17
Don Mills North                                     31
Downsview Central                                    4
Downsview Northwest                                 31
Downsview West                                       9
Downsview, North Park, Upwood Park                  12
East Toronto                                        96
Emery, Humberlea                                     7
Fairview, Henry Farm, Oriole                        43
Flemingdon Park, Don Mills South                    40
Glencairn                                           31
Hillcrest Village                                   22
Humber Summit                                       

This may not turn out to have very meaningful results with many neighborhoods having 6 or less venues, but let's see what we can determine.

In [21]:
print('There are {} unique categories.'.format(len(york_venues['Venue Category'].unique())))

There are 188 unique categories.


In [22]:
# one hot encoding
york_onehot = pd.get_dummies(york_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column to dataframe
york_onehot['Neighborhood'] = york_venues['Neighborhood']

# designate correct order of columns using list
reorder_columns = [york_onehot.columns[-1]] + list(york_onehot.columns[:-1])

# assigning new column order to our dataframe
york_onehot = york_onehot[reorder_columns]

york_onehot.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,...,Train Station,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Again, let's print the shape of our dataframe.

In [23]:
york_onehot.shape

(988, 189)

Now we will group the rows by neighborhood and take the mean occurence of each venue category.

In [24]:
york_grouped = york_onehot.groupby('Neighborhood').mean().reset_index()
york_grouped.head()

Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,...,Train Station,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.02439,0.0,0.0
3,"CFB Toronto, Downsview East",0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0,0.0,...,0.0,0.111111,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0
4,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0


Printing each neighborhood with the top 5 most common venues:

In [25]:
num_top_venues = 5

for group in york_grouped['Neighborhood']:
    print("----"+group+"----")
    temp = york_grouped[york_grouped['Neighborhood'] == group].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bathurst Manor, Downsview North, Wilson Heights----
                venue  freq
0         Coffee Shop  0.08
1  Frozen Yogurt Shop  0.04
2   Convenience Store  0.04
3            Ski Area  0.04
4          Ski Chalet  0.04


----Bayview Village----
                  venue  freq
0   Japanese Restaurant  0.17
1                  Bank  0.17
2         Grocery Store  0.08
3    Chinese Restaurant  0.08
4  Fast Food Restaurant  0.08


----Bedford Park, Lawrence Manor East----
                     venue  freq
0       Italian Restaurant  0.07
1     Fast Food Restaurant  0.07
2              Coffee Shop  0.07
3                Juice Bar  0.02
4  Comfort Food Restaurant  0.02


----CFB Toronto, Downsview East----
                venue  freq
0  Turkish Restaurant  0.11
1         Coffee Shop  0.11
2                 Gym  0.06
3                Café  0.06
4   Other Repair Shop  0.06


----Caledonia-Fairbanks----
                venue  freq
0  Mexican Restaurant  0.09
1                Park  0.09
2       

This function, also from the Foursquare lab, sorts the venues in descending order.

In [26]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Let's create the new dataframe and display the top 5 venues for each neighborhood.

In [57]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = york_grouped['Neighborhood']

for ind in np.arange(york_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(york_grouped.iloc[ind, :], num_top_venues)
    
neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,"Bathurst Manor, Downsview North, Wilson Heights",Coffee Shop,Community Center,Bank,Shopping Mall,Bridal Shop
1,Bayview Village,Bank,Japanese Restaurant,Chinese Restaurant,Skating Rink,Skate Park
2,"Bedford Park, Lawrence Manor East",Italian Restaurant,Coffee Shop,Fast Food Restaurant,Pizza Place,Restaurant
3,"CFB Toronto, Downsview East",Turkish Restaurant,Coffee Shop,Park,Sandwich Place,Gym
4,Caledonia-Fairbanks,Pharmacy,Pizza Place,Park,Mexican Restaurant,Cosmetics Shop


### Time to Cluster our Neighborhoods

We will run *k*-means to create 5 neighborhood clusters.

In [58]:
# set number of clusters
kclusters = 5

york_grouped_clustering = york_grouped.drop('Neighborhood', 1)

# create KMeans object
kmeans = KMeans(n_clusters=kclusters, random_state=1)

# fit data to k-means
kmeans.fit(york_grouped_clustering)

# check cluster labels
kmeans.labels_

array([4, 1, 4, 4, 1, 4, 4, 2, 4, 4, 4, 4, 0, 4, 4, 4, 4, 1, 4, 4, 4, 4,
       4, 1, 3, 4, 4, 4, 4, 4, 1, 4, 4, 4], dtype=int32)

We'll insert our new cluster labels into our previous dataframe with the top 5 venues for each neighborhood.

In [29]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

york_merged = york_data

york_merged = york_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
                               
york_merged.reset_index(drop=True).head()                           

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Convenience Store,Bus Stop,Pharmacy,Shopping Mall
1,M4A,North York,Victoria Village,43.725882,-79.315572,4,Park,Coffee Shop,Portuguese Restaurant,Sporting Goods Shop,Gym / Fitness Center
2,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,4,Furniture / Home Store,Coffee Shop,Fast Food Restaurant,Dessert Shop,Restaurant
3,M3B,North York,Don Mills North,43.745906,-79.352188,4,Japanese Restaurant,Coffee Shop,Pizza Place,Burger Joint,Supermarket
4,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,4,Construction & Landscaping,Fast Food Restaurant,Coffee Shop,Brewery,Pizza Place


Time to visualize the remaining clusters.

In [30]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(york_merged['Latitude'], york_merged['Longitude'],
                                  york_merged['Neighborhood'], york_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

### Examine Clusters

#### Cluster 1

In [38]:
cluster_1 = york_merged.loc[york_merged['Cluster Labels'] == 0]
cluster_1

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
57,M9M,North York,"Emery, Humberlea",43.724766,-79.532242,0,Convenience Store,Intersection,Storage Facility,Discount Store,Bakery


#### Cluster 2

In [39]:
cluster_2 = york_merged.loc[york_merged['Cluster Labels'] == 1]
cluster_2

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Convenience Store,Bus Stop,Pharmacy,Shopping Mall
21,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512,1,Pharmacy,Pizza Place,Park,Mexican Restaurant,Cosmetics Shop
39,M2K,North York,Bayview Village,43.786947,-79.385975,1,Bank,Japanese Restaurant,Chinese Restaurant,Skating Rink,Skate Park
50,M9L,North York,Humber Summit,43.756303,-79.565963,1,Bank,Pizza Place,Pharmacy,Empanada Restaurant,Park
72,M2R,North York,Willowdale West,43.782736,-79.442259,1,Pharmacy,Convenience Store,Coffee Shop,Bus Line,Eastern European Restaurant


#### Cluster 3

In [40]:
cluster_3 = york_merged.loc[york_merged['Cluster Labels'] == 2]
cluster_3

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
53,M3M,North York,Downsview Central,43.728496,-79.495697,2,Vietnamese Restaurant,Restaurant,Baseball Field,Yoga Studio,Dog Run


#### Cluster 4

In [41]:
cluster_4 = york_merged.loc[york_merged['Cluster Labels'] == 3]
cluster_4

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
45,M2L,North York,"Silver Hills, York Mills",43.75749,-79.374714,3,Park,Pool,Dive Bar,Farmers Market,Falafel Restaurant


#### Cluster 5

In [43]:
cluster_5 = york_merged.loc[york_merged['Cluster Labels'] == 4]
cluster_5

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,M4A,North York,Victoria Village,43.725882,-79.315572,4,Park,Coffee Shop,Portuguese Restaurant,Sporting Goods Shop,Gym / Fitness Center
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,4,Furniture / Home Store,Coffee Shop,Fast Food Restaurant,Dessert Shop,Restaurant
7,M3B,North York,Don Mills North,43.745906,-79.352188,4,Japanese Restaurant,Coffee Shop,Pizza Place,Burger Joint,Supermarket
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,4,Construction & Landscaping,Fast Food Restaurant,Coffee Shop,Brewery,Pizza Place
10,M6B,North York,Glencairn,43.709577,-79.445073,4,Grocery Store,Fast Food Restaurant,Pizza Place,Coffee Shop,Park
13,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923,4,Gym,Restaurant,Asian Restaurant,Japanese Restaurant,Coffee Shop
14,M4C,East York,Woodbine Heights,43.695344,-79.318389,4,Coffee Shop,Park,Skating Rink,Pizza Place,Sandwich Place
16,M6C,York,Humewood-Cedarvale,43.693781,-79.428191,4,Grocery Store,Coffee Shop,Pizza Place,Italian Restaurant,Sushi Restaurant
23,M4G,East York,Leaside,43.70906,-79.363452,4,Coffee Shop,Electronics Store,Furniture / Home Store,Sporting Goods Shop,Brewery
27,M2H,North York,Hillcrest Village,43.803762,-79.363452,4,Pharmacy,Park,Coffee Shop,Korean Restaurant,Convenience Store


### Observations and Conclusion

As we can see, there are only 2 meaningful clusters, 2 and 5, though most of the neighborhoods do end up being in cluster 5. What happened?

After checking to see how many venues were in each neighborhood, we saw this Series:

In [61]:
york_venues.groupby('Neighborhood')['Venue'].count().sort_values(ascending=False)

Neighborhood
Willowdale South                                   100
East Toronto                                        96
Leaside                                             61
Lawrence Heights, Lawrence Manor                    47
Fairview, Henry Farm, Oriole                        43
Thorncliffe Park                                    42
The Junction North, Runnymede                       42
Bedford Park, Lawrence Manor East                   41
Flemingdon Park, Don Mills South                    40
Humewood-Cedarvale                                  35
Newtonbrook, Willowdale                             33
Don Mills North                                     31
Downsview Northwest                                 31
Glencairn                                           31
Woodbine Heights                                    30
Parkwoods                                           28
Bathurst Manor, Downsview North, Wilson Heights     26
Caledonia-Fairbanks                                 

Our 3 venues that showed up in clusters of 1 neighborhood each, Emery/Humberlea, SilverHills/York Mills, and Downsview Central had 7 or less venues. Most of our neighborhoods had over 20. This made it more difficult to add these to a bigger cluster because they had less opportunity to share venues with other neighborhoods as there were 189 different types of venues as seen below. In the New York City clustering lab, most of the boroughs had 100 venues, so it was easier for more of them to have comparable traits.

In [63]:
print('york_grouped has a shape of {}'.format(york_grouped.shape))
york_grouped.head()

york_grouped has a shape of (34, 189)


Unnamed: 0,Neighborhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Dealership,BBQ Joint,...,Train Station,Turkish Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Bathurst Manor, Downsview North, Wilson Heights",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,0.0
1,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.02439,0.0,0.0
3,"CFB Toronto, Downsview East",0.0,0.0,0.055556,0.0,0.0,0.0,0.055556,0.0,0.0,...,0.0,0.111111,0.0,0.0,0.055556,0.0,0.0,0.0,0.0,0.0
4,Caledonia-Fairbanks,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0


Let's now observe our (hopefully) more meaningful clusters.

In [67]:
cluster_2.head()

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1,Park,Convenience Store,Bus Stop,Pharmacy,Shopping Mall
21,M6E,York,Caledonia-Fairbanks,43.689026,-79.453512,1,Pharmacy,Pizza Place,Park,Mexican Restaurant,Cosmetics Shop
39,M2K,North York,Bayview Village,43.786947,-79.385975,1,Bank,Japanese Restaurant,Chinese Restaurant,Skating Rink,Skate Park
50,M9L,North York,Humber Summit,43.756303,-79.565963,1,Bank,Pizza Place,Pharmacy,Empanada Restaurant,Park
72,M2R,North York,Willowdale West,43.782736,-79.442259,1,Pharmacy,Convenience Store,Coffee Shop,Bus Line,Eastern European Restaurant


Bank and Pharmacy seem to be most common venues for this cluster, with Park not following too far behind.

In [70]:
cluster_5

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,M4A,North York,Victoria Village,43.725882,-79.315572,4,Park,Coffee Shop,Portuguese Restaurant,Sporting Goods Shop,Gym / Fitness Center
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763,4,Furniture / Home Store,Coffee Shop,Fast Food Restaurant,Dessert Shop,Restaurant
7,M3B,North York,Don Mills North,43.745906,-79.352188,4,Japanese Restaurant,Coffee Shop,Pizza Place,Burger Joint,Supermarket
8,M4B,East York,"Woodbine Gardens, Parkview Hill",43.706397,-79.309937,4,Construction & Landscaping,Fast Food Restaurant,Coffee Shop,Brewery,Pizza Place
10,M6B,North York,Glencairn,43.709577,-79.445073,4,Grocery Store,Fast Food Restaurant,Pizza Place,Coffee Shop,Park
13,M3C,North York,"Flemingdon Park, Don Mills South",43.7259,-79.340923,4,Gym,Restaurant,Asian Restaurant,Japanese Restaurant,Coffee Shop
14,M4C,East York,Woodbine Heights,43.695344,-79.318389,4,Coffee Shop,Park,Skating Rink,Pizza Place,Sandwich Place
16,M6C,York,Humewood-Cedarvale,43.693781,-79.428191,4,Grocery Store,Coffee Shop,Pizza Place,Italian Restaurant,Sushi Restaurant
23,M4G,East York,Leaside,43.70906,-79.363452,4,Coffee Shop,Electronics Store,Furniture / Home Store,Sporting Goods Shop,Brewery
27,M2H,North York,Hillcrest Village,43.803762,-79.363452,4,Pharmacy,Park,Coffee Shop,Korean Restaurant,Convenience Store


There is much more variability in terms of the 1st most common column, but it looks as though Coffee Shop occupies many 1st and 2nd most common columns.

## Closing thoughts

My clustering didn't go quite as well as I would have expected, but it's apparent that the lack of venues partially contributed to our poor clustering results. At the very least we were able to get some information from the two larger clusters.

I'm not sure that it would have been much better using the 'Toronto' neighborhoods, but perhaps if we had not subset our neighborhoods and used all of them, we could have had more even results. However, it was good to at least go through the exercise, and perhaps I will give it another shot using a different subset of the original data.