# Segmenting and Clustering Neighborhoods in Toronto
### Peer-graded Assignment for the course:<br/>*Applied Data Science Capstone (IBM Data Science Professional Certificate)*, Coursera/IBM.
**Author: Paw Hermansen, 2018, Oct. 20**


## Part 3: Explore and cluster the neighborhoods in Toronto

### Import Pyton Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import csv
import folium
import requests
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

## A: Select the neighborhoods and find their venues

### Load the Toronto neighborhoods with locations created in part1 and part2

In [2]:
toronto_data = pd.read_csv('data/toronto_neigborhoods.csv')

print(toronto_data.shape)
toronto_data.head()

(103, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476


### Select the neighborhoods to work on

I decide to work with only boroughs that contain the word Toronto for simplicity and to limit the size of the dataset I work on.

In [3]:
toronto_data = toronto_data[toronto_data['Borough'].str.contains('Toronto')]
toronto_data = toronto_data.sort_values(by=['Borough']).reset_index(drop=True)

print(toronto_data.shape)
toronto_data.head()

(38, 5)


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M5R,Central Toronto,"The Annex, North Midtown, Yorkville",43.67271,-79.405678
1,M5P,Central Toronto,"Forest Hill North, Forest Hill West",43.696948,-79.411307
2,M5N,Central Toronto,Roselawn,43.711695,-79.416936
3,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
4,M4P,Central Toronto,Davisville North,43.712751,-79.390197


In [4]:
print('Number of neighborhoods in each borough is:')
print(toronto_data['Borough'].value_counts())

print('Neighborhoods in all = ', toronto_data.shape[0])

Number of neighborhoods in each borough is:
Downtown Toronto    18
Central Toronto      9
West Toronto         6
East Toronto         5
Name: Borough, dtype: int64
Neighborhoods in all =  38


### Find the latitude and longitude values of Toronto

Because my tests of finding locations from addresses, by calling one or the other API's, were very unstable I decide to use the mean location of all my neighborhoods as the centre of the drawn maps.

In [5]:
latToronto = toronto_data['Latitude'].mean()
longToronto = toronto_data['Longitude'].mean()

print('The mean location of my Toronto neighborhoods are {}, {}.'.format(latToronto, longToronto))

The mean location of my Toronto neighborhoods are 43.66726218421052, -79.38988323421052.


### Show the centers of the neighborhoods on top of a Toronto map

Click each point for the borough and neigborhood names.

In [6]:
# Create map of Toronto using latitude and longitude values
mapToronto = folium.Map(location=[latToronto, longToronto], zoom_start=12)

# Add markers to map.
# Note that apparently the parameter 'parse_html' should not be used in CircleMarker in the current Folium v.0.6.0.
for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'],
                                           toronto_data['Longitude'],
                                           toronto_data['Borough'],
                                           toronto_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(mapToronto)  
    
mapToronto

### Define Foursquare Credentials and Version

I do not include my personal credentials to Foursquare in this public notebook. Instead I define the credentials in environment variables on the machine where I execute the notebook and read them below from the environment variables.

In [7]:
CLIENT_ID = os.environ['FOURSQUARE_ID'] # your Foursquare ID
CLIENT_SECRET = os.environ['FOURSQUARE_SECRET'] # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

if CLIENT_ID and CLIENT_SECRET:
    print('Got your credentails!')
else:
    raise ValueError('Foursquare credentials are missing - should be set as environment variables')

Got your credentails!


### Define function to call the Foursquare API and find venues nearby a given location

The function is copied from the *Applied Data Science Capstone* course material with my own addition to join categories if a venue has more than one category instead of just taking the first category.

The parameters are lists of:
* *names*: a name of the location or neighborhood.
* *latitudes*: the latitude of the location or neighborhood.
* *longitudes*: the longitude of the location or neighborhood.
* *radius*: the maximal distance in meters to from the location to search for venues. Defaults to 500m.

Returns a dataframes with a row for each found venue within *radius* meter from the given location.

In [8]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        LIMIT = 50  # max according to foursquare documentation
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            ', '.join([c['name'] for c in v['venue']['categories']])) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [9]:
toronto_venues = getNearbyVenues(toronto_data['Neighborhood'], toronto_data['Latitude'], toronto_data['Longitude'])

print()
print('Total number of found venues for all neighborhoods: ', toronto_venues.shape[0])
toronto_venues.head()

The Annex, North Midtown, Yorkville
Forest Hill North, Forest Hill West
Roselawn
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Adelaide, King, Richmond
Commerce Court, Victoria Hotel
Design Exchange, Toronto Dominion Centre
Harbourfront East, Toronto Islands, Union Station
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Central Bay Street
Berczy Park
Harbourfront, Regent Park
Ryerson, Garden District
Harbord, University of Toronto
Church and Wellesley
Cabbagetown, St. James Town
Rosedale
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
St. James Town
Chinatown, Grange Park, Kensington Market
The Beaches
Studio District
The Beaches West, India Bazaar
The Danforth West, Riverdale
Business reply mail Processing Centre969 Eastern
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, E

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Ezra's Pound,43.675153,-79.405858,Café
1,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Roti Cuisine of India,43.674618,-79.408249,Indian Restaurant
2,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Bar Begonia,43.675093,-79.406406,French Restaurant
3,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Rose & Sons,43.675668,-79.403617,American Restaurant
4,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Madame Boeuf And Flea,43.67524,-79.40662,Burger Joint


### Save the dataframe with the neighborhoods venues

In [10]:
toronto_venues.to_csv('data/toronto_venues.csv', quoting=csv.QUOTE_ALL, index=False)

## B: Explore the venues

### There are no multiple Venue Categories

In [11]:
toronto_venues['Venue Category'].str.contains(',').sum()

0

In the earlier defined function *getNearbyVenues* I separate multiple categories in the *Venue Category* by a comma. However, as observed from the count of zero in the above cell, it turns out that none of the results from Foursquare actually has multiple categories.

### Venue names are not unique

In [12]:
toronto_venues['Venue'].nunique()

886

Earlier the total number of found venues for all neighborhoods was seen to be 1174 and now it is observed that only 885 of them have different venue names.

This might be explained by either that some venues are counted as belonging to two or more neighborhoods or by the fact that two or more different venues might have the same name, for example a chain of sandwich bars.

An example is *Subway*. First, all the returned *Subway* results from Foursquare. There are 9:

In [13]:
dfSubway = toronto_venues[toronto_venues['Venue'] == 'Subway']

print('Number of Subway rows: ', dfSubway.shape[0])
dfSubway

Number of Subway rows:  9


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
18,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Subway,43.674965,-79.406868,Sandwich Place
22,"The Annex, North Midtown, Yorkville",43.67271,-79.405678,Subway,43.675626,-79.410101,Sandwich Place
39,Davisville North,43.712751,-79.390197,Subway,43.708378,-79.390473,Sandwich Place
55,North Toronto West,43.715383,-79.405678,Subway,43.716818,-79.400136,Sandwich Place
88,Davisville,43.704324,-79.38879,Subway,43.701742,-79.3876,Sandwich Place
94,Davisville,43.704324,-79.38879,Subway,43.708378,-79.390473,Sandwich Place
650,"Cabbagetown, St. James Town",43.667967,-79.367675,Subway,43.665598,-79.36847,Sandwich Place
947,"The Beaches West, India Bazaar",43.668999,-79.315572,Subway,43.666238,-79.317019,Sandwich Place
1157,"Runnymede, Swansea",43.651571,-79.48445,Subway,43.649517,-79.483947,Sandwich Place


Second, we look at only those *Subway* results that have different venue locations. There are eight different *Subway* locations, the *Subway* at (43.708378, -79.390473) is counted as a venue for both *Davisville* and *Davisville North*:

In [14]:
dfSubway.groupby(['Venue', 'Venue Latitude' , 'Venue Longitude']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue Category
Venue,Venue Latitude,Venue Longitude,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Subway,43.649517,-79.483947,1,1,1,1
Subway,43.665598,-79.36847,1,1,1,1
Subway,43.666238,-79.317019,1,1,1,1
Subway,43.674965,-79.406868,1,1,1,1
Subway,43.675626,-79.410101,1,1,1,1
Subway,43.701742,-79.3876,1,1,1,1
Subway,43.708378,-79.390473,2,2,2,2
Subway,43.716818,-79.400136,1,1,1,1


Back to the full dataset returned for the neigborhoods in all Boroughs with Toronto in their name. Earlier we found that we have 885 different venue names but now we know that some of them have different locations and therefor should be considered different with the respect to the neighborhoods where they are located.

When we consider not only the venue name but also its location we find below that the number of different venue locations is 979.

This is still lower than the 1174 rows returned by Foursquare and that means that some venues are counted as belonging to two or more neighborhoods.

In [15]:
toronto_venue_locations = toronto_venues.groupby(['Venue', 'Venue Latitude' , 'Venue Longitude']).count()

print('Number of different venue locations: ', toronto_venue_locations.reset_index().shape[0])

Number of different venue locations:  980


### 'Neighborhood' is a venue category

In [16]:
toronto_venues[toronto_venues['Venue Category'] == 'Neighborhood']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
133,"Adelaide, King, Richmond",43.650571,-79.384568,Downtown Toronto,43.653232,-79.385296,Neighborhood
270,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,Harbourfront,43.639526,-79.380688,Neighborhood
890,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
900,Studio District,43.659526,-79.340923,Leslieville,43.66207,-79.337856,Neighborhood


Later in this notebook I will use onehot encoding of the venue categories. Onehot encoding will create a column with the name 'Neighborhood' for the venue category. But I already use the column name 'Neighborhood' for the name of the neighborhood and two different columns cannot have the same name. To solve this I decide to rename the venue category 'Neighborhood' to 'Locality'.

In [17]:
numberOfLocalityRows = toronto_venues[toronto_venues['Venue Category'] == 'Locality'].shape[0]
print('Number of venues of category \'Locality\':', numberOfLocalityRows)

numberOfNeighborhoodRows = toronto_venues[toronto_venues['Venue Category'] == 'Neighborhood'].shape[0]
print('Number of venues of category \'Neighborhood\':', numberOfNeighborhoodRows)

toronto_venues['Venue Category'] = toronto_venues['Venue Category'].str.replace('Neighborhood', 'Locality')

numberOfLocalityRows = toronto_venues[toronto_venues['Venue Category'] == 'Locality'].shape[0]
print('Number of venues of category \'Locality\':', numberOfLocalityRows)

numberOfNeighborhoodRows = toronto_venues[toronto_venues['Venue Category'] == 'Neighborhood'].shape[0]
print('Number of venues of category \'Neighborhood\':', numberOfNeighborhoodRows)

toronto_venues[toronto_venues['Venue Category'] == 'Locality']

Number of venues of category 'Locality': 0
Number of venues of category 'Neighborhood': 4
Number of venues of category 'Locality': 4
Number of venues of category 'Neighborhood': 0


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
133,"Adelaide, King, Richmond",43.650571,-79.384568,Downtown Toronto,43.653232,-79.385296,Locality
270,"Harbourfront East, Toronto Islands, Union Station",43.640816,-79.381752,Harbourfront,43.639526,-79.380688,Locality
890,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Locality
900,Studio District,43.659526,-79.340923,Leslieville,43.66207,-79.337856,Locality


### Number of different venue categories

In [18]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 213 unique categories.


## C: Analyze Each Neighborhood

This part follows the courses analysis of Manhatten neighborhoods very close.

In [19]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], sparse=False, prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

print(toronto_onehot.shape)
toronto_onehot.head()

(1175, 214)


Unnamed: 0,Neighborhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"The Annex, North Midtown, Yorkville",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"The Annex, North Midtown, Yorkville",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"The Annex, North Midtown, Yorkville",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"The Annex, North Midtown, Yorkville",0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Annex, North Midtown, Yorkville",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The rows are grouped by neighborhood and the mean frequency is shown for each category

In [20]:
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

print(toronto_grouped.shape)
toronto_grouped.head()

(38, 214)


Unnamed: 0,Neighborhood,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Thrift / Vintage Store,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,0.0,...,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business reply mail Processing Centre969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.071429,0.071429,0.071429,0.142857,0.142857,0.142857,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


From the above table we make a list of the top five venue categories in each neighborhood

In [21]:
num_top_venues = 5

for hood in toronto_grouped['Neighborhood']:
    print('Neighborhood:', hood)
    temp = toronto_grouped[toronto_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

Neighborhood: Adelaide, King, Richmond
                 venue  freq
0  American Restaurant  0.08
1           Steakhouse  0.06
2                 Café  0.06
3          Coffee Shop  0.06
4            Gastropub  0.04


Neighborhood: Berczy Park
                venue  freq
0         Coffee Shop  0.08
1        Cocktail Bar  0.06
2  Seafood Restaurant  0.04
3         Cheese Shop  0.04
4            Beer Bar  0.04


Neighborhood: Brockton, Exhibition Place, Parkdale Village
                venue  freq
0         Coffee Shop  0.14
1      Breakfast Spot  0.10
2                Café  0.10
3  Falafel Restaurant  0.05
4       Burrito Place  0.05


Neighborhood: Business reply mail Processing Centre969 Eastern
              venue  freq
0       Yoga Studio  0.06
1     Auto Workshop  0.06
2        Comic Shop  0.06
3              Park  0.06
4  Recording Studio  0.06


Neighborhood: CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
                ve

To create the top venue categories list as a Pandas dataframe we first create a helper function:

In [22]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Then we can create the Pandas dataframe with the top venue categories for each neighborhood:

In [23]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

print(neighborhoods_venues_sorted.shape)
neighborhoods_venues_sorted.head()

(38, 11)


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",American Restaurant,Café,Coffee Shop,Steakhouse,Breakfast Spot,Gastropub,Asian Restaurant,Restaurant,Bar,Hotel
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Beer Bar,Farmers Market,Steakhouse,Cheese Shop,Café,Seafood Restaurant,Bistro
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Bakery,Gym,Furniture / Home Store,Italian Restaurant,Falafel Restaurant,Convenience Store,Performing Arts Venue
3,Business reply mail Processing Centre969 Eastern,Yoga Studio,Auto Workshop,Park,Comic Shop,Pizza Place,Butcher,Burrito Place,Recording Studio,Restaurant,Brewery
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Boat or Ferry,Harbor / Marina


## D: Cluster Neighborhoods

Cluster the neighborhoods into 5 clusters using the k-means algorithm on the above table of the top 10 venue catagories for each neighborhood.

In [24]:
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 1, 2, 2, 2, 2,
       4, 2, 1, 2, 2, 1, 3, 2, 2, 2, 2, 2, 2, 0, 2, 2], dtype=int32)

The immediate observation is that in Toronto many neighborhoods have the same kind of venues.

To be able to visualize the result I create a Pandas dataframe with the clustering and location data and the clustering group for each neighborhood.

In [25]:
# Append the cluster labels to the top 10 venues table
toronto_merged = neighborhoods_venues_sorted
toronto_merged.insert(1, 'Cluster Labels', kmeans.labels_)

# Add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(toronto_data.set_index('Neighborhood'), on='Neighborhood')

print(toronto_merged.shape)
toronto_merged.head()

(38, 16)


Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,PostalCode,Borough,Latitude,Longitude
0,"Adelaide, King, Richmond",2,American Restaurant,Café,Coffee Shop,Steakhouse,Breakfast Spot,Gastropub,Asian Restaurant,Restaurant,Bar,Hotel,M5H,Downtown Toronto,43.650571,-79.384568
1,Berczy Park,2,Coffee Shop,Cocktail Bar,Bakery,Beer Bar,Farmers Market,Steakhouse,Cheese Shop,Café,Seafood Restaurant,Bistro,M5E,Downtown Toronto,43.644771,-79.373306
2,"Brockton, Exhibition Place, Parkdale Village",2,Coffee Shop,Café,Breakfast Spot,Bakery,Gym,Furniture / Home Store,Italian Restaurant,Falafel Restaurant,Convenience Store,Performing Arts Venue,M6K,West Toronto,43.636847,-79.428191
3,Business reply mail Processing Centre969 Eastern,2,Yoga Studio,Auto Workshop,Park,Comic Shop,Pizza Place,Butcher,Burrito Place,Recording Studio,Restaurant,Brewery,M7Y,East Toronto,43.662744,-79.321558
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",2,Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Boat or Ferry,Harbor / Marina,M5V,Downtown Toronto,43.628947,-79.39442


We draw the same map as earlier in this notebook except that the cluster labels are shown as different colors.

In [26]:
# Create map of Toronto using latitude and longitude values
mapToronto = folium.Map(location=[latToronto, longToronto], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to map.
for lat, lng, borough, neighborhood, cluster in zip(toronto_merged['Latitude'],
                                                    toronto_merged['Longitude'],
                                                    toronto_merged['Borough'],
                                                    toronto_merged['Neighborhood'],
                                                    toronto_merged['Cluster Labels']):
    label = 'Cluster {}: {}'.format(cluster, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(mapToronto)  
    
mapToronto

## E: Examine the clusters

First I create a function that returns a dataframe showing the neighborhoods with a given clustering.

In [27]:
def neighborhoodsInCluster(cluster):
    clms = toronto_merged.columns[[0] + list(range(2, 12))]
    return toronto_merged.loc[toronto_merged['Cluster Labels'] == cluster, clms]

### Cluster 0

In [28]:
print('Neighborhoods in cluster ' + str(0))
neighborhoodsInCluster(0)

Neighborhoods in cluster 0


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
13,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Pub,Coffee Shop,Convenience Store,Supermarket,Fried Chicken Joint,Vietnamese Restaurant,Sushi Restaurant,Pizza Place,American Restaurant,Light Rail Station
35,The Beaches,Locality,Pub,Coffee Shop,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


The two neighborhoods in cluster 0 have only two venues in common, which does not sound as much.

### Cluster 1

In [29]:
print('Neighborhoods in cluster ' + str(1))
neighborhoodsInCluster(1)

Neighborhoods in cluster 1


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
17,"Forest Hill North, Forest Hill West",Park,Trail,Jewelry Store,Sushi Restaurant,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run
24,"Moore Park, Summerhill East",Park,Playground,Tennis Court,Restaurant,Yoga Studio,Cupcake Shop,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store
27,Rosedale,Park,Playground,Trail,Yoga Studio,Dance Studio,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


Cluster 1 seems to be parks with dog runs.

### Cluster 2

In [30]:
print('Neighborhoods in cluster ' + str(2))
neighborhoodsInCluster(2)

Neighborhoods in cluster 2


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",American Restaurant,Café,Coffee Shop,Steakhouse,Breakfast Spot,Gastropub,Asian Restaurant,Restaurant,Bar,Hotel
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Beer Bar,Farmers Market,Steakhouse,Cheese Shop,Café,Seafood Restaurant,Bistro
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Café,Breakfast Spot,Bakery,Gym,Furniture / Home Store,Italian Restaurant,Falafel Restaurant,Convenience Store,Performing Arts Venue
3,Business reply mail Processing Centre969 Eastern,Yoga Studio,Auto Workshop,Park,Comic Shop,Pizza Place,Butcher,Burrito Place,Recording Studio,Restaurant,Brewery
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Lounge,Airport Service,Airport Terminal,Boutique,Airport,Airport Food Court,Airport Gate,Sculpture Garden,Boat or Ferry,Harbor / Marina
5,"Cabbagetown, St. James Town",Coffee Shop,Restaurant,Chinese Restaurant,Pub,Café,Bakery,Pizza Place,Park,Italian Restaurant,Indian Restaurant
6,Central Bay Street,Coffee Shop,Café,Bubble Tea Shop,Spa,Sandwich Place,Italian Restaurant,Burger Joint,Bar,Seafood Restaurant,Ramen Restaurant
7,"Chinatown, Grange Park, Kensington Market",Café,Vietnamese Restaurant,Vegetarian / Vegan Restaurant,Caribbean Restaurant,Comfort Food Restaurant,Mexican Restaurant,Bakery,Chinese Restaurant,Ramen Restaurant,Smoke Shop
8,Christie,Grocery Store,Café,Park,Athletics & Sports,Italian Restaurant,Diner,Convenience Store,Nightclub,Restaurant,Baby Store
9,Church and Wellesley,Burger Joint,Gay Bar,Gastropub,Restaurant,Sushi Restaurant,Bubble Tea Shop,Japanese Restaurant,Coffee Shop,Men's Store,Salon / Barbershop


Cluster 2 seems to be city areas with coffee shop, café's, pubs and different restaurants.

### Cluster 3

In [31]:
print('Neighborhoods in cluster ' + str(3))
neighborhoodsInCluster(3)

Neighborhoods in cluster 3


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
28,Roselawn,Garden,Ice Cream Shop,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run,Discount Store


The top 1 venue in cluster 3 is an Ice Cream Shop, which sounds rather boring (though I certainly like Ice Cream shops).

### Cluster 4

In [32]:
print('Neighborhoods in cluster ' + str(4))
neighborhoodsInCluster(4)

Neighborhoods in cluster 4


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Lawrence Park,Park,Dim Sum Restaurant,Swim School,Bus Line,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Dog Run


Why this park area is not included in cluster 1 might be because I clustered into too many clusters. If I clusters into four clusters instead of five then *Lawrence Park* might be included in cluster 1. I leave this for further experiments and improvements at a later time.