# IBM Data Science Capstone Project


as part of the **IBM Data Science Professional Certificate**

by Kevin Götz

## Introduction

This project uses data from the location data provider "Foursquare". Using RESTful API calls to the Foursquare API I'm retrieving data about venues in different neighborhoods around the world. Also for data that are not readily available I'm scraping web data and I'm parsing HTML code.

## Table of Contents

1. Import Libraries

2. Webscraping & Cleaning Toronto Neighborhoods

3. Adding Coordinates
4. Clustering and Visualizing the Neighborhoods

    - 4.1 Getting the venues with Foursquare API
    
    - 4.2 Analysis of the Neighbourhoods

    - 4.3 Clustering of the Neighborhoods by Venues
    
    - 4.4 Visualizing the clustered Neighborhoods

## 1. Import Libraries

In [1]:
# Analysing and cleaning the data
import pandas as pd 
import numpy as np
import re

# Working with geographical data
import geocoder

# for the API calls
import requests

# Clustering
from sklearn.cluster import KMeans

# Visualizing
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

## 2. Webscraping & Cleaning Toronto Neighborhoods

In [2]:
# initial load of the uncleaned DataFrame
toronto_raw = pd.read_html(io='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
toronto_raw.head(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,M1ANot assigned,M2ANot assigned,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),M8ANot assigned,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),M2BNot assigned,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),M7BNot assigned,M8BNot assigned,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,M2CNot assigned,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),M7CNot assigned,M8CNot assigned,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,M2ENot assigned,M3ENot assigned,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),M7ENot assigned,M8ENot assigned,M9ENot assigned
4,M1GScarborough(Woburn),M2GNot assigned,M3GNot assigned,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),M7GNot assigned,M8GNot assigned,M9GNot assigned
5,M1HScarborough(Cedarbrae),M2HNorth York(Hillcrest Village),M3HNorth York(Bathurst Manor / Wilson Heights ...,M4HEast York(Thorncliffe Park),M5HDowntown Toronto(Richmond / Adelaide / King),M6HWest Toronto(Dufferin / Dovercourt Village),M7HNot assigned,M8HNot assigned,M9HNot assigned
6,M1JScarborough(Scarborough Village),M2JNorth York(Fairview / Henry Farm / Oriole),M3JNorth York(Northwood Park / York University),M4JEast YorkEast Toronto(The Danforth East),M5JDowntown Toronto(Harbourfront East / Union ...,M6JWest Toronto(Little Portugal / Trinity),M7JNot assigned,M8JNot assigned,M9JNot assigned
7,M1KScarborough(Kennedy Park / Ionview / East B...,M2KNorth York(Bayview Village),M3KNorth York(Downsview)East (CFB Toronto),M4KEast Toronto(The Danforth West / Riverdale),M5KDowntown Toronto(Toronto Dominion Centre / ...,M6KWest Toronto(Brockton / Parkdale Village / ...,M7KNot assigned,M8KNot assigned,M9KNot assigned
8,M1LScarborough(Golden Mile / Clairlea / Oakridge),M2LNorth York(York Mills / Silver Hills),M3LNorth York(Downsview)West,M4LEast Toronto(India Bazaar / The Beaches West),M5LDowntown Toronto(Commerce Court / Victoria ...,M6LNorth York(North Park / Maple Leaf Park / U...,M7LNot assigned,M8LNot assigned,M9LNorth York(Humber Summit)
9,M1MScarborough(Cliffside / Cliffcrest / Scarbo...,M2MNorth York(Willowdale / Newtonbrook),M3MNorth York(Downsview)Central,M4MEast Toronto(Studio District),M5MNorth York(Bedford Park / Lawrence Manor East),M6MYork(Del Ray / Mount Dennis / Keelsdale and...,M7MNot assigned,M8MNot assigned,M9MNorth York(Humberlea / Emery)


In [3]:
# set all "Not assigned" cells to a proper NaN
toronto_raw = toronto_raw.applymap(lambda x: np.nan if re.search('not assigned', x, re.IGNORECASE) else x)
toronto_raw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,,,M3ANorth York(Parkwoods),M4ANorth York(Victoria Village),M5ADowntown Toronto(Regent Park / Harbourfront),M6ANorth York(Lawrence Manor / Lawrence Heights),M7AQueen's Park(Ontario Provincial Government),,M9AEtobicoke(Islington Avenue)
1,M1BScarborough(Malvern / Rouge),,M3BNorth York(Don Mills)North,M4BEast York(Parkview Hill / Woodbine Gardens),"M5BDowntown Toronto(Garden District, Ryerson)",M6BNorth York(Glencairn),,,M9BEtobicoke(West Deane Park / Princess Garden...
2,M1CScarborough(Rouge Hill / Port Union / Highl...,,M3CNorth York(Don Mills)South(Flemingdon Park),M4CEast York(Woodbine Heights),M5CDowntown Toronto(St. James Town),M6CYork(Humewood-Cedarvale),,,M9CEtobicoke(Eringate / Bloordale Gardens / Ol...
3,M1EScarborough(Guildwood / Morningside / West ...,,,M4EEast Toronto(The Beaches),M5EDowntown Toronto(Berczy Park),M6EYork(Caledonia-Fairbanks),,,
4,M1GScarborough(Woburn),,,M4GEast York(Leaside),M5GDowntown Toronto(Central Bay Street),M6GDowntown Toronto(Christie),,,


In [4]:
# extracting the Information from the Cells

# setting up the new DataFrame
toronto_clean = pd.DataFrame(columns=['postal_code', 'borough', 'neighborhood'])

# apply the cleanup to elementwise to the DataFrame and convert it to a Series
toronto_clean['postal_code'] = pd.Series(toronto_raw.applymap(lambda x: x[:3], na_action='ignore').to_numpy().reshape(-1,))
toronto_clean['borough'] = pd.Series(toronto_raw.applymap(lambda x: x[3:x.index('(')], na_action='ignore').to_numpy().reshape(-1,))
toronto_clean['neighborhood'] = pd.Series(toronto_raw.applymap(lambda x: x[x.index('(') + 1:x.index(')')], na_action='ignore').to_numpy().reshape(-1,))

In [5]:
# drop empty cells & reset index
toronto_clean.dropna(axis=0, how='all', inplace=True)
toronto_clean.reset_index(drop=True, inplace=True)

#check the result
print('---No NaN left---', toronto_clean.isna().sum(), sep='\n\n')

---No NaN left---

postal_code     0
borough         0
neighborhood    0
dtype: int64


In [6]:
# cleaning the neighborhood column
toronto_clean['neighborhood'] = toronto_clean['neighborhood'].map(lambda x: ', '.join([word.strip() for word in x.split('/')]))
toronto_clean.neighborhood.value_counts()

Downsview                                                                                                                                 4
Don Mills                                                                                                                                 2
Willowdale                                                                                                                                2
The Beaches                                                                                                                               1
Clarks Corners, Tam O'Shanter, Sullivan                                                                                                   1
                                                                                                                                         ..
Old Mill South, King's Mill Park, Sunnylea, Humber Bay, Mimico NE, The Queensway East, Royal York South East, Kingsway Park South East    1
Woburn              

In [7]:
# cleaning the borough column
toronto_clean['borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto',
                                  'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto',
                                  'EtobicokeNorthwest':'Etobicoke Northwest',
                                  'East YorkEast Toronto':'East York / East Toronto',
                                  'MississaugaCanada Post Gateway Processing Centre':'Mississauga'},
                                  inplace=True)
toronto_clean.borough.value_counts()

North York                  24
Downtown Toronto            18
Scarborough                 17
Etobicoke                   11
Central Toronto              9
West Toronto                 6
York                         5
East Toronto                 5
East York                    4
Mississauga                  1
East York / East Toronto     1
Queen's Park                 1
Etobicoke Northwest          1
Name: borough, dtype: int64

In [8]:
# drop duplicate rows
toronto_clean.drop_duplicates(inplace=True, ignore_index=True)

In [9]:
# get the shape of the cleaned data
toronto_clean.shape

(103, 3)

## 3. Adding Coordinates

In [10]:
# initialize the variables
code_list = toronto_clean['postal_code'].tolist()
code_coords = pd.DataFrame(columns=['postal_code', 'latitude', 'longitude'])

# loop until you get all the coordinates
for postal_code in code_list:
    g = geocoder.arcgis(f'{postal_code}, Toronto, Ontario')
    lat_lng_coords = g.latlng
    latitude = lat_lng_coords[0]
    longitude = lat_lng_coords[1]
    code_coords = code_coords.append({'postal_code': postal_code,
                                      'latitude': latitude,
                                      'longitude': longitude}, ignore_index=True)

code_coords

Unnamed: 0,postal_code,latitude,longitude
0,M3A,43.75245,-79.32991
1,M4A,43.73057,-79.31306
2,M5A,43.65512,-79.36264
3,M6A,43.72327,-79.45042
4,M7A,43.66253,-79.39188
...,...,...,...
98,M8X,43.65319,-79.51113
99,M4Y,43.66659,-79.38133
100,M7Y,43.64869,-79.38544
101,M8Y,43.63278,-79.48945


In [11]:
# join the new table with the old one
toronto_coords = pd.merge(toronto_clean, code_coords, how='left', on='postal_code')
toronto_coords

Unnamed: 0,postal_code,borough,neighborhood,latitude,longitude
0,M3A,North York,Parkwoods,43.75245,-79.32991
1,M4A,North York,Victoria Village,43.73057,-79.31306
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.65319,-79.51113
99,M4Y,Downtown Toronto,Church and Wellesley,43.66659,-79.38133
100,M7Y,East Toronto,Enclave of M4L,43.64869,-79.38544
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.63278,-79.48945


## 4. Clustering and Visualizing the Neighborhoods

In [12]:
# REVERSED: For illustration purposes the clustering is applied to boroughs with "Toronto" in their names only
# toronto_small = toronto_coords[toronto_coords.borough.str.contains(pat='toronto', case=False)].reset_index(drop=True)
toronto_small = toronto_coords

### 4.1 Getting the venues with the Foursquare API

In [23]:
# loading the passwords from a safe environment

import json
import sys

try:  # error handling
    with open('.\API_Keys.json', "r") as handle:
        pw = json.load(handle)

    CLIENT_ID = pw['Foursquare']['CLIENT_ID']
    CLIENT_SECRET = pw['Foursquare']['CLIENT_SECRET']

except:  # print error message
    print(f'Oops...there was an "{sys.exc_info()[0].__name__}" ! Please check the pw-file manually.')

else:
    print('The Loading of the Passwords was succesful!')
    VERSION = '20180605' # Foursquare API version
    LIMIT = 100 # A default Foursquare API limit value


The Loading of the Passwords was succesful!


In [14]:
# defining the function to get the nearby venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [15]:
# building the new DataFrame with the venues
toronto_venues = getNearbyVenues(names=toronto_small['neighborhood'],
                                 latitudes=toronto_small['latitude'],
                                 longitudes=toronto_small['longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
The Danforth East
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview
The Danforth West, Riverdale
T

### 4.2 Analysis of the Neighbourhoods

In [16]:
# Size and look of the Resulting DataFrame
print(toronto_venues.shape)
toronto_venues.head()

(2246, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.75245,-79.32991,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.75245,-79.32991,KFC,43.754387,-79.333021,Fast Food Restaurant
2,Parkwoods,43.75245,-79.32991,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Parkwoods,43.75245,-79.32991,Towns On The Ravine,43.754754,-79.332552,Hotel
4,Victoria Village,43.73057,-79.31306,Wigmore Park,43.731023,-79.310771,Park


In [17]:
# Description of the DataFrame
toronto_venues.describe(include='object')

Unnamed: 0,Neighborhood,Venue,Venue Category
count,2246,2246,2246
unique,96,1288,256
top,"First Canadian Place, Underground city",Tim Hortons,Coffee Shop
freq,100,68,213


In [18]:
# which are the top brands per category? (only one brand per category)
category_stars = pd.DataFrame(toronto_venues.groupby(by=['Venue Category', 'Venue'])['Venue'].count().sort_values(ascending=False))
category_stars.rename({'Venue': 'Count'}, axis=1, inplace=True)
category_stars.reset_index(inplace=True)
category_stars.drop_duplicates(subset='Venue Category', keep='first', inplace=True, ignore_index=True)
category_stars.head(5)

Unnamed: 0,Venue Category,Venue,Count
0,Coffee Shop,Tim Hortons,68
1,Sandwich Place,Subway,66
2,Pharmacy,Shoppers Drug Mart,28
3,Bank,RBC Royal Bank,21
4,Pizza Place,Pizza Pizza,18


In [19]:
# which are the top 3 categories by neighborhood?
neigh_cat = pd.DataFrame(toronto_venues.groupby(by=['Neighborhood', 'Venue Category'])['Venue'].count())
neigh_cat.rename({'Venue': 'Count'}, axis=1, inplace=True)
neigh_cat.sort_values(by=['Neighborhood', 'Count'], ascending=False, inplace=True)
neigh_cat.reset_index(inplace=True)

# calculate the rank
neigh_cat['Rank'] = neigh_cat.groupby(by='Neighborhood').rank(method='first', ascending=False).astype('int')

# select the neighborhood you want to inspect and give out the top 3 categories
neigh = 'Toronto Dominion Centre, Design Exchange'  # <-- PUT IN YOUR NEIGHBORHOOD HERE
neigh_cat.loc[neigh_cat['Neighborhood'] == neigh, neigh_cat.columns[:-1]].reset_index(drop=True).head(3)

Unnamed: 0,Neighborhood,Venue Category,Count
0,"Toronto Dominion Centre, Design Exchange",Coffee Shop,14
1,"Toronto Dominion Centre, Design Exchange",Café,8
2,"Toronto Dominion Centre, Design Exchange",Hotel,5


In [20]:
# which are the top 3 venues by neighborhood?
neigh_ven = pd.DataFrame(toronto_venues.groupby(by=['Neighborhood', 'Venue'])['Venue'].count())
neigh_ven.rename({'Venue': 'Count'}, axis=1, inplace=True)
neigh_ven.sort_values(by=['Neighborhood', 'Count'], ascending=False, inplace=True)
neigh_ven.reset_index(inplace=True)

# calculate the rank
neigh_ven['Rank'] = neigh_ven.groupby(by='Neighborhood').rank(method='first', ascending=False).astype('int')

# select the neighborhood you want to inspect and give out the top 3 categories
neigh = 'Toronto Dominion Centre, Design Exchange'  # <-- PUT IN YOUR NEIGHBORHOOD HERE
neigh_ven.loc[neigh_ven['Neighborhood'] == neigh, neigh_ven.columns[:-1]].reset_index(drop=True).head(3)

Unnamed: 0,Neighborhood,Venue,Count
0,"Toronto Dominion Centre, Design Exchange",Tim Hortons,4
1,"Toronto Dominion Centre, Design Exchange",Shoppers Drug Mart,3
2,"Toronto Dominion Centre, Design Exchange",Starbucks,3


In [21]:
# Transposing the venue-ranking

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in range(num_top_venues):
    try:
        columns.append(f'{ind+1}{indicators[ind]} Most Common Venue')
    except:
        columns.append(f'{ind+1}th Most Common Venue')

# create a new DataFrame
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = neigh_cat['Neighborhood'].unique()

# fill the new DataFrame
neigh_list = list(neigh_cat.Neighborhood.unique())
for neigh in neigh_list:
    for ind in range(num_top_venues):
        try:
            neighborhoods_venues_sorted.iloc[neigh_list.index(neigh), ind+1] = \
            neigh_cat.loc[neigh_cat['Neighborhood'] == neigh, neigh_cat.columns[1]].T.tolist()[ind]
        except:
            neighborhoods_venues_sorted.iloc[neigh_list.index(neigh), ind+1] = np.nan

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"York Mills, Silver Hills",Park,,,,,,,,,
1,York Mills West,Convenience Store,Park,Speakeasy,,,,,,,
2,Woodbine Heights,Grocery Store,Bus Line,Pharmacy,Bakery,Breakfast Spot,Café,Coffee Shop,Dance Studio,Fast Food Restaurant,Gas Station
3,Woburn,Business Service,Coffee Shop,Korean BBQ Restaurant,Park,Soccer Field,,,,,
4,"Willowdale, Newtonbrook",Café,Korean Restaurant,Middle Eastern Restaurant,Pizza Place,Coffee Shop,Fried Chicken Joint,Grocery Store,Halal Restaurant,Hookah Bar,Japanese Restaurant


### 4.3 Clustering of the Neighborhoods by Venues

In [22]:
# preparing the DataFrame for Clustering

# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues['Venue Category'], prefix='', prefix_sep='')

# add neighborhood column back to dataframe (drop because it is already in but not the correct one)
toronto_onehot.drop('Neighborhood', axis=1, inplace=True)
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = ['Neighborhood'] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Arts & Crafts Store,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#calculating the percentage of the venue categories per Neighborhood
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

toronto_grouped.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,American Restaurant,Antique Shop,Aquarium,Art Gallery,Arts & Crafts Store,...,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# Clustering

# set number of clusters
kclusters = 3

toronto_grouped_clustering =toronto_grouped.drop('Neighborhood', axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1])

In [25]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_small

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.merge(neighborhoods_venues_sorted, how='inner', left_on='neighborhood', right_on='Neighborhood')
toronto_merged.drop('Neighborhood', axis=1, inplace=True)

toronto_merged.head() # check the last columns!

Unnamed: 0,postal_code,borough,neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.75245,-79.32991,1,Fast Food Restaurant,Food & Drink Shop,Hotel,Park,,,,,,
1,M4A,North York,Victoria Village,43.73057,-79.31306,1,Grocery Store,Nail Salon,Park,,,,,,,
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65512,-79.36264,2,Coffee Shop,Bakery,Breakfast Spot,Discount Store,Distribution Center,Electronics Store,Event Space,Greek Restaurant,Italian Restaurant,Pub
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.72327,-79.45042,1,Clothing Store,Bookstore,Coffee Shop,Fast Food Restaurant,Furniture / Home Store,Restaurant,Toy / Game Store,American Restaurant,Café,Chocolate Shop
4,M7A,Queen's Park,Ontario Provincial Government,43.66253,-79.39188,0,Coffee Shop,Burrito Place,Falafel Restaurant,Bank,Bar,Burger Joint,Café,Gastropub,Gym,Mediterranean Restaurant


In [26]:
# dispersion of the cluster labels
toronto_merged['Cluster Labels'].value_counts()

1    82
0    17
2     2
Name: Cluster Labels, dtype: int64

### 4.4 Visualizing the clustered Neighborhoods

In [27]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['latitude'], toronto_merged['longitude'], toronto_merged['neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters