<a href="https://colab.research.google.com/github/prabhavpratyaksh/Coursera_Capstone/blob/master/Week%203/Toronto_Segmentation_%26_Clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Toronto Neigbourhoods Segmentation & Clustering

## Applied Data Science Capstone Project | Week 3 | Peer-Graded Assignment
## Prabhav Pratyaksh 26th July 2021

### Importing libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
!pip install geocoder
import geocoder
!pip install geopy
from geopy.geocoders import Nominatim 
import folium
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
print("All libraries installed")

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[?25l[K     |███▎                            | 10 kB 19.3 MB/s eta 0:00:01[K     |██████▋                         | 20 kB 25.4 MB/s eta 0:00:01[K     |██████████                      | 30 kB 13.0 MB/s eta 0:00:01[K     |█████████████▎                  | 40 kB 9.8 MB/s eta 0:00:01[K     |████████████████▋               | 51 kB 5.1 MB/s eta 0:00:01[K     |████████████████████            | 61 kB 5.2 MB/s eta 0:00:01[K     |███████████████████████▎        | 71 kB 5.8 MB/s eta 0:00:01[K     |██████████████████████████▋     | 81 kB 6.5 MB/s eta 0:00:01[K     |██████████████████████████████  | 92 kB 6.5 MB/s eta 0:00:01[K     |████████████████████████████████| 98 kB 3.9 MB/s 
Collecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
All libraries installed


### Part 1 - Webscraping


In [2]:
# send request to Wikipedia link
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
response = requests.get(url).text

# using BeautifulSoup, get relevant tags for parsing
soup = BeautifulSoup(response, 'html5lib')
table = soup.find('table')
fields = table.find_all('td')
#print(soup.prettify())

df_1 = pd.DataFrame(columns=["Postal Code","Borough","Neighbourhood(s)"]) # creating a dataframe to add the values

# iterating through every column to separate the postal codes, boroughs, and the neigbourhoods
for cell in fields:
  if cell.span.text=='Not assigned':
        pass
  else:
      PC = cell.p.text[:3]
      BH = (cell.span.text).split('(')[0]
      NB = (((((cell.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
      df_1 = df_1.append({"Postal Code":PC,"Borough":BH,"Neighbourhood(s)":NB},ignore_index = True)

df_1['Borough']=df_1['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

df_1.head()
print ("Dataframe has {} rows and {} columns".format(df_1.shape[0],df_1.shape[1]))


Dataframe has 103 rows and 3 columns


In [3]:
#Checking if there are any Neighbourhoods with Not Assigned value
df_1.filter(like = "Not assigned", axis = 0)

Unnamed: 0,Postal Code,Borough,Neighbourhood(s)


From the above code snippet, we see that there are no rows (both Boroughs and Neighbourhoods) that have Not Assigned value. This is because we already removed those while scraping

However, in case there were Boroughs with Not Assigned values, we could have used the below code to drop them from the dataframe

In [4]:
df_1.drop(df_1[df_1['Borough'] == 'Not assigned'].index, inplace=True)
print("Dataframe now has {} rows and {} columns".format(df_1.shape[0],df_1.shape[1]))
df_1.head()

Dataframe now has 103 rows and 3 columns


Unnamed: 0,Postal Code,Borough,Neighbourhood(s)
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government


Note that there are same number of rows and columns after dropping

Similarly, we could have used the below code to replace Neighbourhoods with Not assigned value with their respective Boroughs. However, as seen above there are no Boroughs or Neighbourhoods with Not assigned value

In [5]:
df_1.loc[df_1['Neighbourhood(s)'] == "Not assigned", "Neighbourhood(s)"] = df_1.loc[df_1['Neighbourhood(s)'] == "Not assigned", "Borough"]
df_1

Unnamed: 0,Postal Code,Borough,Neighbourhood(s)
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Queen's Park,Ontario Provincial Government
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto Business,Enclave of M4L
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Now, we group Neighbourhoods by postal code

In [6]:

df_1 = df_1.groupby(['Postal Code', 'Borough'])['Neighbourhood(s)'].apply(', '.join).reset_index()
df_1.columns = ['Postal Code', 'Borough', 'Neighbourhood(s)']
df_1

Unnamed: 0,Postal Code,Borough,Neighbourhood(s)
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [7]:
print("Dataframe now has {} rows and {} columns".format(df_1.shape[0],df_1.shape[1]))

Dataframe now has 103 rows and 3 columns


### Part 2 - Geographical Coordinates

For this part, Geocoder API caused the kernel to hang. Hence, I have used the CSV file provided

In [9]:
# Reading the geospatial coordinates file as a dataframe
filepath = r'/content/Geospatial_Coordinates.csv'
coordinates = pd.read_csv (filepath)

# Merging the 2 dataframes based on the postal codes

df_2 = df_1.merge(coordinates, 'inner', "Postal Code")
df_2

Unnamed: 0,Postal Code,Borough,Neighbourhood(s),Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


### Part 3 - Segmentation & Clustering of Neighbourhoods

For this part, only boroughs with Toronto will be analysed

In [10]:
df = df_2[df_2["Borough"].str.contains("Toronto")]
df.reset_index(drop=True,inplace=True)
df

Unnamed: 0,Postal Code,Borough,Neighbourhood(s),Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106
2,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
3,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572
4,M4M,East Toronto,Studio District,43.659526,-79.340923
5,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879
6,M4P,Central Toronto,Davisville North,43.712751,-79.390197
7,M4R,Central Toronto,North Toronto West,43.715383,-79.405678
8,M4S,Central Toronto,Davisville,43.704324,-79.38879
9,M4T,Central Toronto,"Moore Park, Summerhill East",43.689574,-79.38316


Now, we visualize these neighbourhoods on a map of Toronto.

But before we do that, we need to place Toronto on the map

In [11]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(f"The latitiude and longitude of Toronto are {latitude},{longitude}")

The latitiude and longitude of Toronto are 43.6534817,-79.3839347


In [12]:
# create map of Toronto using latitude and longitude values
map_tor = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood(s)']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_tor)  
    
map_tor

First, we define the our Foursquare credentials

In [121]:
CLIENT_ID = 'Not revealing' # your Foursquare ID
CLIENT_SECRET = 'Not revealing this too' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: Not revealing
CLIENT_SECRET:Not revealing this too


Now, let's explore a neighbourhood. For this task, we will look at 100 venues within a radius of 500m

In [14]:
nbhr = 'Trinity'
nbhr_df = df[df['Neighbourhood(s)'].str.contains(nbhr)]
nbhr_df
lat = nbhr_df.iloc[0,-2]
lng = nbhr_df.iloc[0,-1]
print(f"Latitude and longitude values of {nbhr} are {lat}, {lng}.")

Latitude and longitude values of Trinity are 43.647926700000006, -79.4197497.


In [15]:
#Foursquare credentials
client_id = CLIENT_ID
client_secret = CLIENT_SECRET
version = '20180605'
limit = 100
radius = 500

In [16]:
# URL request construction
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    client_id, 
    client_secret, 
    version, 
    lat, 
    lng, 
    radius, 
    limit)

#result of request stored in results
results = requests.get(url).json()
print("Request successful.")

Request successful.


In [17]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we clean the response received (in JSON) by converting to a pandas dataframe

In [18]:
venues = results['response']['groups'][0]['items']
nearby_venues = pd.json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

print(f"{nearby_venues.shape[0]} venues were returned by Foursquare.")
nearby_venues.head()

42 venues were returned by Foursquare.


Unnamed: 0,name,categories,lat,lng
0,Pizzeria Libretto,Pizza Place,43.648979,-79.420604
1,Bellwoods Brewery,Brewery,43.647097,-79.419955
2,Foxley Bistro,Asian Restaurant,43.648643,-79.420495
3,Bang Bang Ice Cream & Bakery,Ice Cream Shop,43.646246,-79.419553
4,Paris Paris Bar,Wine Bar,43.649237,-79.421436


In [19]:
# Some preliminary analysis on nearby_venues
top_venues = nearby_venues['categories'].value_counts()
top_venues.loc[top_venues >1]

Bar                      3
Asian Restaurant         2
Diner                    2
Vietnamese Restaurant    2
Café                     2
Men's Store              2
Restaurant               2
Name: categories, dtype: int64

Now we explore venues in all neighbourhoods

In [25]:
def get_nearby_venues(names, lats, lngs, radius=500, limit=100):
    venues_list = []
    for name, lat, lng in zip(names, lats, lngs):
        # specify the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            client_id,
            client_secret,
            version,
            lat,
            lng,
            radius,
            limit)
        # make the request, store the response
        results = requests.get(url).json()['response']['groups'][0]['items']
        # extract relevant information from each venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
    # populate the dataframe with venues list
    venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    venues.columns = ['Neighborhood',
                      'Neighborhood Latitude',
                      'Neighborhood Longitude',
                      'Venue',
                      'Venue Latitude',
                      'Venue Longitude',
                      'Venue Category']
    return(venues)

Now, we will use the above function on all neighbourhoods

In [81]:
all_venues = get_nearby_venues(
    df['Neighbourhood(s)'],
    df['Latitude'],
    df['Longitude']
)
all_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,The Beaches,43.676357,-79.293031,Seaspray Restaurant,43.678888,-79.298167,Asian Restaurant
...,...,...,...,...,...,...,...
1602,Enclave of M4L,43.662744,-79.321558,The Ashbridge Estate,43.664691,-79.321805,Garden
1603,Enclave of M4L,43.662744,-79.321558,TTC Russell Division,43.664908,-79.322560,Light Rail Station
1604,Enclave of M4L,43.662744,-79.321558,Jonathan Ashbridge Park,43.664702,-79.319898,Park
1605,Enclave of M4L,43.662744,-79.321558,Olliffe On Queen,43.664503,-79.324768,Butcher


In [34]:
print(f"In all the neighbourhoods, there are {all_venues.shape[0]} venues across {all_venues['Venue Category'].nunique()} categories")
all_venues[['Neighborhood', 'Venue']].groupby('Neighborhood').count()

In all the neighbourhoods, there are 1607 venues across 237 categories


Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Berczy Park,57
"Brockton, Parkdale Village, Exhibition Place",27
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",17
Central Bay Street,68
Christie,16
Church and Wellesley,78
"Commerce Court, Victoria Hotel",100
Davisville,35
Davisville North,10
"Dufferin, Dovercourt Village",17


Now, we move to analyze each neighbourhood. To do that, we use one-hot encoding to make our analysis easier

In [36]:
# one hot encoding
all_venue_onehot = pd.get_dummies(all_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
all_venue_onehot['Neighborhood'] = all_venues['Neighborhood'] 
all_venue_onehot = all_venue_onehot.groupby('Neighborhood').mean().reset_index()

all_venue_onehot.head()

Unnamed: 0,Neighborhood,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bistro,Boat or Ferry,Bookstore,Boutique,Brazilian Restaurant,Breakfast Spot,Brewery,Bubble Tea Shop,Building,Burger Joint,Burrito Place,...,Shoe Store,Shopping Mall,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Sri Lankan Restaurant,Stadium,Stationery Store,Steakhouse,Strip Club,Summer Camp,Supermarket,Supplement Shop,Sushi Restaurant,Swim School,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Yoga Studio
0,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.052632,0.0,0.0,0.0,0.017544,0.017544,0.0,0.035088,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,...,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0,0.0,0.017544,0.0,0.0,0.0,0.0,0.0
1,"Brockton, Parkdale Village, Exhibition Place",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074074,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.037037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"CN Tower, King and Spadina, Railway Lands, Har...",0.058824,0.058824,0.058824,0.117647,0.176471,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.058824,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014706,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.029412,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.029412,0.0,0.0,0.0,0.0,0.0,0.0,0.014706,0.0,0.0,0.0,0.0,0.014706,0.0,0.0,0.0,0.0,0.0,0.014706,0.014706,0.0,0.0,0.0,0.0,0.0,0.0,0.014706,0.0,0.0,0.0,0.0,0.014706
4,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [103]:
# Displaying top venues in each neighbourhood
def top_venues(row, num_venues):
    row_cats = row.iloc[1:]
    row_cats_sorted = row_cats.sort_values(ascending=False)
    return row_cats_sorted.index.values[0:num_venues]



num_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
cols = ['Neighborhood']
for i in np.arange(num_venues):
    try:
        cols.append(f"{i+1}{indicators[i]} Most Common Venue")
    except:
        cols.append(f"{i+1}th Most Common Venue")

# create a dataframe of 10 most common venues by neighborhood
all_common = pd.DataFrame(columns=cols)
all_common['Neighborhood'] = all_venue_onehot['Neighborhood']

for i in np.arange(all_venue_onehot.shape[0]):
    all_common.iloc[i, 1:] = top_venues(all_venue_onehot.iloc[i, :], num_venues)

all_common

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Berczy Park,Cocktail Bar,Bakery,Coffee Shop,Pharmacy,Farmers Market,Seafood Restaurant,Restaurant,Beer Bar,Cheese Shop,Pub
1,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Coffee Shop,Breakfast Spot,Performing Arts Venue,Restaurant,Stadium,Italian Restaurant,Intersection,Bar
2,"CN Tower, King and Spadina, Railway Lands, Har...",Airport Service,Airport Lounge,Airport,Bar,Coffee Shop,Rental Car Location,Sculpture Garden,Boutique,Boat or Ferry,Plane
3,Central Bay Street,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Bubble Tea Shop,Restaurant,Japanese Restaurant,Burger Joint,Salad Place,Spa
4,Christie,Grocery Store,Café,Park,Restaurant,Candy Store,Baby Store,Athletics & Sports,Nightclub,Italian Restaurant,Coffee Shop
5,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Pub,Men's Store,Hotel,Fast Food Restaurant,Mediterranean Restaurant
6,"Commerce Court, Victoria Hotel",Coffee Shop,Restaurant,Café,Hotel,Gym,Deli / Bodega,Seafood Restaurant,Japanese Restaurant,Bakery,American Restaurant
7,Davisville,Dessert Shop,Sandwich Place,Pizza Place,Thai Restaurant,Sushi Restaurant,Gym,Italian Restaurant,Café,Coffee Shop,Salon / Barbershop
8,Davisville North,Gym / Fitness Center,Park,Food & Drink Shop,Sandwich Place,Hotel,Department Store,Breakfast Spot,Gym,Playground,Pizza Place
9,"Dufferin, Dovercourt Village",Pharmacy,Bakery,Park,Art Gallery,Café,Middle Eastern Restaurant,Bar,Supermarket,Bank,Music Venue


Now, we move to the actual clustering analysis

In [104]:
kclusters = 5

all_common_cluster = all_venue_onehot.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(all_common_cluster)

kmeans.labels_

array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 4, 4, 4, 4, 4, 1, 4, 3,
       4, 4, 4, 4, 0, 2, 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4], dtype=int32)

In [105]:
# Creating a new dataframe containing 10 most common venues and their labels

all_common.insert(0, 'Cluster Label', kmeans.labels_)


all_merged = df.copy()

all_merged = all_merged.join(all_common.set_index('Neighborhood'), on='Neighbourhood(s)')

all_merged.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood(s),Latitude,Longitude,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Pub,Health Food Store,Asian Restaurant,Trail,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
1,M4J,East York/East Toronto,The Danforth East,43.685347,-79.338106,1,Intersection,Convenience Store,Park,Yoga Studio,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store
2,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,4,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Café,Ice Cream Shop,Cosmetics Shop,Brewery,Bubble Tea Shop,Restaurant
3,M4L,East Toronto,"India Bazaar, The Beaches West",43.668999,-79.315572,4,Park,Sandwich Place,Pizza Place,Sushi Restaurant,Pub,Liquor Store,Burrito Place,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant
4,M4M,East Toronto,Studio District,43.659526,-79.340923,4,Coffee Shop,Bakery,Café,Brewery,American Restaurant,Gastropub,Middle Eastern Restaurant,Bar,Clothing Store,Stationery Store


Visualizing the clusters

In [106]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.gist_rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
for lat, lng, nbhd, cluster in zip(all_merged['Latitude'],
                                   all_merged['Longitude'],
                                   all_merged['Neighbourhood(s)'],
                                   all_merged['Cluster Label']):
    label = folium.Popup(f"Cluster {cluster}: {nbhd}", parse_html=True)
    folium.CircleMarker([lat, lng],
                        radius=5,
                        popup=label,
                        color=rainbow[cluster-1],
                        fill=True,
                        fill_color=rainbow[cluster-1],
                        fill_opacity=0.5
                       ).add_to(map_clusters)
map_clusters

Examination and uniqueness of each cluster

**Cluster 0**

In [120]:
cluster0 = all_merged.loc[all_merged['Cluster Label'] == 0,
                          all_merged.columns[[2] + list(range(6, all_merged.shape[1]))]
                         ].reset_index(drop=True)

print("Most common venue types: \n", cluster0['1st Most Common Venue'].value_counts())

cluster0

Most common venue types: 
 Jewelry Store    1
Park             1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Neighbourhood(s),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Rosedale,Park,Playground,Trail,Dessert Shop,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store,Eastern European Restaurant,Dumpling Restaurant
1,Forest Hill North & West,Jewelry Store,Sushi Restaurant,Park,Trail,Yoga Studio,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Electronics Store


**Cluster 1**

In [118]:
cluster1 = all_merged.loc[all_merged['Cluster Label'] == 1,
                          all_merged.columns[[2] + list(range(6, all_merged.shape[1]))]
                         ].reset_index(drop=True)

print("Most common venue types: \n", cluster1['1st Most Common Venue'].value_counts())

cluster1

Most common venue types: 
 Intersection    1
Park            1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Neighbourhood(s),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Danforth East,Intersection,Convenience Store,Park,Yoga Studio,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store
1,Lawrence Park,Park,Bus Line,Business Service,Swim School,Yoga Studio,Distribution Center,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room


**Cluster 2**

In [117]:
cluster2 = all_merged.loc[all_merged['Cluster Label'] == 2,
                          all_merged.columns[[2] + list(range(6, all_merged.shape[1]))]
                         ].reset_index(drop=True)

print("Most common venue types: \n", cluster2['1st Most Common Venue'].value_counts())
cluster2

Most common venue types: 
 Garden    1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Neighbourhood(s),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Roselawn,Garden,Home Service,Ice Cream Shop,Yoga Studio,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store


**Cluster 3**

In [116]:
cluster3 = all_merged.loc[all_merged['Cluster Label'] == 3,
                          all_merged.columns[[2] + list(range(6, all_merged.shape[1]))]
                         ].reset_index(drop=True)

print("Most common venue types: \n", cluster3['1st Most Common Venue'].value_counts())
cluster3

Most common venue types: 
 Trail    1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Neighbourhood(s),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Moore Park, Summerhill East",Trail,Summer Camp,Yoga Studio,Discount Store,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Escape Room,Electronics Store


**Cluster 4**

In [115]:
cluster4 = all_merged.loc[all_merged['Cluster Label'] == 4,
                          all_merged.columns[[2] + list(range(6, all_merged.shape[1]))]
                         ].reset_index(drop=True)

print("Most common venue types: \n", cluster4['1st Most Common Venue'].value_counts())
cluster4

Most common venue types: 
 Coffee Shop             15
Café                     5
Gym / Fitness Center     2
Bar                      1
Pharmacy                 1
Breakfast Spot           1
Pub                      1
Grocery Store            1
Park                     1
Greek Restaurant         1
Airport Service          1
Thai Restaurant          1
Dessert Shop             1
Cocktail Bar             1
Name: 1st Most Common Venue, dtype: int64


Unnamed: 0,Neighbourhood(s),1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,The Beaches,Pub,Health Food Store,Asian Restaurant,Trail,Eastern European Restaurant,Distribution Center,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
1,"The Danforth West, Riverdale",Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Café,Ice Cream Shop,Cosmetics Shop,Brewery,Bubble Tea Shop,Restaurant
2,"India Bazaar, The Beaches West",Park,Sandwich Place,Pizza Place,Sushi Restaurant,Pub,Liquor Store,Burrito Place,Fast Food Restaurant,Fish & Chips Shop,Italian Restaurant
3,Studio District,Coffee Shop,Bakery,Café,Brewery,American Restaurant,Gastropub,Middle Eastern Restaurant,Bar,Clothing Store,Stationery Store
4,Davisville North,Gym / Fitness Center,Park,Food & Drink Shop,Sandwich Place,Hotel,Department Store,Breakfast Spot,Gym,Playground,Pizza Place
5,North Toronto West,Coffee Shop,Clothing Store,Sporting Goods Shop,Fast Food Restaurant,Mexican Restaurant,Diner,Cosmetics Shop,Park,Chinese Restaurant,Restaurant
6,Davisville,Dessert Shop,Sandwich Place,Pizza Place,Thai Restaurant,Sushi Restaurant,Gym,Italian Restaurant,Café,Coffee Shop,Salon / Barbershop
7,"Summerhill West, Rathnelly, South Hill, Forest...",Coffee Shop,Pizza Place,Pub,Liquor Store,Sandwich Place,Restaurant,Bank,Supermarket,Bagel Shop,Sushi Restaurant
8,"St. James Town, Cabbagetown",Coffee Shop,Pizza Place,Café,Park,Bakery,Italian Restaurant,Pub,Restaurant,General Entertainment,Butcher
9,Church and Wellesley,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Pub,Men's Store,Hotel,Fast Food Restaurant,Mediterranean Restaurant


##### Thanks for reviewing!