# --------------------- PART 1 -------------------------

Gathering data from wikipedia.

Importing libraries

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

Webscraping the page using BeautifulSoup and displaying a first 'raw' result

In [2]:
res = requests.get("https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0] 
df_as_list = pd.read_html(str(table))
df = pd.read_json(df_as_list[0].to_json(orient='records'))
df.head(10)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
7,M7A,Downtown Toronto,Queen's Park
8,M8A,Not assigned,Not assigned
9,M9A,Queen's Park,Not assigned


Ensuring lines with Neighbourhood 'Not Assigned' are assigned with the value in Borough column as the Neighbourhood.
Checks are done before and after to see the results.

In [3]:
df[ (df['Neighbourhood'] == 'Not assigned') & (df['Borough'] != 'Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Not assigned


In [4]:
df['Neighbourhood'] = np.where(df['Neighbourhood'] == 'Not assigned', df['Borough'], df['Neighbourhood'])

In [5]:
df[ (df['Neighbourhood'] == 'Not assigned') & (df['Borough'] != 'Not assigned')]

Unnamed: 0,Postcode,Borough,Neighbourhood


In [6]:
df.loc[df['Postcode'] == 'M9A']

Unnamed: 0,Postcode,Borough,Neighbourhood
9,M9A,Queen's Park,Queen's Park


Replacing Borough 'Not Assigned' by NaN value, before dropping the corresponding columns

In [7]:
df['Borough'].replace("Not assigned", np.nan, inplace = True)

In [8]:
df.dropna(subset=["Borough"], axis=0, inplace = True)

Finally grouping the lines according to postcode, while concatenating the Neighbourhood values.

In [9]:
df = df.groupby(['Postcode','Borough'])['Neighbourhood'].apply(', '.join).reset_index()

In [10]:
df.head(15)

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


In [11]:
df.shape

(103, 3)

# --------------------- PART 2 -------------------------

Getting latitude/longitude file content and turn it in a dataframe

In [12]:
import wget
filename = wget.download('http://cocl.us/Geospatial_data')

In [13]:
df_coord = pd.read_csv(filename)

In [14]:
df_coord.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


Merging with previous dataframe

In [15]:
df2 = pd.concat([df, df_coord], axis=1, join='inner')

In [16]:
df2.shape

(103, 6)

In [17]:
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",M1B,43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",M1C,43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",M1E,43.763573,-79.188711
3,M1G,Scarborough,Woburn,M1G,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,M1H,43.773136,-79.239476


# --------------------- PART 3 -------------------------

Section A is similar to analysis performed in the lab.

In Section B we first explore different possible clusterisations.
We will then focus on a dedicated clusterisation based on restaurant types. This will lead to several adjustments to dataframe and functions used.

This section will be using foursquare API, please enter your credentials here:

In [18]:
CLIENT_ID = 'Your Client ID' # your Foursquare ID
CLIENT_SECRET = 'Your Client Secret' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

## A - Building dataframes and doing "default" clusterisation and exploration

This section takes the same content than Lab performed on manhattan.

### A-1 Preparation

Additional imports for analysis and visualisation

In [19]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium

#!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

Initialisation to centre map on Toronto (geolocator was timing out)

In [20]:
#address = 'Toronto, ON'
#geolocator = Nominatim(user_agent="toronto_explorer")
#location = geolocator.geocode(address)
#latitude = location.latitude
#longitude = location.longitude

latitude = 43.6532
longitude = -79.3832

map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)
map_toronto

Selecting a subset, including borough that include the word Toronto

In [21]:
toronto_data = df2[df2['Borough'].str.find('Toronto') >= 0].reset_index(drop=True)
toronto_data.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude
0,M4E,East Toronto,The Beaches,M4E,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",M4K,43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",M4L,43.668999,-79.315572
3,M4M,East Toronto,Studio District,M4M,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,M4N,43.72802,-79.38879


Add the corresponding neighbourhood 

In [22]:
map_toronto_focus = folium.Map(location=[latitude, longitude], zoom_start=11)

for lat, lng, borough, neighborhood in zip(toronto_data['Latitude'], toronto_data['Longitude'], toronto_data['Borough'], toronto_data['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto_focus) 
    
map_toronto_focus

#### Conection to Foursquare

Using Foursquare Credentials

#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [23]:
toronto_data.loc[0, 'Neighbourhood']

'The Beaches'

Get the neighborhood's latitude and longitude values.

In [24]:
neighborhood_latitude = toronto_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = toronto_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = toronto_data.loc[0, 'Neighbourhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of The Beaches are 43.67635739999999, -79.2930312.


#### Now, let's get the top 100 venues that are in 'The Beaches' within a radius of 500 meters.

First, let's create the GET request URL. Name your URL **url**.

In [25]:
LIMIT = 100
radius = 500

url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=Y5PDFBEBSSE1LFXDAZ0APNLYGYPT4KP11TCRB5IWEL0DXBM1&client_secret=4TPKTC3JQIUCE2WEHC3FKR4O14ENYGSY3GDUPDLB0JYO5SXF&ll=43.6532,-79.3832&v=20180604&radius=500&limit=100'

Send the GET request and examine the resutls

In [26]:
results = requests.get(url).json()
#results   # uncomment to see the actual results.

From the Foursquare lab, we know that all the information is in the *items* key. Before we proceed, let's borrow the **get_category_type** function from the Foursquare lab.

In [27]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a *pandas* dataframe.

In [28]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Downtown Toronto,Neighborhood,43.653232,-79.385296
1,Nathan Phillips Square,Plaza,43.65227,-79.383516
2,Indigo,Bookstore,43.653515,-79.380696
3,Eggspectation Bell Trinity Square,Breakfast Spot,43.653144,-79.38198
4,CF Toronto Eaton Centre,Shopping Mall,43.653594,-79.380611


In [29]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

100 venues were returned by Foursquare.


<a id='item2'></a>

### A-2. Explore Neighborhoods in Toronto

#### Let's create a function to repeat the same process to all the neighborhoods in Toronto

In [30]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Creating a new dataframe called *toronto_venues*.

In [31]:
toronto_venues = getNearbyVenues(names=toronto_data['Neighbourhood'],
                                   latitudes=toronto_data['Latitude'],
                                   longitudes=toronto_data['Longitude']
                                  )

The Beaches
The Danforth West, Riverdale
The Beaches West, India Bazaar
Studio District
Lawrence Park
Davisville North
North Toronto West
Davisville
Moore Park, Summerhill East
Deer Park, Forest Hill SE, Rathnelly, South Hill, Summerhill West
Rosedale
Cabbagetown, St. James Town
Church and Wellesley
Harbourfront
Ryerson, Garden District
St. James Town
Berczy Park
Central Bay Street
Adelaide, King, Richmond
Harbourfront East, Toronto Islands, Union Station
Design Exchange, Toronto Dominion Centre
Commerce Court, Victoria Hotel
Roselawn
Forest Hill North, Forest Hill West
The Annex, North Midtown, Yorkville
Harbord, University of Toronto
Chinatown, Grange Park, Kensington Market
CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara
Stn A PO Boxes 25 The Esplanade
First Canadian Place, Underground city
Christie
Dovercourt Village, Dufferin
Little Portugal, Trinity
Brockton, Exhibition Place, Parkdale Village
High Park, The Junction Sout

#### Let's check the size of the resulting dataframe

In [32]:
print(toronto_venues.shape)
toronto_venues.head()

(1720, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,The Beaches,43.676357,-79.293031,Glen Manor Ravine,43.676821,-79.293942,Trail
1,The Beaches,43.676357,-79.293031,The Big Carrot Natural Food Market,43.678879,-79.297734,Health Food Store
2,The Beaches,43.676357,-79.293031,Grover Pub and Grub,43.679181,-79.297215,Pub
3,The Beaches,43.676357,-79.293031,Upper Beaches,43.680563,-79.292869,Neighborhood
4,"The Danforth West, Riverdale",43.679557,-79.352188,Pantheon,43.677621,-79.351434,Greek Restaurant


Let's check how many venues were returned for each neighborhood

In [33]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",100,100,100,100,100,100
Berczy Park,58,58,58,58,58,58
"Brockton, Exhibition Place, Parkdale Village",21,21,21,21,21,21
Business Reply Mail Processing Centre 969 Eastern,14,14,14,14,14,14
"CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara",17,17,17,17,17,17
"Cabbagetown, St. James Town",47,47,47,47,47,47
Central Bay Street,88,88,88,88,88,88
"Chinatown, Grange Park, Kensington Market",86,86,86,86,86,86
Christie,18,18,18,18,18,18
Church and Wellesley,85,85,85,85,85,85


#### Let's find out how many unique categories can be curated from all the returned venues

In [34]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 232 uniques categories.


<a id='item3'></a>

### A-3. Analyze Each Neighborhood

In [35]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot['Neighbourhood'] = toronto_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [36]:
toronto_onehot.shape

(1720, 233)

#### Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [37]:
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()
toronto_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,...,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.021277,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.0,0.0,0.0,0.0,0.0,0.0,0.011364,0.0,0.0,...,0.0,0.0,0.0,0.011364,0.0,0.0,0.011364,0.0,0.0,0.011364
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.046512,0.0,0.05814,0.011628,0.0,0.0,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011765,0.0,0.0,0.0,0.0,0.0,0.011765,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.011765,0.0,0.011765,0.0,0.011765


#### Let's confirm the new size

In [38]:
toronto_grouped.shape

(39, 233)

#### Let's print each neighborhood along with the top 5 most common venues

In [39]:
num_top_venues = 5

for hood in toronto_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Adelaide, King, Richmond----
         venue  freq
0  Coffee Shop  0.07
1         Café  0.04
2   Steakhouse  0.04
3          Bar  0.04
4        Hotel  0.03


----Berczy Park----
            venue  freq
0     Coffee Shop  0.09
1    Cocktail Bar  0.05
2  Farmers Market  0.03
3      Steakhouse  0.03
4        Beer Bar  0.03


----Brockton, Exhibition Place, Parkdale Village----
            venue  freq
0     Coffee Shop  0.10
1  Breakfast Spot  0.10
2            Café  0.10
3      Restaurant  0.05
4         Stadium  0.05


----Business Reply Mail Processing Centre 969 Eastern----
                  venue  freq
0           Yoga Studio  0.07
1         Auto Workshop  0.07
2         Garden Center  0.07
3    Light Rail Station  0.07
4  Fast Food Restaurant  0.07


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                 venue  freq
0      Airport Service  0.18
1       Airport Lounge  0.12
2     Airport Terminal  0.12
3  

#### Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [40]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [41]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Coffee Shop,Bar,Café,Steakhouse,Restaurant,Breakfast Spot,Thai Restaurant,Hotel,Asian Restaurant,Sushi Restaurant
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Café,Farmers Market,Beer Bar,Steakhouse,Seafood Restaurant,Greek Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Coffee Shop,Breakfast Spot,Café,Gym,Stadium,Burrito Place,Restaurant,Climbing Gym,Pet Store,Bakery
3,Business Reply Mail Processing Centre 969 Eastern,Yoga Studio,Auto Workshop,Pizza Place,Restaurant,Burrito Place,Brewery,Light Rail Station,Skate Park,Smoke Shop,Spa
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Airport Service,Airport Lounge,Airport Terminal,Boutique,Plane,Airport,Airport Food Court,Sculpture Garden,Bar,Harbor / Marina


<a id='item4'></a>

### A-4. Cluster Neighborhoods - Default set-up (5 clusters)

Run *k*-means to cluster the neighborhood into 5 clusters.

In [42]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [43]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Postcode,Borough,Neighbourhood,Postal Code,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M4E,East Toronto,The Beaches,M4E,43.676357,-79.293031,0,Trail,Neighborhood,Pub,Health Food Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio
1,M4K,East Toronto,"The Danforth West, Riverdale",M4K,43.679557,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Ice Cream Shop,Yoga Studio,Sports Bar,Spa,Bookstore,Juice Bar
2,M4L,East Toronto,"The Beaches West, India Bazaar",M4L,43.668999,-79.315572,0,Sandwich Place,Pizza Place,Gym,Fast Food Restaurant,Ice Cream Shop,Sushi Restaurant,Fish & Chips Shop,Brewery,Food & Drink Shop,Pub
3,M4M,East Toronto,Studio District,M4M,43.659526,-79.340923,0,Café,Coffee Shop,Brewery,Gastropub,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Neighborhood,Sandwich Place
4,M4N,Central Toronto,Lawrence Park,M4N,43.72802,-79.38879,3,Park,Lawyer,Swim School,Bus Line,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


Finally, let's visualize the resulting clusters

In [44]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## 5. Examine Clusters - default set-up

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 0

In [45]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,-79.293031,0,Trail,Neighborhood,Pub,Health Food Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio
1,East Toronto,-79.352188,0,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Ice Cream Shop,Yoga Studio,Sports Bar,Spa,Bookstore,Juice Bar
2,East Toronto,-79.315572,0,Sandwich Place,Pizza Place,Gym,Fast Food Restaurant,Ice Cream Shop,Sushi Restaurant,Fish & Chips Shop,Brewery,Food & Drink Shop,Pub
3,East Toronto,-79.340923,0,Café,Coffee Shop,Brewery,Gastropub,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Neighborhood,Sandwich Place
6,Central Toronto,-79.405678,0,Sporting Goods Shop,Coffee Shop,Yoga Studio,Dessert Shop,Spa,Fast Food Restaurant,Salon / Barbershop,Diner,Restaurant,Ice Cream Shop
7,Central Toronto,-79.38879,0,Sandwich Place,Pizza Place,Dessert Shop,Gym,Coffee Shop,Café,Italian Restaurant,Sushi Restaurant,Brewery,Restaurant
9,Central Toronto,-79.400049,0,Coffee Shop,Pub,Sushi Restaurant,Fried Chicken Joint,Vietnamese Restaurant,Light Rail Station,Restaurant,American Restaurant,Pizza Place,Liquor Store
11,Downtown Toronto,-79.367675,0,Coffee Shop,Restaurant,Café,Bakery,Market,Italian Restaurant,Pub,Pizza Place,Playground,Beer Store
12,Downtown Toronto,-79.38316,0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Gym,Pub,Men's Store,Mediterranean Restaurant,Hotel
13,Downtown Toronto,-79.360636,0,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Restaurant,Mexican Restaurant,Performing Arts Venue,Shoe Store


Coffe shop and café seems to be the most common venus on this cluster

#### Cluster 1

In [46]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Central Toronto,-79.416936,1,Garden,Yoga Studio,Dim Sum Restaurant,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


#### Cluster 2

In [47]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Central Toronto,-79.390197,2,Hotel,Gym,Convenience Store,Department Store,Sandwich Place,Breakfast Spot,Food & Drink Shop,Park,Gastropub,Dessert Shop


#### Cluster 3

In [48]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Central Toronto,-79.38879,3,Park,Lawyer,Swim School,Bus Line,Farmers Market,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
23,Central Toronto,-79.411307,3,Park,Jewelry Store,Trail,Sushi Restaurant,Bus Line,Yoga Studio,Discount Store,Falafel Restaurant,Event Space,Ethiopian Restaurant


#### Cluster 4

In [49]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Central Toronto,-79.38316,4,Park,Playground,Trail,Summer Camp,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
10,Downtown Toronto,-79.377529,4,Park,Playground,Trail,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


And a small additional clusters 3 and 4 for Parks, and Parks and Playgrounds

## B. Extending analysis

This part extends the analysis out of what was done in previous lab.

### B-1 Exploring clusterisation alternatives

As previous clusterisation does not highlight much; in this part, we try to loop over k and different random state to see if other possible clusterisation are interesting

In [50]:
type(kmeans.labels_)

numpy.ndarray

In [51]:
for k in range(3,8):
    for i in range(4):
        kclusters = k

        kmeans = KMeans(n_clusters=kclusters, random_state=i).fit(toronto_grouped_clustering)
        Cluster_labels = pd.DataFrame()
        Cluster_labels.insert(0, 'Cluster Labels', kmeans.labels_)

        neighborhoods_venues_sorted['Cluster Labels'] = Cluster_labels['Cluster Labels']

        toronto_merged = toronto_data

        # merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
        toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

        toronto_merged_grouped = toronto_merged[['Cluster Labels','Neighbourhood']].groupby(['Cluster Labels']).count()
        print('With',k,'labels and random_state = ',i,':')
        display(toronto_merged_grouped)

With 3 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,34
1,1
2,4


With 3 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,4
1,1
2,34


With 3 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,34
1,1
2,4


With 3 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,34
1,1
2,4


With 4 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,33
2,1
3,3


With 4 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,33
1,1
2,1
3,4


With 4 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,1
1,35
2,2
3,1


With 4 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,34
1,1
2,2
3,2


With 5 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,33
1,1
2,1
3,2
4,2


With 5 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,33
2,1
3,2
4,1


With 5 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,34
2,1
3,1
4,1


With 5 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,33
1,1
2,2
3,1
4,2


With 6 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,32
2,2
3,1
4,1
5,1


With 6 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,3
1,8
2,24
3,1
4,2
5,1


With 6 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,3
1,1
2,1
3,31
4,2
5,1


With 6 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,1
1,32
2,1
3,2
4,2
5,1


With 7 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,32
1,1
2,1
3,1
4,2
5,1
6,1


With 7 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,10
1,1
2,22
3,1
4,2
5,2
6,1


With 7 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,31
1,1
2,1
3,2
4,1
5,2
6,1


With 7 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,31
1,1
2,2
3,2
4,1
5,1
6,1


#### At this stage, keeping a reasonable number of clusters, my choice goes for 6 labels, random_state = 1

In [52]:
kmeans = KMeans(n_clusters=6, random_state=1).fit(toronto_grouped_clustering)
Cluster_labels = pd.DataFrame()
Cluster_labels.insert(0, 'Cluster Labels', kmeans.labels_)

neighborhoods_venues_sorted['Cluster Labels'] = Cluster_labels['Cluster Labels']

toronto_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_merged_grouped = toronto_merged[['Cluster Labels','Neighbourhood']].groupby(['Cluster Labels']).count()
display(toronto_merged_grouped)

Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,3
1,8
2,24
3,1
4,2
5,1


In [53]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighbourhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### B-2 Examine Clusters with customised clusterisation

Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, you can then assign a name to each cluster. I will leave this exercise to you.

#### Cluster 2 (main cluster)

In [54]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,East Toronto,-79.352188,2,Greek Restaurant,Coffee Shop,Italian Restaurant,Furniture / Home Store,Ice Cream Shop,Yoga Studio,Sports Bar,Spa,Bookstore,Juice Bar
3,East Toronto,-79.340923,2,Café,Coffee Shop,Brewery,Gastropub,Bakery,Italian Restaurant,American Restaurant,Yoga Studio,Neighborhood,Sandwich Place
6,Central Toronto,-79.405678,2,Sporting Goods Shop,Coffee Shop,Yoga Studio,Dessert Shop,Spa,Fast Food Restaurant,Salon / Barbershop,Diner,Restaurant,Ice Cream Shop
7,Central Toronto,-79.38879,2,Sandwich Place,Pizza Place,Dessert Shop,Gym,Coffee Shop,Café,Italian Restaurant,Sushi Restaurant,Brewery,Restaurant
9,Central Toronto,-79.400049,2,Coffee Shop,Pub,Sushi Restaurant,Fried Chicken Joint,Vietnamese Restaurant,Light Rail Station,Restaurant,American Restaurant,Pizza Place,Liquor Store
11,Downtown Toronto,-79.367675,2,Coffee Shop,Restaurant,Café,Bakery,Market,Italian Restaurant,Pub,Pizza Place,Playground,Beer Store
12,Downtown Toronto,-79.38316,2,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Gym,Pub,Men's Store,Mediterranean Restaurant,Hotel
13,Downtown Toronto,-79.360636,2,Coffee Shop,Park,Bakery,Pub,Breakfast Spot,Café,Restaurant,Mexican Restaurant,Performing Arts Venue,Shoe Store
14,Downtown Toronto,-79.378937,2,Coffee Shop,Clothing Store,Café,Cosmetics Shop,Middle Eastern Restaurant,Bakery,Fast Food Restaurant,Japanese Restaurant,Restaurant,Pizza Place
15,Downtown Toronto,-79.375418,2,Coffee Shop,Café,Restaurant,Italian Restaurant,Cocktail Bar,Hotel,Breakfast Spot,Clothing Store,Beer Bar,Cosmetics Shop


Coffe shop and café seems to be the most common venus on this cluster, like "default" clusterisation highlighted

#### Cluster 1

In [55]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,East Toronto,-79.293031,1,Trail,Neighborhood,Pub,Health Food Store,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio
25,Downtown Toronto,-79.400049,1,Café,Restaurant,Bookstore,Japanese Restaurant,Bar,Bakery,Sandwich Place,Gym,Dessert Shop,Italian Restaurant
26,Downtown Toronto,-79.400049,1,Café,Chinese Restaurant,Vietnamese Restaurant,Coffee Shop,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Mexican Restaurant,Bakery,Bar,Cocktail Bar
30,Downtown Toronto,-79.422564,1,Grocery Store,Café,Park,Coffee Shop,Nightclub,Italian Restaurant,Candy Store,Baby Store,Athletics & Sports,Diner
31,West Toronto,-79.442259,1,Pharmacy,Bakery,Middle Eastern Restaurant,Smoke Shop,Café,Bar,Supermarket,Bank,Music Venue,Brewery
32,West Toronto,-79.41975,1,Bar,Men's Store,Coffee Shop,Restaurant,Asian Restaurant,Vietnamese Restaurant,Pizza Place,Café,Yoga Studio,Bistro
34,West Toronto,-79.464763,1,Café,Mexican Restaurant,Bar,Thai Restaurant,Diner,Italian Restaurant,Flea Market,Fried Chicken Joint,Speakeasy,Cajun / Creole Restaurant
38,East Toronto,-79.321558,1,Yoga Studio,Auto Workshop,Pizza Place,Restaurant,Burrito Place,Brewery,Light Rail Station,Skate Park,Smoke Shop,Spa


Unlike previous group, we can see the lack of coffee shop. Reste is quite diverse; maybe with higher presence of bars compared to first group; but I cannot identify much pattern

#### Cluster 0

In [56]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,East Toronto,-79.315572,0,Sandwich Place,Pizza Place,Gym,Fast Food Restaurant,Ice Cream Shop,Sushi Restaurant,Fish & Chips Shop,Brewery,Food & Drink Shop,Pub
5,Central Toronto,-79.390197,0,Hotel,Gym,Convenience Store,Department Store,Sandwich Place,Breakfast Spot,Food & Drink Shop,Park,Gastropub,Dessert Shop
23,Central Toronto,-79.411307,0,Park,Jewelry Store,Trail,Sushi Restaurant,Bus Line,Yoga Studio,Discount Store,Falafel Restaurant,Event Space,Ethiopian Restaurant


#### Cluster 4

In [57]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[1] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Borough,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Central Toronto,-79.38316,4,Park,Playground,Trail,Summer Camp,Dim Sum Restaurant,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant
10,Downtown Toronto,-79.377529,4,Park,Playground,Trail,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


Similitude is quite clear: These are the park, Playground and Trail areas. 

### B-3 Focusing on Restaurants

Idea in this section will be to cluster areas depending on the restaurant types

Filtering out the corresponding rows in initial datafame:

In [58]:
#toronto_venues_custom = toronto_venues[ (toronto_venues['Venue Category'] != 'Coffee Shop')  & (toronto_venues['Venue Category'] != 'Café')]
toronto_venues_custom = toronto_venues[ toronto_venues['Venue Category'].str.find('Restaurant') > 0 ] # >0 and not >=0 to filter the "Restaurant" unspecified
toronto_venues_custom.shape #Previous shape was (1720,7)

(347, 7)

In [59]:
# one hot encoding
toronto_onehot_custom = pd.get_dummies(toronto_venues_custom[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_onehot_custom['Neighbourhood'] = toronto_venues_custom['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns_custom = [toronto_onehot_custom.columns[-1]] + list(toronto_onehot_custom.columns[:-1])
toronto_onehot_custom = toronto_onehot[fixed_columns_custom]

toronto_onehot_custom.head()

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,...,Portuguese Restaurant,Ramen Restaurant,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,The Beaches,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"The Danforth West, Riverdale",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Grouping, as previously:

In [60]:
toronto_custom_grouped = toronto_onehot_custom.groupby('Neighbourhood').mean().reset_index()
toronto_custom_grouped

Unnamed: 0,Neighbourhood,Afghan Restaurant,American Restaurant,Asian Restaurant,Belgian Restaurant,Brazilian Restaurant,Cajun / Creole Restaurant,Caribbean Restaurant,Chinese Restaurant,Colombian Restaurant,...,Portuguese Restaurant,Ramen Restaurant,Seafood Restaurant,Southern / Soul Food Restaurant,Sushi Restaurant,Taiwanese Restaurant,Thai Restaurant,Theme Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant
0,"Adelaide, King, Richmond",0.0,0.02,0.03,0.0,0.01,0.0,0.0,0.0,0.01,...,0.0,0.01,0.02,0.0,0.02,0.0,0.03,0.0,0.02,0.0
1,Berczy Park,0.0,0.0,0.0,0.017241,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.034483,0.0,0.0,0.0,0.017241,0.0,0.017241,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown, St. James Town",0.0,0.021277,0.0,0.0,0.0,0.0,0.021277,0.021277,0.0,...,0.0,0.0,0.0,0.0,0.0,0.021277,0.021277,0.0,0.0,0.0
6,Central Bay Street,0.0,0.011364,0.0,0.0,0.0,0.0,0.0,0.034091,0.0,...,0.011364,0.011364,0.011364,0.0,0.011364,0.0,0.011364,0.0,0.011364,0.0
7,"Chinatown, Grange Park, Kensington Market",0.0,0.0,0.0,0.011628,0.0,0.0,0.011628,0.05814,0.0,...,0.0,0.011628,0.0,0.0,0.0,0.0,0.011628,0.0,0.046512,0.05814
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.011765,0.011765,0.0,0.0,0.0,0.0,0.011765,0.011765,0.0,...,0.0,0.011765,0.011765,0.0,0.047059,0.0,0.011765,0.011765,0.0,0.011765


Performing the same transformation to display main venues categories for each neighborhood (for further human analysis of clustering)

In [61]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_custom_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_custom_venues_sorted['Neighbourhood'] = toronto_custom_grouped['Neighbourhood']

for ind in np.arange(toronto_custom_grouped.shape[0]):
    neighborhoods_custom_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_custom_grouped.iloc[ind, :], num_top_venues)

neighborhoods_custom_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Thai Restaurant,Asian Restaurant,American Restaurant,Sushi Restaurant,Seafood Restaurant,Vegetarian / Vegan Restaurant,Greek Restaurant,Gluten-free Restaurant,Latin American Restaurant,Mediterranean Restaurant
1,Berczy Park,Seafood Restaurant,Greek Restaurant,Vegetarian / Vegan Restaurant,Italian Restaurant,Japanese Restaurant,Eastern European Restaurant,French Restaurant,Comfort Food Restaurant,Belgian Restaurant,Thai Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Italian Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant
3,Business Reply Mail Processing Centre 969 Eastern,Fast Food Restaurant,Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant


As before, I did a loop on k and random state to get an idea of how many clusters are returned with different set-ups

In [62]:
toronto_custom_grouped_clustering = toronto_custom_grouped.drop('Neighbourhood', 1)

for k in range(2,8):
    for i in range(4):
        kclusters = k

        kmeans = KMeans(n_clusters=kclusters, random_state=i).fit(toronto_custom_grouped_clustering)
        Cluster_labels = pd.DataFrame()
        Cluster_labels.insert(0, 'Cluster Labels', kmeans.labels_)

        neighborhoods_custom_venues_sorted['Cluster Labels'] = Cluster_labels['Cluster Labels']

        toronto_custom_merged = toronto_data

        # merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
        toronto_custom_merged = toronto_custom_merged.join(neighborhoods_custom_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

        toronto_custom_merged_grouped = toronto_custom_merged[['Cluster Labels','Neighbourhood']].groupby(['Cluster Labels']).count()
        print('With',k,'labels and random_state = ',i,':')
        display(toronto_custom_merged_grouped)

With 2 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,38
1,1


With 2 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,38
1,1


With 2 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,1
1,38


With 2 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,38
1,1


With 3 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,7
1,31
2,1


With 3 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,36
2,1


With 3 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,1
1,37
2,1


With 3 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,22
1,1
2,16


With 4 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,29
1,1
2,1
3,8


With 4 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,23
1,13
2,1
3,2


With 4 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,2
1,24
2,12
3,1


With 4 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,23
1,1
2,1
3,14


With 5 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,22
1,1
2,14
3,1
4,1


With 5 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,13
1,23
2,1
3,1
4,1


With 5 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,35
1,1
2,1
3,1
4,1


With 5 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,21
1,1
2,1
3,15
4,1


With 6 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,11
1,1
2,1
3,16
4,1
5,9


With 6 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,1
1,14
2,1
3,21
4,1
5,1


With 6 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,4
1,2
2,1
3,1
4,13
5,18


With 6 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,14
1,1
2,1
3,1
4,21
5,1


With 7 labels and random_state =  0 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,14
1,1
2,1
3,1
4,1
5,20
6,1


With 7 labels and random_state =  1 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,7
1,1
2,13
3,1
4,1
5,1
6,15


With 7 labels and random_state =  2 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,1
1,11
2,6
3,1
4,1
5,1
6,18


With 7 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,19
1,1
2,1
3,1
4,1
5,12
6,4


#### With 3 clusters we already have 2 main categories being identified, let's see what these are:

In [63]:
kclusters = 3
i=3

kmeans = KMeans(n_clusters=kclusters, random_state=i).fit(toronto_custom_grouped_clustering)
Cluster_labels = pd.DataFrame()
Cluster_labels.insert(0, 'Cluster Labels', kmeans.labels_)

neighborhoods_custom_venues_sorted['Cluster Labels'] = Cluster_labels['Cluster Labels']

toronto_custom_merged = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_custom_merged = toronto_custom_merged.join(neighborhoods_custom_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

toronto_custom_merged_grouped = toronto_custom_merged[['Cluster Labels','Neighbourhood']].groupby(['Cluster Labels']).count()
print('With',k,'labels and random_state = ',i,':')
display(toronto_custom_merged_grouped)

With 7 labels and random_state =  3 :


Unnamed: 0_level_0,Neighbourhood
Cluster Labels,Unnamed: 1_level_1
0,22
1,1
2,16


#### Creating the corresponding map

In [64]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_custom_merged['Latitude'], toronto_custom_merged['Longitude'], toronto_custom_merged['Neighbourhood'], toronto_custom_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Cluster 0 seems pretty central compared to cluster 2

### B-4 Examine Restaurant Clusters

#### Cluster 0

In [65]:
toronto_custom_merged.loc[toronto_custom_merged['Cluster Labels'] == 0, toronto_custom_merged.columns[[1,2] + list(range(6, toronto_custom_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
2,East Toronto,"The Beaches West, India Bazaar",Fast Food Restaurant,Italian Restaurant,Sushi Restaurant,Vietnamese Restaurant,Cuban Restaurant,Filipino Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,0
3,East Toronto,Studio District,American Restaurant,Italian Restaurant,Comfort Food Restaurant,Thai Restaurant,Seafood Restaurant,Latin American Restaurant,Middle Eastern Restaurant,Dim Sum Restaurant,Fast Food Restaurant,Falafel Restaurant,0
7,Central Toronto,Davisville,Sushi Restaurant,Italian Restaurant,Greek Restaurant,Thai Restaurant,Seafood Restaurant,Indian Restaurant,Cuban Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,0
11,Downtown Toronto,"Cabbagetown, St. James Town",Italian Restaurant,American Restaurant,Thai Restaurant,Taiwanese Restaurant,Indian Restaurant,Caribbean Restaurant,Chinese Restaurant,Japanese Restaurant,Vietnamese Restaurant,Doner Restaurant,0
12,Downtown Toronto,Church and Wellesley,Sushi Restaurant,Japanese Restaurant,Fast Food Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Mexican Restaurant,American Restaurant,Caribbean Restaurant,Chinese Restaurant,Ethiopian Restaurant,0
14,Downtown Toronto,"Ryerson, Garden District",Middle Eastern Restaurant,Japanese Restaurant,Fast Food Restaurant,Italian Restaurant,Ramen Restaurant,Modern European Restaurant,Ethiopian Restaurant,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,0
15,Downtown Toronto,St. James Town,Italian Restaurant,American Restaurant,Seafood Restaurant,Comfort Food Restaurant,New American Restaurant,Indian Restaurant,Middle Eastern Restaurant,Vegetarian / Vegan Restaurant,German Restaurant,Japanese Restaurant,0
16,Downtown Toronto,Berczy Park,Seafood Restaurant,Greek Restaurant,Vegetarian / Vegan Restaurant,Italian Restaurant,Japanese Restaurant,Eastern European Restaurant,French Restaurant,Comfort Food Restaurant,Belgian Restaurant,Thai Restaurant,0
17,Downtown Toronto,Central Bay Street,Italian Restaurant,Japanese Restaurant,Chinese Restaurant,Portuguese Restaurant,Korean Restaurant,Vegetarian / Vegan Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Falafel Restaurant,Ramen Restaurant,0
19,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",Italian Restaurant,Sushi Restaurant,Vegetarian / Vegan Restaurant,Seafood Restaurant,Ramen Restaurant,New American Restaurant,Indian Restaurant,Chinese Restaurant,Mexican Restaurant,Japanese Restaurant,0


Italian restaurant seem predominant, with other mostly western restaurants, on top of sushi rest

#### Cluster 1

In [66]:
toronto_custom_merged.loc[toronto_custom_merged['Cluster Labels'] == 1, toronto_custom_merged.columns[[1,2] + list(range(6, toronto_custom_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,East Toronto,"The Danforth West, Riverdale",Greek Restaurant,Italian Restaurant,American Restaurant,Indian Restaurant,Caribbean Restaurant,Doner Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,1


Seems pretty western again. Maybe the lack of sushi filtered it out of the previous category

#### Cluster 2

In [67]:
toronto_custom_merged.loc[toronto_custom_merged['Cluster Labels'] == 2, toronto_custom_merged.columns[[1,2] + list(range(6, toronto_custom_merged.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,East Toronto,The Beaches,Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant,2
4,Central Toronto,Lawrence Park,Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant,2
5,Central Toronto,Davisville North,Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant,2
6,Central Toronto,North Toronto West,Fast Food Restaurant,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,French Restaurant,Filipino Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,2
8,Central Toronto,"Moore Park, Summerhill East",Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant,2
9,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Vietnamese Restaurant,American Restaurant,Sushi Restaurant,Dim Sum Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,2
10,Downtown Toronto,Rosedale,Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant,2
13,Downtown Toronto,Harbourfront,Mexican Restaurant,French Restaurant,Asian Restaurant,Vietnamese Restaurant,Dim Sum Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,2
18,Downtown Toronto,"Adelaide, King, Richmond",Thai Restaurant,Asian Restaurant,American Restaurant,Sushi Restaurant,Seafood Restaurant,Vegetarian / Vegan Restaurant,Greek Restaurant,Gluten-free Restaurant,Latin American Restaurant,Mediterranean Restaurant,2
22,Central Toronto,Roselawn,Vietnamese Restaurant,Gluten-free Restaurant,French Restaurant,Filipino Restaurant,Fast Food Restaurant,Falafel Restaurant,Ethiopian Restaurant,Eastern European Restaurant,Dumpling Restaurant,Doner Restaurant,2


Most of these have the actual same distribution for their top 10!
The others also share the predominance of vietnamese restaurant; along with french or gluten free

#### Let's investigate further this strange similitude

Let's look more in details on distribution of those:

In [68]:
neighbourhoods_cluster2 = toronto_custom_merged.loc[toronto_custom_merged['Cluster Labels'] == 2]['Neighbourhood'].tolist()

In [69]:
num_top_venues = 5

for hood in toronto_custom_grouped['Neighbourhood']:
    
    if any(hood in neighbourhood for neighbourhood in neighbourhoods_cluster2):
        print("----"+hood+"----")
        temp = toronto_custom_grouped[toronto_custom_grouped['Neighbourhood'] == hood].T.reset_index()
        temp.columns = ['venue','freq']
        temp = temp.iloc[1:]
        temp['freq'] = temp['freq'].astype(float)
        temp = temp.round({'freq': 2})
        print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
        print('\n')

----Adelaide, King, Richmond----
                           venue  freq
0               Asian Restaurant  0.03
1                Thai Restaurant  0.03
2             Seafood Restaurant  0.02
3  Vegetarian / Vegan Restaurant  0.02
4            American Restaurant  0.02


----Business Reply Mail Processing Centre 969 Eastern----
                             venue  freq
0             Fast Food Restaurant  0.07
1                Afghan Restaurant  0.00
2  Molecular Gastronomy Restaurant  0.00
3               Italian Restaurant  0.00
4              Japanese Restaurant  0.00


----CN Tower, Bathurst Quay, Island airport, Harbourfront West, King and Spadina, Railway Lands, South Niagara----
                 venue  freq
0    Afghan Restaurant   0.0
1    Hotpot Restaurant   0.0
2   Italian Restaurant   0.0
3  Japanese Restaurant   0.0
4    Korean Restaurant   0.0


----Chinatown, Grange Park, Kensington Market----
                           venue  freq
0          Vietnamese Restaurant  0.06
1     

This highlights that some similitude is actually due to lack of restaurants, more than restaurant types themselves

### B-5 Refining display

For display purpose, we will replace Categories with values 0.00 by "No restaurants".
Another possibility could be to merge all categories into only 1 column, in which case clusterisation would probably need to be reprocessed.

In [70]:
def return_most_common_venues_filter(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    for indexi in np.arange(row_categories_sorted.shape[0]):
        if row_categories_sorted.iloc[indexi] == 0.0:
            row_categories_sorted.index.values[indexi] = 'No restaurants'
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [71]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_custom_venues_sorted_2 = pd.DataFrame(columns=columns)
neighborhoods_custom_venues_sorted_2['Neighbourhood'] = toronto_custom_grouped['Neighbourhood']

for ind in np.arange(toronto_custom_grouped.shape[0]):
    neighborhoods_custom_venues_sorted_2.iloc[ind, 1:] = return_most_common_venues_filter(toronto_custom_grouped.iloc[ind, :], num_top_venues)

In [72]:
neighborhoods_custom_venues_sorted_2

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide, King, Richmond",Thai Restaurant,Asian Restaurant,American Restaurant,Sushi Restaurant,Seafood Restaurant,Vegetarian / Vegan Restaurant,Greek Restaurant,Gluten-free Restaurant,Latin American Restaurant,Mediterranean Restaurant
1,Berczy Park,Seafood Restaurant,Greek Restaurant,Vegetarian / Vegan Restaurant,Italian Restaurant,Japanese Restaurant,Eastern European Restaurant,French Restaurant,Comfort Food Restaurant,Belgian Restaurant,Thai Restaurant
2,"Brockton, Exhibition Place, Parkdale Village",Italian Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants
3,Business Reply Mail Processing Centre 969 Eastern,Fast Food Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants
5,"Cabbagetown, St. James Town",Italian Restaurant,American Restaurant,Thai Restaurant,Taiwanese Restaurant,Indian Restaurant,Caribbean Restaurant,Chinese Restaurant,Japanese Restaurant,No restaurants,No restaurants
6,Central Bay Street,Italian Restaurant,Japanese Restaurant,Chinese Restaurant,Portuguese Restaurant,Korean Restaurant,Vegetarian / Vegan Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Falafel Restaurant,Ramen Restaurant
7,"Chinatown, Grange Park, Kensington Market",Vietnamese Restaurant,Chinese Restaurant,Vegetarian / Vegan Restaurant,Dumpling Restaurant,Mexican Restaurant,Hotpot Restaurant,Japanese Restaurant,Doner Restaurant,Dim Sum Restaurant,Filipino Restaurant
8,Christie,Italian Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants
9,Church and Wellesley,Sushi Restaurant,Japanese Restaurant,Fast Food Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Mexican Restaurant,American Restaurant,Caribbean Restaurant,Chinese Restaurant,Ethiopian Restaurant


#### Clusterisation didn't change, let's re-assign the same clusters

In [73]:
neighborhoods_custom_venues_sorted_2['Cluster Labels'] = Cluster_labels['Cluster Labels']

toronto_custom_merged_2 = toronto_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_custom_merged_2 = toronto_custom_merged_2.join(neighborhoods_custom_venues_sorted_2.set_index('Neighbourhood'), on='Neighbourhood')

#### Finally, let's do the cluster Analysis based on refined display

#### Cluster 0

In [74]:
toronto_custom_merged_2.loc[toronto_custom_merged_2['Cluster Labels'] == 0, toronto_custom_merged_2.columns[[1,2] + list(range(6, toronto_custom_merged_2.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
2,East Toronto,"The Beaches West, India Bazaar",Fast Food Restaurant,Italian Restaurant,Sushi Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,0
3,East Toronto,Studio District,American Restaurant,Italian Restaurant,Comfort Food Restaurant,Thai Restaurant,Seafood Restaurant,Latin American Restaurant,Middle Eastern Restaurant,No restaurants,No restaurants,No restaurants,0
7,Central Toronto,Davisville,Sushi Restaurant,Italian Restaurant,Greek Restaurant,Thai Restaurant,Seafood Restaurant,Indian Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,0
11,Downtown Toronto,"Cabbagetown, St. James Town",Italian Restaurant,American Restaurant,Thai Restaurant,Taiwanese Restaurant,Indian Restaurant,Caribbean Restaurant,Chinese Restaurant,Japanese Restaurant,No restaurants,No restaurants,0
12,Downtown Toronto,Church and Wellesley,Sushi Restaurant,Japanese Restaurant,Fast Food Restaurant,Mediterranean Restaurant,Vietnamese Restaurant,Mexican Restaurant,American Restaurant,Caribbean Restaurant,Chinese Restaurant,Ethiopian Restaurant,0
14,Downtown Toronto,"Ryerson, Garden District",Middle Eastern Restaurant,Japanese Restaurant,Fast Food Restaurant,Italian Restaurant,Ramen Restaurant,Modern European Restaurant,Ethiopian Restaurant,Chinese Restaurant,Mexican Restaurant,Vietnamese Restaurant,0
15,Downtown Toronto,St. James Town,Italian Restaurant,American Restaurant,Seafood Restaurant,Comfort Food Restaurant,New American Restaurant,Indian Restaurant,Middle Eastern Restaurant,Vegetarian / Vegan Restaurant,German Restaurant,Japanese Restaurant,0
16,Downtown Toronto,Berczy Park,Seafood Restaurant,Greek Restaurant,Vegetarian / Vegan Restaurant,Italian Restaurant,Japanese Restaurant,Eastern European Restaurant,French Restaurant,Comfort Food Restaurant,Belgian Restaurant,Thai Restaurant,0
17,Downtown Toronto,Central Bay Street,Italian Restaurant,Japanese Restaurant,Chinese Restaurant,Portuguese Restaurant,Korean Restaurant,Vegetarian / Vegan Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Falafel Restaurant,Ramen Restaurant,0
19,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",Italian Restaurant,Sushi Restaurant,Vegetarian / Vegan Restaurant,Seafood Restaurant,Ramen Restaurant,New American Restaurant,Indian Restaurant,Chinese Restaurant,Mexican Restaurant,Japanese Restaurant,0


At first sight, this cluster seems more linked to the presence of Sushi and Italian restaurants

#### Cluster 1

In [75]:
toronto_custom_merged_2.loc[toronto_custom_merged_2['Cluster Labels'] == 1, toronto_custom_merged_2.columns[[1,2] + list(range(6, toronto_custom_merged_2.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
1,East Toronto,"The Danforth West, Riverdale",Greek Restaurant,Italian Restaurant,American Restaurant,Indian Restaurant,Caribbean Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,1


Seems pretty western again. Maybe the lack of sushi filtered it out of the previous category

#### Cluster 2

In [76]:
toronto_custom_merged_2.loc[toronto_custom_merged_2['Cluster Labels'] == 2, toronto_custom_merged_2.columns[[1,2] + list(range(6, toronto_custom_merged_2.shape[1]))]]

Unnamed: 0,Borough,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Cluster Labels
0,East Toronto,The Beaches,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
4,Central Toronto,Lawrence Park,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
5,Central Toronto,Davisville North,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
6,Central Toronto,North Toronto West,Fast Food Restaurant,Chinese Restaurant,Mexican Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
8,Central Toronto,"Moore Park, Summerhill East",No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
9,Central Toronto,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",Vietnamese Restaurant,American Restaurant,Sushi Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
10,Downtown Toronto,Rosedale,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
13,Downtown Toronto,Harbourfront,Mexican Restaurant,French Restaurant,Asian Restaurant,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2
18,Downtown Toronto,"Adelaide, King, Richmond",Thai Restaurant,Asian Restaurant,American Restaurant,Sushi Restaurant,Seafood Restaurant,Vegetarian / Vegan Restaurant,Greek Restaurant,Gluten-free Restaurant,Latin American Restaurant,Mediterranean Restaurant,2
22,Central Toronto,Roselawn,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,No restaurants,2


This cluster seems to group mostly neighbourhodods with Vietnamese restaurants or with few restaurants or no restaurants

At this stage, I could have expected possible 4 clusters: Italian, Sushi, Vietnamese, No restaurants. However additional clusterisation (launching a corresponding loop as before) does not seem to highlight any clusterisation ending up in such a distribution. 

This is probably due to the fact that 0.0 is not seen as such a special value from the clusterisation algorithm perspective, unlike the human bias.

Giving more weight to the presence or not of restaurants (independently of its kind) would probably require to use another aproach. Either clusterisation based on pure frequency of any restaurants; or introduction to dedicated features in current dataframe used. 

<!-- toronto_custom_grouped_clustering = toronto_custom_grouped.drop('Neighbourhood', 1)

for i in range(100):
        kclusters = 4

        kmeans = KMeans(n_clusters=kclusters, random_state=i).fit(toronto_custom_grouped_clustering)
        Cluster_labels = pd.DataFrame()
        Cluster_labels.insert(0, 'Cluster Labels', kmeans.labels_)

        neighborhoods_custom_venues_sorted['Cluster Labels'] = Cluster_labels['Cluster Labels']

        toronto_custom_merged = toronto_data

        # merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
        toronto_custom_merged = toronto_custom_merged.join(neighborhoods_custom_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')

        toronto_custom_merged_grouped = toronto_custom_merged[['Cluster Labels','Neighbourhood']].groupby(['Cluster Labels']).count()
        print('With',k,'labels and random_state = ',i,':')
        display(toronto_custom_merged_grouped)
        -->

### B-6 Some additional statistics

Let's look more in details on distribution of clusters to see if "human" observations are confirmed or not.
As this is not the main purpose of the exercise, I simply check the mean of each restaurant types.

#### Cluster 0

In [77]:
#Getting the list of neighborhoods in cluster 0:
neighbourhoods_cluster0 = toronto_custom_merged_2.loc[toronto_custom_merged_2['Cluster Labels'] == 0]['Neighbourhood'].tolist()

cluster0_stats = toronto_custom_grouped[toronto_custom_grouped['Neighbourhood'].isin(neighbourhoods_cluster0)]
cluster0_stats.mean().sort_values(ascending = False)

Italian Restaurant                 0.035629
Sushi Restaurant                   0.022825
Japanese Restaurant                0.014942
Seafood Restaurant                 0.013604
Thai Restaurant                    0.012425
American Restaurant                0.008274
Fast Food Restaurant               0.007739
Mexican Restaurant                 0.007016
Chinese Restaurant                 0.006425
Vegetarian / Vegan Restaurant      0.005677
French Restaurant                  0.005612
Eastern European Restaurant        0.004764
Comfort Food Restaurant            0.004103
Indian Restaurant                  0.003788
Greek Restaurant                   0.003525
Cuban Restaurant                   0.003497
Middle Eastern Restaurant          0.003417
Asian Restaurant                   0.003182
Latin American Restaurant          0.003157
Ramen Restaurant                   0.002415
New American Restaurant            0.002273
Cajun / Creole Restaurant          0.001976
Portuguese Restaurant           

We retrieve the predominance of Italian and Sushi restaurants. We can note that frequency for Vietnamese restaurants is really low. Let's see if this is higher for cluster 2.
Let's also see if frequencies are higher in cluster 0 than for the cluster 2 as it seemed to be

#### Cluster 2

In [78]:
#Getting the list of neighborhoods in cluster 2:
neighbourhoods_cluster2 = toronto_custom_merged_2.loc[toronto_custom_merged_2['Cluster Labels'] == 2]['Neighbourhood'].tolist()

cluster2_stats = toronto_custom_grouped[toronto_custom_grouped['Neighbourhood'].isin(neighbourhoods_cluster2)]
cluster2_stats.mean().sort_values(ascending = False)

Vietnamese Restaurant              0.010502
Mexican Restaurant                 0.008963
American Restaurant                0.008432
Vegetarian / Vegan Restaurant      0.008076
Fast Food Restaurant               0.007440
Middle Eastern Restaurant          0.006624
Chinese Restaurant                 0.006610
Sushi Restaurant                   0.005714
Asian Restaurant                   0.005581
Dumpling Restaurant                0.002907
Indian Restaurant                  0.002717
Thai Restaurant                    0.002602
Japanese Restaurant                0.002554
French Restaurant                  0.002504
Greek Restaurant                   0.001827
New American Restaurant            0.001827
Ramen Restaurant                   0.001352
Seafood Restaurant                 0.001250
Korean Restaurant                  0.001202
Italian Restaurant                 0.001202
Cuban Restaurant                   0.001202
Southern / Soul Food Restaurant    0.001202
Filipino Restaurant             

This is consistent with manual observations: This cluster corresponds to Vietnamese restaurants + lower frequency of restaurants than previous cluster.
Frequency of Mexican restaurant is however similar in both clusters

### Conclusion

Through this lab, we first replicated analysis from Manhattan. We then extending this initial clusterisation to see if we could identify another possible clusterisation. This highlighted the importance of Coffee Shop and Café in many neighbourhoods. To get additional insights, we instead focused on clusterisation based on restaurant types. This led to refine the display and perform additional checks. Neighbourhood at this stage were clustered in 2 main clusters, with either large number of restaurants and usually important frequency of italian and sushi restaurants; and other areas with lower frequency of restaurants, or where vietnamese restaurant are more frequent.