# <font color=green>__Part 1__</font>

In [1]:
import pandas as pd
import numpy as np
!pip3 install lxml

Collecting lxml
  Downloading lxml-4.5.2-cp37-cp37m-manylinux1_x86_64.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 3.2 MB/s eta 0:00:01
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.5.2


__We will use pandas' read_html function and then choose the first table, which contains the relevant information:__

In [2]:
dataframe = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
dataframe.head(3)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods


In [3]:
# verifying that the "Not assigned" entries are indeed of type string
type(dataframe.iloc[0,1])

str

__As per the assignment, we drop all the rows for which "Borough" is unassigned:__

In [4]:
dataframe = dataframe[dataframe.Borough != 'Not assigned'].reset_index(drop = 'True')

dataframe.head(3)

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"


__According to the assignment, there are duplicates in the 'Postal Code' column. It is, however, possible, that the wikipedia page has been updated after the assignment was posted. It turns out that each postal code is listed only once:__

In [5]:
dataframe.shape[0] - len(dataframe['Postal Code'].unique())

0

__Finally, we verify that there are no rows with unassigned neighbourhood:__

In [6]:
"Not assigned" in dataframe.Neighbourhood.values

False

__As per the assignment, we print the number of rows using the .shape method:__

In [7]:
print(dataframe.shape[0])

103


# <font color=green>__Part 2__</font>

__First we define a function that will get the coordinates associated to a certain postal code using the Geocoder package. As suggested in the assignment, we take measures to deal with the unreliability of the package:__

In [8]:
!pip3 install geocoder
import geocoder # import geocoder

def coords(postal_code):
    # initialize your variable to None
    lat_lng_coords = None

    # loop until you get the coordinates
    while(lat_lng_coords is None):
      g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
      lat_lng_coords = g.latlng

    return lat_lng_coords

Collecting geocoder
  Downloading geocoder-1.38.1-py2.py3-none-any.whl (98 kB)
[K     |████████████████████████████████| 98 kB 2.7 MB/s eta 0:00:011
[?25hCollecting ratelim
  Downloading ratelim-0.1.6-py2.py3-none-any.whl (4.0 kB)
Collecting future
  Downloading future-0.18.2.tar.gz (829 kB)
[K     |████████████████████████████████| 829 kB 7.7 MB/s eta 0:00:01
Building wheels for collected packages: future
  Building wheel for future (setup.py) ... [?25ldone
[?25h  Created wheel for future: filename=future-0.18.2-py3-none-any.whl size=491058 sha256=726553e78066417ee2dfe42291bfe9aea2ffe2273848016a86028efe97615666
  Stored in directory: /home/jovyan/.cache/pip/wheels/56/b0/fe/4410d17b32f1f0c3cf54cdfb2bc04d7b4b8f4ae377e2229ba0
Successfully built future
Installing collected packages: ratelim, future, geocoder
Successfully installed future-0.18.2 geocoder-1.38.1 ratelim-0.1.6


__However, attempting to use the thusly defined function, we get an error message. Upon closer inspection it turns out that an API key is needed. Since I do not have an API key, I will use the csv file provided in the assignment instead:__

In [9]:
df_coord = pd.read_csv('https://cocl.us/Geospatial_data')
df_coord.head(3)

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711


In [10]:
# We verify that there are no duplicates in the 'Postal Code' column
print(df_coord.shape[0]-len(df_coord['Postal Code'].unique()))

# We verify that there are precisely as many postal codes as in the other dataframe:
df_coord.shape[0] - dataframe.shape[0]

0


0

__We now fuse these two dataframes together, noting that the respective 'Postal Code' columns are ordered in a different way:__

In [11]:
# We use 'Postal Code' as index for both dataframes. This is justified, since there are no duplicates (see above)
df_coord.set_index('Postal Code', inplace = True)
dataframe.set_index('Postal Code', inplace = True)

# Now we can combine the dataframes based on index, where the different order does not matter
dataframe = pd.merge(dataframe, df_coord, left_index=True, right_index=True)

# Now we just reset the index and inspect the first twelve rows of the resulting dataframe:
dataframe.reset_index(inplace = True)
dataframe.head(12)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
5,M9A,Etobicoke,"Islington Avenue, Humber Valley Village",43.667856,-79.532242
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
7,M3B,North York,Don Mills,43.745906,-79.352188
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937


# <font color=green>__Part 3__</font>

In [12]:
!conda install -c conda-forge folium=0.5.0 --yes
import folium
import json
import requests
from pandas.io.json import json_normalize

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 4.8.2
  latest version: 4.8.4

Please update conda by running

    $ conda update -n base conda



## Package Plan ##

  environment location: /srv/conda/envs/notebook

  added / updated specs:
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1g             |       h516909a_1         2.1 MB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.2 MB

The following NEW packages will be I

__We start by displaying the neighbourhoods on a map:__

In [13]:
map_Toronto = folium.Map(location=[43.6532, -79.3832], zoom_start=10)

for lat, lng, borough, neighborhood, postalcode in zip(dataframe['Latitude'], dataframe['Longitude'],
                                                       dataframe['Borough'], dataframe['Neighbourhood'],
                                                       dataframe['Postal Code']):
    label = folium.Popup('{}, {}, {}'.format(postalcode, neighborhood, borough), parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)  
    
map_Toronto

__We will use Foursquare in order to obtain information on the venues in these neighbourhoods:__

In [14]:
CLIENT_ID = 'removed for privacy reasons' # your Foursquare ID
CLIENT_SECRET = 'removed for privacy reasons' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

__We borrow the following function from one of the labs in the course:__

In [15]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

__We now go through the postal codes of Toronto and use the Foursquare API to find all the venues within radius=500 of the coordinates associated to the given postal codes; we limit ourselves to the top 100 venues for each postal code:__

In [16]:
postalcodes = dataframe['Postal Code']
neighbourhoods = dataframe['Neighbourhood']
latitudes = dataframe['Latitude']
longitudes = dataframe['Longitude']

venues_list=[]
for p_code, nhbd, lat, lng in zip(postalcodes, neighbourhoods, latitudes, longitudes):
            
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        lat, 
        lng, 
        500, 
        100)
            
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
        
    # return only relevant information for each nearby venue
    venues_list.append([(
        p_code, 
        nhbd, 
        lat, 
        lng, 
        v['venue']['name'], 
        v['venue']['location']['lat'], 
        v['venue']['location']['lng'],  
        v['venue']['categories'][0]['name']) for v in results])

nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Postal Code', 
                  'Neighbourhood',
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    


__We briefly inspect the dataframe we just created:__

In [17]:
nearby_venues.head(20)

Unnamed: 0,Postal Code,Neighbourhood,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
2,M4A,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
3,M4A,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
4,M4A,Victoria Village,43.725882,-79.315572,Portugril,43.725819,-79.312785,Portuguese Restaurant
5,M4A,Victoria Village,43.725882,-79.315572,The Frig,43.727051,-79.317418,French Restaurant
6,M4A,Victoria Village,43.725882,-79.315572,Eglinton Ave E & Sloane Ave/Bermondsey Rd,43.726086,-79.31362,Intersection
7,M4A,Victoria Village,43.725882,-79.315572,Pizza Nova,43.725824,-79.31286,Pizza Place
8,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
9,M5A,"Regent Park, Harbourfront",43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop


__It is important to note that a venue from this dataframe does not necessarily have the postal code from the 'Postal Code' column and it does not necessarily lie in (any of) the neighbourhood(s) from the Neighbourhood column; this is due to the fact that we were just looking for venues within a certain radius of the coordinates we specified, without any further requirements. Also note that it is possible that several venues were listed more than once; this will not matter to us, since we will just see them as being accessible from all the postal codes they are listed for.__

__We now inspect the size of this dataframe and see how many venues can be found for each of the postal codes:__

In [18]:
print(nearby_venues.shape)
nearby_venues[['Postal Code', 'Venue']].groupby('Postal Code').count().rename(columns={"Venue": "Number of Venues"})

(2139, 8)


Unnamed: 0_level_0,Number of Venues
Postal Code,Unnamed: 1_level_1
M1B,1
M1C,1
M1E,8
M1G,3
M1H,9
M1J,2
M1K,6
M1L,10
M1M,2
M1N,4


__Since there are fewer than 103 rows, we see that no venues were found for several of the neighbourhoods.__

__We view the number of different venue categories:__

In [19]:
len(nearby_venues['Venue Category'].unique())

269

__Now we inspect how many venues there are in each category for every postal code:__

In [20]:
venues_per_category = pd.get_dummies(nearby_venues[['Venue Category']], prefix="", prefix_sep="")

# add postal code column back to dataframe
venues_per_category['Postal Code'] = nearby_venues['Postal Code']  

# now we group the rows together by postal code and sum up
venues_per_category = venues_per_category.groupby('Postal Code').sum()
venues_per_category
 

Unnamed: 0_level_0,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
Postal Code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M1B,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1C,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1E,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1G,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1H,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1J,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1K,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1L,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
M1M,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
M1N,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


__In order to recover the most common venues for each postal code, we borrow the following function from one of the labs:__

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

__We now create a dataframe which contains the 5 most common venues for each postal code (assuming that at least 5 exist, note the default order otherwise):__

In [22]:
neighborhoods_venues_sorted = pd.DataFrame(columns = ['Postal Code', '1st Most Common Venue',
                                                      '2nd Most Common Venue', '3rd Most Common Venue',
                                                      '4th Most Common Venue', '5th Most Common Venue'])
neighborhoods_venues_sorted['Postal Code'] = venues_per_category.reset_index()['Postal Code']

for ind in np.arange(venues_per_category.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(venues_per_category.iloc[ind, :], 5)

neighborhoods_venues_sorted

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Fast Food Restaurant,Donut Shop,Dim Sum Restaurant,Diner,Discount Store
1,M1C,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center
2,M1E,Rental Car Location,Medical Center,Bank,Intersection,Mexican Restaurant
3,M1G,Coffee Shop,Korean Restaurant,Dumpling Restaurant,Discount Store,Distribution Center
4,M1H,Bank,Bakery,Hakka Restaurant,Lounge,Caribbean Restaurant
5,M1J,Pizza Place,Playground,Yoga Studio,Donut Shop,Dim Sum Restaurant
6,M1K,Department Store,Discount Store,Bus Station,Chinese Restaurant,Hobby Shop
7,M1L,Bakery,Bus Line,Metro Station,Ice Cream Shop,Intersection
8,M1M,Motel,American Restaurant,Yoga Studio,Donut Shop,Diner
9,M1N,College Stadium,Skating Rink,General Entertainment,Café,Dog Run


__We will now cluster the neighbourhoods. In contrast to the labs, we did not compute the frequency of the venue categories, but rather the total number of venues in a given category. Hence it would not be surprising if the clustering done here would cluster neighbourhoods together based on total numbers of venues rather than based on frequency of occurence, especially if the number of clusters is chosen to be small.__

__Because of this, I expect there to be several clusters which are central, and only one or two clusters, in which neighbourhoods with greater distance to the center are grouped together.__

In [23]:
from sklearn.cluster import KMeans

In [24]:
venues_clustering = venues_per_category.reset_index().drop('Postal Code', axis = 1)

kmeans = KMeans(n_clusters = 5, random_state=0).fit(venues_clustering)

__We create a new dataframe, to be used for visualization purposes:__

In [25]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

df_merged = dataframe

df_merged = df_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

# for the neighbourhoods for which no venues were found there are some NaN entries in the corresponding rows
# we drop these rows

df_merged.dropna(axis = 0, inplace = True)
df_merged.head(10)

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,1.0,Park,Food & Drink Shop,Event Space,Ethiopian Restaurant,Electronics Store
1,M4A,North York,Victoria Village,43.725882,-79.315572,1.0,Portuguese Restaurant,Intersection,French Restaurant,Coffee Shop,Pizza Place
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,4.0,Coffee Shop,Bakery,Café,Pub,Park
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,1.0,Clothing Store,Furniture / Home Store,Vietnamese Restaurant,Coffee Shop,Event Space
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,4.0,Coffee Shop,College Cafeteria,Gym,Hobby Shop,Beer Bar
6,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353,1.0,Fast Food Restaurant,Donut Shop,Dim Sum Restaurant,Diner,Discount Store
7,M3B,North York,Don Mills,43.745906,-79.352188,1.0,Gym,Caribbean Restaurant,Japanese Restaurant,Café,Yoga Studio
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.706397,-79.309937,1.0,Pizza Place,Bank,Gym / Fitness Center,Intersection,Café
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657162,-79.378937,3.0,Clothing Store,Coffee Shop,Café,Cosmetics Shop,Japanese Restaurant
10,M6B,North York,Glencairn,43.709577,-79.445073,1.0,Japanese Restaurant,Sushi Restaurant,Pub,Metro Station,Dessert Shop


__We now visualize the clusters using Folium:__

In [26]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [27]:
map_clusters = folium.Map(location = [43.6532, -79.3832], zoom_start = 10)

x = np.arange(5)
ys = [i + x + (i*x)**2 for i in range(5)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []
for lat, lon, poi, cluster in zip(df_merged['Latitude'], df_merged['Longitude'], df_merged['Postal Code'],
                                  df_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

__One observes that our intuition was correct: distance to center seems to be one of the distinguishing features between clusters.__

__Of course "distance to center" is just one out of many aspects to consider. Let's have a closer look at the clusters:__

In [28]:
df_merged.loc[df_merged['Cluster Labels'] == 0, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
25,Downtown Toronto,0.0,Grocery Store,Café,Park,Coffee Shop,Italian Restaurant
37,West Toronto,0.0,Bar,Coffee Shop,Asian Restaurant,Café,Vietnamese Restaurant
41,East Toronto,0.0,Greek Restaurant,Coffee Shop,Italian Restaurant,Restaurant,Furniture / Home Store
43,West Toronto,0.0,Café,Breakfast Spot,Coffee Shop,Burrito Place,Restaurant
54,East Toronto,0.0,Café,Coffee Shop,Bakery,Brewery,American Restaurant
55,North York,0.0,Sandwich Place,Restaurant,Italian Restaurant,Coffee Shop,Comfort Food Restaurant
59,North York,0.0,Ramen Restaurant,Café,Sandwich Place,Coffee Shop,Pizza Place
74,Central Toronto,0.0,Café,Sandwich Place,Coffee Shop,Liquor Store,Indian Restaurant
79,Central Toronto,0.0,Sandwich Place,Dessert Shop,Pizza Place,Sushi Restaurant,Gym
80,Downtown Toronto,0.0,Café,Bakery,Bar,Japanese Restaurant,Bookstore


In [29]:
df_merged.loc[df_merged['Cluster Labels'] == 1, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,1.0,Park,Food & Drink Shop,Event Space,Ethiopian Restaurant,Electronics Store
1,North York,1.0,Portuguese Restaurant,Intersection,French Restaurant,Coffee Shop,Pizza Place
3,North York,1.0,Clothing Store,Furniture / Home Store,Vietnamese Restaurant,Coffee Shop,Event Space
6,Scarborough,1.0,Fast Food Restaurant,Donut Shop,Dim Sum Restaurant,Diner,Discount Store
7,North York,1.0,Gym,Caribbean Restaurant,Japanese Restaurant,Café,Yoga Studio
8,East York,1.0,Pizza Place,Bank,Gym / Fitness Center,Intersection,Café
10,North York,1.0,Japanese Restaurant,Sushi Restaurant,Pub,Metro Station,Dessert Shop
12,Scarborough,1.0,Bar,Yoga Studio,Drugstore,Discount Store,Distribution Center
13,North York,1.0,Gym,Coffee Shop,Restaurant,Beer Store,Sporting Goods Shop
14,East York,1.0,Park,Skating Rink,Athletics & Sports,Beer Store,Curling Ice


In [30]:
df_merged.loc[df_merged['Cluster Labels'] == 2, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
30,Downtown Toronto,2.0,Coffee Shop,Café,Restaurant,Hotel,Gym
36,Downtown Toronto,2.0,Coffee Shop,Aquarium,Hotel,Café,Restaurant
42,Downtown Toronto,2.0,Coffee Shop,Hotel,Café,Restaurant,Salad Place
48,Downtown Toronto,2.0,Coffee Shop,Restaurant,Café,Hotel,Gym
97,Downtown Toronto,2.0,Coffee Shop,Café,Hotel,Restaurant,Gym


In [31]:
df_merged.loc[df_merged['Cluster Labels'] == 3, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
9,Downtown Toronto,3.0,Clothing Store,Coffee Shop,Café,Cosmetics Shop,Japanese Restaurant
33,North York,3.0,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Women's Store


In [32]:
df_merged.loc[df_merged['Cluster Labels'] == 4, df_merged.columns[[1] + list(range(5, df_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Downtown Toronto,4.0,Coffee Shop,Bakery,Café,Pub,Park
4,Downtown Toronto,4.0,Coffee Shop,College Cafeteria,Gym,Hobby Shop,Beer Bar
15,Downtown Toronto,4.0,Coffee Shop,Café,Cocktail Bar,American Restaurant,Restaurant
20,Downtown Toronto,4.0,Coffee Shop,Bakery,Cocktail Bar,Cheese Shop,Café
24,Downtown Toronto,4.0,Coffee Shop,Sandwich Place,Café,Italian Restaurant,Japanese Restaurant
84,Downtown Toronto,4.0,Café,Coffee Shop,Mexican Restaurant,Vietnamese Restaurant,Vegetarian / Vegan Restaurant
92,Downtown Toronto,4.0,Coffee Shop,Café,Restaurant,Beer Bar,Japanese Restaurant
99,Downtown Toronto,4.0,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant


__Cluster 0 contains neighbourhoods which are not central. The neighbourhoods in said cluster have a lot of coffee shops, cafés and restaurants.__

__Cluster 1 is the largest cluster. Its distinguishing property is that the neighbourhoods contained in it are relatively far from the center and have a low total number of venues. Since these neighbourhoods are non-central, it is not surprising that many of them have parks as their most common venue.__

__Cluster 2 is one of two clusters consisting of exclusively central neighbourhoods, the other one being Cluster 4. Non-surprisingly - given the central location- the most common venue for the neighbourhoods in either cluster is "Coffee Shop". Cluster 2 seems to be more of a hotel area, whereas Cluster 4 seems to be more of a restaurant area. However, there is not enough evidence to draw that conclusion with any reasonable amount of certainty.__

__Finally, Cluster 3 is the smallest cluster with only two neighbourhoods, one of them being central, the other one being quite far removed from the center. The neighbourhoods in this cluster seem to be connected to clothing and fashion.__