Segmenting and Clustering Neighborhoods in Toronto
==================================================
---

Task 1: Data Acquisition
------------------------
---

In this task we are going to import a table of postcodes and neighbourhoods in Toronto (Canada) from Wikipedia. For this, we will use Panda's ability to parse an HTML input and covert the tables in it into dataframes. Once we have the table we are interested in (as there are multiple tables in the Wikipedia page) converted into a dataframe, we will clean up the data, removing entries without valid data (no borough), filling up gaps if appropriate (entries without assigned neighbourhood), and merging duplicated entries (same postcode). We will finally sort the resulting dataframe by postcode and show the size of the final dataframe.

---

First of all, let's import the libraries that we are going to use.

We will use pandas' HTML parser backup up by lxml, so please uncomment the pip lines if needed. You may need to restart the kernel after running the pip command for the parsing to work

In [1]:
import pandas as pd
import numpy as np

#!pip install lxml # Uncomment if needed; a kernel restart may be required after running this command.

---
We can now proceed with the HTML parsing. We will use pandas' `read_html` method, which returns a list of dataframes. Each dataframe is generated from a table in the HTML input indicated, which for us is [the Wikipedia page for the postcodes in Toronto (Canada)](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). For this task, we are only interested in the first table, so we filter out the rest and store the resulting raw dataframe in `raw_data`

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
raw_data = pd.read_html(url)[0]
raw_data

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


---
Now we have the data stored in a local variable, but it is directly copied from the source web page, which is not useful for us. We need to clean up the dataframe as follows:
   * First, we need to remove the rows without an assigned Borough, as those contain no useful data for us. We can easily do that by filtering out the rows with the string *Not assigned* in the 'Borough' column. 
   * Then we can process the rows that have gaps in the data. Specifically, we will update the rows that have a *Not assigned* Neighborhood and copy the value from the 'Borough' column.

In [3]:
raw_data_without_unassigned = raw_data[raw_data.Borough != "Not assigned"].reset_index(drop=True)
raw_data_without_unassigned['Neighbourhood'] = np.where((raw_data_without_unassigned['Neighbourhood'] == "Not assigned"), 
                                                        raw_data_without_unassigned['Borough'], 
                                                        raw_data_without_unassigned['Neighbourhood'])
raw_data_without_unassigned

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,Lawrence Heights
4,M6A,North York,Lawrence Manor
...,...,...,...
205,M8Z,Etobicoke,Kingsway Park South West
206,M8Z,Etobicoke,Mimico NW
207,M8Z,Etobicoke,The Queensway West
208,M8Z,Etobicoke,Royal York South West


---
The next step in the clean-up process is to merge the entries that have the same Postcode and Borough. We will merge these entries by grouping them, keeping the 'Borough' value of the first entry (although all of them are the same, so we could have chosen any of them), and concatenaring the values of 'Neighborhood' with commas (using the `join` function).

In [4]:
raw_grouped = raw_data_without_unassigned.groupby('Postcode').agg({'Borough':'first', 'Neighbourhood': ', '.join,}).reset_index()
raw_grouped

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


---
Finally, and just to ensure increased readability, we will sort the dataframe using the 'Postcode' column. This will be the dataframe we will be using in the rest of the project.

In [5]:
boroughs_df = raw_grouped.sort_values(by ='Postcode' )
boroughs_df

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv..."
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ..."


--- 
As specified in the assignment requirements, we now use the `shape` function to show the size of the dataframe, which consists of 103 rows and 3 columns

In [6]:
boroughs_df.shape

(103, 3)

---
---
Task 2: Adding Latitude and Longitude Coordinates to the Dataframe
------------------------------------------------------------------
---

In this task we are going to acquire the geographical coordinates of the Toronto postal codes that we got in Task 1. We can get this information using the Geocoder Python package. However this package has proven to be hard to use, due to its unreliability and constant issues regarding rate limits. For this reason, we will obtain the geographical information from a CSV file provided by the instructors. Once we have the information in a dataframe, we will merge this dataframe with the dataframe from Task 1, in order to have all the information we will need for the analysis in Task 3 in a single dataframe.

---

First we create a dataframe using [the CSV file provided by the instructors][https://cocl.us/Geospatial_data]. We can do this directly by using the `read_csv` function from pandas.

In [7]:
url_csv = "https://cocl.us/Geospatial_data"
latlong_df = pd.read_csv (url_csv)
latlong_df

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


---
In order to merge this dataframe with the one we generated in Task 1, we will rename the *Postal Code* column to *Postcode*, so both dataframes have the common column with the same title. Once this is done, we can invoke the `merge` method from pandas with both dataframes and the column to be used for merging, and that will generate the dataframe that we want.

In [8]:
latlong_df = latlong_df.rename(columns={"Postal Code": "Postcode"})
latlong_boroughs_df = pd.merge (boroughs_df, latlong_df, on="Postcode")

---
Finally, we show the resulting dataframe to showcase that it is what we expected.

In [9]:
latlong_boroughs_df

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, Martin Grove Gardens, Richv...",43.688905,-79.554724
101,M9V,Etobicoke,"Albion Gardens, Beaumond Heights, Humbergate, ...",43.739416,-79.588437


---
---
Task 3
======
---

For this task we will perform an analysis of the neighbourhoods in Toronto based on the most popular venues for each one as reported by the FourSquare API. We will reduce the data to those Boroughs that contain "Toronto" in the name, for simplification and validation purposes, and then we will obtain the most popular venues for each neighbourhood. Once we have this information, we will collect what types of venues are the most popular for each neighbourhood, and cluster them based on this information. Finally, we will map again the neighbourhoods colored by cluster, and we will analyze the clusters to draw conclusions about each cluster characteristics.

---
First import the libraries and tools needed

In [10]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

---
This is the FourSquare API information. Please, fill in with your information

In [11]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

---
Now we defined several functions that will allow us to automatically acquired and process the data for all the neighbourhoods without polluting too much the code afterwards.

In [12]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
    
def getNearbyVenues(names, latitudes, longitudes, radius=500, limit=100):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

---
For simplification purposes, and so we can manually verify how the code is working, we will work only with Boroughs that contain 'Toronto' (4 Boroughs and 39 Neighbourhoods).

In [13]:
sample_df = latlong_boroughs_df[latlong_boroughs_df["Borough"].str.contains("Toronto")]
sample_df.shape

(39, 5)

---
Let's plot the neighbourhoods in a map, so we can have an idea of the locations

In [14]:
# Get Toronto lat and long
torontoaddr = 'Toronto, CA'

geolocator = Nominatim(user_agent="ca_explorer")
loctoronto = geolocator.geocode(torontoaddr)
lattoronto = loctoronto.latitude
longtoronto = loctoronto.longitude
# create map of New York using latitude and longitude values
map_toronto = folium.Map(location=[lattoronto, longtoronto], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(sample_df['Latitude'], sample_df['Longitude'], sample_df['Borough'], sample_df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

---
Now get the top 100 venues in each neighbourhood. If we want to, we can show how many venues were returned for each neighbourhood and how many unique categories were returned in total. This was commented to clean up the output.

In [15]:
toronto_venues = getNearbyVenues(names=sample_df['Neighbourhood'],
                                   latitudes=sample_df['Latitude'],
                                   longitudes=sample_df['Longitude'],
                                   limit=100
                                  )
# toronto_venues.groupby('Neighborhood').count() # Uncomment to show how many venues were returned for each neighbourhood. This was commented to clean up the output
# print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique()))) # Uncomment to show how many unique categories were returned in total

---
In order to automate the processing of the retrieved data, we need to encode the venues' categories in a numeric form. We will use one hot encoding to effectively convert the venues' categories to a bitmap, and then we will convert the individual venue markers to percentages over the total amount of venues returned for the neighbourhood.

In [16]:
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe and move it to the first column
toronto_onehot['Neighborhood'] = toronto_venues['Neighborhood'] 
neighborcolumn = toronto_onehot['Neighborhood']
toronto_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
toronto_onehot.insert(0, 'Neighborhood', neighborcolumn)

# Convert the individual counts to percentages over the total number of venues
toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()
toronto_grouped.head()

Unnamed: 0,Neighborhood,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,"Adelaide, King, Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,...,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.01,0.0
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.017857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Brockton, Exhibition Place, Parkdale Village",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower, Bathurst Quay, Island airport, Harbo...",0.0,0.058824,0.058824,0.058824,0.117647,0.176471,0.117647,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


---
We can now cluster the neighbourhoods using K-means. In K-Means we have to choose the number of clusters we want to use. We have run multiple clusters with different numbers, and we have obtained this table:

| Number of Clusters | Inertia |
|--------------------|---------|
|1|2.9064918063369576|
|2|2.405424013587206|
|3|1.9650877199634893|
|4|1.748721010946518|
|5|1.6011134093367019|
|6|1.4275644639619056|
|7|1.2267236602317164|
|8|1.1417281931795038|
|9|1.0309227657625462|
|10|0.9349825299262835|
|11|0.8269675491450207|
|12|0.7800204958813879|
|13|0.7066244423690319|
|14|0.6376411561676196|
|15|0.5826582577870574|
|16|0.515300983211602|
|17|0.474301112630121|
|18|0.4179676687848779|
|19|0.3892142904593785|
|20|0.34893080297363993|
|21|0.30076658366325293|
|22|0.2681513744777553|
|23|0.2323479588775409|
|24|0.21492356332145585|
|25|0.17983206123825457|
|26|0.16268369832996318|
|27|0.1368184391323179|
|28|0.1153125884335763|
|29|0.10055354780018513|
|30|0.08489920241049459|
|31|0.06734153730018035|
|32|0.05033476939750917|
|33|0.03740449407936914|
|34|0.025335558390022675|
|35|0.01649389172335601|
|36|0.008993891723356007|
|37|0.0047333333333333324|
|38|0.0021000000000000003|
|39|0.0|

Based on this, 6 clusters seems to be a good value to use, halfway between having artificial partitioning, and not having enough clusters to reflect the individualities present.

In [18]:
kclusters = 6
toronto_clustered = toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_clustered)

---
After the clustering has completed, we collect the 10 most popular venue types per neighbourhood, and then create a dataframe with all the information collected so far: coordinates, 10 most popular venue types, and the cluster label

In [21]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
toronto_neigh_venues_sorted = pd.DataFrame(columns=columns)
toronto_neigh_venues_sorted['Neighborhood'] = toronto_grouped['Neighborhood']

for ind in np.arange(toronto_grouped.shape[0]):
    toronto_neigh_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)
    
    
toronto_neigh_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = sample_df
toronto_merged = toronto_merged.rename(columns={"Neighbourhood": "Neighborhood"})
toronto_merged = toronto_merged.join(toronto_neigh_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
toronto_merged.head() 

Unnamed: 0,Postcode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,M4E,East Toronto,The Beaches,43.676357,-79.293031,4,Pub,Trail,Health Food Store,Other Great Outdoors,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio
41,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,1,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Bookstore,Furniture / Home Store,Frozen Yogurt Shop,Fruit & Vegetable Store,Juice Bar,Liquor Store
42,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572,1,Pizza Place,Sandwich Place,Pub,Burger Joint,Burrito Place,Liquor Store,Park,Fish & Chips Shop,Steakhouse,Italian Restaurant
43,M4M,East Toronto,Studio District,43.659526,-79.340923,1,Café,Coffee Shop,Bakery,Brewery,Gastropub,American Restaurant,Italian Restaurant,Yoga Studio,Diner,Bar
44,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879,1,Photography Studio,Construction & Landscaping,Park,Swim School,Bus Line,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


---
We can now generate a map with the neighbourhoods colored by the cluster they belong to. We add a black border to the markers to ease visibility when the marker color matches the map color.

In [22]:
toronto_map_clusters = folium.Map(location=[lattoronto, longtoronto], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='Black',
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_map_clusters)
       
toronto_map_clusters

We can see that Cluster #1 contains most of the neighbourhoods, while the other clusters contain a single neighbourhood each.

---
And we can now analyze each cluster individually and draw conclusions about what makes those neighbours similar and different from others.

#### Cluster #0

In [29]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 0, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
48,"Moore Park, Summerhill East",0,Playground,Park,Restaurant,Tennis Court,Concert Hall,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant


This cluster is special because of its mix of business areas and parks. We can see that there are plenty of playgrounds, parks, and event spaces, along with activity areas (tennis courts), as well as restaurants of several types, to support the business in the area.

#### Cluster #1

In [30]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 1, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
41,"The Danforth West, Riverdale",1,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Bookstore,Furniture / Home Store,Frozen Yogurt Shop,Fruit & Vegetable Store,Juice Bar,Liquor Store
42,"The Beaches West, India Bazaar",1,Pizza Place,Sandwich Place,Pub,Burger Joint,Burrito Place,Liquor Store,Park,Fish & Chips Shop,Steakhouse,Italian Restaurant
43,Studio District,1,Café,Coffee Shop,Bakery,Brewery,Gastropub,American Restaurant,Italian Restaurant,Yoga Studio,Diner,Bar
44,Lawrence Park,1,Photography Studio,Construction & Landscaping,Park,Swim School,Bus Line,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store
45,Davisville North,1,Gym,Park,Department Store,Food & Drink Shop,Sandwich Place,Hotel,Breakfast Spot,Electronics Store,Eastern European Restaurant,Ethiopian Restaurant
46,North Toronto West,1,Clothing Store,Coffee Shop,Yoga Studio,Gym / Fitness Center,Shoe Store,Salon / Barbershop,Restaurant,Pet Store,Park,Mexican Restaurant
47,Davisville,1,Dessert Shop,Pizza Place,Sandwich Place,Italian Restaurant,Gym,Sushi Restaurant,Coffee Shop,Café,Farmers Market,Discount Store
49,"Deer Park, Forest Hill SE, Rathnelly, South Hi...",1,Pub,Coffee Shop,Pizza Place,Light Rail Station,Sports Bar,Supermarket,Sushi Restaurant,Restaurant,Fried Chicken Joint,Liquor Store
51,"Cabbagetown, St. James Town",1,Park,Café,Restaurant,Coffee Shop,Italian Restaurant,Pub,Bakery,Pizza Place,Pet Store,Sandwich Place
52,Church and Wellesley,1,Coffee Shop,Japanese Restaurant,Sushi Restaurant,Gay Bar,Restaurant,Mediterranean Restaurant,Fast Food Restaurant,Gym,Hotel,Dance Studio


This cluster encompasses most of the neighbourhoods in our analysis. This makes sense as the Boroughs selected are in the downtown area of Toronto, which seems to be mainly a business area, with lots of restaurants, coffee shops, and establishments that are likely to be visited on the way to or from home, like garages, grocery stores, or gyms. This notion gets validated when we see that the airport neighbourhood, which is one of the most representative commute-heavy neighbourhoods, is included in this cluster.

#### Cluster #2

In [31]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 2, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
64,"Forest Hill North, Forest Hill West",2,Mexican Restaurant,Trail,Jewelry Store,Sushi Restaurant,Yoga Studio,Diner,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store


This cluster comprises a single neighbourhood. To understand why this ineighbourhood is different from the rest we can look at the map. This neighbourhood is located between a park and a college. This creates a mix of recreational venue types (Trails, Event Spaces), and restaurants with higher diversity and greater variety than those in cluster #1.

#### Cluster #3

In [32]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 3, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
63,Roselawn,3,Pool,Garden,Yoga Studio,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


The neighbourhood in Cluster #3 is, out of all the neighbourhoods in the analysis, the only one that represents a residential area, which can be deduced by the presence of "Pools" as the most common venue in this area. Additionally, the presence of Gardens without Parks or Trails also seem to indicate a residential type area.

#### Cluster #4

In [33]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 4, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
37,The Beaches,4,Pub,Trail,Health Food Store,Other Great Outdoors,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Yoga Studio


This cluster is clearly special because of its recreational nature. We can see that there are plenty of trails, dog runs, and outdoors areas. We can also see plenty of pubs, which are likely to represent the nightlife near the beach area.

#### Cluster #5

In [34]:
toronto_merged.loc[toronto_merged['Cluster Labels'] == 5, toronto_merged.columns[[2] + list(range(5, toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
50,Rosedale,5,Park,Playground,Trail,Dessert Shop,Falafel Restaurant,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant


This cluster is special because of its mix of business areas and parks. We can see that there are plenty of playgrounds, parks, and trails, which locates this neighbourhood mostly in one of the largest parks in the city, with restaurants and stores to serve the population that attend activities in the park.