<h1><center> Segmenting and Clustering Neighborhoods in Toronto </center></h1>

**Name:** Heejoon Ahn

**Date:** February 6, 2021

This notebook contains the response to all questions from the assignment. They have been put into different sections with proper headers. 

## Question 1: 

Question 1: Use the Notebook to build the code to scrape the following Wikipedia page, 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' in order to obtain the data that is in the table of postal codes and to transform the data into a *pandas* dataframe like the one shown. 

First, import the libraries and then the data appropriately. 

In [None]:
# pip install geocoder # uncomment this line if you haven't completed the assignment

In [None]:
# pip install folium

In [32]:
# import libraries used for entire assignment
import pandas as pd
import numpy as np
import requests
import geocoder 
from geopy.geocoders import Nominatim

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

In [2]:
# import data appropriately
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wiki_url = requests.get(url)
wiki_data = pd.read_html(wiki_url.text)
wiki_data

[    Postal Code           Borough  \
 0           M1A      Not assigned   
 1           M2A      Not assigned   
 2           M3A        North York   
 3           M4A        North York   
 4           M5A  Downtown Toronto   
 ..          ...               ...   
 175         M5Z      Not assigned   
 176         M6Z      Not assigned   
 177         M7Z      Not assigned   
 178         M8Z         Etobicoke   
 179         M9Z      Not assigned   
 
                                          Neighbourhood  
 0                                         Not assigned  
 1                                         Not assigned  
 2                                            Parkwoods  
 3                                     Victoria Village  
 4                            Regent Park, Harbourfront  
 ..                                                 ...  
 175                                       Not assigned  
 176                                       Not assigned  
 177                

In [3]:
len(wiki_data), type(wiki_data)

(3, list)

## Question 2: 

* Only process the cells that have an assigned borough. Ignore cells with a borough that is **Not assigned**. 
* More than one neighborhood can exist in one postal code area. For example, in the table on the Wikipedia page, you will notice that M5A is listed twice and has two neighborhoods: **Harbourfront** and **Regent Park**. These two rows will be combined into one row with the neighborhoods separated with a comma as shown in **row 11** in the table. 
* If a cell has a borough but a **Not assigned** neighborhood, then the neighborhood will be the same as the borough.
* Clean your Notebook and add Markdown cells to explain your work and any assumptions you are making. 
* In the last cell of your notebook, use the **.shape** method to print the number of rows of your dataframe. 

Then, the first table is retrieved. I made sure to drop certain boroughs that are "Not assigned". We then group the records based on postal code as well. 

In [4]:
wiki_d1 = wiki_data[0]
wiki_d1 = wiki_d1[wiki_d1['Borough'] != "Not assigned"]
wiki_df1 = wiki_d1.groupby(['Postal Code']).head()
wiki_df1

Unnamed: 0,Postal Code,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
165,M4Y,Downtown Toronto,Church and Wellesley
168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [5]:
# Check for number of records where Neighborhoods are "Not assigned"
wiki_df1.Neighbourhood.str.count("Not assigned").sum()

0

Since there are no neighborhoods with "Not assigned" in the value, we can move onto the next step. This involved cleaning the dataframe, and this cleaning involves resetting our index and removing the added index column to the dataframe since it is not needed nor crucial to this assignment. 

In [6]:
# resetting index of data - Cleaning of the dataframe one last time
wiki_df1 = wiki_df1.reset_index()
wiki_df1

Unnamed: 0,index,Postal Code,Borough,Neighbourhood
0,2,M3A,North York,Parkwoods
1,3,M4A,North York,Victoria Village
2,4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,5,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...,...
98,160,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,165,M4Y,Downtown Toronto,Church and Wellesley
100,168,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,169,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


Because we do not need the column 'index', we drop it in the next step of the cleaning process. 

In [7]:
# dropping the index column - Cleaning of the dataframe one last time
wiki_df1.drop(['index'], axis='columns', inplace=True)
wiki_df1

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C..."
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu..."


In [8]:
wiki_df1.shape

(103, 3)

Answer to Question: We have 103 rows and 3 columns in our dataframe. 

## Question 3: 

We will use the Geocoder Python package for this assignment: https://geocoder.readthedocs.io/index.html.

```python
# initialize your variable to None
lat_lng_coords = None

# loop until you get the coordinates
while(lat_lng_coords is None):
  g = geocoder.google('{}, Toronto, Ontario'.format(postal_code))
  lat_lng_coords = g.latlng

latitude = lat_lng_coords[0]
longitude = lat_lng_coords[1]
```

Due to potential issues as mentioned in the assignment, the code as shown above was not used. In its place, a different method of code was utilized to load the data. 

In [9]:
data = pd.read_csv('https://cocl.us/Geospatial_data')
data

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476
...,...,...,...
98,M9N,43.706876,-79.518188
99,M9P,43.696319,-79.532242
100,M9R,43.688905,-79.554724
101,M9V,43.739416,-79.588437


In [10]:
# printing the shape of our dataframe
print("The shape of the wiki data was: ", wiki_df1.shape)
print("The shape of our csv data is: ", data.shape)

The shape of the wiki data was:  (103, 3)
The shape of our csv data is:  (103, 3)


The next step is to make sure we have a full understanding of the dataframes and the value types. 

In [11]:
wiki_df1.dtypes

Postal Code      object
Borough          object
Neighbourhood    object
dtype: object

In [12]:
data.dtypes

Postal Code     object
Latitude       float64
Longitude      float64
dtype: object

In [13]:
# time to combine and merge the dataframes
combined_df = wiki_df1.join(data.set_index('Postal Code'), on='Postal Code', how='inner')
combined_df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [14]:
combined_df.shape

(103, 5)

With the combined dataframe, we retrieved a dataframe with 103 rows like expected. 

## Question 4: 

Explore and cluster the neighborhoods in Toronto. You can decide to work with only boroughs that contain the word Toronto and then replicate the same analysis we did to the New York City Data. It is up to you. 

Just make sure:
1. to add enough Markdown cells to explain what you decided to do and to report any observations you make. 
2. to generate maps to visualize your neighborhoods and how they cluster together. 

Once you are happy with your analyis, submit a link to the new Notebook on your Github repository. 

**My work**

I will be referencing the work I had done previously with the New York City Data for Toronto. I will be first determining the values to help find the generic coordinates of Toronto. 

In [15]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("The appropriate coordinates of Toronto are {}, {}.".format(latitude, longitude))

The appropriate coordinates of Toronto are 43.6534817, -79.3839347.


This next step involves mapping Toronto for an overall view using folium library like in the New York City tutorial. 

In [17]:
# Creating the map of Toronto, Canada
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers
for latitude, longitude, borough, neighbourhood in zip(combined_df["Latitude"], combined_df['Longitude'], combined_df['Borough'],
                                                       combined_df['Neighbourhood']):
    label='{}, {}'.format(neighbourhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude], 
        radius = 5, 
        popup = label, 
        color="red", 
        fill=True).add_to(map_Toronto)
    
# visualize map
map_Toronto

Once the map has been visualized, it is time to initialize the Foursquare API credentials as done in the New York City example. The ID and SECRET code has been removed in the final, saved version on Github as it is still sensitive credential information. The outputs are kept from the proper run to help with answering the question. 

In [18]:
CLIENT_ID = 'your Foursquare ID' # your Foursquare ID
CLIENT_SECRET = 'your Foursquare Secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: ACWA2C5E0E0JEKVGH3IKPHZY0J1HS0WO0PGN3AJXZEOEOCTO
CLIENT_SECRET:OOJNLA0NXIP1NGFY0WU5FHX5DFDGRVG3BVQDAXYUBZIGAI2R


Since the credentials have been incporporated, a function was written to help retrieve all venues in Toronto. 

In [19]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Now we collect the venues in Toronto for each Neighborhood and output them: 

In [21]:
venues_Toronto = getNearbyVenues(combined_df['Neighbourhood'], combined_df['Latitude'], combined_df['Longitude'])

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue, Humber Valley Village
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto, Broadview North (Old East York)
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmo

In [22]:
venues_Toronto.shape

(1335, 5)

With the list of all venues in Toronto, we retrieved 1317 records and 5 columns. Now it is to check the sample data.

In [23]:
venues_Toronto.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop


In [24]:
# check the venues based on Neighborhood
venues_Toronto.groupby('Neighbourhood').head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,Park
1,Parkwoods,43.753259,-79.329656,Variety Store,Food & Drink Shop
2,Victoria Village,43.725882,-79.315572,Victoria Village Arena,Hockey Arena
3,Victoria Village,43.725882,-79.315572,Portugril,Portuguese Restaurant
4,Victoria Village,43.725882,-79.315572,Tim Hortons,Coffee Shop
...,...,...,...,...,...
1321,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium,Wings Joint
1322,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,South St. Burger,Burger Joint
1323,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Dollarama,Discount Store
1324,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Healthy Planet,Supplement Shop


When looking at venues based on the neighborhoods, we find that there are 417 records for each neighborhood. Now we check for maximum venue categories. 

In [25]:
venues_Toronto.groupby('Venue Category').max()

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Accessories Store,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,Ardene Shoes Outlet
Adult Boutique,Church and Wellesley,43.665860,-79.383160,Seduction
Airport,Downsview,43.737473,-79.394420,Toronto Downsview Airport (YZD)
Airport Food Court,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Billy Bishop Café
Airport Gate,"CN Tower, King and Spadina, Railway Lands, Har...",43.628947,-79.394420,Gate 8
...,...,...,...,...
Warehouse Store,Thorncliffe Park,43.705369,-79.349372,Costco
Wine Bar,"Toronto Dominion Centre, Design Exchange",43.653206,-79.379817,The National Club
Wings Joint,"Mimico NW, The Queensway West, South of Bloor,...",43.628841,-79.520999,Wingporium
Women's Store,"Lawrence Manor, Lawrence Heights",43.733283,-79.419750,Want Boutique


From the dimensions of the dataframe, we can see that there are 239 different types of Venue Categories! The next few steps are generated to help prepare for the proper clustering as shown in the New York City example. We will be doing this appropriately for Toronto in this assignment. 

In [27]:
# looking at one hot encoded venue categories
Toronto_VC = pd.get_dummies(venues_Toronto[['Venue Category']], prefix="", prefix_sep="")
Toronto_VC

Unnamed: 0,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1330,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1331,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1332,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1333,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Look at adding the neighborhood to the encoded dataframe
Toronto_VC['Neighbourhood'] = venues_Toronto['Neighbourhood']

# move the neighborhood column to first column
fixed_columns = [Toronto_VC.columns[-1]] + list(Toronto_VC.columns[:-1])
Toronto_VC = Toronto_VC[fixed_columns]
Toronto_VC.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Parkwoods,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Victoria Village,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


To further organize the data appropriately, the means of the venue categories will be calculated in each Neighborhood. 

In [29]:
t_grouped = Toronto_VC.groupby('Neighbourhood').mean().reset_index()
t_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0


With the data somewhat prepared, we can now make the function to retreive the top most common venue categories! 

In [31]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

In [37]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columsn according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = t_grouped['Neighbourhood']

for ind in np.arange(t_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(t_grouped.iloc[ind,:], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Breakfast Spot,Latin American Restaurant,Skating Rink,Yoga Studio,Dim Sum Restaurant,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore
1,"Alderwood, Long Branch",Pizza Place,Gym,Coffee Shop,Pub,Skating Rink,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
2,"Bathurst Manor, Wilson Heights, Downsview North",Coffee Shop,Bank,Park,Ice Cream Shop,Diner,Shopping Mall,Sandwich Place,Deli / Bodega,Middle Eastern Restaurant,Mobile Phone Shop
3,Bayview Village,Café,Bank,Chinese Restaurant,Japanese Restaurant,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore
4,"Bedford Park, Lawrence Manor East",Italian Restaurant,Sandwich Place,Coffee Shop,Sushi Restaurant,Restaurant,Café,Indian Restaurant,Butcher,Juice Bar,Thai Restaurant


With this work done, we now cluster the Neighborhoods

In [38]:
# set the number of clusters
k_num_clusters = 5

t_clustering = t_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k_num_clusters, random_state=0).fit(t_clustering)
kmeans

KMeans(n_clusters=5, random_state=0)

In [39]:
# Check labelling of model
kmeans.labels_[0:100]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
       2, 0, 2, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 3, 0, 0, 2, 0, 0, 0, 2,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
       2, 0, 2, 0, 0, 0, 0, 2], dtype=int32)

Now, we have to add teh clustering label column to drop 10 common venue categories. Then we will be joining the t_grouped dataframe with the combined_df on neighbourhood to add latitude and longitude for each neighborhood for plotting purposes.

In [40]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

# combining data
toronto_merged = combined_df
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Park,Food & Drink Shop,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Hockey Arena,Pizza Place,Portuguese Restaurant,Coffee Shop,Intersection,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636,0.0,Coffee Shop,Bakery,Park,Theater,Breakfast Spot,French Restaurant,Restaurant,Pub,Performing Arts Venue,Historic Site
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Sporting Goods Shop,Boutique,Miscellaneous Shop,Furniture / Home Store,Carpet Store,Event Space,Women's Store,Coffee Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,Coffee Shop,Sushi Restaurant,Bar,Beer Bar,Smoothie Shop,Burrito Place,Sandwich Place,Café,Portuguese Restaurant,College Auditorium
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944,2.0,Pool,Park,River,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160,0.0,Sushi Restaurant,Mexican Restaurant,Café,Ice Cream Shop,Indian Restaurant,Dessert Shop,Men's Store,Restaurant,Japanese Restaurant,Beer Bar
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558,0.0,Gym / Fitness Center,Auto Workshop,Park,Comic Shop,Recording Studio,Restaurant,Burrito Place,Skate Park,Brewery,Spa
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509,3.0,Baseball Field,Yoga Studio,Farmers Market,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run


In [41]:
# Drop all NaN values
toronto_merged_noNA = toronto_merged.dropna(subset=['Cluster Labels'])
toronto_merged_noNA.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M3A,North York,Parkwoods,43.753259,-79.329656,2.0,Park,Food & Drink Shop,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
1,M4A,North York,Victoria Village,43.725882,-79.315572,0.0,Hockey Arena,Pizza Place,Portuguese Restaurant,Coffee Shop,Intersection,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.65426,-79.360636,0.0,Coffee Shop,Bakery,Park,Theater,Breakfast Spot,French Restaurant,Restaurant,Pub,Performing Arts Venue,Historic Site
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763,0.0,Clothing Store,Accessories Store,Sporting Goods Shop,Boutique,Miscellaneous Shop,Furniture / Home Store,Carpet Store,Event Space,Women's Store,Coffee Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494,0.0,Coffee Shop,Sushi Restaurant,Bar,Beer Bar,Smoothie Shop,Burrito Place,Sandwich Place,Café,Portuguese Restaurant,College Auditorium


With no NAs to skew our data, we then plot the new clusters to the map of Toronto! 

In [43]:
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for clusters
x = np.arange(k_num_clusters)
ys = [i + x + (i*x)**2 for i in range(k_num_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to map
markers_colors=[]
for lat,lon,poi,cluster in zip(toronto_merged_noNA['Latitude'], toronto_merged_noNA['Longitude'], 
                               toronto_merged_noNA['Neighbourhood'], toronto_merged_noNA['Cluster Labels']):
    label = folium.Popup('Cluster ' + str(int(cluster)+1)+'\n'+str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon], 
        radius=5, 
        popup=label, 
        color=rainbow[int(cluster-1)], 
        fill=True, 
        fill_color=rainbow[int(cluster-1)]).add_to(map_clusters)
    
# visualize map
map_clusters

Verifying our clusters below:

In [44]:
# cluster 1
toronto_merged_noNA.loc[toronto_merged_noNA['Cluster Labels'] == 0, toronto_merged_noNA.columns[[1] + list(range(5, toronto_merged_noNA.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,North York,0.0,Hockey Arena,Pizza Place,Portuguese Restaurant,Coffee Shop,Intersection,Distribution Center,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store
2,Downtown Toronto,0.0,Coffee Shop,Bakery,Park,Theater,Breakfast Spot,French Restaurant,Restaurant,Pub,Performing Arts Venue,Historic Site
3,North York,0.0,Clothing Store,Accessories Store,Sporting Goods Shop,Boutique,Miscellaneous Shop,Furniture / Home Store,Carpet Store,Event Space,Women's Store,Coffee Shop
4,Downtown Toronto,0.0,Coffee Shop,Sushi Restaurant,Bar,Beer Bar,Smoothie Shop,Burrito Place,Sandwich Place,Café,Portuguese Restaurant,College Auditorium
6,Scarborough,0.0,Fast Food Restaurant,Print Shop,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
...,...,...,...,...,...,...,...,...,...,...,...,...
96,Downtown Toronto,0.0,Restaurant,Bakery,Café,Italian Restaurant,Coffee Shop,General Entertainment,Japanese Restaurant,Caribbean Restaurant,Bank,Jewelry Store
97,Downtown Toronto,0.0,Café,Coffee Shop,Restaurant,Hotel,Seafood Restaurant,Gym / Fitness Center,Tea Room,Gastropub,Gym,Concert Hall
99,Downtown Toronto,0.0,Sushi Restaurant,Mexican Restaurant,Café,Ice Cream Shop,Indian Restaurant,Dessert Shop,Men's Store,Restaurant,Japanese Restaurant,Beer Bar
100,East Toronto,0.0,Gym / Fitness Center,Auto Workshop,Park,Comic Shop,Recording Studio,Restaurant,Burrito Place,Skate Park,Brewery,Spa


In [45]:
# cluster 2
toronto_merged_noNA.loc[toronto_merged_noNA['Cluster Labels'] == 1, toronto_merged_noNA.columns[[1] + list(range(5, toronto_merged_noNA.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Scarborough,1.0,Bar,Yoga Studio,Farmers Market,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run


In [46]:
# cluster 3 
toronto_merged_noNA.loc[toronto_merged_noNA['Cluster Labels'] == 2, toronto_merged_noNA.columns[[1] + list(range(5, toronto_merged_noNA.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,North York,2.0,Park,Food & Drink Shop,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
21,York,2.0,Park,Women's Store,Pool,Yoga Studio,Dog Run,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Distribution Center
35,East York,2.0,Park,Metro Station,Convenience Store,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
52,North York,2.0,Park,Yoga Studio,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
61,Central Toronto,2.0,Park,Swim School,Bus Line,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
64,York,2.0,Convenience Store,Park,Jewelry Store,Yoga Studio,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
66,North York,2.0,Convenience Store,Park,Yoga Studio,Department Store,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
77,Etobicoke,2.0,Sandwich Place,Park,Mobile Phone Shop,Deli / Bodega,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run,Distribution Center
91,Downtown Toronto,2.0,Park,Playground,Trail,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run
98,Etobicoke,2.0,Pool,Park,River,Yoga Studio,Deli / Bodega,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run


In [47]:
# cluster 4
toronto_merged_noNA.loc[toronto_merged_noNA['Cluster Labels'] == 3, toronto_merged_noNA.columns[[1] + list(range(5, toronto_merged_noNA.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
57,North York,3.0,Baseball Field,Paper / Office Supplies Store,Yoga Studio,Dessert Shop,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop
101,Etobicoke,3.0,Baseball Field,Yoga Studio,Farmers Market,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run


In [48]:
# cluster 5
toronto_merged_noNA.loc[toronto_merged_noNA['Cluster Labels'] == 4, toronto_merged_noNA.columns[[1] + list(range(5, toronto_merged_noNA.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
83,Central Toronto,4.0,Summer Camp,Restaurant,Yoga Studio,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Drugstore,Donut Shop,Dog Run


We have finished the clustering of Toronto neighbourhoods based on our venue categories!