First, I will import the libraries and packages needed for this assignment. 

In [2]:
#to scrape data from Wiki
from bs4 import BeautifulSoup

import numpy as np# library for data analsysis
import pandas as pd

#to handle json files
import json

#to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

#to handle requests to webpages
import requests

# tranform JSON file into a pandas dataframe
from pandas.io.json import json_normalize

#for plotting
import matplotlib.cm as cm
import matplotlib.colors as colors

#for clustering
from sklearn.cluster import KMeans

!pip -q install geocoder
# import time
import geocoder

# map rendering library
!conda install -c conda-forge folium=0.5.0 --yes
import folium 

print('All good.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.22.0               |     pyh9f0ad1d_0          63 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0           conda-forge
    geopy:          

Next, I will extract the table from the wikipedia page, and store it in my dataframe df.

In [3]:
#get wiki page
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
wikipedia_page = requests.get(wikipedia_link)

#extracts table from wiki page into df
df=pd.read_html(wikipedia_link, header=0)[0]
df

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,
1,M2A,Not assigned,
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"
7,M8A,Not assigned,
8,M9A,Etobicoke,Islington Avenue
9,M1B,Scarborough,"Malvern, Rouge"


Great! Now I have the df. But seems like it will need some cleaning up...

1. Remove cells where borough == NA
2. Where more than one neighborhood exists in a postal code area, combine those neighbourhoods into one row separated with a comma.
3. If a cell has a borough but a Not assigned neighborhood, then the neighborhood will be the same as the borough.

In [4]:
#remove rows where borough == NA
df = df[df.Borough != "Not assigned"].reset_index()
df.drop(["index"], axis=1, inplace=True)

In [7]:
df.shape
df.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,"Regent Park, Harbourfront"
3,M6A,North York,"Lawrence Manor, Lawrence Heights"
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


For some reason, the original table on the wiki page already had the neighbourhoods with the same postal code grouped together, so there was no need for me to process that. 

END OF PART 1

PART 2:

We use the geocoder to get the coordinates of each postal code. 

First, we define the function to do so, then we run it through every row. 

In [12]:
import geocoder

def get_latlng(arcgis_geocoder):
    
    # Initialize the Location (lat. and long.) to "None"
    lat_lng_coords = None
    
    # While loop helps to create a continous run until all the location coordinates are geocoded
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Toronto, Ontario'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
# Geocoder ends here

In [9]:
sample = get_latlng('M9V')
sample
#geocoder works fine

[43.74405485200003, -79.58120294599996]

In [10]:
postal_codes = df['Postal Code']    
coordinates = [get_latlng(postal_code) for postal_code in postal_codes.tolist()]

In [13]:
df_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df['Latitude'] = df_coordinates['Latitude']
df['Longitude'] = df_coordinates['Longitude']
df.head(5)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.752935,-79.335641
1,M4A,North York,Victoria Village,43.728102,-79.31189
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939


END OF PART 2.

PART 3:

I will start off with getting the coordinates of Toronto, Canada.

In [14]:
address = 'Toronto, Ontario'

geolocator = Nominatim(user_agent="t_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


Here's a map of Toronto with the neighbourhoods superimposed on top. 

In [15]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

Wow, that actually worked. 

Next, I will utilize the Foursquare API to explore the neighborhoods and segment them.

To do so, I am going to call in my Foursquare credentials. 

In [16]:
CLIENT_ID = 'X1SHAI0SFP4GHKRBKDKO4IN55TYNZHIF0EV3EEKZZ1PFHUTF' # your Foursquare ID
CLIENT_SECRET = 'AYTB1WQIT3TEO5JFHFHOPDUNQYPW4MXYFJ20FX4U230OM1S0' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: X1SHAI0SFP4GHKRBKDKO4IN55TYNZHIF0EV3EEKZZ1PFHUTF
CLIENT_SECRET:AYTB1WQIT3TEO5JFHFHOPDUNQYPW4MXYFJ20FX4U230OM1S0


Let's start with exploring the very first neighbourhood in our df.

In [17]:
neighborhood_name = df.loc[0, 'Neighborhood'] # neighborhood name
neighborhood_latitude = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'Longitude'] # neighborhood longitude value


print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Parkwoods are 43.75293455500008, -79.33564142299997.


Hmmm...I wonder what are the top 50 venues in this neighbourhood. 

In [18]:
radius = 500
LIMIT = 50
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?&client_id=X1SHAI0SFP4GHKRBKDKO4IN55TYNZHIF0EV3EEKZZ1PFHUTF&client_secret=AYTB1WQIT3TEO5JFHFHOPDUNQYPW4MXYFJ20FX4U230OM1S0&v=20180605&ll=43.75293455500008,-79.33564142299997&radius=500&limit=50'

In [19]:
#get the results
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5ec8b9455fb726001be8dd91'},
  'headerLocation': 'Sunnybrook - York Mills',
  'headerFullLocation': 'Sunnybrook - York Mills, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 2,
  'suggestedBounds': {'ne': {'lat': 43.75743455950008,
    'lng': -79.32942319651914},
   'sw': {'lat': 43.74843455050008, 'lng': -79.3418596494808}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4e8d9dcdd5fbbbb6b3003c7b',
       'name': 'Brookbanks Park',
       'location': {'address': 'Toronto',
        'lat': 43.751976046055574,
        'lng': -79.33214044722958,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.751976046055574,
          'lng': -79.33214044722958}],
        'distance': 301,
        'cc': 'CA',


Seems like there aren't that many venues in this neighbourhood, but that's ok. Let's see what categories these venues belong to. 

In [20]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now, I shall clear the json and structure it into a pandas dataframe 

In [21]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Brookbanks Park,Park,43.751976,-79.33214
1,Variety Store,Food & Drink Shop,43.751974,-79.333114


In [58]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

2 venues were returned by Foursquare.


Seems like there are only 2 venues in the Parkwoods neighbourhood. Not the most exciting place to stay I guess.  

Let's have a look at the venues (radius set to 1000m) in Toronto. 

In [22]:
#define function to get nearby venues (centered on Toronto) with a radius of 1000m

def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [23]:
toronto_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Parkwoods
Victoria Village
Regent Park, Harbourfront
Lawrence Manor, Lawrence Heights
Queen's Park, Ontario Provincial Government
Islington Avenue
Malvern, Rouge
Don Mills
Parkview Hill, Woodbine Gardens
Garden District, Ryerson
Glencairn
West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale
Rouge Hill, Port Union, Highland Creek
Don Mills
Woodbine Heights
St. James Town
Humewood-Cedarvale
Eringate, Bloordale Gardens, Old Burnhamthorpe, Markland Wood
Guildwood, Morningside, West Hill
The Beaches
Berczy Park
Caledonia-Fairbanks
Woburn
Leaside
Central Bay Street
Christie
Cedarbrae
Hillcrest Village
Bathurst Manor, Wilson Heights, Downsview North
Thorncliffe Park
Richmond, Adelaide, King
Dufferin, Dovercourt Village
Scarborough Village
Fairview, Henry Farm, Oriole
Northwood Park, York University
East Toronto
Harbourfront East, Union Station, Toronto Islands
Little Portugal, Trinity
Kennedy Park, Ionview, East Birchmount Park
Bayview Village
Downsview
The Danforth West, Ri

In [79]:
toronto_venues.shape

(3377, 7)

In [24]:
toronto_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,41,41,41,41,41,41
"Alderwood, Long Branch",32,32,32,32,32,32
"Bathurst Manor, Wilson Heights, Downsview North",30,30,30,30,30,30
Bayview Village,13,13,13,13,13,13
"Bedford Park, Lawrence Manor East",37,37,37,37,37,37
Berczy Park,50,50,50,50,50,50
"Birch Cliff, Cliffside West",16,16,16,16,16,16
"Brockton, Parkdale Village, Exhibition Place",50,50,50,50,50,50
Business reply mail Processing Centre,50,50,50,50,50,50
"CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport",50,50,50,50,50,50


My next question is, how many unique categories are there in Toronto? 

In [25]:
print('There are {} unique categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 314 unique categories.


What categories are these?

In [26]:
toronto_venues_unique_count = toronto_venues['Venue Category'].value_counts().to_frame(name='Count')
toronto_venues_unique_count

Unnamed: 0,Count
Coffee Shop,253
Café,143
Park,118
Pizza Place,100
Restaurant,95
Bakery,77
Sandwich Place,72
Italian Restaurant,72
Grocery Store,71
Bank,58


So the above was for Toronto in general. Could we get more granular data on the neighbourhoods in Toronto? 

Let's explore the neighbourhoods further. 

I would like to perform one hot encoding on the data to get binary data on the categories of venues in each neighbourhood. 

In [27]:
#one hot encoding
t_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix = "", prefix_sep = "")

# add neighborhood column back to dataframe
t_onehot['Neighbourhood'] = toronto_venues['Neighbourhood']

# move neighborhood column to the first column
fixed_columns = [t_onehot.columns[-1]] + list(t_onehot.columns[:-1])
t_onehot = t_onehot[fixed_columns]

t_onehot.tail(15)

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
3362,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3363,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3364,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3365,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3366,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3367,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3368,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3369,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3370,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3371,"Mimico NW, The Queensway West, South of Bloor,...",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [153]:
t_onehot.loc[t_onehot['Airport'] != 0]

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
1310,Downsview,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Airport's in Downsview. Good to know. 

Let's group the data for the same neighbourhoods together so we can have a clearer picture of the venues in each. 

In [28]:
t_grouped = t_onehot.groupby('Neighbourhood').mean().reset_index()
t_grouped.head()

Unnamed: 0,Neighbourhood,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,...,0.0,0.027027,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0


In [29]:
# Top common venues per neighbourhood
num_top_venues = 5

for hood in t_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = t_grouped[t_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue', 'freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending = False).reset_index(drop = True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0  Chinese Restaurant  0.17
1       Shopping Mall  0.10
2              Bakery  0.05
3      Sandwich Place  0.05
4         Pizza Place  0.02


----Alderwood, Long Branch----
         venue  freq
0  Coffee Shop  0.09
1  Pizza Place  0.09
2     Pharmacy  0.09
3         Bank  0.06
4   Donut Shop  0.03


----Bathurst Manor, Wilson Heights, Downsview North----
         venue  freq
0         Bank  0.07
1         Park  0.07
2  Coffee Shop  0.07
3     Building  0.03
4     Ski Area  0.03


----Bayview Village----
                 venue  freq
0                 Park  0.38
1                 Café  0.08
2  Japanese Restaurant  0.08
3          Coffee Shop  0.08
4                 Bank  0.08


----Bedford Park, Lawrence Manor East----
                venue  freq
0         Coffee Shop  0.08
1  Italian Restaurant  0.08
2      Sandwich Place  0.05
3    Sushi Restaurant  0.03
4          Restaurant  0.03


----Berczy Park----
                 venue  freq
0       

In [30]:
#define a function to return the most common venues per neighbourhood, in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending = False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [31]:
#create a new dataframe showing the top 5 venues of each neighbourhood in desc order

num_top_venues = 5

indicators = ['st', 'nd', 'rd'] #indicators like 1st, 2nd, 3rd

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhoods_venues_sorted = pd.DataFrame(columns=columns)
neighbourhoods_venues_sorted['Neighbourhood'] = t_grouped['Neighbourhood']

for ind in np.arange(t_grouped.shape[0]):
    neighbourhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(t_grouped.iloc[ind, :], num_top_venues)

In [32]:
neighbourhoods_venues_sorted
#neighbourhoods_venues_sorted.shape

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Agincourt,Chinese Restaurant,Shopping Mall,Bakery,Sandwich Place,Pizza Place
1,"Alderwood, Long Branch",Pizza Place,Coffee Shop,Pharmacy,Bank,Bar
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Park,Coffee Shop,Ski Area,Deli / Bodega
3,Bayview Village,Park,Chinese Restaurant,Trail,Convenience Store,Pharmacy
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Italian Restaurant,Sandwich Place,Japanese Restaurant,Thai Restaurant
5,Berczy Park,Coffee Shop,Café,Beer Bar,Creperie,Park
6,"Birch Cliff, Cliffside West",Park,Convenience Store,Gym Pool,Farm,Bus Stop
7,"Brockton, Parkdale Village, Exhibition Place",Café,Bakery,Restaurant,Gift Shop,Coffee Shop
8,Business reply mail Processing Centre,Coffee Shop,Café,American Restaurant,Restaurant,Concert Hall
9,"CN Tower, King and Spadina, Railway Lands, Har...",Gym,Coffee Shop,Yoga Studio,Italian Restaurant,Park


In [33]:
t_grouped_clustering = t_grouped.drop('Neighbourhood', 1)
t_grouped_clustering.head()

Unnamed: 0,Accessories Store,Afghan Restaurant,Airport,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo Exhibit
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.027027,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0


Yeah! So the "neighbourhoods_venues_sorted" dataframe tells us the top 5 categories of venues found in each Neighbourhood. Now, let's prepare our data for k-means clustering.

For convenience, I will just assume k=5 is the best parameter for this scenario.

In [45]:
# set number of clusters
kclusters = 10

# run k-means clustering
kmeans = KMeans(n_clusters = kclusters, random_state=0).fit(t_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([7, 7, 2, 8, 2, 0, 1, 0, 3, 3, 7, 3, 2, 3, 0, 2, 7, 7, 0, 2, 2, 9,
       2, 2, 2, 0, 3, 7, 2, 0, 2, 3, 2, 2, 7, 2, 0, 7, 5, 3, 7, 2, 7, 7,
       0, 7, 2, 0, 2, 0, 2, 7, 2, 2, 7, 3, 3, 3, 2, 2, 3, 7, 1, 2, 3, 3,
       8, 2, 4, 0, 2, 6, 7, 0, 0, 7, 3, 2, 2, 0, 2, 2, 2, 2, 0, 0, 7, 7,
       7, 7, 2, 2, 2, 8, 2, 3, 1], dtype=int32)

In [49]:
#add clustering labels
neighbourhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

t_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
t_merged = t_merged.join(neighbourhoods_venues_sorted.set_index('Neighbourhood'), on='Neighborhood')

t_merged.shape # check the last columns!

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.752935,-79.335641,2.0,Park,Bus Stop,Golf Course,Bank,Road
1,M4A,North York,Victoria Village,43.728102,-79.31189,0.0,Pizza Place,Coffee Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Bus Line
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,1.0,Coffee Shop,Bakery,Café,Park,Pub
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211,1.0,Clothing Store,Men's Store,Furniture / Home Store,Toy / Game Store,Dessert Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939,1.0,Coffee Shop,Gastropub,Park,Sushi Restaurant,Bubble Tea Shop
5,M9A,Etobicoke,Islington Avenue,43.667481,-79.528953,2.0,Pharmacy,Grocery Store,Park,Shopping Mall,Bakery
6,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913,0.0,Fast Food Restaurant,Zoo Exhibit,Coffee Shop,Garden,Gas Station
7,M3B,North York,Don Mills,43.7489,-79.35722,0.0,Japanese Restaurant,Coffee Shop,Asian Restaurant,Italian Restaurant,Bank
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529,0.0,Pizza Place,Fast Food Restaurant,Brewery,Gym / Fitness Center,Pharmacy
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529,1.0,Coffee Shop,Gastropub,Theater,Plaza,Diner


In [39]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(t_merged['Latitude'], t_merged['Longitude'], t_merged['Neighborhood'], t_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

ValueError: cannot convert float NaN to integer

Whaaaaattttt...got an error message saying there's a NaN value in my "Cluster Labels" column. Upon inspection, I realized that the postal code M1X has 0 venues. Hence, I will exclude this row for my analysis. 

In [64]:
#drop row with NaN
#t_merged.dropna(subset=["Cluster Labels"], axis=0, inplace=True) 
t_merged.shape
t_merged

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M3A,North York,Parkwoods,43.752935,-79.335641,2.0,Park,Bus Stop,Golf Course,Bank,Road
1,M4A,North York,Victoria Village,43.728102,-79.31189,0.0,Pizza Place,Coffee Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Bus Line
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.650964,-79.353041,1.0,Coffee Shop,Bakery,Café,Park,Pub
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.723265,-79.451211,1.0,Clothing Store,Men's Store,Furniture / Home Store,Toy / Game Store,Dessert Shop
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.66179,-79.38939,1.0,Coffee Shop,Gastropub,Park,Sushi Restaurant,Bubble Tea Shop
5,M9A,Etobicoke,Islington Avenue,43.667481,-79.528953,2.0,Pharmacy,Grocery Store,Park,Shopping Mall,Bakery
6,M1B,Scarborough,"Malvern, Rouge",43.808626,-79.189913,0.0,Fast Food Restaurant,Zoo Exhibit,Coffee Shop,Garden,Gas Station
7,M3B,North York,Don Mills,43.7489,-79.35722,0.0,Japanese Restaurant,Coffee Shop,Asian Restaurant,Italian Restaurant,Bank
8,M4B,East York,"Parkview Hill, Woodbine Gardens",43.707193,-79.311529,0.0,Pizza Place,Fast Food Restaurant,Brewery,Gym / Fitness Center,Pharmacy
9,M5B,Downtown Toronto,"Garden District, Ryerson",43.657491,-79.377529,1.0,Coffee Shop,Gastropub,Theater,Plaza,Diner


Now, I will try again to do a cluster map. 

In [57]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(t_merged['Latitude'], t_merged['Longitude'], t_merged['Neighborhood'], t_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Time to examine the 5 different clusters to determine the discriminating venue categories that distinguish each cluster.

In [59]:
#Cluster 1
t_merged.loc[t_merged['Cluster Labels'] == 0, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
1,North York,0.0,Pizza Place,Coffee Shop,Middle Eastern Restaurant,Mediterranean Restaurant,Bus Line
6,Scarborough,0.0,Fast Food Restaurant,Zoo Exhibit,Coffee Shop,Garden,Gas Station
7,North York,0.0,Japanese Restaurant,Coffee Shop,Asian Restaurant,Italian Restaurant,Bank
8,East York,0.0,Pizza Place,Fast Food Restaurant,Brewery,Gym / Fitness Center,Pharmacy
10,North York,0.0,Grocery Store,Coffee Shop,Italian Restaurant,Park,Gas Station
11,Etobicoke,0.0,Park,Pizza Place,Convenience Store,Fish & Chips Shop,Clothing Store
13,North York,0.0,Japanese Restaurant,Coffee Shop,Asian Restaurant,Italian Restaurant,Bank
14,East York,0.0,Coffee Shop,Bar,Bank,Bus Line,Grocery Store
16,York,0.0,Convenience Store,Pizza Place,Grocery Store,Bus Stop,Gastropub
17,Etobicoke,0.0,Pizza Place,Flower Shop,Skating Rink,Gas Station,Beer Store


In [60]:
#Cluster 2
t_merged.loc[t_merged['Cluster Labels'] == 1, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
2,Downtown Toronto,1.0,Coffee Shop,Bakery,Café,Park,Pub
3,North York,1.0,Clothing Store,Men's Store,Furniture / Home Store,Toy / Game Store,Dessert Shop
4,Downtown Toronto,1.0,Coffee Shop,Gastropub,Park,Sushi Restaurant,Bubble Tea Shop
9,Downtown Toronto,1.0,Coffee Shop,Gastropub,Theater,Plaza,Diner
15,Downtown Toronto,1.0,Café,Gastropub,Coffee Shop,Seafood Restaurant,Cocktail Bar
19,East Toronto,1.0,Pub,Coffee Shop,Pizza Place,Bar,Caribbean Restaurant
20,Downtown Toronto,1.0,Coffee Shop,Café,Beer Bar,Creperie,Park
24,Downtown Toronto,1.0,Coffee Shop,Café,Theater,Plaza,Japanese Restaurant
25,Downtown Toronto,1.0,Café,Korean Restaurant,Coffee Shop,Grocery Store,Indian Restaurant
30,Downtown Toronto,1.0,Theater,Café,Coffee Shop,American Restaurant,Gym


In [61]:
#Cluster 3
t_merged.loc[t_merged['Cluster Labels'] == 2, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,North York,2.0,Park,Bus Stop,Golf Course,Bank,Road
5,Etobicoke,2.0,Pharmacy,Grocery Store,Park,Shopping Mall,Bakery
39,North York,2.0,Park,Chinese Restaurant,Trail,Convenience Store,Pharmacy
58,Scarborough,2.0,Park,Convenience Store,Gym Pool,Farm,Bus Stop
61,Central Toronto,2.0,Gym / Fitness Center,College Gym,Bus Line,Bookstore,Park
68,Central Toronto,2.0,Park,Gym / Fitness Center,Bank,Sushi Restaurant,Café
91,Downtown Toronto,2.0,Trail,Park,Farmers Market,Candy Store,Grocery Store


In [62]:
#Cluster 4
t_merged.loc[t_merged['Cluster Labels'] == 3, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
22,Scarborough,3.0,Park,Indian Restaurant,Coffee Shop,Pharmacy,Zoo Exhibit


In [63]:
#Cluster 5
t_merged.loc[t_merged['Cluster Labels'] == 4, t_merged.columns[[1] + list(range(5, t_merged.shape[1]))]]

Unnamed: 0,Borough,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
12,Scarborough,4.0,Breakfast Spot,Park,Burger Joint,Bar,Cupcake Shop


## Discussion

1. Clusters 1 & 2 are evidently the places to be staying in if you enjoy a lively environment. Both clusters appear to offer a wide selection of restaurants, cafes, shops and pubs. Cluster 1 seems to be geographically further away from the coast, and Cluster 2 is the "downtown" area. Lots of neighbourhoods in Cluster 2 have coffee shops as the "1st most common venue", whereas neighbourhoods in Cluster 1 appear to have restaurants as the most common venue instead. 

    This does not necessarily mean, however, that Cluster 2 has fewer restaurants than Cluster 1. This is because we only ranked availability of venues relative to each other in t_merged, and did not account for the absolute number of each venue category in the clustering (?) 
    

2. Cluster 3 is definitely a much quieter place to live, with proximity to parks, trails and fitness places. However, neighbourhoods in Cluster 3 seem to be scattered around neighbourhoods of Clusters 1 & 2, so it might not be a bad choice to stay in given that it's not too far away from the 'trendy' areas as well, while offering some respite from the crowds. 


3. Clustera 4 & 5 only consist of one neighbourhood each. Both seem to be similar, though Cluster 5 has the advantage of having more breakfast spots. 


4. A broad overview of the results would suggest that if one were to be interested in moving to Toronto, neighbourhoods in Clusters 1 or 2 would seem to be the more trendy areas to stay in. However, if one would like to be near to those spots, but not be too caught up in the mingle, then Cluster 3 would be a nice place to consider as well (it also probably has lower housing prices and other factors that I did not account for in this exercise).

## THE END