<a href="https://colab.research.google.com/github/mphill82/Coursera_Capstone/blob/main/Clustering_Toronto_Neighborhoods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering Toronto Neighborhoods by Venues
#### By Mitch Phillips

Our ojbective here to is find similarities between different postal code areas of Toronto based on the types of venues they have nearby.  We will use postal code data from Wikipedia and venue data from Foursquare.  Foursquare venue data in Toronto is relatively limitd so we end up dropping many areas with limited venue listings. We will find 3 clusters using the k-means nearest neighbor algorithm.  Then we will look at the most common venue categories of each cluster to characterize them.

### Part 1 - Acquiring Neighborhood information from Wikipedia

First we scrape data from Wikipedia on neighborhoods and boroughs of each postal code in Toronto.

In [21]:
import numpy as np
import pandas as pd
import json
import requests
from bs4 import BeautifulSoup

In [22]:
page=requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup=BeautifulSoup(page,'html.parser')

In [23]:
table = soup.find('table')

We extract the postal code, borough, and list of associated neighborhoods from each cell of the table on wikipedia and store it to dataframe df.

In [24]:
df=[]
table_contents=[]
table=soup.find('table')
for row in table.findAll('td'):
    cell = {}
    if row.span.text=='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]   #first three characters in a row
        cell['Borough'] = (row.span.text).split('(')[0]   #everything in the span section of a row before the first
        cell['Neighborhood'] = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        table_contents.append(cell)   #starting after the first parentheses, remove the rest of the parentheses and replace forward slashes with commas

df=pd.DataFrame(table_contents)
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})

We sort the dataframe by postalcode and see that there are 103 postal codes.



In [25]:
df=df.sort_values('PostalCode')
df = df.reset_index(drop=True)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
...,...,...,...
98,M9N,York,Weston
99,M9P,Etobicoke,Westmount
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [26]:
df.shape

(103, 3)

### Part 2 - Acquiring Latitude and Longitude data

Geocoder was not working so I used the csv file provided by course instructor to access latitude and longitude data corresponding to toronto postal codes.

In [27]:
url='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
#Read the file and convert it to a pandas dataframe
df_ll= pd.read_csv(url)
#Combine this dataframe with the neighborhood dataframe from before
df=pd.concat([df,df_ll[['Latitude','Longitude']]],axis=1)
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,Weston,43.706876,-79.518188
99,M9P,Etobicoke,Westmount,43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [28]:
df.shape

(103, 5)

Now let's display all the Toronto postal code locations on a map.

In [29]:
import folium
from geopy.geocoders import Nominatim

In [30]:
#Get the center of Toronto
address = 'Toronto, ON'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of Toronto using latitude and longitude values
f = folium.Figure(width=650, height=450)
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10).add_to(f)

# add markers to map
for lat, lng, borough, pc in zip(df['Latitude'], df['Longitude'], df['Borough'], df['PostalCode']):
    label = '{}, {}'.format(pc, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

### Part 3 - Clustering with Foursquare Venue data

We need to define api credentials to access Foursquare data.

In [31]:
CLIENT_ID = '2M2FVW3D5I4QK0ZPX0XWRRECY3XUU5JQK3O5V0GV2LX0MR3Q' # your Foursquare ID
CLIENT_SECRET = '4WE2VPK1JZQN0R4MH3FRMWJA25KNB5VVXM1MVA4BCS2SZ1RI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 2M2FVW3D5I4QK0ZPX0XWRRECY3XUU5JQK3O5V0GV2LX0MR3Q
CLIENT_SECRET:4WE2VPK1JZQN0R4MH3FRMWJA25KNB5VVXM1MVA4BCS2SZ1RI


We define a function to find all nearby venus of any postal code.

In [32]:
def getNearbyVenues(postalcodes, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for pc, lat, lng in zip(postalcodes, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            pc, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postal Code', 
                  'PC Latitude', 
                  'PC Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now we call this function on our dataframe to find all the nearby venues of each postal code in Toronto.

In [34]:
toronto_venues_=getNearbyVenues(df['PostalCode'],df['Latitude'],df['Longitude'])

In [35]:
toronto_venues=toronto_venues_
toronto_venues

Unnamed: 0,Postal Code,PC Latitude,PC Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M1B,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant
1,M1C,43.784535,-79.160497,Great Shine Window Cleaning,43.783145,-79.157431,Home Service
2,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
3,M1E,43.763573,-79.188711,RBC Royal Bank,43.766790,-79.191151,Bank
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store
...,...,...,...,...,...,...,...
2130,M9V,43.739416,-79.588437,McDonald's,43.741757,-79.584230,Fast Food Restaurant
2131,M9V,43.739416,-79.588437,Pizza Nova,43.736761,-79.589817,Pizza Place
2132,M9W,43.706748,-79.594054,Economy Rent A Car,43.708471,-79.589943,Rental Car Location
2133,M9W,43.706748,-79.594054,Saand Rexdale,43.705072,-79.598725,Drugstore


We've collected 2135 venues and identified what postal code area they belong to as well as what type venue they are.  We can look at how many vanues belong to each area...

In [36]:
toronto_venues.groupby('Postal Code')['Venue'].count().to_frame().T

Postal Code,M1B,M1C,M1E,M1G,M1H,M1J,M1K,M1L,M1M,M1N,M1P,M1R,M1S,M1T,M1V,M1W,M2H,M2J,M2K,M2N,M2P,M2R,M3A,M3B,M3C,M3H,M3J,M3K,M3L,M3M,M3N,M4A,M4B,M4C,M4E,M4G,M4H,M4J,M4K,M4L,...,M5M,M5N,M5P,M5R,M5S,M5T,M5V,M5W,M5X,M6A,M6B,M6C,M6E,M6G,M6H,M6J,M6K,M6L,M6M,M6N,M6P,M6R,M6S,M7A,M7R,M7Y,M8V,M8W,M8X,M8Y,M8Z,M9B,M9C,M9L,M9M,M9N,M9P,M9R,M9V,M9W
Venue,1,2,9,3,8,2,4,9,2,4,6,6,4,13,3,14,5,62,4,35,4,7,3,5,20,23,6,2,6,3,4,4,10,7,5,32,20,5,42,19,...,25,2,4,19,32,62,16,100,100,11,4,4,4,16,15,43,25,4,5,4,25,15,35,31,14,17,13,7,2,2,15,2,10,2,1,1,9,4,9,3


We have 99 postal code areas with venues listed, but many of them with very few venues.  Let's drop areas with less than 10 venues for the purpose of this clustering analysis.  First we'll add a column for venue counts.

In [37]:
toronto_venues = toronto_venues.join(toronto_venues.groupby('Postal Code')['Venue'].count(), on='Postal Code', rsuffix=' count')
toronto_venues

Unnamed: 0,Postal Code,PC Latitude,PC Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Venue count
0,M1B,43.806686,-79.194353,Wendy’s,43.807448,-79.199056,Fast Food Restaurant,1
1,M1C,43.784535,-79.160497,Great Shine Window Cleaning,43.783145,-79.157431,Home Service,2
2,M1C,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar,2
3,M1E,43.763573,-79.188711,RBC Royal Bank,43.766790,-79.191151,Bank,9
4,M1E,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store,9
...,...,...,...,...,...,...,...,...
2130,M9V,43.739416,-79.588437,McDonald's,43.741757,-79.584230,Fast Food Restaurant,9
2131,M9V,43.739416,-79.588437,Pizza Nova,43.736761,-79.589817,Pizza Place,9
2132,M9W,43.706748,-79.594054,Economy Rent A Car,43.708471,-79.589943,Rental Car Location,3
2133,M9W,43.706748,-79.594054,Saand Rexdale,43.705072,-79.598725,Drugstore,3


Then we can filter out the areas with less than 10 venues by this new venue count column.

In [38]:
toronto_venues = toronto_venues[toronto_venues['Venue count']>=10]
toronto_venues.groupby('Postal Code')['Venue'].count().to_frame().T

Postal Code,M1T,M1W,M2J,M2N,M3C,M3H,M4B,M4G,M4H,M4K,M4L,M4M,M4R,M4S,M4V,M4X,M4Y,M5A,M5B,M5C,M5E,M5G,M5H,M5J,M5K,M5L,M5M,M5R,M5S,M5T,M5V,M5W,M5X,M6A,M6G,M6H,M6J,M6K,M6P,M6R,M6S,M7A,M7R,M7Y,M8V,M8Z,M9C
Venue,13,14,62,35,20,23,10,32,20,42,19,36,20,37,14,52,78,45,100,83,59,59,96,100,100,100,25,19,32,62,16,100,100,11,16,15,43,25,25,15,35,31,14,17,13,15,10


In [39]:
toronto_venues.groupby('Postal Code')['Venue'].count().to_frame().shape

(47, 1)

We dropped 53 postal code areas and now we have 47 remaining which have 10 or more venues.   Let's map these postal code areas we've kept.

In [40]:
# create map of Toronto using latitude and longitude values
f = folium.Figure(width=650, height=450)
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10).add_to(f)

# add markers to map
for lat, lng, pc in zip(toronto_venues['PC Latitude'], toronto_venues['PC Longitude'], toronto_venues['Postal Code']):
    label = '{}'.format(pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Pre-processing

We're going to get our data ready for clustering.  We'll do this by adding a feature column to our dataframe for each unique venue category.

In [41]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
toronto_df=pd.concat([toronto_venues[['Postal Code','PC Latitude','PC Longitude']],toronto_onehot],axis=1)
toronto_df

Unnamed: 0,Postal Code,PC Latitude,PC Longitude,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,...,Shopping Mall,Shopping Plaza,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
60,M1T,43.781638,-79.304302,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
61,M1T,43.781638,-79.304302,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
62,M1T,43.781638,-79.304302,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
63,M1T,43.781638,-79.304302,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
64,M1T,43.781638,-79.304302,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2101,M9C,43.643515,-79.577201,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2102,M9C,43.643515,-79.577201,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2103,M9C,43.643515,-79.577201,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2104,M9C,43.643515,-79.577201,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


Now we can group this dataframe by postal code area and take the mean of each feature column.  This will give us a frequency of occurance for each venue category.

In [42]:
toronto_df_grouped=toronto_df.groupby(['Postal Code','PC Latitude','PC Longitude']).mean().reset_index()
toronto_df_grouped

Unnamed: 0,Postal Code,PC Latitude,PC Longitude,Accessories Store,Adult Boutique,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Workshop,BBQ Joint,Baby Store,Bagel Shop,Bakery,Bank,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Beach,Bed & Breakfast,Beer Bar,Beer Store,Belgian Restaurant,Bike Shop,Bistro,Board Shop,Boat or Ferry,Bookstore,Boutique,...,Shopping Mall,Shopping Plaza,Skate Park,Skating Rink,Smoke Shop,Smoothie Shop,Snack Place,Soup Place,Spa,Speakeasy,Sporting Goods Shop,Sports Bar,Stadium,Stationery Store,Steakhouse,Strip Club,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tailor Shop,Taiwanese Restaurant,Tanning Salon,Tea Room,Thai Restaurant,Theater,Theme Restaurant,Thrift / Vintage Store,Tibetan Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,M1T,43.781638,-79.304302,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,M1W,43.799525,-79.318389,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,M2J,43.778517,-79.346556,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.032258,0.032258,0.016129,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,...,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.016129,0.0,0.0,0.0,0.0,0.032258,0.0
3,M2N,43.77012,-79.408493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.057143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0
4,M3C,43.7259,-79.340923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.05,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,M3H,43.754328,-79.442259,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.086957,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.043478,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,M4B,43.706397,-79.309937,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,M4G,43.70906,-79.363452,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,...,0.03125,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,0.0625,0.03125,0.0,0.0,0.0,0.0,0.03125,0.0,0.03125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,M4H,43.705369,-79.349372,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.0,0.05
9,M4K,43.679557,-79.352188,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381


Since we will want to be able to characterize each cluster and see what types of venues they have, let's write a function to sort the venues of any postal code area from most to least frequent.

In [43]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create a new dataframe to display the top 5 venues from each neighborhood.

In [44]:
toronto_grouped=toronto_df_grouped.drop(['PC Latitude','PC Longitude'],axis=1)

num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = toronto_grouped['Postal Code']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1T,Pizza Place,Fast Food Restaurant,Bank,Fried Chicken Joint,Italian Restaurant
1,M1W,Fast Food Restaurant,Electronics Store,Breakfast Spot,Bank,Burger Joint
2,M2J,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Juice Bar
3,M2N,Ramen Restaurant,Sushi Restaurant,Shopping Mall,Pizza Place,Restaurant
4,M3C,Gym,Coffee Shop,Restaurant,Sporting Goods Shop,Asian Restaurant
5,M3H,Coffee Shop,Bank,Sushi Restaurant,Pharmacy,Pizza Place
6,M4B,Pizza Place,Intersection,Flea Market,Bank,Gym / Fitness Center
7,M4G,Coffee Shop,Sporting Goods Shop,Bank,Sandwich Place,Furniture / Home Store
8,M4H,Indian Restaurant,Yoga Studio,Restaurant,Pizza Place,Pharmacy
9,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store


Now we will run k-means to cluster these neighborhoods into 5 clusters.

In [45]:
from sklearn.cluster import KMeans

In [46]:
# set number of clusters
kclusters = 3

toronto_grouped_clustering = toronto_grouped.drop(['Postal Code'], axis=1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([0, 0, 1, 2, 1, 2, 0, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2,
       2, 2, 2], dtype=int32)

Let's create a new dataframe that includes the cluster label, latitude and longitude data, as well as the top 5 venues for each neighborhood.

In [47]:
# add clustering labels
neighborhoods_venues_sorted.insert(loc=0,column='Cluster Labels', value=kmeans.labels_)
#add latitude/longitude for each neighborhood
toronto_merged = neighborhoods_venues_sorted.join(toronto_df_grouped[['Postal Code','PC Latitude','PC Longitude']].set_index('Postal Code'), on='Postal Code')
toronto_merged

Unnamed: 0,Cluster Labels,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,PC Latitude,PC Longitude
0,0,M1T,Pizza Place,Fast Food Restaurant,Bank,Fried Chicken Joint,Italian Restaurant,43.781638,-79.304302
1,0,M1W,Fast Food Restaurant,Electronics Store,Breakfast Spot,Bank,Burger Joint,43.799525,-79.318389
2,1,M2J,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Juice Bar,43.778517,-79.346556
3,2,M2N,Ramen Restaurant,Sushi Restaurant,Shopping Mall,Pizza Place,Restaurant,43.77012,-79.408493
4,1,M3C,Gym,Coffee Shop,Restaurant,Sporting Goods Shop,Asian Restaurant,43.7259,-79.340923
5,2,M3H,Coffee Shop,Bank,Sushi Restaurant,Pharmacy,Pizza Place,43.754328,-79.442259
6,0,M4B,Pizza Place,Intersection,Flea Market,Bank,Gym / Fitness Center,43.706397,-79.309937
7,1,M4G,Coffee Shop,Sporting Goods Shop,Bank,Sandwich Place,Furniture / Home Store,43.70906,-79.363452
8,2,M4H,Indian Restaurant,Yoga Studio,Restaurant,Pizza Place,Pharmacy,43.705369,-79.349372
9,1,M4K,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,43.679557,-79.352188


Now we can view the clusters on a map.  In the next section we will attempt to understand why the k-means algorithm picked these clusters by seeing what venue types characterize them.

In [48]:
import matplotlib.cm as cm
import matplotlib.colors as colors

In [49]:
# create map
f = folium.Figure(width=650, height=450)
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10).add_to(f)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, pc, cluster in zip(toronto_merged['PC Latitude'], toronto_merged['PC Longitude'], toronto_merged['Postal Code'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(pc) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

### Part 4: Analysis of the Clusters

We will now analyze each of the clusters by viewing them on a map and then seeing their venue category rankings.

Cluster 0:  The banks cluster

In [50]:
#make cluster dataframe
cluster0=toronto_merged.loc[toronto_merged['Cluster Labels'] == 0].reset_index()

In [51]:
# create map of Toronto using latitude and longitude values
f = folium.Figure(width=400, height=300)
map_toronto = folium.Map(location=[latitude+.05, longitude], zoom_start=10).add_to(f)

# add markers to map
for lat, lng, pc in zip(cluster0['PC Latitude'], cluster0['PC Longitude'], cluster0['Postal Code']):
    label = '{}'.format(pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [52]:
cluster0_cats=cluster0.drop(['index','Cluster Labels','Postal Code','PC Latitude','PC Longitude'], axis=1)
cluster0_cats

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Pizza Place,Fast Food Restaurant,Bank,Fried Chicken Joint,Italian Restaurant
1,Fast Food Restaurant,Electronics Store,Breakfast Spot,Bank,Burger Joint
2,Pizza Place,Intersection,Flea Market,Bank,Gym / Fitness Center


Below we can see the most common venue categories in cluster 0.  It looks like the cluster is primarily characterized by the presence of banks.

In [53]:
categories = []
for col in cluster0_cats.iloc[:,1:]:
    col_venues = cluster0_cats[col].tolist()
    categories += col_venues
cluster0_cats=pd.DataFrame(categories)
cluster0_cats.value_counts().head()

Bank                    3
Italian Restaurant      1
Intersection            1
Gym / Fitness Center    1
Fried Chicken Joint     1
dtype: int64

Cluster 1:  The coffee and restaurants cluster

In [54]:
#make cluster dataframe
cluster1=toronto_merged.loc[toronto_merged['Cluster Labels'] == 1].reset_index()

In [55]:
# create map of Toronto using latitude and longitude values
f = folium.Figure(width=400, height=300)
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10).add_to(f)

# add markers to map
for lat, lng, pc in zip(cluster1['PC Latitude'], cluster1['PC Longitude'], cluster1['Postal Code']):
    label = '{}'.format(pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [56]:
cluster1_cats=cluster1.drop(['index','Cluster Labels','Postal Code','PC Latitude','PC Longitude'], axis=1)
cluster1_cats

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Clothing Store,Coffee Shop,Fast Food Restaurant,Restaurant,Juice Bar
1,Gym,Coffee Shop,Restaurant,Sporting Goods Shop,Asian Restaurant
2,Coffee Shop,Sporting Goods Shop,Bank,Sandwich Place,Furniture / Home Store
3,Greek Restaurant,Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store
4,Coffee Shop,Gastropub,Bakery,Brewery,Café
5,Clothing Store,Coffee Shop,Fast Food Restaurant,Park,Pet Store
6,Coffee Shop,Supermarket,Fried Chicken Joint,Liquor Store,Bagel Shop
7,Coffee Shop,Sushi Restaurant,Japanese Restaurant,Restaurant,Gay Bar
8,Coffee Shop,Bakery,Café,Pub,Park
9,Clothing Store,Coffee Shop,Café,Bubble Tea Shop,Middle Eastern Restaurant


Below wee see the most common venue categories in cluster 1.  It looks like this cluster is characterized by cafes, restaurants, coffee shops, hotels, and italian restaurants.

In [57]:
categories = []
for col in cluster1_cats.iloc[:,1:]:
    col_venues = cluster1_cats[col].tolist()
    categories += col_venues
cluster1_cats=pd.DataFrame(categories)
cluster1_cats.value_counts().head()

Café                  10
Restaurant             8
Coffee Shop            6
Hotel                  6
Italian Restaurant     4
dtype: int64

Cluster 2: The pizza cluster

In [58]:
#make cluster dataframe
cluster2=toronto_merged.loc[toronto_merged['Cluster Labels'] == 2].reset_index()

In [59]:
# create map of Toronto using latitude and longitude values
f = folium.Figure(width=400, height=300)
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10).add_to(f)

# add markers to map
for lat, lng, pc in zip(cluster2['PC Latitude'], cluster2['PC Longitude'], cluster2['Postal Code']):
    label = '{}'.format(pc)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [60]:
cluster2_cats=cluster2.drop(['index','Cluster Labels','Postal Code','PC Latitude','PC Longitude'], axis=1)
cluster2_cats

Unnamed: 0,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Ramen Restaurant,Sushi Restaurant,Shopping Mall,Pizza Place,Restaurant
1,Coffee Shop,Bank,Sushi Restaurant,Pharmacy,Pizza Place
2,Indian Restaurant,Yoga Studio,Restaurant,Pizza Place,Pharmacy
3,Fast Food Restaurant,Gym,Pet Store,Board Shop,Brewery
4,Pizza Place,Dessert Shop,Sandwich Place,Gym,Thai Restaurant
5,Coffee Shop,Pizza Place,Park,Café,Bakery
6,Pizza Place,Coffee Shop,Italian Restaurant,Sandwich Place,Restaurant
7,Sandwich Place,Café,Coffee Shop,BBQ Joint,History Museum
8,Café,Bakery,Bookstore,Bar,Japanese Restaurant
9,Café,Coffee Shop,Bar,Vietnamese Restaurant,Vegetarian / Vegan Restaurant


Finally, we see that cluster 2 is characterized by cafes, pizza places, coffeeshops, restaurants, and pharmacies.

In [61]:
categories = []
for col in cluster2_cats.iloc[:,1:]:
    col_venues = cluster2_cats[col].tolist()
    categories += col_venues
cluster2_cats=pd.DataFrame(categories)
cluster2_cats.value_counts().head()

Café           7
Pizza Place    6
Coffee Shop    6
Restaurant     5
Pharmacy       4
dtype: int64