# Segmenting and Clustering Neighborhoods in TORONTO, CANADA

The Battle of Neighborhoods - Week 1

OBJECTIVE I: Explore, segment, and cluster the neighborhoods in the city of Toronto based on the postalcode and borough information

1. IDENTIFY WIKIPEDIA PAGE CONTAINING TORONTO NEIGHBORHOOD DATA (Postal Code, Borough, Neighborhood)
2. IMPORT APPLICABLE LIBRARIES
3. RETRIEVE URL, CREATE BEAUTIFUL SOUP OBJECT
4. OBTAIN DATA AND TRANSFORM INTO STRUCTURED FORMAT
        A) SCRAPE WIKIPEDIA PAGE, WRANGLE DATA, CLEAN
            i) create empty list
            ii) find table data
            iii) create dictionary having 3 keys (Postal Code, Borough, Neighborhood) and append to list
                WEBSCRAPING TIPS:
                    -extract 3-character postal code using tablerow.p.text
                    -obtain Borough and Neighborhood data using split, strip, and replace functions
        B) READ LIST INTO STRUCTURED FORMAT --> pandas dataframe


***NOTE***
The BeautifulSoup package is an alternative option for more complicated webscraping <br>
-- Main documentation page:  https://beautiful-soup-4.readthedocs.io/en/latest/#kinds-of-objects

1. IDENTIFY WIKIPEDIA PAGE CONTAINING TORONTO NEIGHBORHOOD DATA (Postal Code, Borough, Neighborhood) https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

2. IMPORT APPLICABLE LIBRARIES
3. RETRIEVE URL, CREATE BEAUTIFUL SOUP OBJECT

In [1]:
from bs4 import BeautifulSoup
import pandas as pd

import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"

html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)

print('Libraries imported, successful url retrieval.')

Libraries imported, successful url retrieval.


4) OBTAIN DATA AND TRANSFORM INTO STRUCTURED FORMAT

<blockquote>
<p> A) SCRAPE WIKIPEDIA PAGE, WRANGLE AND CLEAN DATA <p>
<blockquote>
           <p> -only process cells having an assigned borough; ignore boroughs that are "Not assigned" </p>
           <p> -merge postal codes listed twice (combine into one row with neighborhoods separated by comma) </p>
           <p> -for cells having a Borough and a "Not assigned" neighborhood, the neighborhood = the borough </p>
</blockquote>

In [2]:
#CREATE EMPTY LIST
table_contents=[]

#FIND TABLE DATA IN SOUP OBJECT
table=soup.find('table')

#CREATE DICTIONARY HAVING 3 KEYS AND APPEND TO LIST
for row in table.findAll('td'):
    cell = {}
    if row.span.text =='Not assigned':
        pass
    else:
        cell['PostalCode'] = row.p.text[:3]
        cell['District'] = (row.span.text).split('(')[0]
        cell['Neighbourhood'] = (row.span.text).split('(')[1].strip(')').replace(' /',',').replace(')',' ').strip(' ')
    table_contents.append(cell)

<blockquote>
<p>B)  READ LIST INTO STRUCTURED FORMAT --> pandas dataframe </p>
<blockquote>
<p>a**use .shape method to print # of rows in dataframe**</p>
</blockquote>

In [3]:
#read list into dataframe
Torontodf=pd.DataFrame(table_contents)
Torontodf['District']=Torontodf['District'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})
#drop null values
Torontodf=Torontodf.dropna()
#reset index and drop original Index column
draftTorontodf = Torontodf.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
draftTorontodf.drop(columns=['index'], inplace=True)

In [22]:
Torontodf.shape

(103, 5)

OBJECTIVE II: Obtain latitude and longitude coordinates of each neighborhood (required to utilize FourSquare)

1. IMPORT APPLICABLE LIBRARIES <br>
    -pgeocode (for documentation -> https://pypi.org/project/pgeocode/)
2. USING POSTAL CODE DATA FROM **cdf** DATAFRAME AND PGEOCODE, GENERATE DATAFRAME OF LATITUDE AND LONGITUDE COORDINATES
3. MERGE CDF AND LATITUDE/LONGITUDE DATAFRAME TO UTILIZE FOURSQUARE LOCATION DATA

In [5]:
!pip install pgeocode



In [6]:
import pgeocode

In [23]:
#create Postal Code data frame and convert to list
postal_code = draftTorontodf['PostalCode']

pcode = []
for i in postal_code:
    pcode.append(i)

#use postal code list to generate latitude and longitude dataframe (LatLong) from geo-localisation query
nomi = pgeocode.Nominatim('ca')
pcquery = nomi.query_postal_code(pcode)
latlong = pcquery[['postal_code','latitude', 'longitude']]
latlong = latlong.dropna()
latlong

#latlong.to_clipboard(sep=',')    OPTIONAL (to copy to clipboard)


Unnamed: 0,postal_code,latitude,longitude
0,M3A,43.7545,-79.3300
1,M4A,43.7276,-79.3148
2,M5A,43.6555,-79.3626
3,M6A,43.7223,-79.4504
4,M7A,43.6641,-79.3889
...,...,...,...
98,M8X,43.6518,-79.5076
99,M4Y,43.6656,-79.3830
100,M7Y,43.7804,-79.2505
101,M8Y,43.6325,-79.4939


In [24]:
# verify that dataframes are of equal length; identify discrepancies (identify the Postal Code(s) only in cdf)
if len(draftTorontodf) == len(latlong):
    print("Dataframe length =" + str(len(draftTorontodf)))
    Torontodf = draftTorontodf
else:
    print("Postal Codes only in Torontodf:")

set(draftTorontodf.PostalCode).symmetric_difference(latlong.postal_code)
                                                 

Postal Codes only in Torontodf:


{'M7R'}

In [25]:
#identify row #(s) of postal codes only in cdf
draftTorontodf.loc[draftTorontodf['PostalCode']=='M7R']


Unnamed: 0,PostalCode,District,Neighbourhood
76,M7R,Mississauga,Enclave of L4W


In [10]:
#OPTIONAL
#postal_code.to_clipboard(sep=',')    
#postal_code.to_clipboard(sep=',')    
#Torontodf.to_clipboard(sep=',')
#run Line 2 or 3 to copy to clipboard for pasting into spreadsheet

# RUN THE .drop CELL BELOW ONCE THEN COMMENT OUT ..  LEN OF ALL 3 DF's MUST EQUAL 102

In [26]:
#drop identified row(s) from draft cdf, then verify that dataframes are of equal length

Torontodf = Torontodf.drop([76])

#LINES 3 - 10 ARE NEEDED IN ORDER TO ELIMINATE NaN lATITUDES (only run once, then comment out: Torontodf = ...)
#NaN LATITUDE MUST BE ELIMINATED TO GENERATE MAP.

In [27]:
#verify that dataframes are of equal length
if len(Torontodf) == len(latlong):
    print("Dataframe length =" + str(len(Torontodf)))
    
#if dataframes are not equal run the cell that drops 76

Dataframe length =102


In [28]:
#compare column headings of common column and rename so that names are identical
latlong.rename(columns = {'postal_code':'PostalCode', 'latitude':'Latitude', 'longitude':'Longitude'}, inplace = True)
#merge 2 dataframes on common column (PostalCode)
Torontodf = Torontodf.merge(latlong, on = 'PostalCode', how = 'left')   

In [29]:
print('Toronto is comprised of {} districts and {} neighbourhoods.'.format(
len(Torontodf['District'].unique()),
Torontodf.shape[0]))

Toronto is comprised of 14 districts and 102 neighbourhoods.


In [15]:
#OPTIONAL
#Torontodf.to_clipboard(sep=',')    

#run Line 2 to copy to clipboard for pasting into spreadsheet

OBJECTIVE III: Explore and cluster neighborhoods in Toronto.

1. IMPORT/INSTALL APPLICABLE LIBRARIES
2. USING GEOPY LIBRARY, OBTAIN COORDINATES OF TORONTO.  USE FOLIUM TO GENERATE MAP OF TORONTO WITH THE OBTAINED COORDINATES.
3. UTILIZE FOURSQUARE API TO OBTAIN NEIGHBORHOOD/VENUE DATA.
4. PERFORM SUMMARY ANALYSIS OF DATA ACQUIRED VIA FOURSQUARE.
5. CONDUCT DETAILED REVIEW OF VENUE DATA BY NEIGHBORHOOD.
6. IDENTIFY MOST COMMON VENUES
7. CLUSTER NEIGHBOURHOODS (run k-means).

1. IMPORT/INSTALL APPLICABLE LIBRARIES
        -geopy library  
        -folium 
        -requests 
        -json_normalize (transform JSON file into a pandas dataframe)
        -numpy
        -matplotlib.cm
        -matplotlib.colors

In [30]:
#install geopy library
!pip install geopy    
from geopy import Nominatim #convert address into latitude, longitude values

#!conda install - c conda-fore folium=0.5.0 (uncomment if no Foursquare API work completed)
import folium   #map rendering library

import requests   #library to handle requests
from pandas.io.json import json_normalize #transform JSON file into a pandas dataframe

import numpy as np #library to handle data in a vectorized manner

from sklearn.cluster import KMeans   #import kmeans from clustering stage

#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

print('Libraries imported')

Libraries imported


2. USING GEOPY LIBRARY, OBTAIN COORDINATES OF TORONTO.  USE FOLIUM TO GENERATE MAP OF TORONTO WITH THE OBTAINED COORDINATES.

In [31]:
#use geopy library to obtain Toronto coordinates
address = 'Toronto, ON'

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Toronto geographical coordinates:  LATITUDE {}, LONGITUDE {}'.format(latitude, longitude))


Toronto geographical coordinates:  LATITUDE 43.6534817, LONGITUDE -79.3839347


# ONLY RUN CELL BELOW IF ['Latitude_x'] & ['Longitude_x'] columns appear in Torontodf

In [34]:
Torontodf['Latitude'] = Torontodf['Latitude_x']
Torontodf['Longitude'] = Torontodf['Longitude_x']

In [35]:
#use folium to create map of Toronto with coordinates
map_Toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

#add markers to map
for lat, lng, district, neighbourhood in zip(Torontodf['Latitude'], Torontodf['Longitude'], Torontodf['District'], Torontodf['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Toronto)
    
map_Toronto

3. UTILIZE FOURSQUARE API TO EXPLORE TORONTO's Scarborough DISTRICT AND ITS NEIGHBORHOODS.
    <ol>
    <li>use dataframe created in Part 1 (Torontodf) to generate dataframe of location data for Scarborough District</li>
    <li>generate map of Scarborough with coordinates to visualize neighborhoods in district</li>
    <li>define FourSquare credentials and Version</li>
    <li>create a dataframe of Scarborough venues</li>

In [36]:
#create Scarborough dataframe (Scarbdata) from Torontodf
ScarbData = Torontodf[Torontodf['District'] == "Scarborough"].reset_index(drop = True)
ScarbData.head()

Unnamed: 0,PostalCode,District,Neighbourhood,Latitude_x,Longitude_x,Latitude_y,Longitude_y,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193,43.8113,-79.193,43.8113,-79.193
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7878,-79.1564,43.7878,-79.1564,43.7878,-79.1564
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7678,-79.1866,43.7678,-79.1866,43.7678,-79.1866
3,M1G,Scarborough,Woburn,43.7712,-79.2144,43.7712,-79.2144,43.7712,-79.2144
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,43.7686,-79.2389,43.7686,-79.2389


In [37]:
#OPTIONAL
#ScarbData.to_clipboard(sep=',')  
#run Line 2 to copy to clipboard for pasting into spreadsheet

In [38]:
#use geopy library to obtain Scarborough coordinates
address = "Scarborough, Toronto, ON"

geolocator = Nominatim(user_agent="ca_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('Scarborough geographical coordinates:  LATITUDE {}, LONGITUDE {}'.format(latitude, longitude))


Scarborough geographical coordinates:  LATITUDE 43.7729744, LONGITUDE -79.2576479


In [39]:
#create map of Scarborough using coordinates
map_Scarborough = folium.Map(location=[latitude, longitude], zoom_start=10)

#add markers to map
for lat, lng, label in zip(ScarbData['Latitude'], ScarbData['Longitude'], ScarbData['Neighbourhood']):
    label = '{}, {}'.format(neighbourhood, district)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Scarborough)
    
map_Scarborough

In [40]:
#define FourSquare credentials and Version

CLIENT_ID = 'PAQS0AI5F44NVQGAIHIX1RQM2XKP1LN5XZCXYBVAB1AUEWRW' # your Foursquare ID
CLIENT_SECRET = '4S3YOGSHIXJHZOIAT3LB25SERBR4L0F3G0AKO0UZ4PZBFMHR' 
ACCESS_TOKEN = 'NA5C4T0PNS4MQKSSAUGPSTXHO5AZ0YOQJNQKJMAG0RGR3G32'
VERSION = '20180604' #Foursquare API version
LIMIT = 100     #default Foursquare API limit value
print('Your credentials:')
print('CLIENT_ID',":" + CLIENT_ID)
print('CLIENT SECRET',":" + CLIENT_SECRET)

Your credentials:
CLIENT_ID :PAQS0AI5F44NVQGAIHIX1RQM2XKP1LN5XZCXYBVAB1AUEWRW
CLIENT SECRET :4S3YOGSHIXJHZOIAT3LB25SERBR4L0F3G0AKO0UZ4PZBFMHR


In [41]:
#create a function to get desired venue information for all the neighborhoods in Scarborough (use radius of 2000 meters = 1.24 miles)
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        #create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)
        
        #make the GET request, review JSON and identify keys containing relevant information (results)
        results = requests.get(url).json()
        results = results['response']['groups'][0]['items']
        
        #use JSON results to create a list of venues (venues_list) -> return only relevant information for each nearby venue 
        venues_list.append([(
            name,
            lat, 
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])
        
    #create dataframe (nearby_venues) from venues_list
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    
    #establish nearby_venues dataframe column names
    nearby_venues.columns = ['Neighbourhood',
            'Neighbourhood Latitude', 
            'Neighbourhood Longitude',
            'Venue',
            'Venue Latitude',
            'Venue Longitude',
            'Venue Category']
    
    return(nearby_venues)

4. PERFORM SUMMARY ANALYSIS OF DATA ACQUIRED VIA FOURSQUARE.

In [42]:
#use getNearbyVenues function and Scarborough dataframe to create a dataframe of Scarborough venues (scarborough_venues)
scarborough_venues = getNearbyVenues(names=ScarbData['Neighbourhood'],latitudes=ScarbData['Latitude'],longitudes=ScarbData['Longitude'])
                                                  

Malvern, Rouge
Rouge Hill, Port Union, Highland Creek
Guildwood, Morningside, West Hill
Woburn
Cedarbrae
Scarborough Village
Kennedy Park, Ionview, East Birchmount Park
Golden Mile, Clairlea, Oakridge
Cliffside, Cliffcrest, Scarborough Village West
Birch Cliff, Cliffside West
Dorset Park, Wexford Heights, Scarborough Town Centre
Wexford, Maryvale
Agincourt
Clarks Corners, Tam O'Shanter, Sullivan
Milliken, Agincourt North, Steeles East, L'Amoreaux East
Steeles West, L'Amoreaux West
Upper Rouge


In [43]:
#print scarborough_venues dataframe size and view head of dataframe
print(scarborough_venues.shape)
scarborough_venues.head()

#OPTIONAL - uncomment and run Line 6 to copy to clipboard for pasting into spreadsheet
#scarborough_venues.to_clipboard(sep=',')  

(1078, 7)


Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Malvern, Rouge",43.8113,-79.193,African Rainforest Pavilion,43.817725,-79.183433,Zoo Exhibit
1,"Malvern, Rouge",43.8113,-79.193,Toronto Zoo,43.820582,-79.181551,Zoo
2,"Malvern, Rouge",43.8113,-79.193,Polar Bear Exhibit,43.823372,-79.185145,Zoo
3,"Malvern, Rouge",43.8113,-79.193,RBC Royal Bank,43.798782,-79.19709,Bank
4,"Malvern, Rouge",43.8113,-79.193,Petro-Canada,43.807831,-79.171431,Gas Station


In [44]:
#Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

In [45]:
#review summary details (total venues by category in Scarborough)
VenCatTotal = scarborough_venues.groupby('Venue Category').count()
VenCatTotal = VenCatTotal['Venue']
VenCatTotal

#OPTIONAL  - uncomment and run Line 7 to copy to clipboard for pasting into spreadsheet
#VenCatTotal.to_clipboard(sep=',')  

#OPTIONAL -  uncomment/run line 10 to identify District Total for a particular venue category
#VenCatTotal['BBQ Joint'] 

Venue Category
American Restaurant     6
Arts & Crafts Store     3
Asian Restaurant       10
Athletics & Sports      6
Auto Garage             4
                       ..
Women's Store           2
Xinjiang Restaurant     2
Yoga Studio             1
Zoo                     2
Zoo Exhibit            12
Name: Venue, Length: 150, dtype: int64

In [46]:
#review total venues returned for each neighbourhood
ScarbNHVenues = scarborough_venues.groupby('Neighbourhood').count()
ScarbNHVenues = ScarbNHVenues.rename(columns = {'Venue Category':'Total Venues'})
ScarbNHVenues['Total Venues']

Neighbourhood
Agincourt                                                    89
Birch Cliff, Cliffside West                                  40
Cedarbrae                                                   100
Clarks Corners, Tam O'Shanter, Sullivan                      84
Cliffside, Cliffcrest, Scarborough Village West              50
Dorset Park, Wexford Heights, Scarborough Town Centre        53
Golden Mile, Clairlea, Oakridge                              99
Guildwood, Morningside, West Hill                            38
Kennedy Park, Ionview, East Birchmount Park                  59
Malvern, Rouge                                               46
Milliken, Agincourt North, Steeles East, L'Amoreaux East     83
Rouge Hill, Port Union, Highland Creek                       37
Scarborough Village                                          74
Steeles West, L'Amoreaux West                                72
Upper Rouge                                                   7
Wexford, Maryvale         

In [48]:
#OPTIONAL (to view unique categories in list)
#scarborough_venues['Venue Category'].unique()

In [49]:
#review # unique categories
print('There are {} unique venue categories in Scarborough.'.format(len(scarborough_venues['Venue Category'].unique())))

#OPTIONAL - to view unique categories in list, uncomment and run line 5
#scarborough_venues['Venue Category'].unique()

There are 150 unique venue categories in Scarborough.


5. CONDUCT DETAILED REVIEW OF VENUE DATA BY NEIGHBORHOOD.

In [50]:
#use one hot encoding to identify presence of each venue category by neighbourhood
scarborough_onehot = pd.get_dummies(scarborough_venues[['Venue Category']], prefix = "", prefix_sep = "")
                                                       
#add neighbourhood column back to dataframe
scarborough_onehot['Neighbourhood'] = scarborough_venues['Neighbourhood']
                                                    
#move neighbourhood column to first column  
fixed_columns  = [scarborough_onehot.columns[-1]] + list(scarborough_onehot.columns[:-1])                                                     
scarborough_onehot = scarborough_onehot[fixed_columns]    
         
#review results                                                      
scarborough_onehot.head()       

#OPTIONAL - run line 15 to copy and paste to spreadsheeet
#scarborough_onehot.to_clipboard(sep=',')                                                     

Unnamed: 0,Neighbourhood,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Garage,Automotive Shop,BBQ Joint,Badminton Court,Bakery,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"Malvern, Rouge",0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [51]:
#to transpose onehot dataframe (venues by row, neighbourhoods by column)
scarb_onehotTransposed = scarborough_onehot.T
#scarb_onehotTransposed

In [52]:
#OPTIONAL - uncomment/run line 2 to copy/paste into spreadsheet
#scarborough_onehot.to_clipboard(sep=',') 


In [53]:
#review size of onehot dataframe
scarborough_onehot.shape

(1078, 151)

In [54]:
#Group rows by 1) Neighbourhood 2) Category within Neighbourhood group (take mean of frequency of occurrence of each category)
scarborough_grouped = scarborough_onehot.groupby("Neighbourhood").mean().reset_index()
scarborough_grouped

Unnamed: 0,Neighbourhood,American Restaurant,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,Auto Garage,Automotive Shop,BBQ Joint,Badminton Court,Bakery,...,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Wings Joint,Women's Store,Xinjiang Restaurant,Yoga Studio,Zoo,Zoo Exhibit
0,Agincourt,0.011236,0.0,0.022472,0.0,0.0,0.0,0.0,0.0,0.022472,...,0.0,0.0,0.022472,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Birch Cliff, Cliffside West",0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.025,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Cedarbrae,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.02,...,0.0,0.01,0.01,0.0,0.01,0.0,0.01,0.01,0.0,0.0
3,"Clarks Corners, Tam O'Shanter, Sullivan",0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,...,0.0,0.0,0.011905,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Cliffside, Cliffcrest, Scarborough Village West",0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Dorset Park, Wexford Heights, Scarborough Town...",0.018868,0.018868,0.018868,0.0,0.0,0.0,0.0,0.018868,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,"Golden Mile, Clairlea, Oakridge",0.010101,0.010101,0.0,0.0,0.0,0.010101,0.010101,0.0,0.030303,...,0.0,0.010101,0.010101,0.0,0.0,0.010101,0.0,0.0,0.0,0.0
7,"Guildwood, Morningside, West Hill",0.0,0.0,0.0,0.026316,0.0,0.026316,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,"Kennedy Park, Ionview, East Birchmount Park",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.016949,0.016949,0.033898,0.0,0.0,0.0,0.0,0.0
9,"Malvern, Rouge",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.021739,0.0,0.0,0.0,0.021739,0.0,0.0,0.043478,0.26087


In [55]:
#calculate size of resulting dataframe
scarborough_grouped.shape

(17, 151)

In [56]:
#print Top 5 Most Common Venues for each Neighbourhood (i.e. highest frequency)

num_top_venues=5

for hood in scarborough_grouped['Neighbourhood']:
    print("----"+hood+"----")
    temp = scarborough_grouped[scarborough_grouped['Neighbourhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq']=temp['freq'].astype(float)
    temp = temp.round({'freq':2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt----
                venue  freq
0  Chinese Restaurant  0.12
1         Coffee Shop  0.08
2          Restaurant  0.06
3                Bank  0.03
4   Indian Restaurant  0.03


----Birch Cliff, Cliffside West----
           venue  freq
0    Coffee Shop  0.12
1  Grocery Store  0.10
2    Pizza Place  0.08
3           Park  0.05
4           Bank  0.05


----Cedarbrae----
                  venue  freq
0           Coffee Shop  0.08
1  Fast Food Restaurant  0.07
2            Restaurant  0.07
3                  Bank  0.06
4           Gas Station  0.05


----Clarks Corners, Tam O'Shanter, Sullivan----
                  venue  freq
0  Fast Food Restaurant  0.08
1                  Bank  0.06
2           Coffee Shop  0.06
3            Restaurant  0.06
4        Sandwich Place  0.05


----Cliffside, Cliffcrest, Scarborough Village West----
                  venue  freq
0              Pharmacy  0.10
1  Fast Food Restaurant  0.06
2        Ice Cream Shop  0.06
3                  Park  0.06


6. IDENTIFY MOST COMMON VENUES.

In [57]:
#create function to return most common venues 
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [58]:
#create DataFrame of each neighbourhood's most common venues

#define top venues parameter 
num_top_venues= 10

indicators = ['st', 'nd', 'rd']

#create columns according to # of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))
        
#create new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = scarborough_grouped['Neighbourhood']

for ind in np.arange(scarborough_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(scarborough_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()    

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Chinese Restaurant,Coffee Shop,Restaurant,Indian Restaurant,Supermarket,Sandwich Place,Pharmacy,Bank,Gas Station,Discount Store
1,"Birch Cliff, Cliffside West",Coffee Shop,Grocery Store,Pizza Place,Gas Station,Bank,Park,General Entertainment,Restaurant,Fast Food Restaurant,Café
2,Cedarbrae,Coffee Shop,Fast Food Restaurant,Restaurant,Bank,Sandwich Place,Gas Station,Clothing Store,Pharmacy,Department Store,Grocery Store
3,"Clarks Corners, Tam O'Shanter, Sullivan",Fast Food Restaurant,Bank,Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Gas Station,Burrito Place,Pharmacy,Chinese Restaurant
4,"Cliffside, Cliffcrest, Scarborough Village West",Pharmacy,Coffee Shop,Harbor / Marina,Park,Pizza Place,Fast Food Restaurant,Ice Cream Shop,Sandwich Place,Beach,Indian Restaurant


In [61]:
#create a copy scarb_onehotTransposed dataframe
unique = scarb_onehotTransposed.copy(deep=True)

#create a dataframe of unique venues

unique.columns = unique.iloc[0,:]
unique['Neighbourhood'] = unique.index.copy() # adds a copy of the index to the end (OK, exploration)
unique = unique[1:]

unique

Neighbourhood,"Malvern, Rouge","Malvern, Rouge.1","Malvern, Rouge.2","Malvern, Rouge.3","Malvern, Rouge.4","Malvern, Rouge.5","Malvern, Rouge.6","Malvern, Rouge.7","Malvern, Rouge.8","Malvern, Rouge.9",...,"Steeles West, L'Amoreaux West","Steeles West, L'Amoreaux West.1",Upper Rouge,Upper Rouge.1,Upper Rouge.2,Upper Rouge.3,Upper Rouge.4,Upper Rouge.5,Upper Rouge.6,Neighbourhood.1
American Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,American Restaurant
Arts & Crafts Store,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Arts & Crafts Store
Asian Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Asian Restaurant
Athletics & Sports,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Athletics & Sports
Auto Garage,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Auto Garage
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Women's Store,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Women's Store
Xinjiang Restaurant,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Xinjiang Restaurant
Yoga Studio,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Yoga Studio
Zoo,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Zoo


In [77]:
#create dataframe of venue totals
#define dataframe columns ()
NHOODcolnames = scarborough_grouped['Neighbourhood'].unique()
#instantiate dataframe
NHOODVENUES = pd.DataFrame(columns=NHOODcolnames)


#A) apply the sum function to get each venue category total by Neighborhood (unique['Malvern,Rouge'].apply(np.sum, axis = 1))
#B) put this (A) in a dictionary (add on   .to_dict())
#C) get the values for this dictionary (add on   .values())

for i in NHOODcolnames:
    NHOODVENUES[i] = unique[i].apply(np.sum, axis=1).to_dict().values()

NHOODVENUES.index = unique.index
NHOODVENUES.head()

Unnamed: 0,Agincourt,"Birch Cliff, Cliffside West",Cedarbrae,"Clarks Corners, Tam O'Shanter, Sullivan","Cliffside, Cliffcrest, Scarborough Village West","Dorset Park, Wexford Heights, Scarborough Town Centre","Golden Mile, Clairlea, Oakridge","Guildwood, Morningside, West Hill","Kennedy Park, Ionview, East Birchmount Park","Malvern, Rouge","Milliken, Agincourt North, Steeles East, L'Amoreaux East","Rouge Hill, Port Union, Highland Creek",Scarborough Village,"Steeles West, L'Amoreaux West",Upper Rouge,"Wexford, Maryvale",Woburn
American Restaurant,1,0,1,0,0,1,1,0,0,0,0,0,0,0,0,2,0
Arts & Crafts Store,0,0,0,1,0,1,1,0,0,0,0,0,0,0,0,0,0
Asian Restaurant,2,1,0,0,1,1,0,0,0,0,3,0,0,1,0,1,0
Athletics & Sports,0,0,1,0,0,0,0,1,0,0,0,0,0,2,0,1,1
Auto Garage,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,1


7. CLUSTER NEIGHBOURHOODS (run k-means).

In [66]:
#set number of clusters
kclusters = 5

scarborough_grouped_clustering = scarborough_grouped.drop('Neighbourhood', 1)

#run kmeans clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(scarborough_grouped_clustering)

#check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]


array([0, 4, 2, 2, 2, 2, 2, 4, 2, 3])

In [67]:
#add clustering labels (NOTE executing twice results in error: "Cannot insert CLuster labels, Already Exists"; comment out after running)
#neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

In [134]:
df.drop (['your_column_name'], axis=1, inplace=True)

Unnamed: 0,PostalCode,District,Neighbourhood,Longitude_x,Latitude_y,Longitude_y,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",-79.193,43.8113,-79.193,43.8113,-79.193,3,Zoo Exhibit,Fast Food Restaurant,Restaurant,Gas Station,Trail,Bus Station,Zoo,Pizza Place,Coffee Shop,Caribbean Restaurant
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",-79.1564,43.7878,-79.1564,43.7878,-79.1564,4,Coffee Shop,Pizza Place,Grocery Store,Park,Hotel,Bank,Trail,Breakfast Spot,Neighborhood,Shopping Mall
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",-79.1866,43.7678,-79.1866,43.7678,-79.1866,4,Pizza Place,Coffee Shop,Pharmacy,Grocery Store,Hotel,Fast Food Restaurant,Bank,Sports Bar,Business Service,Sandwich Place
3,M1G,Scarborough,Woburn,-79.2144,43.7712,-79.2144,43.7712,-79.2144,2,Coffee Shop,Fast Food Restaurant,Park,Sandwich Place,Supermarket,Pizza Place,Chinese Restaurant,Pharmacy,Thrift / Vintage Store,Bank
4,M1H,Scarborough,Cedarbrae,-79.2389,43.7686,-79.2389,43.7686,-79.2389,2,Coffee Shop,Fast Food Restaurant,Restaurant,Bank,Sandwich Place,Gas Station,Clothing Store,Pharmacy,Department Store,Grocery Store
5,M1J,Scarborough,Scarborough Village,-79.2323,43.7464,-79.2323,43.7464,-79.2323,2,Fast Food Restaurant,Coffee Shop,Pharmacy,Sandwich Place,Pizza Place,Beer Store,Bank,Ice Cream Shop,Indian Restaurant,Discount Store
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",-79.2639,43.7298,-79.2639,43.7298,-79.2639,2,Fast Food Restaurant,Coffee Shop,Sandwich Place,Chinese Restaurant,Burger Joint,Wings Joint,Breakfast Spot,Burrito Place,Pizza Place,Beer Store
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",-79.2843,43.7122,-79.2843,43.7122,-79.2843,2,Fast Food Restaurant,Coffee Shop,Sandwich Place,Burger Joint,Clothing Store,Bank,Bakery,Pizza Place,Furniture / Home Store,Grocery Store
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",-79.2312,43.7247,-79.2312,43.7247,-79.2312,2,Pharmacy,Coffee Shop,Harbor / Marina,Park,Pizza Place,Fast Food Restaurant,Ice Cream Shop,Sandwich Place,Beach,Indian Restaurant
9,M1N,Scarborough,"Birch Cliff, Cliffside West",-79.2646,43.6952,-79.2646,43.6952,-79.2646,4,Coffee Shop,Grocery Store,Pizza Place,Gas Station,Bank,Park,General Entertainment,Restaurant,Fast Food Restaurant,Café


In [145]:
scarborough_merged = ScarbData
#merge scarbororough_grouped with scarborough_data to add latitude/longitude for each neighborhood
scarborough_merged = scarborough_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on= 'Neighbourhood')

scarborough_merged.drop(['Longitude_x'], axis=1, inplace = True)
scarborough_merged.drop(['Longitude_y'], axis=1, inplace = True)
scarborough_merged.drop(['Latitude_x'], axis=1, inplace = True)
scarborough_merged.drop(['Latitude_y'], axis=1, inplace = True)

In [146]:
scarborough_merged

Unnamed: 0,PostalCode,District,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,M1B,Scarborough,"Malvern, Rouge",43.8113,-79.193,3,Zoo Exhibit,Fast Food Restaurant,Restaurant,Gas Station,Trail,Bus Station,Zoo,Pizza Place,Coffee Shop,Caribbean Restaurant
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek",43.7878,-79.1564,4,Coffee Shop,Pizza Place,Grocery Store,Park,Hotel,Bank,Trail,Breakfast Spot,Neighborhood,Shopping Mall
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.7678,-79.1866,4,Pizza Place,Coffee Shop,Pharmacy,Grocery Store,Hotel,Fast Food Restaurant,Bank,Sports Bar,Business Service,Sandwich Place
3,M1G,Scarborough,Woburn,43.7712,-79.2144,2,Coffee Shop,Fast Food Restaurant,Park,Sandwich Place,Supermarket,Pizza Place,Chinese Restaurant,Pharmacy,Thrift / Vintage Store,Bank
4,M1H,Scarborough,Cedarbrae,43.7686,-79.2389,2,Coffee Shop,Fast Food Restaurant,Restaurant,Bank,Sandwich Place,Gas Station,Clothing Store,Pharmacy,Department Store,Grocery Store
5,M1J,Scarborough,Scarborough Village,43.7464,-79.2323,2,Fast Food Restaurant,Coffee Shop,Pharmacy,Sandwich Place,Pizza Place,Beer Store,Bank,Ice Cream Shop,Indian Restaurant,Discount Store
6,M1K,Scarborough,"Kennedy Park, Ionview, East Birchmount Park",43.7298,-79.2639,2,Fast Food Restaurant,Coffee Shop,Sandwich Place,Chinese Restaurant,Burger Joint,Wings Joint,Breakfast Spot,Burrito Place,Pizza Place,Beer Store
7,M1L,Scarborough,"Golden Mile, Clairlea, Oakridge",43.7122,-79.2843,2,Fast Food Restaurant,Coffee Shop,Sandwich Place,Burger Joint,Clothing Store,Bank,Bakery,Pizza Place,Furniture / Home Store,Grocery Store
8,M1M,Scarborough,"Cliffside, Cliffcrest, Scarborough Village West",43.7247,-79.2312,2,Pharmacy,Coffee Shop,Harbor / Marina,Park,Pizza Place,Fast Food Restaurant,Ice Cream Shop,Sandwich Place,Beach,Indian Restaurant
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.6952,-79.2646,4,Coffee Shop,Grocery Store,Pizza Place,Gas Station,Bank,Park,General Entertainment,Restaurant,Fast Food Restaurant,Café


Visualize the resulting clusters
(required library -> matplotlib (.cm as cm, .colors as colors)

In [147]:
#create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

#set cluster color scheme
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

#add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(scarborough_merged['Latitude'], scarborough_merged['Longitude'], scarborough_merged['Neighbourhood'], scarborough_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

***Make this notebook trusted to load map: File --> Trust Notebook***

<li>Examine each cluster and review statistics to determine discriminating venue categories that distinguish each cluster.</li>
<li>Assign a name to each cluster based on defining categories.</li>

In [148]:
#create a dictionary of the total venues by neighbourhood
ScarbNHVenueTotals=dict(ScarbNHVenues['Total Venues'])
ScarbNHVenueTotals

{'Agincourt': 89,
 'Birch Cliff, Cliffside West': 40,
 'Cedarbrae': 100,
 "Clarks Corners, Tam O'Shanter, Sullivan": 84,
 'Cliffside, Cliffcrest, Scarborough Village West': 50,
 'Dorset Park, Wexford Heights, Scarborough Town Centre': 53,
 'Golden Mile, Clairlea, Oakridge': 99,
 'Guildwood, Morningside, West Hill': 38,
 'Kennedy Park, Ionview, East Birchmount Park': 59,
 'Malvern, Rouge': 46,
 "Milliken, Agincourt North, Steeles East, L'Amoreaux East": 83,
 'Rouge Hill, Port Union, Highland Creek': 37,
 'Scarborough Village': 74,
 "Steeles West, L'Amoreaux West": 72,
 'Upper Rouge': 7,
 'Wexford, Maryvale': 89,
 'Woburn': 58}

In [149]:
#create a function to provide additional statistics (venue totals, % of neighborhood total, etc) for each Cluster

def cluster_stats(C):
    #create list  of dictionary keys for cdict (dictionary containing Top 5 Venue Total dictionaries)
    Ckeys = ['1st - Total Venues', '2nd - Total Venues', '3rd - Total Venues', '4th - Total Venues','5th - Total Venues']

    #create list  of dictionary keys for pctofnhdict (dictionary containing % of neighborhood dictionaries)
    pctofnhkeys = ['1st - % of Neighbourhood Total', '2nd - % of Neighbourhood Total', '3rd - % of Neighbourhood Total', '4th - % of Neighbourhood Total', '5th - % of Neighbourhood Total']

    #create list  of dictionary keys for pctofdistrictdict (dictionary containing % of district dictionaries)
    pctofdistrictkeys = ['1st - % of District Total', '2nd - % of District Total', '3rd - % of District Total', '4th - % of District Total', '5th - % of District Total']

    Dkeys = ['1st - District Total', '2nd - District Total', '3rd - District Total', '4th - District Total','5th - District Total']

    NHVenueTotals = []
    clistofdicts = []
    listofnhpctdicts = []
    pctofdistrictlistofdicts = []

    for i in C.index:
        top5totals = []
        DistrictCatTotals = []
        PctofNH = []     #NH = Neighborhood
        PctofDistrict = []
    
        clusterneighbourhood = C.loc[i,].values[0]  #identifies Neighbourhood in row
        NHVenueTotals.append(ScarbNHVenueTotals[clusterneighbourhood])  #To create list of venue totals by neighbourhood
        for col in range(2,7):
            clusterneighbourhoodvenue = C.loc[i,:].values[col] #identifies Neighborhood Venue Category in row
            top5totals.append(NHOODVENUES.T.loc[clusterneighbourhood,clusterneighbourhoodvenue].sum())
            PctofNH = (top5totals) / (ScarbNHVenueTotals[clusterneighbourhood]) * 100        
            DistrictCatTotals.append(VenCatTotal[clusterneighbourhoodvenue]) #identify District Total of venue category (for % of District Calc)
            PctofDistrict = pd.Series(top5totals)/pd.Series(DistrictCatTotals)*100
        
        cdict = dict(zip(Ckeys,top5totals))          
        #create a list of venue category totals dictionaries (NOTE - appending dataframe via .append is not recommended/significant use of resources)
        clistofdicts.append(cdict)
    
        pctofnhdict = dict(zip(pctofnhkeys, PctofNH))
        #create list of "percent of neighborhood" dictionaries
        listofnhpctdicts.append(pctofnhdict)
  
        pctofdistrictdict = dict(zip(pctofdistrictkeys,PctofDistrict))
        #create list of "percent of district" dictionaries
        pctofdistrictlistofdicts.append(pctofdistrictdict)
        
    #Create DataFrame of the % of neighborhood stats
    pctofnhdf = pd.DataFrame(listofnhpctdicts)

    #Create DataFrame of the % of district stats
    pctofdistrictdf = pd.DataFrame(pctofdistrictlistofdicts)

    #create DataFrame of cluster stats
    cdfstats = pd.DataFrame(clistofdicts)
    cdfstats.insert(0, 'Neighbourhood', C[['Neighbourhood']])
    cdfstats.insert(1, 'Cluster Labels', C[['Cluster Labels']])
    cdfstats.insert(2, 'Total Venues in Neighbourhood', NHVenueTotals)
    cdfstats.insert(3,  '1st Most Common Venue', C[['1st Most Common Venue']])
    cdfstats.insert(5,  '1st - % of Neighbourhood Total', pctofnhdf[['1st - % of Neighbourhood Total']])
    cdfstats.insert(6,  '1st - % of District Total', pctofdistrictdf[['1st - % of District Total']])
    cdfstats.insert(7,  '2nd Most Common Venue', C[['2nd Most Common Venue']])
    cdfstats.insert(9,  '2nd - % of Neighbourhood Total', pctofnhdf[['2nd - % of Neighbourhood Total']])
    cdfstats.insert(10,  '2nd - % of District Total', pctofdistrictdf[['2nd - % of District Total']])
    cdfstats.insert(11,  '3rd Most Common Venue', C[['3rd Most Common Venue']])
    cdfstats.insert(13,  '3rd - % of Neighbourhood Total', pctofnhdf[['3rd - % of Neighbourhood Total']])
    cdfstats.insert(14,  '3rd - % of District Total', pctofdistrictdf[['3rd - % of District Total']])
    cdfstats.insert(15,  '4th Most Common Venue', C[['4th Most Common Venue']])
    cdfstats.insert(17,  '4th - % of Neighbourhood Total', pctofnhdf[['4th - % of Neighbourhood Total']])
    cdfstats.insert(18,  '4th - % of District Total', pctofdistrictdf[['4th - % of District Total']])
    cdfstats.insert(19,  '5th Most Common Venue', C[['5th Most Common Venue']])
    cdfstats.insert(21,  '5th - % of Neighbourhood Total', pctofnhdf[['5th - % of Neighbourhood Total']])
    cdfstats.insert(22,  '5th - % of District Total', pctofdistrictdf[['5th - % of District Total']])

    #create dataframe of 6th-10th most common venues
    Etc = C.iloc[:, 7:]

    #concatenate 2 dataframes (add df of 6th-10th most common venues to end of cdfstats)
    cdfstats = pd.concat([cdfstats, Etc], axis = 1)

    return(cdfstats.T)

PRESENT CLUSTER INFORMATION

CLuster 1 of 5 (RED) - PACIFIC TRIANGLE

Of the 15 unique restaurant categories within the cluster, 11 are Asian/Indian.  Each neighbourhood's most Common venue is the Chinese Restaurant (there are 26 to choose from, comprising 10% of the cluster's total venues).

In [150]:
#view top 10 venues in CLuster 
Cluster1 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 0, scarborough_merged.columns[[2] + list(range(5, scarborough_merged.shape[1]))]].reset_index(drop=True)  #NOTE: Cluster Statistics function requires .reset_index(drop=True) as a parameter
Cluster1

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,0,Chinese Restaurant,Coffee Shop,Restaurant,Indian Restaurant,Supermarket,Sandwich Place,Pharmacy,Bank,Gas Station,Discount Store
1,"Milliken, Agincourt North, Steeles East, L'Amo...",0,Chinese Restaurant,Coffee Shop,Pizza Place,Bubble Tea Shop,Dessert Shop,Asian Restaurant,Noodle House,Vietnamese Restaurant,Korean Restaurant,Fast Food Restaurant
2,"Steeles West, L'Amoreaux West",0,Chinese Restaurant,Coffee Shop,Fast Food Restaurant,Pharmacy,Japanese Restaurant,Sandwich Place,Park,Bank,Pizza Place,Gas Station


In [151]:
#view cluster size
Cluster1.shape

(3, 12)

In [152]:
print("Additional Cluster Statistics:")
C = Cluster1
cluster_stats(C)

#OPTIONAL - run line 6 to copy to clipboard and paste into spreadsheet
#cluster_stats(C).to_clipboard(sep=',')    

Additional Cluster Statistics:


Unnamed: 0,0,1,2
Neighbourhood,Agincourt,"Milliken, Agincourt North, Steeles East, L'Amo...","Steeles West, L'Amoreaux West"
Cluster Labels,0,0,0
Total Venues in Neighbourhood,89,83,72
1st Most Common Venue,Chinese Restaurant,Chinese Restaurant,Chinese Restaurant
1st - Total Venues,11,10,5
1st - % of Neighbourhood Total,12.3596,12.0482,6.94444
1st - % of District Total,26.1905,23.8095,11.9048
2nd Most Common Venue,Coffee Shop,Coffee Shop,Coffee Shop
2nd - Total Venues,7,6,5
2nd - % of Neighbourhood Total,7.86517,7.22892,6.94444


Cluster 2 of 5 (PURPLE) - THE GREEN SCENE - Head to Upper Rouge for an outdoor encounter and choose from 6 unique open air categories.

In [153]:
#view top 10 venues in CLuster 
Cluster2 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 1, scarborough_merged.columns[[2] + list(range(5, scarborough_merged.shape[1]))]].reset_index(drop=True) 
Cluster2

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Upper Rouge,1,Playground,Indian Restaurant,Farm,Trail,Golf Course,Grocery Store,Sculpture Garden,Diner,Discount Store,Dog Run


In [154]:
#view cluster size
Cluster2.shape

(1, 12)

In [155]:
#view additional cluster statistics
print("Additional Cluster Statistics:")
C = Cluster2
cluster_stats(C)

Additional Cluster Statistics:


Unnamed: 0,0
Neighbourhood,Upper Rouge
Cluster Labels,1
Total Venues in Neighbourhood,7
1st Most Common Venue,Playground
1st - Total Venues,1
1st - % of Neighbourhood Total,14.2857
1st - % of District Total,25
2nd Most Common Venue,Indian Restaurant
2nd - Total Venues,1
2nd - % of Neighbourhood Total,14.2857


Cluster 3 of 5 (BLUE)- QUICK CUISINE - You won't go hungry in this cluster of convenience.  Of the 220 venues in the Top 5 Most Common group, there are 142 quick bite options to choose from.

In [156]:
#view top 10 venues in CLuster 
Cluster3 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 2, scarborough_merged.columns[[2] + list(range(5, scarborough_merged.shape[1]))]].reset_index(drop=True)
Cluster3

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Woburn,2,Coffee Shop,Fast Food Restaurant,Park,Sandwich Place,Supermarket,Pizza Place,Chinese Restaurant,Pharmacy,Thrift / Vintage Store,Bank
1,Cedarbrae,2,Coffee Shop,Fast Food Restaurant,Restaurant,Bank,Sandwich Place,Gas Station,Clothing Store,Pharmacy,Department Store,Grocery Store
2,Scarborough Village,2,Fast Food Restaurant,Coffee Shop,Pharmacy,Sandwich Place,Pizza Place,Beer Store,Bank,Ice Cream Shop,Indian Restaurant,Discount Store
3,"Kennedy Park, Ionview, East Birchmount Park",2,Fast Food Restaurant,Coffee Shop,Sandwich Place,Chinese Restaurant,Burger Joint,Wings Joint,Breakfast Spot,Burrito Place,Pizza Place,Beer Store
4,"Golden Mile, Clairlea, Oakridge",2,Fast Food Restaurant,Coffee Shop,Sandwich Place,Burger Joint,Clothing Store,Bank,Bakery,Pizza Place,Furniture / Home Store,Grocery Store
5,"Cliffside, Cliffcrest, Scarborough Village West",2,Pharmacy,Coffee Shop,Harbor / Marina,Park,Pizza Place,Fast Food Restaurant,Ice Cream Shop,Sandwich Place,Beach,Indian Restaurant
6,"Dorset Park, Wexford Heights, Scarborough Town...",2,Coffee Shop,Sandwich Place,Bank,Clothing Store,Gym,Soccer Field,Restaurant,Fast Food Restaurant,Bookstore,Pizza Place
7,"Wexford, Maryvale",2,Coffee Shop,Restaurant,Fast Food Restaurant,Pizza Place,Middle Eastern Restaurant,Sandwich Place,Gas Station,Burger Joint,Supermarket,American Restaurant
8,"Clarks Corners, Tam O'Shanter, Sullivan",2,Fast Food Restaurant,Bank,Restaurant,Coffee Shop,Pizza Place,Sandwich Place,Gas Station,Burrito Place,Pharmacy,Chinese Restaurant


In [157]:
#view cluster size
Cluster3.shape

(9, 12)

In [158]:
#view cluster statistics
print("Additional Cluster Statistics:")
C = Cluster3
cluster_stats(C)
#OPTIONAL - run line 6 to copy to clipboard and paste into spreadsheet
#cluster_stats(C).to_clipboard(sep=',') 

Additional Cluster Statistics:


Unnamed: 0,0,1,2,3,4,5,6,7,8
Neighbourhood,Woburn,Cedarbrae,Scarborough Village,"Kennedy Park, Ionview, East Birchmount Park","Golden Mile, Clairlea, Oakridge","Cliffside, Cliffcrest, Scarborough Village West","Dorset Park, Wexford Heights, Scarborough Town...","Wexford, Maryvale","Clarks Corners, Tam O'Shanter, Sullivan"
Cluster Labels,2,2,2,2,2,2,2,2,2
Total Venues in Neighbourhood,58,100,74,59,99,50,53,89,84
1st Most Common Venue,Coffee Shop,Coffee Shop,Fast Food Restaurant,Fast Food Restaurant,Fast Food Restaurant,Pharmacy,Coffee Shop,Coffee Shop,Fast Food Restaurant
1st - Total Venues,7,8,8,4,10,5,5,8,7
1st - % of Neighbourhood Total,12.069,8,10.8108,6.77966,10.101,10,9.43396,8.98876,8.33333
1st - % of District Total,8.04598,9.1954,11.4286,5.71429,14.2857,13.1579,5.74713,9.1954,10
2nd Most Common Venue,Fast Food Restaurant,Fast Food Restaurant,Coffee Shop,Coffee Shop,Coffee Shop,Coffee Shop,Sandwich Place,Restaurant,Bank
2nd - Total Venues,7,7,6,3,9,3,4,5,5
2nd - % of Neighbourhood Total,12.069,7,8.10811,5.08475,9.09091,6,7.54717,5.61798,5.95238


Cluster 4 of 5 (GREEN) 
- WILD LIFE - The zoo is the outstanding attraction for visitors to this neighbourhood.

In [159]:
#view top 10 venues in CLuster 
Cluster4 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 3, scarborough_merged.columns[[2] + list(range(5, scarborough_merged.shape[1]))]].reset_index(drop=True)
Cluster4

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Malvern, Rouge",3,Zoo Exhibit,Fast Food Restaurant,Restaurant,Gas Station,Trail,Bus Station,Zoo,Pizza Place,Coffee Shop,Caribbean Restaurant


In [160]:
#view cluster size
Cluster4.shape

(1, 12)

In [161]:
#view cluster statistics
print("Additional Cluster Statistics:")
C = Cluster4
cluster_stats(C)

Additional Cluster Statistics:


Unnamed: 0,0
Neighbourhood,"Malvern, Rouge"
Cluster Labels,3
Total Venues in Neighbourhood,46
1st Most Common Venue,Zoo Exhibit
1st - Total Venues,12
1st - % of Neighbourhood Total,26.087
1st - % of District Total,100
2nd Most Common Venue,Fast Food Restaurant
2nd - Total Venues,4
2nd - % of Neighbourhood Total,8.69565


Cluster 5 of 5 (YELLOW) - CONVENIENCE IS THE KEY - Stock up on caffeine, groceries, gas, and cash in this cluster offering a variety of useful venues.

In [162]:
#view top 10 venues in CLuster 
Cluster5 = scarborough_merged.loc[scarborough_merged['Cluster Labels'] == 4, scarborough_merged.columns[[2] + list(range(5, scarborough_merged.shape[1]))]].reset_index(drop=True) 
Cluster5

Unnamed: 0,Neighbourhood,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Rouge Hill, Port Union, Highland Creek",4,Coffee Shop,Pizza Place,Grocery Store,Park,Hotel,Bank,Trail,Breakfast Spot,Neighborhood,Shopping Mall
1,"Guildwood, Morningside, West Hill",4,Pizza Place,Coffee Shop,Pharmacy,Grocery Store,Hotel,Fast Food Restaurant,Bank,Sports Bar,Business Service,Sandwich Place
2,"Birch Cliff, Cliffside West",4,Coffee Shop,Grocery Store,Pizza Place,Gas Station,Bank,Park,General Entertainment,Restaurant,Fast Food Restaurant,Café


In [163]:
#view cluster size
Cluster5.shape

(3, 12)

In [164]:
#view cluster statistics
print("Additional Cluster Statistics:")
C = Cluster5
cluster_stats(C)

Additional Cluster Statistics:


Unnamed: 0,0,1,2
Neighbourhood,"Rouge Hill, Port Union, Highland Creek","Guildwood, Morningside, West Hill","Birch Cliff, Cliffside West"
Cluster Labels,4,4,4
Total Venues in Neighbourhood,37,38,40
1st Most Common Venue,Coffee Shop,Pizza Place,Coffee Shop
1st - Total Venues,4,6,5
1st - % of Neighbourhood Total,10.8108,15.7895,12.5
1st - % of District Total,4.5977,12.2449,5.74713
2nd Most Common Venue,Pizza Place,Coffee Shop,Grocery Store
2nd - Total Venues,3,5,4
2nd - % of Neighbourhood Total,8.10811,13.1579,10
