# This notebook will be built incrementally with each task in Week 3 assignment of the Applied Data Science capstone

# PART 1: scrape data from the wikipedia page and ready it in a dataframe for analysis

## Extract data using BeautifulSoup library by web scraping wikipedia page below
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M

In [2]:
from bs4 import BeautifulSoup
import requests #to request the html document behind the url
import pandas as pd

In [3]:
#will define a function to generate the html extract from the web link
def make_soup(url):
    page = requests.get(url)
    if str(page.status_code).startswith('2'):
        return BeautifulSoup(page.text, 'html.parser')
    else:
        return 'Error:' & page.status_code

In [4]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
soup = make_soup(url)

In [5]:
print(soup.prettify()) #view the tree created by BeautifulSoup

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of postal codes of Canada: M - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjZwRApAADsAAImRYykAAABA","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_postal_codes_of_Canada:_M","wgTitle":"List of postal codes of Canada: M","wgCurRevisionId":935851093,"wgRevisionId":935851093,"wgArticleId":539066,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Communi

In [6]:
#Select only the table we are interested in
table = soup.find('table', attrs={'class':'wikitable sortable'})
table_rows = table.find_all('tr')

In [7]:
#Create a list and start populating it, will then convert it to a pandas dataframe
PC = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        PC.append(row)

df = pd.DataFrame(PC, columns=["PostalCode","Borough","Neighborhood"])
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
...,...,...,...
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West
285,M8Z,Etobicoke,South of Bloor


## Begin data processing to transform it into a desired state, applying conditions and transformations specified in the assignment

In [8]:
#How many Not Assigned boroughs are there
df["Borough"].value_counts()

Not assigned        77
Etobicoke           44
North York          38
Scarborough         37
Downtown Toronto    37
Central Toronto     17
West Toronto        13
York                 9
East Toronto         7
East York            6
Mississauga          1
Queen's Park         1
Name: Borough, dtype: int64

In [9]:
#Remove the 77 Postal Codes with 'Not Assigned' Boroughs
df.drop(df.loc[df['Borough']=="Not assigned"].index, inplace=True, axis=0)
df

Unnamed: 0,PostalCode,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor
...,...,...,...
281,M8Z,Etobicoke,Kingsway Park South West
282,M8Z,Etobicoke,Mimico NW
283,M8Z,Etobicoke,The Queensway West
284,M8Z,Etobicoke,Royal York South West


In [10]:
#select how many codes have more than one neighborhood
df[["PostalCode","Neighborhood"]].groupby("PostalCode").count()

Unnamed: 0_level_0,Neighborhood
PostalCode,Unnamed: 1_level_1
M1B,2
M1C,3
M1E,3
M1G,1
M1H,1
...,...
M9N,1
M9P,1
M9R,4
M9V,8


In [11]:
#Concatenate multiple Neighborhoods for a PostalCode
def enlist(g):
    return ','.join(g.Neighborhood)
new_Neighbor = df.groupby("PostalCode").apply(enlist).to_frame(name="Total_Neighborhood")
#This will merge the concatentated new_Neighborhood calculated above with the main dataframe
new_df = pd.merge(df,new_Neighbor, on="PostalCode").drop_duplicates("PostalCode") #retains first row of each PostalCode
new_df = new_df.reset_index(drop=True)
new_df.drop(columns="Neighborhood", inplace=True)
new_df

Unnamed: 0,PostalCode,Borough,Total_Neighborhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights,Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park
...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North"
99,M4Y,Downtown Toronto,Church and Wellesley
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout..."


In [12]:
#Lastly, how many boroughs we have that have "Not assigned" neighborhood
new_df.loc[new_df["Total_Neighborhood"]=="Not assigned"]

Unnamed: 0,PostalCode,Borough,Total_Neighborhood
5,M9A,Queen's Park,Not assigned


In [13]:
#Apply the transformation to have the Borough name as the Neighborhood if Neighborhood is "Not assigned"
new_df.loc[new_df.Total_Neighborhood == "Not assigned", 'Total_Neighborhood'] = new_df['Borough']

In [14]:
#Confirm substitution works with the previously identified PostalCode 'M9A'
new_df.loc[new_df["PostalCode"]=="M9A"]

Unnamed: 0,PostalCode,Borough,Total_Neighborhood
5,M9A,Queen's Park,Queen's Park


In [15]:
new_df.shape

(103, 3)

## This completes web scraping of the wikipedia page and preprocessing the table therein. new_df is now ready for further clustering but needs to be enriched with lat, long to make Foursquare API calls

# PART 2: Collect latitude and longitude for the Postal Codes in the data set

In [15]:
#import geocoder

In [16]:
#testing to see if the geocoder package will work
#lat_lng_coords = None
#while (lat_lng_coords is None):
#    g = geocoder.google('{}, Toronto, Ontario'.format(new_df.loc[2]['PostalCode']))
#    lat_lng_coords = g.latllng

## Tested the geocoder package to obtain lat-long but its not yielding any results and kernel remains busy with no response
### Reverting to the .csv file provided for this scenario in the assignment.
#### Downloaded to local disk. Will now convert it into a pandas df

In [16]:
Lat_Long_df = pd.read_csv("../../../Downloads/Geospatial_Coordinates.csv")

In [17]:
Lat_Long_df.rename(columns={'Postal Code':'PostalCode'}, inplace=True)

In [18]:
#Merge lat long to the dataset
prop_df = pd.merge(new_df, Lat_Long_df, on='PostalCode')
prop_df

Unnamed: 0,PostalCode,Borough,Total_Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.654260,-79.360636
3,M6A,North York,"Lawrence Heights,Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway,Montgomery Road,Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,Business Reply Mail Processing Centre 969 Eastern,43.662744,-79.321558
101,M8Y,Etobicoke,"Humber Bay,King's Mill Park,Kingsway Park Sout...",43.636258,-79.498509


# Part 3: Analyze and cluster the neighborhoods based on proprerties extracted from Foursquare

### Will replicate the analyses done in an earlier module for New York City using the explore function from Foursquare

In [19]:
# prop_df now has the boroughs and neigborhoods we need to analyze using folium
print('The dataframe prop_df has {} boroughs and {} neighborhoods.'.format(
        len(prop_df['Borough'].unique()),
        prop_df.shape[0]
    )
)

The dataframe prop_df has 11 boroughs and 103 neighborhoods.


In [20]:
#setting up dependencies for geospatial analysis of Toronto City Neighborhoods
import folium

import json #library to handle json files
from pandas.io.json import json_normalize #flatten json

from geopy.geocoders import Nominatim #to obtain latitude and longitude for an address string

import matplotlib.cm as cm
import matplotlib.colors as colors

In [21]:
#handling timeout error - can ignore this if call to Nominatim geocoder works fine in the next cell
from geopy.exc import GeocoderTimedOut
def do_geocode(address):
    try:
        return geoloc.geocode(address)
    except GeocoderTimedOut:
        return do_geocode(address)

In [22]:
#setup seed address for Toronto, Ontario, Canada
address = 'Toronto, Ontario'
#obtain coordinates using geopy module Nominatim. First assign a user agent
geoloc = Nominatim(user_agent="toronto_exp")
location = geoloc.geocode(address)
latitude = location.latitude
longitude = location.longitude
#Print coordinates we got for Toronto City
print('The coordinates for Toronto City are {}, {}.'.format(latitude, longitude))

The coordinates for Toronto City are 43.653963, -79.387207.


In [23]:
# create the folium map to visualize the boroughs and neigborhoods as they are now in prop_df
#THE MAIN RECIPE
map_ontario = folium.Map(location=[latitude, longitude], zoom_start=12)

#garnish the map with labels and markers
for lati, longi, boro, neigh in zip(prop_df['Latitude'], prop_df['Longitude'], prop_df['Borough'], prop_df['Total_Neighborhood']):
    label = '{},{}'.format(neigh, boro)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lati, longi],
    radius=3,
    popup=label,
    color='green',
    fill=True,
    fill_color='#3186cc',
    fill_opacity = 0.6,
    parse_html=False).add_to(map_ontario)

map_ontario

### Clearly the neigborhoods are well distributed in the map. Let's see if we can narrow these down to those closer to Toronto City 

In [24]:
prop_df["Borough"].unique() #see what the 11 boroughs are made of

array(['North York', 'Downtown Toronto', "Queen's Park", 'Scarborough',
       'East York', 'Etobicoke', 'York', 'East Toronto', 'West Toronto',
       'Central Toronto', 'Mississauga'], dtype=object)

In [31]:
#limit the data set to only boroughs that have Toronto in their name
toronto_df = prop_df[prop_df['Borough'].str.contains(r'\bToronto\b')].reset_index(drop=True)
toronto_df.head()

Unnamed: 0,PostalCode,Borough,Total_Neighborhood,Latitude,Longitude
0,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
1,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494
2,M5B,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937
3,M5C,Downtown Toronto,St. James Town,43.651494,-79.375418
4,M4E,East Toronto,The Beaches,43.676357,-79.293031


In [32]:
toronto_df.shape #so what shape is the toronto df now

(39, 5)

In [33]:
#put this on the map again
# create the folium map to visualize the boroughs and neigborhoods as they are now in prop_df
#THE MAIN RECIPE
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=12)

#garnish the map with labels and markers
for lati, longi, boro, neigh in zip(toronto_df['Latitude'], toronto_df['Longitude'], toronto_df['Borough'], toronto_df['Total_Neighborhood']):
    label = '{},{}'.format(neigh, boro)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lati, longi],
    radius=3,
    popup=label,
    color='green',
    fill=True,
    fill_color='#3186cc',
    fill_opacity = 0.6,
    parse_html=False).add_to(map_toronto)

map_toronto

### Still the spread is wide but we are better focused around the coastline and Toronto City. Now to analyze the neighborhoods by exploring places of interest and then segment them
#### Begin by setting up the Foursquare API dependencies

In [34]:
CLIENT_ID = '45NLKMKCVY1LWUGBXWSAYFMLWOH1ZCVBMZDLIITRCJ3ISWPK' # your Foursquare ID
CLIENT_SECRET = 'I3KZJTSLDKM2HGN22LYW300RX3FOEFVMOOBDUL5W5N05054Q' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
#validate this works fine
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 45NLKMKCVY1LWUGBXWSAYFMLWOH1ZCVBMZDLIITRCJ3ISWPK
CLIENT_SECRET:I3KZJTSLDKM2HGN22LYW300RX3FOEFVMOOBDUL5W5N05054Q


In [35]:
#will begin with exploring the first Neighborhood which is...
toronto_df["Total_Neighborhood"][0]

'Harbourfront'

In [36]:
#create the GET request URL to send to Foursquare for getting venues at Harbourfront
#get the vars in
neigh_lat = toronto_df.loc[0,'Latitude']
neigh_lon = toronto_df.loc[0,'Longitude']
neigh_name = toronto_df.loc[0,'Total_Neighborhood']
print('Latitude and longitude values of {} are {}, {}.'.format(neigh_name, 
                                                               neigh_lat, 
                                                               neigh_lon))

Latitude and longitude values of Harbourfront are 43.6542599, -79.3606359.


In [37]:
#will begin with getting Top 100 venues within 500 metre radius
LIMIT = 100
radius = 300#changes this to 300m to reduce footprint and overlap
#set up URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neigh_lat, 
    neigh_lon, 
    radius, 
    LIMIT)

url

'https://api.foursquare.com/v2/venues/explore?&client_id=45NLKMKCVY1LWUGBXWSAYFMLWOH1ZCVBMZDLIITRCJ3ISWPK&client_secret=I3KZJTSLDKM2HGN22LYW300RX3FOEFVMOOBDUL5W5N05054Q&v=20180605&ll=43.6542599,-79.3606359&radius=300&limit=100'

In [38]:
#request venues using explore function call
stream = requests.get(url).json()
stream

{'meta': {'code': 200, 'requestId': '5e381c7ebe61c9001b78be00'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': 'Open now', 'key': 'openNow'}]},
  'headerLocation': 'Corktown',
  'headerFullLocation': 'Corktown, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 12,
  'suggestedBounds': {'ne': {'lat': 43.6569599027, 'lng': -79.35691110008916},
   'sw': {'lat': 43.6515598973, 'lng': -79.36436069991085}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '54ea41ad498e9a11e9e13308',
       'name': 'Roselle Desserts',
       'location': {'address': '362 King St E',
        'crossStreet': 'Trinity St',
        'lat': 43.653446723052674,
        'lng': -79.3620167174383,
        'labeledLatLngs': [{'label': 'display',


### To understand the category of each of the venues returned, we define a function to extract that from the json above

In [39]:
def get_cat_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
    if len(categories_list)==0:
        return None
    else:
        return categories_list[0]['name']

#Begin extracting relevant data on venues from json
venues = stream['response']['groups'][0]['items']
venues_df = json_normalize(venues)
venues_closeby = venues_df.loc[:,['venue.name','venue.categories','venue.location.lat','venue.location.lng']]

venues_closeby['venue.categories'] = venues_closeby.apply(get_cat_type, axis=1)
venues_closeby.head()

Unnamed: 0,venue.name,venue.categories,venue.location.lat,venue.location.lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Gym / Fitness Center,43.653191,-79.357947
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149


In [40]:
#remove unnecessary nomenclature from column names, keep only the last part {-1}
venues_closeby.columns = [col.split('.')[-1] for col in venues_closeby.columns]

venues_closeby.head()

Unnamed: 0,name,categories,lat,lng
0,Roselle Desserts,Bakery,43.653447,-79.362017
1,Tandem Coffee,Coffee Shop,43.653559,-79.361809
2,Cooper Koo Family YMCA,Gym / Fitness Center,43.653191,-79.357947
3,Body Blitz Spa East,Spa,43.654735,-79.359874
4,Morning Glory Cafe,Breakfast Spot,43.653947,-79.361149


In [41]:
#since this is for just one Neighborhood='Harboufront', check to see how many venues were returned
print("{} venues were returned by Foursquare for {}.".format(venues_closeby.shape[0],toronto_df["Total_Neighborhood"][0]))

12 venues were returned by Foursquare for Harbourfront.


### Now need to repeat this exercise for all Neighborhoods. Define a function to apply and extract venues

In [42]:
def get_venues_closeby(names, latitudes, longitudes, radius=500):
    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        
        #Create URL for Foursquare
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)

        #GET details
        stream = requests.get(url).json()["response"]['groups'][0]['items']

        venues_list.append([(
        name,
        lat,
        lng,
        entity['venue']['name'],
        entity['venue']['location']['lat'],
        entity['venue']['location']['lng'],
        entity['venue']['categories'][0]['name'])for entity in stream])
        
    venues_closeby = pd.DataFrame([venue for nbh_venue_list in venues_list for venue in nbh_venue_list])
    venues_closeby.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude',
                  'Venue Category']
    
    return venues_closeby

In [44]:
toronto_venues = get_venues_closeby(names=toronto_df['Total_Neighborhood'], latitudes=toronto_df['Latitude'], longitudes=toronto_df['Longitude'])

Harbourfront
Queen's Park
Ryerson,Garden District
St. James Town
The Beaches
Berczy Park
Central Bay Street
Christie
Adelaide,King,Richmond
Dovercourt Village,Dufferin
Harbourfront East,Toronto Islands,Union Station
Little Portugal,Trinity
The Danforth West,Riverdale
Design Exchange,Toronto Dominion Centre
Brockton,Exhibition Place,Parkdale Village
The Beaches West,India Bazaar
Commerce Court,Victoria Hotel
Studio District
Lawrence Park
Roselawn
Davisville North
Forest Hill North,Forest Hill West
High Park,The Junction South
North Toronto West
The Annex,North Midtown,Yorkville
Parkdale,Roncesvalles
Davisville
Harbord,University of Toronto
Runnymede,Swansea
Moore Park,Summerhill East
Chinatown,Grange Park,Kensington Market
Deer Park,Forest Hill SE,Rathnelly,South Hill,Summerhill West
CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara
Rosedale
Stn A PO Boxes 25 The Esplanade
Cabbagetown,St. James Town
First Canadian Place,Underground city

In [45]:
toronto_venues.shape

(1714, 7)

In [46]:
toronto_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.65426,-79.360636,Roselle Desserts,43.653447,-79.362017,Bakery
1,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
2,Harbourfront,43.65426,-79.360636,Cooper Koo Family YMCA,43.653191,-79.357947,Gym / Fitness Center
3,Harbourfront,43.65426,-79.360636,Body Blitz Spa East,43.654735,-79.359874,Spa
4,Harbourfront,43.65426,-79.360636,Impact Kitchen,43.656369,-79.35698,Restaurant


In [47]:
#check the number of venues per Neighborhood
toronto_venues[['Neighborhood','Venue']].groupby('Neighborhood').count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
"Adelaide,King,Richmond",100
Berczy Park,56
"Brockton,Exhibition Place,Parkdale Village",23
Business Reply Mail Processing Centre 969 Eastern,16
"CN Tower,Bathurst Quay,Island airport,Harbourfront West,King and Spadina,Railway Lands,South Niagara",16
"Cabbagetown,St. James Town",47
Central Bay Street,83
"Chinatown,Grange Park,Kensington Market",87
Christie,19
Church and Wellesley,82


In [48]:
#And the number of unique categories for venues in Toronto
print("A total of {} unique categories exist across {} Neighborhoods.".format(len(toronto_venues['Venue Category'].unique()),len(toronto_venues['Neighborhood'].unique())))

A total of 230 unique categories exist across 39 Neighborhoods.


In [55]:
#Codify the venues details using One Hot Encoding
toronto_coded = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="") #creates columns for each venue Category

# add neighborhood column back to dataframe
toronto_coded['Neighborhood'] = toronto_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_coded.columns[-1]] + list(toronto_coded.columns[:-1])
toronto_coded = toronto_coded[fixed_columns]

toronto_coded.head()

Unnamed: 0,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [50]:
#Now the coded toronto data size is...
toronto_coded.shape

(1714, 230)

### Need to assess what category of venues we have in each neighborhood as also the prominence of that category. These properties can then be used to identify/tag clusters we create from Neighborhoods
#### Doing it by averaging the appearance of certain categories in a Neighborhood more than others, so these can be sorted

In [53]:
toronto_nbh = toronto_coded.groupby('Neighborhood').mean().reset_index()
toronto_nbh

Unnamed: 0,Neighborhood,Yoga Studio,Afghan Restaurant,Airport,Airport Food Court,Airport Gate,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Theater,Theme Restaurant,Toy / Game Store,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Wine Bar,Women's Store
0,"Adelaide,King,Richmond",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,...,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.01
1,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,0.0,0.0
2,"Brockton,Exhibition Place,Parkdale Village",0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Business Reply Mail Processing Centre 969 Eastern,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",0.0,0.0,0.0625,0.0625,0.0625,0.125,0.125,0.125,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,"Cabbagetown,St. James Town",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.021277,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Central Bay Street,0.012048,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012048,...,0.0,0.0,0.0,0.0,0.0,0.012048,0.0,0.0,0.012048,0.0
7,"Chinatown,Grange Park,Kensington Market",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.045977,0.0,0.068966,0.011494,0.0
8,Christie,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Church and Wellesley,0.012195,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.012195,...,0.012195,0.012195,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
toronto_nbh.shape

(39, 230)

In [57]:
#Define a function to return top venue categories given X where Top X is Top 10 if X=10
def return_most_common_venues(row, X):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:X]

In [58]:
import numpy as np
top_X_venues = 10

indicators = ['st', 'nd', 'rd'] #to suffix to column names

# create columns according to number of top venues
columns = ['Neighborhood']

#set up column names as a list
for ind in np.arange(top_X_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
nbh_venues_sorted = pd.DataFrame(columns=columns)
nbh_venues_sorted['Neighborhood'] = toronto_nbh['Neighborhood']

for ind in np.arange(toronto_nbh.shape[0]):
    nbh_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_nbh.iloc[ind, :], top_X_venues)

nbh_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Cosmetics Shop,Bakery,Breakfast Spot,Burger Joint,Asian Restaurant,Thai Restaurant,Bar
1,Berczy Park,Coffee Shop,Cocktail Bar,Bakery,Cheese Shop,Beer Bar,Seafood Restaurant,Farmers Market,Steakhouse,Café,Gourmet Shop
2,"Brockton,Exhibition Place,Parkdale Village",Coffee Shop,Breakfast Spot,Café,Nightclub,Pet Store,Bar,Burrito Place,Restaurant,Climbing Gym,Performing Arts Venue
3,Business Reply Mail Processing Centre 969 Eastern,Park,Auto Workshop,Comic Shop,Pizza Place,Burrito Place,Restaurant,Brewery,Light Rail Station,Smoke Shop,Farmers Market
4,"CN Tower,Bathurst Quay,Island airport,Harbourf...",Airport Terminal,Airport Lounge,Airport Service,Plane,Rental Car Location,Boat or Ferry,Harbor / Marina,Boutique,Bar,Airport Gate


### Now to clustering the Neighborhoods

In [59]:
nbh_venues_sorted.shape

(39, 11)

In [60]:
from sklearn.cluster import KMeans
# set number of clusters
k = 6

toronto_nbh_groups = toronto_nbh.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k, random_state=0).fit(toronto_nbh_groups)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([3, 3, 3, 0, 0, 3, 3, 3, 0, 3])

### Merging the Neighborhood cluster with Top Neighborhood venues data above

In [61]:
toronto_df.rename(columns={'Total_Neighborhood':'Neighborhood'}, inplace=True)
# add clustering labels
nbh_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = toronto_df.drop('PostalCode', axis=1)

# merge toronto_ with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(nbh_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

toronto_merged.head() # check the last columns!

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown Toronto,Harbourfront,43.65426,-79.360636,3,Coffee Shop,Bakery,Park,Pub,Café,Breakfast Spot,Mexican Restaurant,Restaurant,Dessert Shop,Bank
1,Downtown Toronto,Queen's Park,43.662301,-79.389494,3,Coffee Shop,Gym,Park,Diner,Beer Bar,Seafood Restaurant,Sandwich Place,Salad Place,Juice Bar,Restaurant
2,Downtown Toronto,"Ryerson,Garden District",43.657162,-79.378937,3,Coffee Shop,Clothing Store,Japanese Restaurant,Café,Cosmetics Shop,Diner,Fast Food Restaurant,Electronics Store,Bubble Tea Shop,Bakery
3,Downtown Toronto,St. James Town,43.651494,-79.375418,3,Coffee Shop,Café,Restaurant,American Restaurant,Breakfast Spot,Cocktail Bar,Beer Bar,Cosmetics Shop,Bakery,Italian Restaurant
4,East Toronto,The Beaches,43.676357,-79.293031,0,Asian Restaurant,Health Food Store,Pub,Trail,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant


## Visualize the clustered neighborhoods

In [63]:
# create map
toronto_groups_map = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(k)
ys = [i + x + (i*x)**2 for i in range(k)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Neighborhood'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(toronto_groups_map)
       
toronto_groups_map

In [69]:
toronto_merged["Cluster Labels"].value_counts()

3    23
0    12
5     1
4     1
2     1
1     1
Name: Cluster Labels, dtype: int64

## Examining the clusters to see most prominent venue categories in each
### Cluster 1

In [71]:
toronto_merged.loc[toronto_merged['Cluster Labels']==0, toronto_merged.columns[[1]+list(range(6,toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,The Beaches,Health Food Store,Pub,Trail,Dessert Shop,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant
7,Christie,Café,Park,Restaurant,Candy Store,Nightclub,Baby Store,Gas Station,Coffee Shop,Bank
9,"Dovercourt Village,Dufferin",Bakery,Park,Supermarket,Bar,Middle Eastern Restaurant,Café,Recording Studio,Fast Food Restaurant,Gym / Fitness Center
15,"The Beaches West,India Bazaar",Italian Restaurant,Steakhouse,Fast Food Restaurant,Sushi Restaurant,Ice Cream Shop,Liquor Store,Burrito Place,Burger Joint,Fish & Chips Shop
20,Davisville North,Park,Gym,Breakfast Spot,Dance Studio,Sandwich Place,Department Store,Food & Drink Shop,Diner,Dessert Shop
21,"Forest Hill North,Forest Hill West",Trail,Mexican Restaurant,Sushi Restaurant,Women's Store,Department Store,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop
22,"High Park,The Junction South",Café,Thai Restaurant,Bar,Park,Fried Chicken Joint,Music Venue,Diner,Cajun / Creole Restaurant,Bookstore
24,"The Annex,North Midtown,Yorkville",Café,Coffee Shop,Park,History Museum,Liquor Store,Burger Joint,Indian Restaurant,Pub,Flower Shop
26,Davisville,Dessert Shop,Pizza Place,Coffee Shop,Italian Restaurant,Gym,Café,Sushi Restaurant,Pharmacy,Brewery
27,"Harbord,University of Toronto",Bakery,Japanese Restaurant,Sandwich Place,Restaurant,Bookstore,Bar,College Arts Building,Coffee Shop,Chinese Restaurant


### Cluster 2

In [72]:
toronto_merged.loc[toronto_merged['Cluster Labels']==1, toronto_merged.columns[[1]+list(range(6,toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Roselawn,Home Service,Ice Cream Shop,Women's Store,Department Store,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop


### Cluster 3

In [73]:
toronto_merged.loc[toronto_merged['Cluster Labels']==2, toronto_merged.columns[[1]+list(range(6,toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
29,"Moore Park,Summerhill East",Women's Store,Deli / Bodega,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


### Cluster 4

In [77]:
toronto_merged.loc[toronto_merged['Cluster Labels']==3, toronto_merged.columns[[1]+list(range(6,toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Harbourfront,Bakery,Park,Pub,Café,Breakfast Spot,Mexican Restaurant,Restaurant,Dessert Shop,Bank
1,Queen's Park,Gym,Park,Diner,Beer Bar,Seafood Restaurant,Sandwich Place,Salad Place,Juice Bar,Restaurant
2,"Ryerson,Garden District",Clothing Store,Japanese Restaurant,Café,Cosmetics Shop,Diner,Fast Food Restaurant,Electronics Store,Bubble Tea Shop,Bakery
3,St. James Town,Café,Restaurant,American Restaurant,Breakfast Spot,Cocktail Bar,Beer Bar,Cosmetics Shop,Bakery,Italian Restaurant
5,Berczy Park,Cocktail Bar,Bakery,Cheese Shop,Beer Bar,Seafood Restaurant,Farmers Market,Steakhouse,Café,Gourmet Shop
6,Central Bay Street,Italian Restaurant,Café,Ice Cream Shop,Juice Bar,Sandwich Place,Burger Joint,Japanese Restaurant,Salad Place,Department Store
8,"Adelaide,King,Richmond",Café,Steakhouse,Cosmetics Shop,Bakery,Breakfast Spot,Burger Joint,Asian Restaurant,Thai Restaurant,Bar
10,"Harbourfront East,Toronto Islands,Union Station",Aquarium,Café,Italian Restaurant,Hotel,Scenic Lookout,Brewery,Sporting Goods Shop,Restaurant,Fried Chicken Joint
11,"Little Portugal,Trinity",Coffee Shop,Asian Restaurant,Restaurant,Vietnamese Restaurant,Pizza Place,Men's Store,Café,Greek Restaurant,Juice Bar
12,"The Danforth West,Riverdale",Coffee Shop,Italian Restaurant,Ice Cream Shop,Furniture / Home Store,Restaurant,Bookstore,Grocery Store,Pub,Pizza Place


### Cluster 5

In [78]:
toronto_merged.loc[toronto_merged['Cluster Labels']==4, toronto_merged.columns[[1]+list(range(6,toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
18,Lawrence Park,Bus Line,Swim School,Event Space,Ethiopian Restaurant,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant


### Cluster 6

In [79]:
toronto_merged.loc[toronto_merged['Cluster Labels']==5, toronto_merged.columns[[1]+list(range(6,toronto_merged.shape[1]))]]

Unnamed: 0,Neighborhood,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
33,Rosedale,Trail,Playground,Dance Studio,Electronics Store,Eastern European Restaurant,Dumpling Restaurant,Donut Shop,Doner Restaurant,Dog Run


## Cluster 1 & 4 clearly are better formed and have a few observations to evaluate. 
### It appears Cluster 1 is a "Shopping & Fitness Hub" while Cluster 4 is a "Multicuisine Food Center"
### However, other clusters are made of just one neighborhood, suggesting that there may be more iterations needed to improve their clustering

<font color='red'>### Important Note: Since the kmeans ran on frequencies of most common venue categories in a neighborhood, it cannot be expected to yield clusters that look geospatially closer on a map.</font>