# Final Project:
## New Facility Location Selection
### by: Jeffrey Dupree

This notebook will scrape neighborhood information from a Wikipedia page https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M to create a dataframe consisting of the Postal Code, the Borough name, and the Neighborhood name.

#### Section One

First, we install the necessary libraries.

In [1]:
# If you don't have these packages available, uncomment the appropriate lines below to install them.

import sys
#!{sys.executable} -m pip install beautifulsoup4
#!{sys.executable} -m pip install lxml
#!{sys.executable} -m pip install requests

from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

Next, we need to get the information from the Wikipedia page using `requests.get`.

In [2]:
source = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text

Use the BeautifulSoup package to scrape the information from the Wikipedia page. I used the lxml parsing method, but you can use any you like.

In [3]:
soup = BeautifulSoup(source, 'lxml')

Find the table using `soup.find` from BeautifulSoup. Uncomment the second line to see the structure and content of the table. The tags are needed for the next steps.

In [4]:
table = soup.find('table')
#print(table.prettify())

Now a pandas dataframe needs to be created. This will require looping through the elements from the table and assigning the to a list. The list can then be made into a dataframe using `pd.DataFrame`. The columns will need header names. I manually assigned these instead of pulling them from the BeautifulSoup object `table`.

In [5]:
table_rows = table.find_all('tr')

res = []
for tr in table_rows:
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in td if tr.text.strip()]
    if row:
        res.append(row)

# Label the columns.
df = pd.DataFrame(res, columns=['PostalCode','Borough','Neighborhood'])

Next remove the rows where the borough is "Not assigned", assign the borough name for neighborhoods without an assigned name, and combine rows where the postal code is the same but there are multiple neighborhoods.

In [6]:
# Remove rows with Borough = "Not assigned"
df = df[df.Borough != 'Not assigned']

In [7]:
# If Neighborhood = "Not assigned" then assign with Borough value.
df['Neighborhood'] = np.where(df['Neighborhood'] == 'Not assigned', df['Borough'], df['Neighborhood'])

In [8]:
# Combine rows where the Postal Code is the same.
df = df.groupby(['PostalCode', 'Borough'])['Neighborhood'].apply(', '.join).reset_index()

The resulting dataframe looks like this.

In [9]:
df

Unnamed: 0,PostalCode,Borough,Neighborhood
0,M1B,Scarborough,"Rouge, Malvern"
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union"
2,M1E,Scarborough,"Guildwood, Morningside, West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park"
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge"
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff, Cliffside West"


Check the size of the dataframe.

In [10]:
df.shape

(103, 3)

#### Section Two

In [14]:
# The code was removed by Watson Studio for sharing.

In [16]:
import re
import geopy
from geopy.geocoders import Nominatim
geolocator = Nominatim(country_bias="ca", user_agent=user_agent)

# Create an empty list for your latitude and longitude variables.
latitude = []
longitude = []


for i in range(0,df.shape[0]): #Loop through each Postal Codes.
    g = geolocator.geocode({"postalcode": df.PostalCode[i]}, exactly_one=False) #First try to geocode based on the postal code.
    if g != None and len(g) == 1: #If the postal code returns a single response, extract the lat/lon and record them.
        latitude.append(g[0].latitude)
        longitude.append(g[0].longitude)
    else: #If the postal code does not geocode to a single location, or no location at all, then use the neighborhoods to geocode a location.
        hoods = df.Neighborhood[i].split(', ')
        for j in range(0,len(hoods)):
            sum_lat = 0
            sum_lon = 0
            sum_loc = 0
            g = geolocator.geocode({"city": hoods[j], "state": "on", "county": "toronto"}, exactly_one=False)
            if g != None:
                rtrns = len(g)-1
                for k in range(rtrns,-1,-1): #Loop through the location objects returned to collect the lat/long data. Average to get a geometric center if more than one.
                    pc = re.search('\D\d\D', g[k].address)
                    hm = g[k].address.find(hoods[j])
                    if pc != None: 
                        if pc.group(0) == df.PostalCode[i]:
                            sum_lat = sum_lat + g[k].latitude
                            sum_lon = sum_lon + g[k].longitude
                            sum_loc = sum_loc + 1
                        elif hm >= 0:
                            sum_lat = sum_lat + g[k].latitude
                            sum_lon = sum_lon + g[k].longitude
                            sum_loc = sum_loc + 1
        if sum_loc < 1: #Prevent a 'divide by zero' error by ensuring sum_loc is at least 1.
            sum_loc = 1
        avg_lat = sum_lat / sum_loc
        avg_lon = sum_lon / sum_loc
        latitude.append(avg_lat)
        longitude.append(avg_lon)
#Add the latitude and longitude lists to the dataframe as two new columns.
df['Latitude'] = latitude
df['Longitude'] = longitude

Unfortunately, this method is unreliable and has to be run multiple times to get a successful run. Even when successful, it still only finds lat/long data for a little more than half of the postal codes as can be seen below.

In [17]:
print("Only ",df[df.Latitude != 0].shape[0]," of the ",df.shape[0]," postal codes were geocoded.")
df.loc[df['Latitude'] == 0]

Only  61  of the  103  postal codes were geocoded.


Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
4,M1H,Scarborough,Cedarbrae,0.0,0.0
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",0.0,0.0
14,M1V,Scarborough,"Agincourt North, L'Amoreaux East, Milliken, St...",0.0,0.0
16,M1X,Scarborough,Upper Rouge,0.0,0.0
21,M2M,North York,"Newtonbrook, Willowdale",0.0,0.0
24,M2R,North York,Willowdale West,0.0,0.0
25,M3A,North York,Parkwoods,0.0,0.0
26,M3B,North York,Don Mills North,0.0,0.0
30,M3K,North York,"CFB Toronto, Downsview East",0.0,0.0
31,M3L,North York,Downsview West,0.0,0.0


To complete the dataframe I use the provided csv file going forward.

In [18]:
df['Latitude'] = float(0)
df['Longitude'] = float(0)

In [19]:
df_csv = pd.read_csv("https://cocl.us/Geospatial_data") #Import the csv as a dataframe.
for i in range(0,df.shape[0]):
    postalcode = df.PostalCode[i]
    csv_row = df.loc[df_csv['Postal Code']==postalcode].index[0] #Select the row in the new dataframe with the postal code that matches the original dataframe.
    if df.Latitude[i] == 0: #If the geocoding failed for this postal code, copy the lat/long from the new dataframe.
        df.Latitude[i] = df_csv.Latitude[csv_row]
        df.Longitude[i] = df_csv.Longitude[csv_row]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Now there are latitude and longitude values for each of the postal codes.

In [43]:
df.loc[df['Latitude'] != 0]

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711
3,M1G,Scarborough,Woburn,43.770992,-79.216917
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476
5,M1J,Scarborough,Scarborough Village,43.744734,-79.239476
6,M1K,Scarborough,"East Birchmount Park, Ionview, Kennedy Park",43.727929,-79.262029
7,M1L,Scarborough,"Clairlea, Golden Mile, Oakridge",43.711112,-79.284577
8,M1M,Scarborough,"Cliffcrest, Cliffside, Scarborough Village West",43.716316,-79.239476
9,M1N,Scarborough,"Birch Cliff, Cliffside West",43.692657,-79.264848


#### Section Three

In [21]:
import json # library to handle JSON files

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

Solving environment: done

# All requested packages already installed.



In [22]:
print('The dataframe has {} boroughs.'.format(
        len(df['Borough'].unique())
    )
)

The dataframe has 11 boroughs.


In [23]:
# create map of Toronto using latitude and longitude values
toronto = geolocator.geocode({"state": "on", "county": "toronto"})
map_toronto = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=11)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

In [24]:
# The code was removed by Watson Studio for sharing.

Create the url that will query the Foursquare API for the top 100 venues within 500 meters of the location. The cell above assigns the client ID and client secret to variables that will be called below.

In [25]:
search_lat = df.Latitude[0]
search_lon = df.Longitude[0]
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    search_lat, 
    search_lon, 
    radius, 
    LIMIT)


In [26]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5d556ac6992951002515be5f'},
  'headerLocation': 'Malvern',
  'headerFullLocation': 'Malvern, Toronto',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 1,
  'suggestedBounds': {'ne': {'lat': 43.8111863045, 'lng': -79.18812958073042},
   'sw': {'lat': 43.80218629549999, 'lng': -79.2005772192696}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4bb6b9446edc76b0d771311c',
       'name': "Wendy's",
       'location': {'crossStreet': 'Morningside & Sheppard',
        'lat': 43.80744841934756,
        'lng': -79.19905558052072,
        'labeledLatLngs': [{'label': 'display',
          'lat': 43.80744841934756,
          'lng': -79.19905558052072}],
        'distance': 387,
        'cc': 'CA',
        'city': 'Toronto',
    

In [27]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [28]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

Unnamed: 0,name,categories,lat,lng
0,Wendy's,Fast Food Restaurant,43.807448,-79.199056


This creates a function for using the Foursquare API to find the nearby venues for all of the boroughs in the dataframe.

In [29]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Borough', 
                  'Borough Latitude', 
                  'Borough Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [30]:
toronto_venues = getNearbyVenues(names=df['Borough'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
Scarborough
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
North York
East York
East York
East Toronto
East York
East York
East York
East Toronto
East Toronto
East Toronto
Central Toronto
Central Toronto
Central Toronto
Central Toronto
Central Toronto
Central Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
North York
Central Toronto
Central Toronto
Central Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
Downtown Toronto
North York
North York
York
York
Downtown Toronto
Wes

In [31]:
print(toronto_venues.shape)
toronto_venues.head()

(2235, 7)


Unnamed: 0,Borough,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Scarborough,43.806686,-79.194353,Wendy's,43.807448,-79.199056,Fast Food Restaurant
1,Scarborough,43.784535,-79.160497,Royal Canadian Legion,43.782533,-79.163085,Bar
2,Scarborough,43.784535,-79.160497,Scarborough Historical Society,43.788755,-79.162438,History Museum
3,Scarborough,43.763573,-79.188711,Swiss Chalet Rotisserie & Grill,43.767697,-79.189914,Pizza Place
4,Scarborough,43.763573,-79.188711,G & G Electronics,43.765309,-79.191537,Electronics Store


In [32]:
toronto_venues.groupby('Borough').count()

Unnamed: 0_level_0,Borough Latitude,Borough Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Borough,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Central Toronto,109,109,109,109,109,109
Downtown Toronto,1283,1283,1283,1283,1283,1283
East Toronto,125,125,125,125,125,125
East York,78,78,78,78,78,78
Etobicoke,77,77,77,77,77,77
Mississauga,11,11,11,11,11,11
North York,238,238,238,238,238,238
Queen's Park,39,39,39,39,39,39
Scarborough,85,85,85,85,85,85
West Toronto,172,172,172,172,172,172


In [33]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 273 uniques categories.


We use one-hot encoding to determine if a venue type exists in a neighborhood. This will create a column for each of the unique categories, and assign a value of 1 if that venue type exists in the neighborhood or 0 otherwise.

In [34]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")

# add borough column back to dataframe
toronto_onehot['Borough'] = toronto_venues['Borough'] 

# move borough column to the first column
fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Scarborough,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [35]:
toronto_onehot.shape

(2235, 274)

With the one-hot encoded data, we can determine the frequency with which each venue type occurs in each borough. This results in a dataframe with a column for each unique venue type and a row for each unique borough.

In [36]:
toronto_grouped = toronto_onehot.groupby('Borough').mean().reset_index()
toronto_grouped

Unnamed: 0,Borough,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store,Yoga Studio
0,Central Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018349,0.0,...,0.0,0.009174,0.0,0.0,0.009174,0.0,0.0,0.0,0.0,0.009174
1,Downtown Toronto,0.0,0.000779,0.000779,0.000779,0.001559,0.002338,0.001559,0.014809,0.001559,...,0.002338,0.01325,0.002338,0.0,0.004677,0.0,0.007015,0.0,0.000779,0.002338
2,East Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.024
3,East York,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.012821,0.0,0.012821,0.0,0.0,0.0,0.012821
4,Etobicoke,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,0.0
5,Mississauga,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,North York,0.004202,0.0,0.004202,0.0,0.0,0.0,0.0,0.008403,0.0,...,0.0,0.0,0.004202,0.004202,0.008403,0.0,0.0,0.004202,0.012605,0.0
7,Queen's Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025641
8,Scarborough,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011765,0.0,...,0.0,0.0,0.0,0.0,0.011765,0.0,0.0,0.0,0.0,0.0
9,West Toronto,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005814,...,0.0,0.005814,0.0,0.0,0.011628,0.0,0.011628,0.0,0.0,0.005814


Next we will determine the five most frequent venues within a borough to describe a neighborhood 'type', and group the borough by type symilarity.

In [37]:
num_top_venues = 5

for hood in toronto_grouped['Borough']:
    print("----"+hood+"----")
    temp = toronto_grouped[toronto_grouped['Borough'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Central Toronto----
            venue  freq
0     Coffee Shop  0.08
1  Sandwich Place  0.06
2     Pizza Place  0.06
3            Café  0.05
4    Dessert Shop  0.05


----Downtown Toronto----
                venue  freq
0         Coffee Shop  0.09
1                Café  0.05
2  Italian Restaurant  0.03
3          Restaurant  0.03
4               Hotel  0.03


----East Toronto----
                venue  freq
0    Greek Restaurant  0.07
1         Coffee Shop  0.06
2  Italian Restaurant  0.05
3      Ice Cream Shop  0.03
4             Brewery  0.03


----East York----
                 venue  freq
0          Coffee Shop  0.06
1         Burger Joint  0.05
2          Pizza Place  0.05
3                 Park  0.05
4  Sporting Goods Shop  0.04


----Etobicoke----
            venue  freq
0     Pizza Place  0.10
1  Sandwich Place  0.06
2     Coffee Shop  0.05
3        Pharmacy  0.05
4            Pool  0.04


----Mississauga----
                       venue  freq
0                      Hotel  0

In [38]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [39]:
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Borough']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
boroughs_venues_sorted = pd.DataFrame(columns=columns)
boroughs_venues_sorted['Borough'] = toronto_grouped['Borough']

for ind in np.arange(toronto_grouped.shape[0]):
    boroughs_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

boroughs_venues_sorted.head()

Unnamed: 0,Borough,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,Central Toronto,Coffee Shop,Sandwich Place,Pizza Place,Café,Park
1,Downtown Toronto,Coffee Shop,Café,Restaurant,Hotel,Bakery
2,East Toronto,Greek Restaurant,Coffee Shop,Italian Restaurant,Brewery,Park
3,East York,Coffee Shop,Park,Burger Joint,Pizza Place,Sporting Goods Shop
4,Etobicoke,Pizza Place,Sandwich Place,Pharmacy,Coffee Shop,Grocery Store


Using a k-means clustering, we group the boroughs by similarity of venues available. For this example we chose 5 clusters, but this can be adjusted by setting the `kclusters` variable to the desired number of clusters in the code below.

In [40]:
# set number of clusters
kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Borough', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 0, 0, 4, 4, 1, 0, 3, 4, 0], dtype=int32)

Each borough is now assigned to one of five clusters, indexed as 0-4.

In [41]:
# add clustering labels
boroughs_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

toronto_merged = df

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(boroughs_venues_sorted.set_index('Borough'), on='Borough')

toronto_merged.head() # check the last columns!

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,M1B,Scarborough,"Rouge, Malvern",43.806686,-79.194353,4,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Breakfast Spot,Pizza Place
1,M1C,Scarborough,"Highland Creek, Rouge Hill, Port Union",43.784535,-79.160497,4,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Breakfast Spot,Pizza Place
2,M1E,Scarborough,"Guildwood, Morningside, West Hill",43.763573,-79.188711,4,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Breakfast Spot,Pizza Place
3,M1G,Scarborough,Woburn,43.770992,-79.216917,4,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Breakfast Spot,Pizza Place
4,M1H,Scarborough,Cedarbrae,43.773136,-79.239476,4,Fast Food Restaurant,Coffee Shop,Chinese Restaurant,Breakfast Spot,Pizza Place


Visualized on a map, the borough clusters look like this.

In [42]:
# create map
map_clusters = folium.Map(location=[toronto.latitude, toronto.longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(toronto_merged['Latitude'], toronto_merged['Longitude'], toronto_merged['Borough'], toronto_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters