# IBM Data Science Coursera Capstone Project

# Table of Contents

1. Introduction
2. Data
3. Methodology
4. Results and Discussion
5. Conclusion

# 1. Introduction

Vancouver is well a developed Canadian city with a highly competitive services industry. In particular, the restaurant and bar scene is particularly vibrant, if not saturated, and populated generally by chains or independent locations. Though a tough market to get in to, the industry in Vancouver is such that a restaurant or bar can be financially very successful if it gains a foothold in the scene. These services can thrive off the city's remarkably high income, ethnically diverse population, and passionate food/craft beer scene.

This project will attempt to explore patterns of suburbs within Vancouver by categorizing them into clusters in order to identify existing trends within neighborhoods. Recommendations can be made on which category of neighborhood will be most suitable for a certain type of venue to be opened.

The result of this project may be most useful for entrepreneurs in the food and beverage sector given that location can be the deciding factor for a success.

# 2. Data

### Required Data

To analyze trends in Vancouver's neighbourhoods, the list of neighburhoods is obtained from the "Local Area Boundary" dataset from opeandata.vancouver.ca. As the file rendered poorly as json or csv, I took it as an excel sheet and cleaned it, removing some unwanted columns.

Venue queries will then be made by neighbourhoods using FourSquare APIs. The resulting data regarding venue category will be used to observe commonality between neighbourhoods. The commonality clusters can then provide insight on which type of venue will thrive better on which cluster. K-means clustering algorithm will be used to find pattern between the neighbourhoods.

In summary, the following data is required to conduct our analysis:

The neighbourhoods of Vancouver,
the coordinates of these neighbourhoods,
trending venues in the area, and
venue categories.

This project will use data from the page above. It will be scraped, preprocessed, and finally analyzed in conjunction with FourSquare location data to see the optimum locations to open a bar or restaurant in the city of Vancouver.

### Data Gathering

First, import the required libraires and modules.

In [1]:
# Load needed libraries for data collection

# HTML request and scraper library
import requests
from bs4 import BeautifulSoup

# Geocoding library
#!conda install -c conda-forge geopy --yes # Unquote to install geopy
from geopy.geocoders import ArcGIS # module to convert an address into latitude and longitude values

# Library for data analysis
import pandas as pd
from pandas.io.json import json_normalize # Function to transform json
import numpy as np

#!conda install -c conda-forge folium=0.5.0 --yes # Unquote to install folium
import folium # map plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import collapsible JSON for exploration
from IPython.display import JSON

# k-means for categorization
from sklearn.cluster import KMeans

# Pretty print
from pprint import pprint

Load in the dataset from opendata.vancouver.ca.

In [2]:
df = pd.read_excel(r'C:\Users\lukef\Downloads\local-area-boundary.xlsx')
print(df)

   MAPID              Neighborhood   Latitude   Longitude
0     AR             Arbutus-Ridge  49.246805 -123.161669
1    CBD                  Downtown  49.280747 -123.116567
2   FAIR                  Fairview  49.264540 -123.131049
3     GW        Grandview-Woodland  49.276440 -123.066728
4     HS          Hastings-Sunrise  49.277934 -123.040270
5   MARP                   Marpole  49.210207 -123.128382
6     RP                Riley Park  49.244766 -123.103147
7   SHAU               Shaughnessy  49.245681 -123.139760
8    STR                Strathcona  49.278220 -123.088235
9     WE                  West End  49.285011 -123.135438
10    KC  Kensington-Cedar Cottage  49.246686 -123.072885
11    MP            Mount Pleasant  49.263065 -123.098513
12   OAK                  Oakridge  49.226403 -123.123025
13    RC       Renfrew-Collingwood  49.247343 -123.040166
14   SUN                    Sunset  49.218756 -123.092038
15   WPG           West Point Grey  49.268401 -123.203467
16    DS      

Now we have the neighbourhoods of Vancouver, along with their coordinates.

In [3]:
df.head()

Unnamed: 0,MAPID,Neighborhood,Latitude,Longitude
0,AR,Arbutus-Ridge,49.246805,-123.161669
1,CBD,Downtown,49.280747,-123.116567
2,FAIR,Fairview,49.26454,-123.131049
3,GW,Grandview-Woodland,49.27644,-123.066728
4,HS,Hastings-Sunrise,49.277934,-123.04027


Let's make a map of Vancouver with all the neighbourhoods on it.

In [4]:
vancouver_latlong = [49.246292, -123.116226]

In [5]:
# create map of Vancouver using latitude and longitude values
vancouver_map = folium.Map(location=vancouver_latlong, zoom_start=11)

# add markers to map
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='#2E7D32',
        fill=True,
        fill_color='#FFEB3B',
        fill_opacity=0.75,
        parse_html=False).add_to(vancouver_map)
vancouver_map

### Integrate FourSquare API

Add credentials

In [6]:
CLIENT_ID = 'OD4MXYBVLW33S3OJ2TPID3VKBLXZQ31CJLGNKIK04PA4XZYE' # your Foursquare ID
CLIENT_SECRET = 'VAOOZUL0RGXHG2T3E2OKGZICBFKYHV1I3YF23LJD4MBZIJJD' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: OD4MXYBVLW33S3OJ2TPID3VKBLXZQ31CJLGNKIK04PA4XZYE
CLIENT_SECRET:VAOOZUL0RGXHG2T3E2OKGZICBFKYHV1I3YF23LJD4MBZIJJD


Now we can create a function that pulls recommended venues from each neighbourhood.
Doing this will allow us to examine the distribution of venues in each via a map to see which neighbourhoods would be a good place to open a new one.

In [7]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [8]:
vancouver_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )
vancouver_venues

Arbutus-Ridge
Downtown
Fairview
Grandview-Woodland
Hastings-Sunrise
Marpole
Riley Park
Shaughnessy
Strathcona
West End
Kensington-Cedar Cottage
Mount Pleasant
Oakridge
Renfrew-Collingwood
Sunset
West Point Grey
Dunbar-Southlands
Kerrisdale
Killarney
Kitsilano
South Cambie
Victoria-Fraserview


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arbutus-Ridge,49.246805,-123.161669,valley bike lane,49.249531,-123.159336,Bike Rental / Bike Share
1,Arbutus-Ridge,49.246805,-123.161669,Bus Stop 51498 (25),49.247906,-123.167012,Bus Stop
2,Arbutus-Ridge,49.246805,-123.161669,Total Corporate Learning Inc.,49.242441,-123.161609,Business Service
3,Arbutus-Ridge,49.246805,-123.161669,Triangle Park,49.245061,-123.167914,Park
4,Downtown,49.280747,-123.116567,L'Hermitage,49.280139,-123.117480,Hotel
...,...,...,...,...,...,...,...
470,Victoria-Fraserview,49.220012,-123.064135,7-Eleven,49.221115,-123.065350,Convenience Store
471,Victoria-Fraserview,49.220012,-123.064135,East Side Re-Rides,49.219511,-123.066045,Motorcycle Shop
472,Victoria-Fraserview,49.220012,-123.064135,Panago Pizza,49.219231,-123.066215,Pizza Place
473,Victoria-Fraserview,49.220012,-123.064135,Bosley's,49.220926,-123.065382,Pet Store


Let's double check the resulting dataframe and its shape.

In [9]:
vancouver_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Arbutus-Ridge,49.246805,-123.161669,valley bike lane,49.249531,-123.159336,Bike Rental / Bike Share
1,Arbutus-Ridge,49.246805,-123.161669,Bus Stop 51498 (25),49.247906,-123.167012,Bus Stop
2,Arbutus-Ridge,49.246805,-123.161669,Total Corporate Learning Inc.,49.242441,-123.161609,Business Service
3,Arbutus-Ridge,49.246805,-123.161669,Triangle Park,49.245061,-123.167914,Park
4,Downtown,49.280747,-123.116567,L'Hermitage,49.280139,-123.11748,Hotel


In [10]:
vancouver_venues.shape

(475, 7)

In [11]:
vancouver_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Arbutus-Ridge,4,4,4,4,4,4
Downtown,77,77,77,77,77,77
Dunbar-Southlands,6,6,6,6,6,6
Fairview,27,27,27,27,27,27
Grandview-Woodland,35,35,35,35,35,35
Hastings-Sunrise,14,14,14,14,14,14
Kensington-Cedar Cottage,13,13,13,13,13,13
Kerrisdale,4,4,4,4,4,4
Killarney,18,18,18,18,18,18
Kitsilano,55,55,55,55,55,55


The number across each row represents how many venues are registered in each neighbourhood. Certain neighbourhoods are primarily residential and are also less accessible vis-a-vis more central neighbourhoods, and so have a low number of venues. We can exclude neighbourhoods from our analysis that have less than 10 venues as these would seem unlikely and unpopular spots to establish venues. Based on my experience living in some of these areas with few venues, my assumption is that they have low footfall and out-of-neighbourhood visitors and so are not great places to do business.

In [12]:
# Drop neighbourhoods with < 10 venues

# Filter neighbourhood
lowvenues = vancouver_venues.groupby('Neighborhood').Venue.count() < 10
lowvenues = list(lowvenues[lowvenues].index)

# duplicate df
vancouver_venues_filtered = vancouver_venues

# Exclude the neighbourhoods
for i in lowvenues:
    vancouver_venues_filtered = vancouver_venues_filtered[vancouver_venues_filtered.Neighborhood != i]

Now let's see which neighbourhoods are returned.

In [13]:
vancouver_venues_filtered.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Downtown,77,77,77,77,77,77
Fairview,27,27,27,27,27,27
Grandview-Woodland,35,35,35,35,35,35
Hastings-Sunrise,14,14,14,14,14,14
Kensington-Cedar Cottage,13,13,13,13,13,13
Killarney,18,18,18,18,18,18
Kitsilano,55,55,55,55,55,55
Mount Pleasant,76,76,76,76,76,76
Oakridge,10,10,10,10,10,10
Riley Park,58,58,58,58,58,58


Let's see how many unique categories of venues there are.

In [14]:
print('There are {} uniques categories.'.format(len(vancouver_venues_filtered['Venue Category'].unique())))

There are 146 uniques categories.


That's a lot of venue categories! You may note at this point that the goal of this analysis is to find out optimum areas to open a new restaurant or bar, which would fall in to one of the 146 listed categories of venues. The assumption I have made in not filtering out all other categories is that other venues are indicative of an area that has activity and has thriving businesses. Moreover, we must consider that other venues can attract people to a restaurant or bar; for example, someone may decide to go for a meal or a drink after shopping, doing errands, etc. in the neighbourhood. Opening a venue in area with other venues nearby allows for the spontaneous 'walk-in' visitor, rather than relying solely on those who visit the neighbourhood with the preplanned intention of going to our new venue.

Let's take a look at all the venues on a map and see where the hubs of activity are.

In [15]:
vanmap_venue = folium.Map(location=vancouver_latlong, zoom_start=11) # generate map centred around the city of Vancouver

# add a red circle marker to represent center of Vancouver
folium.CircleMarker(
    vancouver_latlong,
    radius=10,
    color='red',
    popup='Vancouver',
    fill = True,
    fill_color = 'red',
    fill_opacity = 1
    ).add_to(vanmap_venue)

# add blue neighbourhood markers to map 
for lat, lng, label in zip(df['Latitude'], df['Longitude'], df['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=7,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#FFEB3B',
        fill_opacity=0.75,
        parse_html=False).add_to(vanmap_venue)

# add venues to the map as green circle markers
for lat, lng, label, cat in zip(vancouver_venues_filtered["Venue Latitude"], vancouver_venues_filtered["Venue Longitude"], 
                                vancouver_venues_filtered["Venue"], vancouver_venues_filtered["Venue Category"]):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label +", " + cat,
        fill=True,
        color='green',
        fill_color='green',
        fill_opacity=0.6
        ).add_to(vanmap_venue)


# display map
vanmap_venue

Intuitively, we can see three things: 1) The neighbourhoods marked that have few (< 10) venues do not have their venues displayed. We can identify these blue markers as poor candidates in which to establish a new restaurant/bar venue. 2) Certain areas of the city have large clusters of venues. For example, if we look at the top of the map, it's apparent that, unsurprisingly, downtown is densely populated with venues. Other areas, such as Mount Pleasant and Riley Park, also seem to be areas of activity. 3) The further away we get from downtown, the less dense the clusters of venues seem to be.

# 3. Methodology

Since our objective is to categorize the neighbourhoods, we will use K-means clustering algorithm to categorize each of the subdistricts within Vancouver.

A one-hot encoding will be done on the venue dataframe and it will be grouped by neighbourhoods. The encoding will return venue categories as column per neighbourhood, which will then be grouped to provide weighting of venue type occurrence on each neighbourhood.

The encoded dataframe will be further filtered into top venues before the K-means clustering algorithm will be run over it. This will return cluster labels over the neighbourhoods. The clusters will be observed one by one manually to determine its content.

Recommendation will be made based on the clustering.

This established, let's apply one-hot coding methodology to our data.

In [16]:
# one hot encoding
van_onehot = pd.get_dummies(vancouver_venues_filtered[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
van_onehot['Neighborhood'] = vancouver_venues_filtered['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [van_onehot.columns[-1]] + list(van_onehot.columns[:-1])
van_onehot = van_onehot[fixed_columns]

van_onehot.head()

Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,...,Theater,Theme Park Ride / Attraction,Thrift / Vintage Store,Tiki Bar,Toy / Game Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
4,Downtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Downtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Downtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Downtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Downtown,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check the shape of the dataframe, and group the encoded dataframe.

In [17]:
van_onehot.shape

(428, 147)

In [18]:
van_grouped = van_onehot.groupby('Neighborhood').mean().reset_index()
van_grouped

Unnamed: 0,Neighborhood,American Restaurant,Art Gallery,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,BBQ Joint,Bagel Shop,Bakery,Bank,...,Theater,Theme Park Ride / Attraction,Thrift / Vintage Store,Tiki Bar,Toy / Game Store,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Wine Shop,Women's Store,Yoga Studio
0,Downtown,0.0,0.025974,0.0,0.0,0.0,0.0,0.0,0.012987,0.0,...,0.025974,0.0,0.0,0.0,0.012987,0.025974,0.012987,0.0,0.0,0.0
1,Fairview,0.0,0.0,0.037037,0.074074,0.0,0.037037,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.0,0.0,0.0
2,Grandview-Woodland,0.0,0.0,0.0,0.028571,0.028571,0.028571,0.0,0.028571,0.0,...,0.057143,0.0,0.0,0.0,0.028571,0.057143,0.0,0.0,0.0,0.0
3,Hastings-Sunrise,0.0,0.0,0.071429,0.0,0.0,0.071429,0.0,0.0,0.0,...,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Kensington-Cedar Cottage,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,...,0.0,0.0,0.0,0.0,0.0,0.0,0.153846,0.0,0.0,0.0
5,Killarney,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.055556,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Kitsilano,0.018182,0.0,0.0,0.0,0.0,0.0,0.0,0.036364,0.036364,...,0.0,0.0,0.018182,0.0,0.036364,0.018182,0.018182,0.036364,0.018182,0.018182
7,Mount Pleasant,0.0,0.0,0.026316,0.0,0.0,0.0,0.013158,0.0,0.0,...,0.0,0.0,0.026316,0.0,0.0,0.0,0.026316,0.0,0.0,0.013158
8,Oakridge,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.0
9,Riley Park,0.0,0.017241,0.017241,0.0,0.017241,0.0,0.0,0.0,0.017241,...,0.0,0.0,0.0,0.017241,0.0,0.017241,0.051724,0.0,0.0,0.0


In [19]:
van_grouped.shape

(13, 147)

Let's find the top 5 venues in each neighbourhood.

In [20]:
num_top_venues = 5

for hood in van_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = van_grouped[van_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Downtown----
                venue  freq
0               Hotel  0.08
1          Restaurant  0.06
2  Seafood Restaurant  0.04
3         Coffee Shop  0.04
4                Café  0.04


----Fairview----
                 venue  freq
0          Coffee Shop  0.11
1     Asian Restaurant  0.07
2  Japanese Restaurant  0.07
3                 Park  0.07
4       Breakfast Spot  0.07


----Grandview-Woodland----
                           venue  freq
0                    Coffee Shop  0.11
1                    Pizza Place  0.09
2  Vegetarian / Vegan Restaurant  0.06
3                  Deli / Bodega  0.06
4                        Theater  0.06


----Hastings-Sunrise----
                          venue  freq
0  Theme Park Ride / Attraction  0.14
1                Farmers Market  0.07
2                   Gas Station  0.07
3                        Office  0.07
4                      Gun Shop  0.07


----Kensington-Cedar Cottage----
                   venue  freq
0  Vietnamese Restaurant  0.15
1      

Now let's define a function to return the most common venues. Then, we'll create a new dataframe and display the top 10 venues for each neighborhood.

In [21]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [22]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighbourhood_top10 = pd.DataFrame(columns=columns)
neighbourhood_top10['Neighborhood'] = van_grouped['Neighborhood']

for ind in np.arange(van_grouped.shape[0]):
    neighbourhood_top10.iloc[ind, 1:] = return_most_common_venues(van_grouped.iloc[ind, :], num_top_venues)

neighbourhood_top10.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Downtown,Hotel,Restaurant,Seafood Restaurant,Café,Coffee Shop,Gastropub,Breakfast Spot,Steakhouse,Sandwich Place,Concert Hall
1,Fairview,Coffee Shop,Park,Asian Restaurant,Breakfast Spot,Japanese Restaurant,Indian Restaurant,Korean Restaurant,Pet Store,Pharmacy,Nail Salon
2,Grandview-Woodland,Coffee Shop,Pizza Place,Indian Restaurant,Vegetarian / Vegan Restaurant,Theater,Deli / Bodega,Brewery,Park,Pub,Record Shop
3,Hastings-Sunrise,Theme Park Ride / Attraction,Pizza Place,Gun Shop,Farmers Market,Bridal Shop,Event Space,Gas Station,Office,Portuguese Restaurant,Café
4,Kensington-Cedar Cottage,Vietnamese Restaurant,Indian Restaurant,Bank,Burger Joint,Sandwich Place,Café,Breakfast Spot,Supermarket,Seafood Restaurant,Bus Stop


### Clustering Neighbourhoods

It's time to use k-means to group the neighborhoods into 5 clusters.

In [23]:
# set number of clusters
kclusters = 5

van_clustering = van_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(van_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 4, 0, 1, 1, 1, 1, 1])

Now that we have our array, we can create a new dataframe with the cluster including the top 10 venues for each neighborhood.

In [24]:
# add clustering labels
neighbourhood_top10.insert(0, 'Cluster Labels', kmeans.labels_)

van_new_df = df

# merge df to add latitude/longitude for each subdistrict
van_new_df = van_new_df.merge(neighbourhood_top10.set_index('Neighborhood'), on='Neighborhood')

# Shift label to start from index 1
van_new_df['Cluster Labels'] = van_new_df['Cluster Labels'] + 1

van_new_df.head()

Unnamed: 0,MAPID,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,CBD,Downtown,49.280747,-123.116567,2,Hotel,Restaurant,Seafood Restaurant,Café,Coffee Shop,Gastropub,Breakfast Spot,Steakhouse,Sandwich Place,Concert Hall
1,FAIR,Fairview,49.26454,-123.131049,2,Coffee Shop,Park,Asian Restaurant,Breakfast Spot,Japanese Restaurant,Indian Restaurant,Korean Restaurant,Pet Store,Pharmacy,Nail Salon
2,GW,Grandview-Woodland,49.27644,-123.066728,2,Coffee Shop,Pizza Place,Indian Restaurant,Vegetarian / Vegan Restaurant,Theater,Deli / Bodega,Brewery,Park,Pub,Record Shop
3,HS,Hastings-Sunrise,49.277934,-123.04027,5,Theme Park Ride / Attraction,Pizza Place,Gun Shop,Farmers Market,Bridal Shop,Event Space,Gas Station,Office,Portuguese Restaurant,Café
4,RP,Riley Park,49.244766,-123.103147,2,Japanese Restaurant,Coffee Shop,Vietnamese Restaurant,Farmers Market,Skating Rink,Restaurant,Sporting Goods Shop,Café,Chinese Restaurant,Lounge


Now, let's visualize the clusters on a map.

In [26]:
# create map
map_clusters = folium.Map(location=vancouver_latlong, zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(van_new_df['Latitude'], van_new_df['Longitude'], van_new_df['Neighborhood'], van_new_df['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

There's the map! Purple markers = cluster 1, blue = cluster 2, green = cluster 3, orange = cluster 4, red = cluster 5.

### Examine the clusters

Cluster 1

We can see that cluster 1 has Vietnamese and Indian restaurants - these could be good types of restaurants to open. It appears that these cuisines are popular, and so another one may cater to local taste. Moreover, restaurants/bars seem to do well here; 7 out 10 of the most common venues are places to dine in. However, the cluster one consists only of one neighbourhood (MAPID: KC = Kensington-Cedar Cottage), so it has less venues and activity. This maps with the cluster being quite far away from Downtown and closer to residential suburbs. There may be other clusters that are busier and therefore better places to open a restaurant/bar.

In [27]:
van_new_df.loc[van_new_df['Cluster Labels'] == 1, van_new_df.columns[[0] + list(range(4, van_new_df.shape[1]))]]

Unnamed: 0,MAPID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
7,KC,1,Vietnamese Restaurant,Indian Restaurant,Bank,Burger Joint,Sandwich Place,Café,Breakfast Spot,Supermarket,Seafood Restaurant,Bus Stop


Cluster 2

Cluster 2 consists of 10 neighbourhoods that may be suitable candidates for opening our new restaurant/bar. All the neighbourhoods contained within are known as areas popular with restaurant-goers. Importantly, the top 6 clusters in the north of the map are densely populated and known cultural areas, suggesting there is already a culture of places to eat and drink at - although this could mean these areas are competitive and a new venture could struggle to become established. On the other hand, most are located close to Downtown or are situated on good transport lines. We can see that all have a good number of bars or restaurants within them, along with a selection of other businesses that might draw hungry/thirsty customers. Moreover, there is a good distribution of different types of restaurants and bars, suggesting any style, from a cafe to a vegan spot, could thrive.

In [28]:
van_new_df.loc[van_new_df['Cluster Labels'] == 2, van_new_df.columns[[0] + list(range(4, van_new_df.shape[1]))]]

Unnamed: 0,MAPID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,CBD,2,Hotel,Restaurant,Seafood Restaurant,Café,Coffee Shop,Gastropub,Breakfast Spot,Steakhouse,Sandwich Place,Concert Hall
1,FAIR,2,Coffee Shop,Park,Asian Restaurant,Breakfast Spot,Japanese Restaurant,Indian Restaurant,Korean Restaurant,Pet Store,Pharmacy,Nail Salon
2,GW,2,Coffee Shop,Pizza Place,Indian Restaurant,Vegetarian / Vegan Restaurant,Theater,Deli / Bodega,Brewery,Park,Pub,Record Shop
4,RP,2,Japanese Restaurant,Coffee Shop,Vietnamese Restaurant,Farmers Market,Skating Rink,Restaurant,Sporting Goods Shop,Café,Chinese Restaurant,Lounge
6,WE,2,Café,Gay Bar,Farmers Market,Coffee Shop,Diner,Lingerie Store,Grocery Store,Sandwich Place,Bakery,Noodle House
8,MP,2,Coffee Shop,Diner,Breakfast Spot,Sandwich Place,Sushi Restaurant,Lounge,Brewery,Indian Restaurant,Vietnamese Restaurant,Thrift / Vintage Store
9,OAK,2,Bubble Tea Shop,Light Rail Station,Pizza Place,Vietnamese Restaurant,Bus Station,Coffee Shop,Sporting Goods Shop,Sandwich Place,Fast Food Restaurant,Sushi Restaurant
10,KIL,2,Coffee Shop,Juice Bar,Shopping Mall,Sandwich Place,Salon / Barbershop,Recreation Center,Liquor Store,Sushi Restaurant,Farmers Market,Chinese Restaurant
11,KITS,2,Coffee Shop,Pizza Place,Japanese Restaurant,Wine Shop,Bakery,Bank,Toy / Game Store,Food Truck,Optical Shop,Grocery Store


Cluster 3

Cluster 3 consists of one neighbourhood (Strathcona). It is known as a hub for brewing and the craft beer scene in the city. Although there are bars/pubs and some venues to grab a quick bite to eat, it seems to be lacking in a range of sit-down restaurants. On the other hand, this area is quite industrial, is not the most well-linked area by public transport, and lacks a diversity of cultural venues that would attract visitors and customers. Venues here may also face competition from the nearby downtown area in Cluster 2, which boasts well established bars/restaurants and other venues.

In [29]:
van_new_df.loc[van_new_df['Cluster Labels'] == 3, van_new_df.columns[[0] + list(range(4, van_new_df.shape[1]))]]

Unnamed: 0,MAPID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,STR,3,Park,Deli / Bodega,Cheese Shop,Food Truck,Brewery,Sandwich Place,Restaurant,Pub,Coffee Shop,Dive Bar


Cluster 4

Cluster 4 contains one neighbourhood (South Cambie). It appears to have a mix of practical amenities such as banks, grocery stores, etc., tracking with its status as a fairly residential area. We can also see that 4 of the ten most common venues are Asian-style cuisines. Cambie and much of the south of the city has an ethnically diverse population, especially Asian immigrants and their descendants. These restaurants could be common as they are catering to local taste. A restaurant of a different style could face challenges cutting in to this market. On the other hand, bars do not seem to be common here, which could represent an opportunity for one to set up.

In [30]:
van_new_df.loc[van_new_df['Cluster Labels'] == 4, van_new_df.columns[[0] + list(range(4, van_new_df.shape[1]))]]

Unnamed: 0,MAPID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,SC,4,Coffee Shop,Café,Vietnamese Restaurant,Malay Restaurant,Cantonese Restaurant,Grocery Store,Bank,Liquor Store,Gift Shop,Sushi Restaurant


Cluster 5

Cluster 5 consists of the Hastings-Sunrise neighbourhood, an area that presents opportunities and challenges for a new restaurant or bar. On one hand, the area has a theme parks (namely the PNE) which draws out-of-town visitors, and restaurants and bars seem to be not common. The area is also very close to downtown. However, it is the accessibility of downtown which may draw guests away from the area to more established areas. Moreover, Hastings-Sunrise has perhaps the highest levels of social deprivation in the city and a reputation it has struggled to shake off. A new restaurant or bar may be welcome here but may struggle to gain traction owing to the area's reputation, relative lack of venue diversity, and competition from venues in the downtown area.

In [31]:
van_new_df.loc[van_new_df['Cluster Labels'] == 5, van_new_df.columns[[0] + list(range(4, van_new_df.shape[1]))]]

Unnamed: 0,MAPID,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,HS,5,Theme Park Ride / Attraction,Pizza Place,Gun Shop,Farmers Market,Bridal Shop,Event Space,Gas Station,Office,Portuguese Restaurant,Café


# 4. Results and Discussion

As discussed above, there are several factors to consider when opening a restaurant/bar in Vancouver city. Groupings as a result of K-means clustering algorithm tallies with demographic and social trends in the city and we can make our recommendations based of these and induction from experience living and working in the city. Ranked in order of suitability for a new establishment, the optimum clusters in which to set up a new restaurant are:

1. Cluster 2
2. Cluster 3
3. Cluster 1
4. Cluster 4
5. Cluster 5

Much of the reasons for the suitability of each cluster have been discussed above. However, more detailed analysis can be derived:

Cluster 5, the Hastings-Sunrise neighbourhood, appears the least suited for a new bar/restaurant as its lack of venues of cultural/social signifance will fail to bring in many visitors to the neighbourhood itself, though a busy summer's day at the PNE theme park may preclude this. Moreover, Cluster 5's the proximity to downtown and even other neighbourhoods in Cluster 2 will likely mean both locals and tourists will choose to visit bars/restaurants elsewhere. In addition, true or not, the reputation the area has as one of social and economic deprivation will drive away visitors and, therefore, prospective customers. For example, tourists arriving in the downtown area of the city are often encouraged by city residents to avoid what is colloquially known as 'Eastside'.

Cluster 4, consisting of the South Cambie neighbourhood, is seemingly lacking in restaurants and some particular cuisines, and the ethnically diverse population in a residential area may attract customers to a new and exciting establisment. A well-planned venture focusing on local residents could be a hit, especially among those who would like cuisines not frequently on offer. However, I believe it remains a relatively poor choice because the area is primarily residential, relatively far away from the downtown core, and lacks other venues to attract visitors. A new establishment here should not expect visitors from out of the neighbourhood or indeed the city, reducing the chance of spontaneous walk-in customers. An establishment here will have to, therefore, gain a formidable reputation to attract guests from other areas of the city. However, as noted, doing so could give it a large local clientele.

Cluster 1, consisting of the Kensington-Cedar Cottage neighbourhood, is a mid-tier candidate for opening a new restaurant/bar. Geographically, it faces challenges as it is situated relatively far away from the downtown core and its surrounding neighbourhoods and more towards the suburbia of the south and east of the city. However, while restaurants are prevalent venues (7 out of 10 of the most common venues are restaurants and cafes) , they do not totally dominate, suggesting that these establishments can be successful and that there may be space for more. The results of the most common venues analysis also shows that there is a myriad of other common venues, such as banks, grocery stores, etc. that suggest that the area has activity and functions more than just an area for people to live before they commute elsewhere.

Cluster 3, the Strathcona neighbourhood, presents several opportunities for a new restaurant/bar. Doing so will encounter some of the problems characteristic of the Eastside of the city that Cluster 5 also faces. Cluster 3 is located directly adjacent to the struggling 'Downtown Eastside' area and the area itself is industrial. However, the area is still bustling, with a reputation for having many famous breweries that attract visitors from all over the city. A restaurant or gastropub could succeed here by tapping in to this market and using its proximity to downtown as a draw for visitors.

Cluster 2 appears to contain the best areas to establish a new bar or restaurant. All but one of the neighbourhoods within it are near the downtown area, which are known as places to visit, eat, and drink. The only exception to this is one neighbourhood in the cluster which is on the city boundary line, but is close to another city that begins where Vancouver ends. Neighbourhood clusters such as Downtown, Kitsilano, and Riley Park all encompass bustling areas of economic and social activity that residents and tourists from around the city will come to visit based on the reputation of these areas and their venues. As a result, it is unsurprising to see that all clusters have a good distribution of restaurants and bars and, crucially, other venues that bring customers and visitors. Choosing one of these neighbourhoods, such as the West End (MAP ID: WE), that has a high number of venues for a "casual bite" but not sit-down restaurants or bars could prove successful for meeting demand in a busy area. Although the scene in all of the neighbourhoods in this cluster will undoubtedly be very competitive, the distribution of restaurants and bars suggest they can be successful in areas that rely on their reputation as areas to explore, eat, and drink to draw customers and visitors.


In terms of the data analysis, having reviewed the code used, it seems there may be some issues regarding how FourSquare API registers and venues and decides how frequent venues are in a given area. For example, many known venues in the city do not appear on this map. More importantly, however, a minority of the venues returned in most common venues appear to be incorrect. For example, Cluster 2 shows that theme parks and gun shops occupy the first and second most common venues respectively in the area. However, this is demonstrably incorrect; Cluster 2 has many more cafes and restaurants than gun shops, for instance, while there is only one (though significant) theme park in Vancouver, the PNE. It is not plausible to say that there are more theme parks here than gas stations, bus stops, offices and other venues that are considered by the API. Although I do not believe this significantly impacted my analysis, it does appear to be a problem that could negatively skew less inductive future analyses.

# 5. Conclusion

Based on the analysis above, we conclude that a cluster 2 contains the best locations for a prospective new restaurant or bar, owing to reasons of reputation, proximity to the city's downtown area, and precedent of other similar, successful venues.