# Capstone Notebook

This notebook will contain all work for the Applied Data Science Capstone course

In [1]:
import pandas as pd
import numpy as np
import requests
import random

!pip install geopy
from geopy.geocoders import Nominatim

from pandas import json_normalize

!pip install folium
import folium



In [2]:
print("Hello Capstone Project Course!")

Hello Capstone Project Course!


## Introduction/Business Problem
If you are looking to open a Chain of Cafes in Canada and sell coffee as an entrepreneur, you will be facing tough competition in a potentially saturated market. Therefore a key problem to solve is:

**Where is the best place to open up your multiple Cafes?**

Solving this problem gives us information and prevents opening a Cafe in a non-saturated area may provide a new business the best opportunity to thrive without competition. This is a very difficult question to solve and another issue arises:

**How do we measure/determine which is the best place to start a Cafe?**

For a given n number of chains you wish to open, how do you distribute them such that you can get the greatest coverage and or exposure in a city with an established market. Traditionally one would think the best way to approach this is to maybe find places of high traffic of people. This data however is extremely difficult to come by. However, an important piece of information available is the location of current Venues. This tells us many things:

*   Which places or locations already have a steady population of customers that drink coffee. This is inferred based on density of venues in a given location.
*   Customer base and or demographics of customers. One would think that places with lots of Cafes would be a business district or a place of high traffic for sales.

Data science and exploration of location data will be key to finding the solution.

## Data
To solve the problem, we will need the following data:


*   List of neighbourhoods in Canada. This defines the scope of the project to the city of Sydney. This data can be sourced from webscraping Wikipedia pages and or some location data service.
*   Latitude and longitude coordinates of these neighbourhoods for visualisation, clustering and other purposes. This data can be sourced from location data services.
*   Venue data for finding saturated neighbourhoods of Cafes. This data can be sourced from Foursquare's API.

The culmination of this data will provide insights into saturation of venues in specific neighbourhoods and potential vacancies in locations.



# Methodology
### List of neighbourhoods in Canada


In [3]:
df=pd.read_html("https://en.wikipedia.org/w/index.php?title=List_of_postal_codes_of_Canada:_M&oldid=945633050")[0]

In [4]:
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [5]:
df = df[df.Borough != 'Not assigned']
df['Neighbourhood']=df['Neighbourhood'].replace('Not assigned', df['Borough'])
df.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M6A,North York,Lawrence Heights
6,M6A,North York,Lawrence Manor


Geospatial data:

In [6]:
!wget -q -O 'geospatial_data.csv' http://cocl.us/Geospatial_data
print('Data downloaded!')

Data downloaded!


In [7]:
df_geo = pd.read_csv('geospatial_data.csv')
df_geo.columns

Index(['Postal Code', 'Latitude', 'Longitude'], dtype='object')

In [11]:
df_grouped = df.groupby(['Postcode','Borough'], as_index=False, sort=False).agg(', '.join)
df_grouped.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Harbourfront
3,M6A,North York,"Lawrence Heights, Lawrence Manor"
4,M7A,Downtown Toronto,Queen's Park


In [12]:
df_geo = df_geo.rename({'Postal Code':'Postcode'}, axis=1)

In [13]:
df2 = pd.merge(df_grouped,df_geo)

In [14]:
df2.head()

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,Harbourfront,43.65426,-79.360636
3,M6A,North York,"Lawrence Heights, Lawrence Manor",43.718518,-79.464763
4,M7A,Downtown Toronto,Queen's Park,43.662301,-79.389494


### Venue Data


In [15]:
# Foursquare Credentials
CLIENT_ID = 'FF0YXBOX3Y2E0QICY4DW3LTM5IP0CEL3EHIFRMQZBVZU0UK0' # your Foursquare ID
CLIENT_SECRET = 'YJO42BH315TOCCT3LODT1CMB20YLUNO14OMPQ2JBNE5CEAQN' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: FF0YXBOX3Y2E0QICY4DW3LTM5IP0CEL3EHIFRMQZBVZU0UK0
CLIENT_SECRET:YJO42BH315TOCCT3LODT1CMB20YLUNO14OMPQ2JBNE5CEAQN


In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    print('Found {} venues in {} neighborhoods.'.format(nearby_venues.shape[0], len(venues_list)))
    
    return(nearby_venues)

In [17]:
venues = getNearbyVenues(names=df2['Neighbourhood'],
                         latitudes=df2['Latitude'],
                         longitudes=df2['Longitude'])

Found 2163 venues in 103 neighborhoods.


In [20]:
print(venues.shape)
venues.head()

(2163, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,Parkwoods,43.753259,-79.329656,TTC stop #8380,43.752672,-79.326351,Bus Stop
2,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,Parkwoods,43.753259,-79.329656,TTC stop - 44 Valley Woods,43.755402,-79.333741,Bus Stop
4,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena


In [23]:
venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
"Adelaide, King, Richmond",97,97,97,97,97,97
Agincourt,4,4,4,4,4,4
"Agincourt North, L'Amoreaux East, Milliken, Steeles East",3,3,3,3,3,3
"Albion Gardens, Beaumond Heights, Humbergate, Jamestown, Mount Olive, Silverstone, South Steeles, Thistletown",9,9,9,9,9,9
"Alderwood, Long Branch",7,7,7,7,7,7
...,...,...,...,...,...,...
Willowdale West,7,7,7,7,7,7
Woburn,4,4,4,4,4,4
"Woodbine Gardens, Parkview Hill",12,12,12,12,12,12
Woodbine Heights,7,7,7,7,7,7


# Exploring the Foursquare Data
We can use the 'Venue Category' field in the data to find what our venue of interest is referred to:

In [24]:
venues['Venue Category'].unique()

array(['Park', 'Bus Stop', 'Food & Drink Shop', 'Hockey Arena',
       'Portuguese Restaurant', 'Coffee Shop', 'Pizza Place', 'Bakery',
       'Distribution Center', 'Spa', 'Restaurant', 'Pub',
       'Breakfast Spot', 'Gym / Fitness Center', 'Historic Site',
       'Farmers Market', 'Dessert Shop', 'Chocolate Shop',
       'Performing Arts Venue', 'French Restaurant', 'Mexican Restaurant',
       'Café', 'Yoga Studio', 'Theater', 'Event Space',
       'Asian Restaurant', 'Shoe Store', 'Ice Cream Shop',
       'Electronics Store', 'Art Gallery', 'Cosmetics Shop', 'Bank',
       'Beer Store', 'Health Food Store', 'Antique Shop', 'Boutique',
       'Furniture / Home Store', 'Vietnamese Restaurant',
       'Clothing Store', 'Accessories Store', "Women's Store",
       'Miscellaneous Shop', 'Italian Restaurant', 'Creperie',
       'Sushi Restaurant', 'Arts & Crafts Store', 'Burrito Place',
       'Beer Bar', 'Hobby Shop', 'Diner', 'Fried Chicken Joint',
       'Smoothie Shop', 'Sandwich Pl

We find that Cafes are referred to as "Coffee Shop" and that there are 184 Cafe Venues in the Foursquare data.

In [31]:
venues[venues['Venue Category']=="Coffee Shop"].shape

(184, 7)

Let's visualise the locations on the map

In [34]:
cafe_data = venues[venues['Venue Category']=="Coffee Shop"]
cafe_data.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
6,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop
9,Harbourfront,43.65426,-79.360636,Tandem Coffee,43.653559,-79.361809,Coffee Shop
21,Harbourfront,43.65426,-79.360636,Arvo,43.649963,-79.361442,Coffee Shop
23,Harbourfront,43.65426,-79.360636,Rooster Coffee,43.6519,-79.365609,Coffee Shop
25,Harbourfront,43.65426,-79.360636,Starbucks,43.651613,-79.364917,Coffee Shop


In [35]:
latitude = 43.6532
longitude = -79.3832

In [36]:
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(cafe_data['Venue Latitude'], cafe_data['Venue Longitude'], cafe_data['Venue']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

# Clustering

Clustering is a way of defining which areas of competition need to be addressed with a chain store. By specifying n number of franchises, we should be able to find the regions/locales in which each store will be competing with. 

We will be using Kmeans clustering.

In [47]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# Define K
K = 5

In [52]:
X = np.array(cafe_data[['Venue Latitude','Venue Longitude']])
X

array([[ 43.72551663, -79.31310251],
       [ 43.65355871, -79.36180946],
       [ 43.6499628 , -79.36144178],
       [ 43.65189966, -79.36560912],
       [ 43.651613  , -79.364917  ],
       [ 43.65308058, -79.35707786],
       [ 43.65813541, -79.35951549],
       [ 43.65682097, -79.35896984],
       [ 43.7194274 , -79.4679949 ],
       [ 43.66014   , -79.38587   ],
       [ 43.66076299, -79.38618409],
       [ 43.65945605, -79.39041123],
       [ 43.662407  , -79.385943  ],
       [ 43.658204  , -79.388998  ],
       [ 43.6610382 , -79.3937966 ],
       [ 43.660887  , -79.39372   ],
       [ 43.6594149 , -79.3912214 ],
       [ 43.658175  , -79.3906813 ],
       [ 43.6589062 , -79.3886961 ],
       [ 43.65723   , -79.38087   ],
       [ 43.65883297, -79.38368352],
       [ 43.65785441, -79.37919981],
       [ 43.65602746, -79.38057492],
       [ 43.65596912, -79.38268427],
       [ 43.65446529, -79.37891894],
       [ 43.65464868, -79.38057377],
       [ 43.659509  , -79.382132  ],
 

In [53]:
kmeans = KMeans(n_clusters=K, random_state=42).fit(X)
kmeans.labels_

array([3, 2, 2, 2, 2, 2, 2, 2, 4, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 0, 0,
       0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 0, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 3, 3, 3, 3, 3, 4, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 3, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 4, 4, 4, 4, 1,
       4, 4, 4, 2, 2, 2, 1, 1, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)

In [60]:
kmeans.cluster_centers_

array([[ 43.69111624, -79.34990758],
       [ 43.64496766, -79.54348292],
       [ 43.65240874, -79.38365094],
       [ 43.76762737, -79.30697082],
       [ 43.74640736, -79.4318879 ]])

In [63]:
center_lat = kmeans.cluster_centers_[:,0]
center_lon = kmeans.cluster_centers_[:,1]

We can plot our clusters on the folium map.


In [83]:
map_toronto = folium.Map(location=[latitude,longitude], zoom_start=11)
colours = ['red','black','green','blue','purple','orange','black','white']

# add markers to map
for lat, lng, label, col in zip(cafe_data['Venue Latitude'], cafe_data['Venue Longitude'], cafe_data['Venue'], kmeans.labels_):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=colours[col],
        fill=True,
        fill_color='#FFFFFF',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

# add centers
for lat, lng, label in zip(center_lat, center_lon, ['0','1','2','3','4']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=10,
        popup=label,
        color='white',
        fill=True,
        fill_color='#FFFFFF',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  

map_toronto

In [67]:
# Import counter to find out number of stores are in each cluster:
from collections import Counter, defaultdict
competition = Counter(kmeans.labels_)
print(Counter(kmeans.labels_))

Counter({2: 139, 0: 16, 4: 11, 3: 10, 1: 8})


In [86]:
print("Center of cluster 0 is ", kmeans.cluster_centers_[0], " and has ", competition[0], " stores to compete with")
print("Center of cluster 1 is ", kmeans.cluster_centers_[1], " and has ", competition[1], " stores to compete with")
print("Center of cluster 2 is ", kmeans.cluster_centers_[2], " and has ", competition[2], " stores to compete with")
print("Center of cluster 3 is ", kmeans.cluster_centers_[3], " and has ", competition[3], " stores to compete with")
print("Center of cluster 4 is ", kmeans.cluster_centers_[4], " and has ", competition[4], " stores to compete with")

Center of cluster 0 is  [ 43.69111624 -79.34990758]  and has  16  stores to compete with
Center of cluster 1 is  [ 43.64496766 -79.54348292]  and has  8  stores to compete with
Center of cluster 2 is  [ 43.65240874 -79.38365094]  and has  139  stores to compete with
Center of cluster 3 is  [ 43.76762737 -79.30697082]  and has  10  stores to compete with
Center of cluster 4 is  [ 43.74640736 -79.4318879 ]  and has  11  stores to compete with
