# Table of contents
* [1. Introduction: Business Problem](#introduction )
* [2. Data](#Data)
    * [2.1 Propose solution for business problem by using foursquare data](#solution)
    * [2.2 Collect and process data](#collect)
* [3. Methodology](#Methodology)
    * [3.1 Data preprocessing](#preprocessing)
    * [3.2 Clustering / modeling data](#clustering)
    * [3.3 Convert result to human-readable format](#convert_result)
    * [3.4 Analyze result](#analyze)
    * [3.5 Visualize by Folium](#folium)
* [4. Results and Discussion](#Results)
* [5. Conclusion](#Conclusion)

# 1. Introduction <a name="introduction"></a>

This notebook will walk you through some steps to find a suitable location to open a restaurant in Toronto, Canada based on Foursquare venues database. 


## Audience

People who are interested in opening a restaurant in Toronto

## Problem to solve: how to find a suitable place to open my  restaurant ? 

### Things to consider 

Besides menu, food quality, special recipe, space size or decoration concept, location plays a very important part to your success as a restaurant owner. There's no silver bullet to solve the "where 's the best place for my restaurent?", of course, but we can give you some insights that you then choose it yourself. Because at the end of the day, it all depends on your restaurant concept and your targeted customer. 

Basically, our location research should cover following things:

#### a.  Visibility 

Easy to notice, easy to find ( both on real life and location mobile applitions ). In your early days, a nearby popular venue should be very helpful.  

#### b.  Accessibility

It should be easy for people to get there by different kinds of transportation methods. Your customer base will be bigger if you have this advantage. Even when you own a car, you still want to use public transportation service sometimes to avoid the parking issue ( or you're just too lazy ). 

#### c. Material Supply

On the other side, talking about accessibility, a restaurant should be in a convenient place for their supplier ( Eg: raw material ) as well. You don't want to cross half way around the city to buy some basic ingredients in urgent case. 

#### d. Parking lot issue

Very important. No matter how good your the food is, If I have to spend more than 15 mins to find the parking lot or walk for more 500m before entering your restaurant, I'll be very annoying after all. 
A place near by a parking lot is perfect but a neighbor hotel should be ok to make a deal with you to share their parking service. 

#### e. Potential customer base nearby

A restaurant which server dinner ( casual or fine ) should be close to residence area. Bistro should be close to office. Crowded places like metro, train station, bus stop..are good places for fast food. 
Or after all, in general, more people, more traffic, more chances that someone will visit your restaurent. You definitely don't want to open your restaurant in a cave. 


# 2. Data <a name="Data"></a>
## 2.1 Propose solution by using foursquare location database  <a name="solution"></a>

Follow our previous research about [Toronto neighborhood](https://github.com/kembox/Coursera_Capstone/blob/master/toronto_neighborhoods.ipynb) , we'll reuse location data for Toronto borough/neighborhood to feed to foursquare [search API](https://developer.foursquare.com/docs/api/venues/search) with specific [categoryId](https://developer.foursquare.com/docs/resources/categories) to get following info for each area ( borough that has its own postal code ): 

- Visible venue: number of intersection within 500m radius within each area. The bigger the better. ( More places with foot traffic, more options for you to choose a place that can attract your customer well ) 
- Highly accessible venue: number of public transportation within 500m radius. ( Same reason as above: more traffic, more customer, most likely ). 
- The good place should also have a market / Supermarket / grocery store / gourmet ...  in <= 3km nearby. The more of those places exist within the area we're examining, the more likely we'll choose that location. 
- Parking lot venue: number of parking lot and Hotel/Office category nearby ( As we can deal with them to share their parking service ) , withint 500m radius. Again, bigger better. 
- Your customer base info in this area: info about places nearby that you can expect customer from ( Eg: Residence, Hotel, Office, MovieTheater, Supermarket, Shopping Mall, College & University...  )  
 
Base on the data mentioned above, we can run a classification model to divide Toronto borough into different types based on their "suitable level" of being our restaurant location. 
Most likely that the location that can satisfy all our factors will have high cost or even ran out of available place for rent. Classify them well will give us more options to choose to fit our initial budget.

## 2.2 Collect and process data <a name="collect"></a>

#### Install and  Import needed lib 

In [1]:
#Install and import BeautifulSoup library to parse XML data above
!pip3 --quiet install bs4

#import needed lib
from bs4 import BeautifulSoup
import pandas as pd
pd.set_option('display.max_columns',None)
pd.set_option('display.max_rows',None)
import requests
import numpy as np
from sklearn.cluster import KMeans 

import matplotlib.cm as cm
import matplotlib.colors as colors

from geopy.geocoders import Nominatim

!pip3 --quiet install folium
import folium

import foursquare

#### Download Toronto location data 
The data which was collected from wikipedia and processed by method explained [here](https://github.com/kembox/Coursera_Capstone/blob/master/toronto_neighborhoods.ipynb)

In [2]:
!wget https://raw.githubusercontent.com/kembox/Coursera_Capstone/master/toronto_data.csv -O toronto_data.csv --quiet

In [3]:
#Load data to a dataframe
toronto_data=pd.read_csv('toronto_data.csv',index_col=None)
toronto_data.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude
0,M4E,East Toronto,The Beaches,43.676357,-79.293031
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188
2,M4L,East Toronto,"The Beaches West, India Bazaar",43.668999,-79.315572
3,M4M,East Toronto,Studio District,43.659526,-79.340923
4,M4N,Central Toronto,Lawrence Park,43.72802,-79.38879


#### Define some hardcode venue categoryID as input for later venue search

Based on [Foursquare venue category](https://developer.foursquare.com/docs/resources/categories), we picked some kind of venues that we think it affects our 4 key factors ( visibility, accessibility, parking lot, customer nearby ) explained above. 

The hierachical json format for venue category returned from Fourquare is quite complicate to parse to match our scatterd input demand ( the type of venues we need belong to different groups ). 
So i set the the fixed values here for simplicity. 


In [4]:
#Venues categories

#Visibility
easy_view={'Intersection':'52f2ab2ebcbc57f1066b8b4c'}

#support parking lot
parking={'ParkingLot':'4c38df4de52ce0d596b336e1','Hotel':'4bf58dd8d48988d1fa931735'}

#accessibility
transport={
        'BusStation':'4bf58dd8d48988d1fe931735',
        'BusStop':'52f2ab2ebcbc57f1066b8b4f',
        'MetroStation':'4bf58dd8d48988d1fd931735',
        'LightRailStation':'4bf58dd8d48988d1fc931735',
        'TrainStation':'4bf58dd8d48988d129951735'
        }

#Customer source
customer_source={
        'Office':'4bf58dd8d48988d124941735',
        'Residence':'4e67e38e036454776db1fb3a',
        'ShoppingPlaza':'5744ccdfe4b0c0459246b4dc',
        'PedestrianPlaza':'52e81612bcbc57f1066b7a25',
        'CollegeUniversity':'4d4b7105d754a06372d81259'
        }

#Material supply
supply={
        'Market':'50be8ee891d4fa8dcc7199a7',
        'SuperMarket':'52f2ab2ebcbc57f1066b8b46',
        'Butcher':'4bf58dd8d48988d11d951735',
        'FarmersMarket':'4bf58dd8d48988d1fa941735',
        'FishMarket':'4bf58dd8d48988d10e951735'
        }

####  Function to count venues in the same category nearby each Toronto borough

In [5]:
def VenuesNearby(df,venue_categories,radius=500):
    
    nearby_df=pd.DataFrame([])
    # Initialize empty dataframe for merging later
    
    for k in venue_categories.keys():
        L=[]
        venue_category=venue_categories[k]
        for row in df.itertuples():
        #for row in df.head().itertuples():
            ll=str(row.Latitude) + ',' +  str(row.Longitude)
            venue=client.venues.search(params={'ll':ll,'categoryId':venue_category,'radius':radius})
            if len(venue['venues']) > 0:
                L.append(len(venue['venues']))
            else:
                L.append(0)
        if nearby_df.empty:
            nearby_df=pd.DataFrame(L,columns=[k + 'NearBy'])
        else:
            nearby_df=nearby_df.join(pd.DataFrame(L,columns=[k + 'NearBy']))

    return nearby_df

#### Define Foursquare credentials 

In [6]:
######
CLIENT_ID = 'XXX' # your Foursquare ID

 # your Foursquare ID
CLIENT_SECRET = 'XXX' # your Foursquare Secret
VERSION = '20190602' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET,version='20190528')

Your credentails:
CLIENT_ID: XXX
CLIENT_SECRET:XXX


#### Grab info about nearby venues and add to our dataframe
Different kinds of venues should have different radius range that we want to set. 
For example in our case, we'll locate:
- Material supplier within 3km 
- Parking lot within 200m
- Intersection within 500m
- Other venues within 1km


### NOTE: Possible 429 error from foursquare while calling functions below
Following function calls can be used to grab needed data from Foursquare. 

But as we're going to make many requests to foursquare API there's a chance that you're blocked by foursquare rate limit mechanism.

To avoid that siutation: I commented it to disable them by default. If you want to run it by yourself, make sure that you haven't make any foursquare API within today first, then uncomment them to run it. 

We'll proceed with aggregated data [here](#preprocessing) that I crawled from foursquare and save it to a json file. 
( See [this](https://developer.foursquare.com/docs/api/troubleshooting/rate-limits) for more info about rate limit)

In [7]:
#supply_nearby=VenuesNearby(toronto_data,supply,radius=3000)
#easy_view_nearby=VenuesNearby(toronto_data,easy_view,radius=500)
#parking_nearby=VenuesNearby(toronto_data,parking,radius=200)
#transport_nearby=VenuesNearby(toronto_data,transport,radius=1000)
#customer_source_nearby=VenuesNearby(toronto_data,customer_source,radius=1000)

In [8]:
#places=[supply_nearby,easy_view_nearby,parking_nearby,transport_nearby,customer_source_nearby]
#places=[supply_nearby]
#for i in range(len(places)):
#    toronto_data=toronto_data.join(places[i])

# 3. Methodology <a name="Methodology"></a>

Main component of the report where we do the job for collecting and processing data explained above, fit them to a classification model and visualize the results so we can use it to get answer of our business problem


## 3.1 Data proprocessing  <a name="preprocessing"></a>

#### Download aggregated data in case of hitting rate limit by 4square

In [9]:
!wget https://raw.githubusercontent.com/kembox/Coursera_Capstone/master/toronto_resto.json -O toronro_resto.json -q 

In [10]:
resto_data=pd.read_json('toronto_resto.json').reset_index(drop=True)
resto_data.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude,MarketNearBy,SuperMarketNearBy,ButcherNearBy,FarmersMarketNearBy,FishMarketNearBy,IntersectionNearBy,ParkingLotNearBy,HotelNearBy,BusStationNearBy,BusStopNearBy,MetroStationNearBy,LightRailStationNearBy,TrainStationNearBy,OfficeNearBy,ResidenceNearBy,ShoppingPlazaNearBy,PedestrianPlazaNearBy,CollegeUniversityNearBy
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,6,3,8,10,2,0,0,0,10,5,0,3,0,23,20,0,0,5
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,16,20,18,22,9,10,3,0,13,6,3,3,1,27,27,0,0,19
2,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,25,29,28,29,16,0,0,0,7,1,5,0,0,29,30,0,0,21
3,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,27,29,26,27,14,3,1,7,8,10,3,6,0,28,30,0,0,28
4,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,28,30,26,27,12,8,16,22,16,15,11,14,2,30,30,0,1,30


#### Preprocess data

Convert data in *NearBy columns to a simple binary data ( 1/0 ) based on there value in comparison with mean value in each column. 
As we want to find possible options, not the best one so "above average" should be considered a good one


In [11]:
for col in resto_data:
    if 'NearBy' in col:
        resto_data[col]=resto_data[col].apply(lambda x: 1 if x >= resto_data[col].mean() else 0)
resto_data.head()

Unnamed: 0,PostCode,Borough,Neighborhood,Latitude,Longitude,MarketNearBy,SuperMarketNearBy,ButcherNearBy,FarmersMarketNearBy,FishMarketNearBy,IntersectionNearBy,ParkingLotNearBy,HotelNearBy,BusStationNearBy,BusStopNearBy,MetroStationNearBy,LightRailStationNearBy,TrainStationNearBy,OfficeNearBy,ResidenceNearBy,ShoppingPlazaNearBy,PedestrianPlazaNearBy,CollegeUniversityNearBy
0,M4E,East Toronto,The Beaches,43.676357,-79.293031,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,M4K,East Toronto,"The Danforth West, Riverdale",43.679557,-79.352188,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,0,0,0
2,M4W,Downtown Toronto,Rosedale,43.679563,-79.377529,1,1,1,1,1,0,0,0,0,0,1,0,0,1,1,0,0,0
3,M4X,Downtown Toronto,"Cabbagetown, St. James Town",43.667967,-79.367675,1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,1
4,M4Y,Downtown Toronto,Church and Wellesley,43.66586,-79.38316,1,1,1,1,1,1,1,1,1,1,1,1,0,1,1,0,1,1


## 3.2 Clustering / Modeling data <a name="clustering"></a>


Why clustering ? 

Based on our discussion in business problem analysis phase , we already know that the more nearby places condition satisfied, the better the place we should choose. Though in our case, we shouldn't care about the best place only, we should check the more options as possible because the best place maybe too expensive or even unavailable. As we have more data info venues group, we can be more flexible in decision making. 

From *Nearby feature in our dataframe above, let's do the clustering method to classify them info different groups and examine the appopriate level of the each result group. 

Which algorithm ? 

K-means clustering due to its simplicity and performance. We also can control how many group we want to divide our featured data into. I choose to seperate our venues into 3 groups only:

- Very good place: can be expensive or unavailble 
- Acceptable place: good for new player in industry with limited buget
- Poor place: doesn't look very promising

In [12]:
#set k=3
kcluster=3
cluster_resto=resto_data.drop(['PostCode','Borough','Neighborhood','Latitude','Longitude'],1)
cluster_resto.head()

#Run k-means clustering
kmeans = KMeans(n_clusters=kcluster,random_state=0).fit(cluster_resto)
kmeans.labels_[0:10]

array([0, 0, 2, 2, 1, 2, 1, 1, 1, 1], dtype=int32)

In [13]:
#Add group data back into last column of  our dataframe
try:
    resto_data.insert(resto_data.shape[1],'ClusterLabels', kmeans.labels_)
except ValueError:
    print("Warn: cannot insert Cluster Labels, already exists. It looks like you're running this twice. Let's proceed")
    
resto_data[['Borough','Neighborhood','ClusterLabels']].head()

Unnamed: 0,Borough,Neighborhood,ClusterLabels
0,East Toronto,The Beaches,0
1,East Toronto,"The Danforth West, Riverdale",0
2,Downtown Toronto,Rosedale,2
3,Downtown Toronto,"Cabbagetown, St. James Town",2
4,Downtown Toronto,Church and Wellesley,1


## 3.3 Convert grouped data into human-readable format <a name="convert_result"></a>

#### Sum up NearBy values to form  scores for 5 main factors: easy_view, parking, transport, customer_source, supply

In [14]:
scores={'easy_view':easy_view,'parking':parking,'transport':transport,'customer_source':customer_source,'supply':supply}
rows=resto_data.shape[0]
cols=resto_data.shape[1]

for row_index in range(rows):
    sum=0
    for k,v in scores.items():
        cnt=0
        for NearBy_column_name in v.keys():
            cnt = cnt + resto_data.loc[row_index,NearBy_column_name+'NearBy']
        resto_data.loc[row_index, k + '_score'] = cnt
        sum = sum + resto_data.loc[row_index, k +'_score']
    resto_data.loc[row_index, 'overall_score']=sum
scores_df=resto_data.loc[:,'ClusterLabels':'overall_score']
scores_df.head()

Unnamed: 0,ClusterLabels,easy_view_score,parking_score,transport_score,customer_source_score,supply_score,overall_score
0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1.0,0.0,0.0,1.0,2.0,4.0
2,2,0.0,0.0,1.0,2.0,5.0,8.0
3,2,0.0,0.0,0.0,3.0,5.0,8.0
4,1,1.0,2.0,4.0,4.0,5.0,16.0


### Label encoding for scores value

In [15]:
for col in scores_df:
    if '_score' in col and 'overall' not in col:
        scores_df[col]=scores_df[col].apply(lambda x: 'GOOD' if x >= resto_data[col].mean() else 'BAD')
        

In [16]:
#### Overview cluster label 0
scores_df.set_index('ClusterLabels').loc[0].head(3)

Unnamed: 0_level_0,easy_view_score,parking_score,transport_score,customer_source_score,supply_score,overall_score
ClusterLabels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,BAD,BAD,BAD,BAD,BAD,0.0
0,GOOD,BAD,BAD,BAD,BAD,4.0
0,BAD,BAD,BAD,BAD,BAD,1.0


In [17]:
#### Overview cluster label 1
scores_df.set_index('ClusterLabels').loc[1].head(3)

Unnamed: 0_level_0,easy_view_score,parking_score,transport_score,customer_source_score,supply_score,overall_score
ClusterLabels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,GOOD,GOOD,GOOD,GOOD,GOOD,16.0
1,GOOD,GOOD,GOOD,GOOD,GOOD,17.0
1,GOOD,GOOD,GOOD,GOOD,GOOD,16.0


In [18]:
#### Overview cluster label 2
scores_df.set_index('ClusterLabels').loc[2].head(3)

Unnamed: 0_level_0,easy_view_score,parking_score,transport_score,customer_source_score,supply_score,overall_score
ClusterLabels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,BAD,BAD,BAD,BAD,GOOD,8.0
2,BAD,BAD,BAD,GOOD,GOOD,8.0
2,GOOD,BAD,GOOD,GOOD,GOOD,12.0


## 3.4 Analyze the result <a name="analyze"></a>

#### Which ClusterLabel / group exactly give us the best result ? 

As we can see, Kmeans clustering algorithm divided our venus into groups very well followed our predefined conditions. Label encoding helps us to notice quickly with group is better at the first glance. 
But let's calculate the overall scores of each group to see which one is the best. And where exactly the best locations for us is in Toronto. 

In [19]:
clusterD={}
for x in range(kcluster):
    key='cluster' + str(x)
    clusterD[key]=scores_df.set_index('ClusterLabels').loc[x][['overall_score']].mean().values[0]
    
sorted_cluster_list=sorted(clusterD.items(), key=lambda x: x[1])
print("Cluster Label ordered by the suitable level for our restautent, from worst to best:\n")
for x in sorted_cluster_list:
    print(x[0]) 

Cluster Label ordered by the suitable level for our restautent, from worst to best:

cluster0
cluster2
cluster1


#### Let's see where the best group located

In [20]:
resto_data.set_index('ClusterLabels').loc[1][['Borough','Neighborhood','overall_score']]

Unnamed: 0_level_0,Borough,Neighborhood,overall_score
ClusterLabels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Downtown Toronto,Church and Wellesley,16.0
1,Downtown Toronto,"Ryerson, Garden District",17.0
1,Downtown Toronto,St. James Town,16.0
1,Downtown Toronto,Berczy Park,15.0
1,Downtown Toronto,Central Bay Street,16.0
1,Downtown Toronto,"Adelaide, King, Richmond",16.0
1,Downtown Toronto,"Harbourfront East, Toronto Islands, Union Station",15.0
1,Downtown Toronto,"Design Exchange, Toronto Dominion Centre",16.0
1,Downtown Toronto,"Commerce Court, Victoria Hotel",17.0
1,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,16.0


### So, unsurprisingly, the best place to open a restaurant is in downtown toronto

## 3.5 Visualize by folium <a name="folium"></a>

Let's visualize the cluster label venues above by Folium

In [21]:
#create map 
address = 'Toronto, Canada'

geolocator = Nominatim(user_agent="toronto_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))
map_clusters = folium.Map(location=[latitude,longitude],zoom_start=11)

#set color scheme for the cluster
x = np.arange(kcluster)

ys= [ i + x + (i*x)**2 for i in range(kcluster)]


colors_array=cm.rainbow(np.linspace(0,1,len(ys)))
rainbow= [colors.rgb2hex(i) for i in colors_array]

#add markers to the map 
markers_colors=[]
for lat,lon,poi,cluster in zip(resto_data['Latitude'],resto_data['Longitude'],resto_data['Neighborhood'],resto_data['ClusterLabels']):
    label = folium.Popup(str(poi) + ' Cluster ' +str(cluster),parse_html=True)
    folium.CircleMarker(
        [lat,lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)

map_clusters



The geograpical coordinate of Toronto are 43.653963, -79.387207.


# 4. Results and Discussion <a name="Results"></a>

We can easily see from the folium map above that Downtown Toronto is the most convenient place to open restaurant based on our 5 key factors logic. That's the area that has the most convenient transportation, many easy-to-splot place, biggest customer source, convenient parking facility and also is surrounded by enough material supply for your busines. 

However, it all depends on what kind of restautant you want to open and what's your budget. 
For this "busy" dynamic city life area, FastFood / Bar / Bistro / Fast casual / Food Trucks Food Trucks / Concession Stands / delivery-only restaurant should be good and safe choice. 

Fine / casual dining or family type of restaurant should be opened in our "cluster2" area ( green one in our map, for example old Toronto are ) as it still has enough convenient facility, enough customer source but will not be too crowded and hectic so people should have time to enjoy a slow meal. 

In comparison with other areas, the far West or Central Toronto don't have our 5 factors advantages so should't be our place. You still can open a restaurant there but should start small and do it slowly. 

# 5. Conclusion <a name="Conclusion"></a>

By using our data science skill along with very useful foursquare data, we can solve a very practical problem. 
Here in this playbook, our research was conducted without knowing any knowledge as a Toronto local but still can draft our a pretty nice plan for a business problem. 

Even though our dataset is very small and we simply choose Kmeans clustering model mostly based on its simplicity, the results we have is still very promising and answer our business question very effectively. Thus, if we have more data to analyze and spend more time on choosing or tuning our model more carefully, the much significant results should be expected. 

Back to our particular practical question: "How to choose suitable place to open a restaurant in Toronto?" I believe the answer addressed correctly the problem we need to deal with. In short, based on the location data already, the Downtown Toronto is the best place to open the restaurant. But you also can use the data that we provide to match it with your specific kind of restaurant concept as well as budget you have to choose the best solution. 

Thank you for using this. Enjoy your location data 
