# Where to open another chain of your restaurant in Toronto
## Introduction/Business Problem
When you are a successful entrepreneur in a city, say in this case Toronto, owning a well running restaurant, you could naturally think about opening another chain store. Through clustering, the entrepreneur can find similar spots that might replicate the success of the existing business.
## Data Section
I will continue working on the data presented in previous weeks:  
1. Borough and neighborhood data scraping from: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M. This gives a basic idea of different neighborhood
2. Latitude and longitude provided in the file “Geospatial_Coordinates.csv”. This tells us the latitude and longitude information of the neighborhood
3. Using Foursquare to acquire business information, so that neighborhood with similar business type could be clustered together. 

## Data Processing
### Geographic data

In [1]:
#scrape data from wiki page
from bs4 import BeautifulSoup
import requests
import pandas as pd

df = pd.DataFrame(columns=['Postcode', 'Borough', 'Neighbourhood'])

with open('wikipage.htm') as html_file:
    soup = BeautifulSoup(html_file,'lxml')

i=0    
for code in soup.find('div',class_='mw-parser-output').table.tbody.find_all('tr'):
    list0=list()
    for element in code.find_all('td'):
        list0.append(element.text.strip())
    if list0==list():continue
    df.loc[i]=list0
    i=i+1
pd.options.mode.chained_assignment = None

#Dealing with missing data
df_bo=df[df['Borough']!='Not assigned']
index=(df_bo['Neighbourhood']=='Not assigned')
index
df_bo.loc[index,'Neighbourhood']=df_bo.loc[index,'Borough']

#combine the ones with same postcode
df_nei=df_bo.groupby(['Postcode','Borough']).Neighbourhood.apply(lambda x : ",".join(x)).to_frame()

#final data
df_canada=df_nei.reset_index()
data = pd.read_csv("Geospatial_Coordinates.csv") 
data.rename(columns={'Postal Code':'Postcode'}, inplace=True)
df_merge =pd.merge(df_canada,data,how='left',on='Postcode')

### Venues data

In [2]:
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim 
!conda install -c conda-forge folium=0.5.0 --yes
import folium 

#foursquare api
import random

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

#Define Foursquare Credentials and Version
CLIENT_ID = '2USAT4JF3HVHMVFO4IRPNZHPUHNVYEM2BC3LY1OUHMZYNZPU' # your Foursquare ID
CLIENT_SECRET = 'VTKBIKZJDYXLRZZEYGWT54KNDB1K3MUBN5EKTAOQ5LEV0OWJ' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 30

#Get the geograpical coordinate
address = 'Toronto,Canada'
geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
LIMIT = 100
radius = 500

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [9]:
# define the function to get nearby venus
def getNearbyVenues(names, latitudes, longitudes, radius=500):    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [10]:
# get nearby venues
df_toronto=df_merge
toronto_venues = getNearbyVenues(names=df_toronto['Neighbourhood'],
                                   latitudes=df_toronto['Latitude'],
                                   longitudes=df_toronto['Longitude'])

In [21]:
toronto_venues.groupby("Venue Category").count().sort_values('Neighbourhood',ascending=False).head(10)

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Coffee Shop,182,182,182,182,182,182
Café,100,100,100,100,100,100
Restaurant,60,60,60,60,60,60
Park,56,56,56,56,56,56
Bakery,55,55,55,55,55,55
Pizza Place,53,53,53,53,53,53
Italian Restaurant,53,53,53,53,53,53
Bar,44,44,44,44,44,44
Sandwich Place,42,42,42,42,42,42
Hotel,42,42,42,42,42,42


In [23]:
toronto_venues.groupby("Venue Category").count().sort_values('Neighbourhood',ascending=True).head(15)

Unnamed: 0_level_0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Accessories Store,1,1,1,1,1,1
Luggage Store,1,1,1,1,1,1
Indonesian Restaurant,1,1,1,1,1,1
Indie Movie Theater,1,1,1,1,1,1
Hotpot Restaurant,1,1,1,1,1,1
Hospital,1,1,1,1,1,1
Historic Site,1,1,1,1,1,1
Health & Beauty Service,1,1,1,1,1,1
Hardware Store,1,1,1,1,1,1
Harbor / Marina,1,1,1,1,1,1


From the output above, we have a clearer view that in Toranto, of all different kinds of cuisine, Pizza and Itlian are most popular, German and Indonesian restaurant are less common.

## Model Building
### k-means clustering

In [34]:
# one hot encoding
toronto_onehot = pd.get_dummies(toronto_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
toronto_onehot = pd.concat([toronto_venues['Neighbourhood'], toronto_onehot], axis=1)
toronto_grouped = toronto_onehot.groupby('Neighbourhood').mean().reset_index()

# put that into a pandas dataframe
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]
# display the top 10 venues for each neighborhood.
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
import numpy as np
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = toronto_grouped['Neighbourhood']

for ind in np.arange(toronto_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"Adelaide,King,Richmond",Coffee Shop,Café,Steakhouse,Bar,Thai Restaurant,Hotel,Bakery,Cosmetics Shop,Sushi Restaurant,Burger Joint
1,Agincourt,Lounge,Clothing Store,Latin American Restaurant,Breakfast Spot,Yoga Studio,Dumpling Restaurant,Discount Store,Dog Run,Doner Restaurant,Donut Shop
2,"Agincourt North,L'Amoreaux East,Milliken,Steel...",Playground,Park,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant
3,"Albion Gardens,Beaumond Heights,Humbergate,Jam...",Pizza Place,Fried Chicken Joint,Grocery Store,Sandwich Place,Liquor Store,Fast Food Restaurant,Beer Store,Pharmacy,General Entertainment,Curling Ice
4,"Alderwood,Long Branch",Pizza Place,Athletics & Sports,Pharmacy,Coffee Shop,Pub,Sandwich Place,Skating Rink,Gym,Airport Terminal,Farmers Market


In [35]:
#cluster neighbourhood
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5
toronto_grouped_clustering = toronto_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(toronto_grouped_clustering)

# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
toronto_merged = df_toronto

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
toronto_merged = toronto_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='Neighbourhood')
toronto_merged = toronto_merged.dropna()# check the last columns!

In [36]:
toronto_merged.sort_values("Cluster Labels")

Unnamed: 0,Postcode,Borough,Neighbourhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,M2L,North York,"Silver Hills,York Mills",43.757490,-79.374714,0.0,Cafeteria,Yoga Studio,Drugstore,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Dessert Shop
0,M1B,Scarborough,"Rouge,Malvern",43.806686,-79.194353,1.0,Fast Food Restaurant,Drugstore,Dim Sum Restaurant,Diner,Discount Store,Dog Run,Doner Restaurant,Donut Shop,Dumpling Restaurant,Gym / Fitness Center
70,M5X,Downtown Toronto,"First Canadian Place,Underground city",43.648429,-79.382280,1.0,Coffee Shop,Café,Hotel,Steakhouse,Bar,Asian Restaurant,Burger Joint,Restaurant,Seafood Restaurant,Deli / Bodega
69,M5W,Downtown Toronto,Stn A PO Boxes 25 The Esplanade,43.646435,-79.374846,1.0,Coffee Shop,Café,Restaurant,Japanese Restaurant,Bakery,Hotel,Italian Restaurant,Seafood Restaurant,Beer Bar,Cocktail Bar
68,M5V,Downtown Toronto,"CN Tower,Bathurst Quay,Island airport,Harbourf...",43.628947,-79.394420,1.0,Airport Service,Airport Terminal,Airport Lounge,Bar,Boat or Ferry,Boutique,Sculpture Garden,Airport Gate,Airport Food Court,Airport
67,M5T,Downtown Toronto,"Chinatown,Grange Park,Kensington Market",43.653206,-79.400049,1.0,Café,Bar,Dumpling Restaurant,Vietnamese Restaurant,Chinese Restaurant,Coffee Shop,Mexican Restaurant,Bakery,Park,Dessert Shop
66,M5S,Downtown Toronto,"Harbord,University of Toronto",43.662696,-79.400049,1.0,Café,Restaurant,Japanese Restaurant,Bar,Bakery,Bookstore,Sandwich Place,Italian Restaurant,Theater,Chinese Restaurant
65,M5R,Central Toronto,"The Annex,North Midtown,Yorkville",43.672710,-79.405678,1.0,Sandwich Place,Café,Coffee Shop,Pharmacy,History Museum,Liquor Store,Shoe Repair,Burger Joint,Indian Restaurant,Pub
62,M5M,North York,"Bedford Park,Lawrence Manor East",43.733283,-79.419750,1.0,Coffee Shop,Italian Restaurant,Liquor Store,Thai Restaurant,Indian Restaurant,Restaurant,Butcher,Sushi Restaurant,Café,Pub
61,M5L,Downtown Toronto,"Commerce Court,Victoria Hotel",43.648198,-79.379817,1.0,Coffee Shop,Café,Hotel,Restaurant,Seafood Restaurant,Gastropub,Steakhouse,Bakery,Gym,Deli / Bodega


In this case, the enterprenuer should find a neighbourhood with same cluster label as the one that he owns before.

## Conclusion

This cluster models gives a rough consideration about the kinds of venues around the neighborhood. We might also add other information such as nearby population, traffica and other factors into consideration to form a better model. We could also try to shrink the area of neighbourhood. 
The final decission on best second chain restaurant location should be made by the enteprneur based on the recommanded similar spots.