# Capstone Project Report: Battle of Neighbourhoods

## 1. Introduction/Business Problem

The goal of this project is to determine the best location to open a Japanese restaurant in Toronto.

My client is an entrepreneur and a global real state investor that wants to expand one of his restaurants' chain and open a new location in the city of Toronto. Since Downtown Toronto is very competetive, my client needs insights from data to decide in which neighborhood to establish his new Japanese restaurant.

### Audience

This report provides valuable information for for investors looking to open a restaurant in Toronto, as it shows relevant insights from data that might be crucial to decide the best neighborhood to establish a new restaurant.

## 2. Data

This project uses data from two data sources: Wikipedia, Statistics Canada (StatsCan) and Foursquare.

- The Wikipedia data source, has Toronto neighborhood's details along with their postal codes. All the information is available here: https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M


- The Statistics Canada population density dataset which provides information of population count based on the postal code of communities across the country. This data will be used to determine the locations with the highest population density, assuming that the demand for a Japanese Restaurant would be higher. This data will be analysed with Foursquare data (the data source mentioned bellow), to see where existing restaurants are. In this project, it is assumed that the places with more Japanese restaurants are likely to have the greatest demand. It will be also considered that the people that frequent these restaurants are likely within the same neighborhood or close to another neighborhood with similar population density. This dataset is available here: https://www12.statcan.gc.ca/census-recensement/2016/dp-pd/hlt-fst/pd-pl/Table.cfm?Lang=Eng&T=1201&S=22&O=A


- The Foursquare API will be used to obtain the geographical location data for Downtown Toronto. This data will be used to explore the restaurant venues in the neighbourhoods that will be crucial in determining the number of restaurants in each neighborhood, which we can then cross with the population data from the StatsCan information above. From here, we are able to determine the best location based on existing restaurants and population density.

In [13]:
#Installing Folium
!pip install folium



In [14]:
conda install -c anaconda beautifulsoup4

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [15]:
conda install -c anaconda lxml

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import requests
import folium
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
%matplotlib inline
import json
from pandas.io.json import json_normalize

In [4]:
# Obtaining Wikipedia data using get request and BeautifulSoup4:
wiki_data= requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
wiki_soup= BeautifulSoup(wiki_data, 'lxml')

In [5]:
#Defining the dataframe columns
New_columns=['Postal Code', 'Borough', 'Neighborhood']

#Creating a new dataframe
df=pd.DataFrame(columns=New_columns)
df

Unnamed: 0,Postal Code,Borough,Neighborhood


In [6]:
#Filling the dataframe
wiki_table=wiki_soup.find('table')
wiki_table

<table class="wikitable sortable">
<tbody><tr>
<th>Postal Code
</th>
<th>Borough
</th>
<th>Neighbourhood
</th></tr>
<tr>
<td>M1A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M2A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3A
</td>
<td>North York
</td>
<td>Parkwoods
</td></tr>
<tr>
<td>M4A
</td>
<td>North York
</td>
<td>Victoria Village
</td></tr>
<tr>
<td>M5A
</td>
<td>Downtown Toronto
</td>
<td>Regent Park, Harbourfront
</td></tr>
<tr>
<td>M6A
</td>
<td>North York
</td>
<td>Lawrence Manor, Lawrence Heights
</td></tr>
<tr>
<td>M7A
</td>
<td>Downtown Toronto
</td>
<td>Queen's Park, Ontario Provincial Government
</td></tr>
<tr>
<td>M8A
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M9A
</td>
<td>Etobicoke
</td>
<td>Islington Avenue, Humber Valley Village
</td></tr>
<tr>
<td>M1B
</td>
<td>Scarborough
</td>
<td>Malvern, Rouge
</td></tr>
<tr>
<td>M2B
</td>
<td>Not assigned
</td>
<td>Not assigned
</td></tr>
<tr>
<td>M3B
</td>
<td

In [20]:
# Looping through the table and filling the dataframe one row at a time
for tr in wiki_table.find_all('tr'):
    new_row=[]
    for td in tr.find_all('td'):
        new_row.append(td.text.strip())
    if len(new_row)==3:
            df.loc[len(df)]=new_row
df.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [21]:
# Processing cells that have assigned borough
df_new= df[df.Borough!='Not assigned']
df_new.head()

Unnamed: 0,Postal Code,Borough,Neighborhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"
5,M6A,North York,"Lawrence Manor, Lawrence Heights"
6,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government"


In [22]:
# Creating the resultant dataframe
df_new = df_new.groupby('Postal Code').agg({'Borough':'first','Neighborhood': ', '.join}).reset_index()
df_new

Unnamed: 0,Postal Code,Borough,Neighborhood
0,M1B,Scarborough,"Malvern, Rouge, Malvern, Rouge"
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek, Rouge ..."
2,M1E,Scarborough,"Guildwood, Morningside, West Hill, Guildwood, ..."
3,M1G,Scarborough,"Woburn, Woburn"
4,M1H,Scarborough,"Cedarbrae, Cedarbrae"
...,...,...,...
98,M9N,York,"Weston, Weston"
99,M9P,Etobicoke,"Westmount, Westmount"
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ..."
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest..."


In [23]:
#Creating the dataframe coordinates from the data
df_coor=pd.read_csv('http://cocl.us/Geospatial_data')
df_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [24]:
#Combining neighborhood dataset with coordinates dataset
df_new=pd.merge(df_new, df_coor, on='Postal Code')
df_new.head(103)

Unnamed: 0,Postal Code,Borough,Neighborhood,Latitude,Longitude
0,M1B,Scarborough,"Malvern, Rouge, Malvern, Rouge",43.806686,-79.194353
1,M1C,Scarborough,"Rouge Hill, Port Union, Highland Creek, Rouge ...",43.784535,-79.160497
2,M1E,Scarborough,"Guildwood, Morningside, West Hill, Guildwood, ...",43.763573,-79.188711
3,M1G,Scarborough,"Woburn, Woburn",43.770992,-79.216917
4,M1H,Scarborough,"Cedarbrae, Cedarbrae",43.773136,-79.239476
...,...,...,...,...,...
98,M9N,York,"Weston, Weston",43.706876,-79.518188
99,M9P,Etobicoke,"Westmount, Westmount",43.696319,-79.532242
100,M9R,Etobicoke,"Kingsview Village, St. Phillips, Martin Grove ...",43.688905,-79.554724
101,M9V,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437


In [25]:
#Reading the data into a pandas dataframe
df_pop = pd.read_csv('T120120210223044745.CSV',encoding = 'unicode_escape')

# Remove unnecessary data from the table
df_pop = df_pop.rename(columns={'Geographic code':'Postal Code', 'Geographic name':'Geoname', 'Province or territory':'Province', 'Incompletely enumerated Indian reserves and Indian settlements, 2016':'Incomplete', 'Population, 2016':'Population2016', 'Total private dwellings, 2016':'PrivateDwellings', 'Private dwellings occupied by usual residents, 2016':'OccupiedPrivateDwellings'})
df_pop = df_pop.drop(columns=['Geoname', 'Province', 'Incomplete', 'PrivateDwellings', 'OccupiedPrivateDwellings'])
df_pop = df_pop.iloc[1:]

df_pop.info

<bound method DataFrame.info of                                             Postal Code  Population2016
1                                                   A0A         46587.0
2                                                   A0B         19792.0
3                                                   A0C         12587.0
4                                                   A0E         22294.0
5                                                   A0G         35266.0
...                                                 ...             ...
1644  For further information, refer to: http://www1...             NaN
1645  Source: Statistics Canada, 2016 Census of Popu...             NaN
1646  How to cite: Statistics Canada. 2017. Populati...             NaN
1647  Statistics Canada Catalogue no. 98-402-X201600...             NaN
1648  http://www12.statcan.gc.ca/census-recensement/...             NaN

[1648 rows x 2 columns]>

In [26]:
#remove the last rows from the dataset since it has irrelevant information
df_pop.dropna(subset = ["Population2016"], inplace=True)
df_pop.tail()

Unnamed: 0,Postal Code,Population2016
1637,X0G,500.0
1638,X1A,20054.0
1639,Y0A,1641.0
1640,Y0B,6561.0
1641,Y1A,27672.0


In [27]:
# Combining data from both datasets and Creating a new dataframe with Toronto neighbourhoods sorted by population

df3 = pd.merge(df_pop, df_new, on="Postal Code", how='right')
df3 = df3.sort_values(by=['Population2016'], ascending=False)
df3.dropna(subset = ["Population2016"], inplace=True)
df3.head()

Unnamed: 0,Postal Code,Population2016,Borough,Neighborhood,Latitude,Longitude
22,M2N,75897.0,North York,"Willowdale, Willowdale East, Willowdale, Willo...",43.77012,-79.408493
0,M1B,66108.0,Scarborough,"Malvern, Rouge, Malvern, Rouge",43.806686,-79.194353
18,M2J,58293.0,North York,"Fairview, Henry Farm, Oriole, Fairview, Henry ...",43.778517,-79.346556
101,M9V,55959.0,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437
14,M1V,54680.0,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577


In [28]:
#Defining Foursquare credentials and version
CLIENT_ID = 'XH32H40ZP2MRUPOYX3HYYFQYWRUVPDTPIJBRFU1G3Q2V2RLH' 
CLIENT_SECRET = 'QWADZD1OW1Y5VLJDBJ4HVVSSZWTTZKNQ5NHVOSPVZRYQIJP1' 
ACCESS_TOKEN = 'MC0DCSLO2L4N4Z5P1AM0CFQQ5EGYF003DPQYFDTLYS1KMWLL' 
VERSION = '20180604'

In [29]:
# Setting a limit to prevent Foursquare's free account overuse

limit = 200

# Setting a search radius of 5000m (assuming that people will travel up to 5km to visit a restaurant)

radius = 5000

# Defining a function to retrieve venues

def getNearbyVenues(names, latitudes, longitudes, radius=5000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            limit)
            
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [30]:
# Getting neighbourhood's list

Toronto_Venues = getNearbyVenues(names=df3['Neighborhood'],
                                   latitudes=df3['Latitude'],
                                   longitudes=df3['Longitude']
                                  )

Willowdale, Willowdale East, Willowdale, Willowdale East
Malvern, Rouge, Malvern, Rouge
Fairview, Henry Farm, Oriole, Fairview, Henry Farm, Oriole
South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens, South Steeles, Silverstone, Humbergate, Jamestown, Mount Olive, Beaumond Heights, Thistletown, Albion Gardens
Milliken, Agincourt North, Steeles East, L'Amoreaux East, Milliken, Agincourt North, Steeles East, L'Amoreaux East
CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport, CN Tower, King and Spadina, Railway Lands, Harbourfront West, Bathurst Quay, South Niagara, Island airport
Steeles West, L'Amoreaux West, Steeles West, L'Amoreaux West
Kennedy Park, Ionview, East Birchmount Park, Kennedy Park, Ionview, East Birchmount Park
Guildwood, Morningside, West Hill, Guildwood, Morningside, West Hill
Woodbine Heights, Woodbine Heights
Dorset Park, Wexford Heights, Scarborough Town C

In [31]:
# SHOW UNIQUE VENUE CATEGORIES

print('Unique Venue Categories:')
list(Toronto_Venues['Venue Category'].unique())

Unique Venue Categories:


['Hotel',
 'Ramen Restaurant',
 'Seafood Restaurant',
 'Steakhouse',
 'Japanese Restaurant',
 'Café',
 'Bubble Tea Shop',
 'Movie Theater',
 'Creperie',
 'Grocery Store',
 'Fried Chicken Joint',
 'Sushi Restaurant',
 'Liquor Store',
 'Korean Restaurant',
 'Coffee Shop',
 'Burger Joint',
 'Bakery',
 'Thai Restaurant',
 'Theater',
 'Park',
 'French Restaurant',
 'Supermarket',
 'Plaza',
 'Bridal Shop',
 'Gym',
 'Pizza Place',
 'Spa',
 'Fast Food Restaurant',
 'Shopping Mall',
 'Ski Chalet',
 'Middle Eastern Restaurant',
 'General Entertainment',
 'Health Food Store',
 'Fish Market',
 'Mediterranean Restaurant',
 'Bookstore',
 'Persian Restaurant',
 'Greek Restaurant',
 'Deli / Bodega',
 'Clothing Store',
 'Restaurant',
 'Outdoor Supply Store',
 'Vietnamese Restaurant',
 'Sandwich Place',
 'Sporting Goods Shop',
 'Dessert Shop',
 'Tea Room',
 'Italian Restaurant',
 'Gourmet Shop',
 'Escape Room',
 'Cosmetics Shop',
 'Auto Dealership',
 'Furniture / Home Store',
 'Trail',
 'Hobby Shop',
 '

In [32]:
# ISOLATE ONLY THOSE CATEGORIES WITH JAPANESE THEMES (SUSHI, RAMEN, ETC.)
Japanese_restaurants = ['Ramen Restaurant', 'Japanese Restaurant', 'Sushi Restaurant', 'Noodle House']

Japanese_rest_pd = pd.DataFrame(Japanese_restaurants)

Japanese_rest_pd

Unnamed: 0,0
0,Ramen Restaurant
1,Japanese Restaurant
2,Sushi Restaurant
3,Noodle House


In [33]:
# Renaming columns to the five types of Japanese restaurant
Japanese_rest_pd = Japanese_rest_pd.rename(columns={0:'Venue Category'})

# Combining dataframes to see Japanese restaurant types in neighbourhoods
Toronto_Japanese_rest = pd.merge(Toronto_Venues, Japanese_rest_pd, on='Venue Category', how='right')

Toronto_Japanese_rest

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,"Willowdale, Willowdale East, Willowdale, Willo...",43.770120,-79.408493,Konjiki Ramen,43.766998,-79.412222,Ramen Restaurant
1,"Willowdale, Willowdale West, Willowdale, Willo...",43.782736,-79.442259,Konjiki Ramen,43.766998,-79.412222,Ramen Restaurant
2,"Don Mills, Don Mills",43.725900,-79.340923,Kinton Ramen,43.707302,-79.395854,Ramen Restaurant
3,"Bathurst Manor, Wilson Heights, Downsview Nort...",43.754328,-79.442259,Konjiki Ramen,43.766998,-79.412222,Ramen Restaurant
4,"Willowdale, Newtonbrook, Willowdale, Newtonbrook",43.789053,-79.408493,Konjiki Ramen,43.766998,-79.412222,Ramen Restaurant
...,...,...,...,...,...,...,...
265,"Clarks Corners, Tam O'Shanter, Sullivan, Clark...",43.781638,-79.304302,Wonton Chai Noodle 雲吞仔,43.787722,-79.269552,Noodle House
266,"Clarks Corners, Tam O'Shanter, Sullivan, Clark...",43.781638,-79.304302,Deer Garden Signatures 鹿園魚湯米線,43.821898,-79.298857,Noodle House
267,"Wexford, Maryvale, Wexford, Maryvale",43.750072,-79.295849,Eight Noodles,43.778234,-79.308299,Noodle House
268,"Woburn, Woburn",43.770992,-79.216917,Wonton Chai Noodle 雲吞仔,43.787722,-79.269552,Noodle House


In [34]:
# Using one hot encoding
newonehot = pd.get_dummies(Toronto_Japanese_rest[['Venue Category']], prefix="", prefix_sep="")

# Add neighbourhood back in and move to first column
newonehot['Neighborhood'] = Toronto_Japanese_rest['Neighborhood'] 
fixed_columns = [newonehot.columns[-1]] + list(newonehot.columns[:-1])
newonehot = newonehot[fixed_columns]

newonehot.head()

Unnamed: 0,Neighborhood,Japanese Restaurant,Noodle House,Ramen Restaurant,Sushi Restaurant
0,"Willowdale, Willowdale East, Willowdale, Willo...",0,0,1,0
1,"Willowdale, Willowdale West, Willowdale, Willo...",0,0,1,0
2,"Don Mills, Don Mills",0,0,1,0
3,"Bathurst Manor, Wilson Heights, Downsview Nort...",0,0,1,0
4,"Willowdale, Newtonbrook, Willowdale, Newtonbrook",0,0,1,0


In [35]:
# Analysis of restaurant types (percentages) in each neighborhood

grouped = newonehot.groupby('Neighborhood').mean().reset_index()
grouped

Unnamed: 0,Neighborhood,Japanese Restaurant,Noodle House,Ramen Restaurant,Sushi Restaurant
0,"Agincourt, Agincourt",0.222222,0.444444,0.000000,0.333333
1,"Alderwood, Long Branch, Alderwood, Long Branch",0.500000,0.000000,0.000000,0.500000
2,"Bathurst Manor, Wilson Heights, Downsview Nort...",0.333333,0.000000,0.166667,0.500000
3,"Bayview Village, Bayview Village",0.600000,0.000000,0.200000,0.200000
4,"Bedford Park, Lawrence Manor East, Bedford Par...",0.333333,0.000000,0.111111,0.555556
...,...,...,...,...,...
79,"Willowdale, Willowdale East, Willowdale, Willo...",0.571429,0.000000,0.142857,0.285714
80,"Willowdale, Willowdale West, Willowdale, Willo...",0.400000,0.000000,0.200000,0.400000
81,"Woburn, Woburn",0.333333,0.333333,0.000000,0.333333
82,"York Mills West, York Mills West",0.333333,0.000000,0.111111,0.555556


In [36]:
num_top_venues = 4

for hood in grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = grouped[grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Agincourt, Agincourt----
                 venue  freq
0         Noodle House  0.44
1     Sushi Restaurant  0.33
2  Japanese Restaurant  0.22
3     Ramen Restaurant  0.00


----Alderwood, Long Branch, Alderwood, Long Branch----
                 venue  freq
0  Japanese Restaurant   0.5
1     Sushi Restaurant   0.5
2         Noodle House   0.0
3     Ramen Restaurant   0.0


----Bathurst Manor, Wilson Heights, Downsview North, Bathurst Manor, Wilson Heights, Downsview North----
                 venue  freq
0     Sushi Restaurant  0.50
1  Japanese Restaurant  0.33
2     Ramen Restaurant  0.17
3         Noodle House  0.00


----Bayview Village, Bayview Village----
                 venue  freq
0  Japanese Restaurant   0.6
1     Ramen Restaurant   0.2
2     Sushi Restaurant   0.2
3         Noodle House   0.0


----Bedford Park, Lawrence Manor East, Bedford Park, Lawrence Manor East----
                 venue  freq
0     Sushi Restaurant  0.56
1  Japanese Restaurant  0.33
2     Ramen Restau

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [38]:
import numpy as np
num_top_venues = 4

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = grouped['Neighborhood']

for ind in np.arange(grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,"Agincourt, Agincourt",Noodle House,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant
1,"Alderwood, Long Branch, Alderwood, Long Branch",Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House
2,"Bathurst Manor, Wilson Heights, Downsview Nort...",Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House
3,"Bayview Village, Bayview Village",Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House
4,"Bedford Park, Lawrence Manor East, Bedford Par...",Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House


In [39]:
neighborhoods_venues_sorted.tail()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
79,"Willowdale, Willowdale East, Willowdale, Willo...",Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House
80,"Willowdale, Willowdale West, Willowdale, Willo...",Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House
81,"Woburn, Woburn",Sushi Restaurant,Noodle House,Japanese Restaurant,Ramen Restaurant
82,"York Mills West, York Mills West",Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House
83,"York Mills, Silver Hills, York Mills, Silver H...",Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House


In [40]:
# MERGE DATAFRAMES TO INCLUDE ALL DATA FROM NEIGHBORHOOD AND RESTAURANT TYPE DFs

Toronto_complete = pd.merge(df3, neighborhoods_venues_sorted, on='Neighborhood', how='left')
Toronto_complete.dropna(subset = ["1st Most Common Venue"], inplace=True)
Toronto_complete.head()

Unnamed: 0,Postal Code,Population2016,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue
0,M2N,75897.0,North York,"Willowdale, Willowdale East, Willowdale, Willo...",43.77012,-79.408493,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House
2,M2J,58293.0,North York,"Fairview, Henry Farm, Oriole, Fairview, Henry ...",43.778517,-79.346556,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House
3,M9V,55959.0,Etobicoke,"South Steeles, Silverstone, Humbergate, Jamest...",43.739416,-79.588437,Sushi Restaurant,Ramen Restaurant,Noodle House,Japanese Restaurant
4,M1V,54680.0,Scarborough,"Milliken, Agincourt North, Steeles East, L'Amo...",43.815252,-79.284577,Noodle House,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant
6,M1W,48471.0,Scarborough,"Steeles West, L'Amoreaux West, Steeles West, L...",43.799525,-79.318389,Noodle House,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant


In [41]:
!conda install -c conda-forge scikit-learn --yes
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - scikit-learn


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    certifi-2020.12.5          |   py36h5fab9bb_1         143 KB  conda-forge
    joblib-1.0.1               |     pyhd8ed1ab_0         206 KB  conda-forge
    openssl-1.1.1j             |       h7f98852_0         2.1 MB  conda-forge
    scikit-learn-0.24.1        |   py36he4fde30_0         7.5 MB  conda-forge
    scipy-1.5.3                |   py36h9e8f40b_0        19.1 MB  conda-forge
    threadpoolctl-2.1.0        |     pyh5ca1d4c_0          15 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        29.1 MB

The following NEW packages will be INSTALLED:


In [42]:
# CLUSTER MODELLING
# USE SILHOUETTE TO FIND BEST CLUSTER GROUPS

groupedclusters = grouped.drop('Neighborhood', 1)

kclusters = np.arange(2,10)
results = {}
for size in kclusters:
    model = KMeans(n_clusters = size).fit(groupedclusters)
    predictions = model.predict(groupedclusters)
    results[size] = silhouette_score(groupedclusters, predictions)

best_size = max(results, key=results.get)
best_size

9

In [43]:
# RUN K MEANS AND SEGMENT DATA

kclusters = best_size
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(groupedclusters)

# CHECK LABELS

kmeans.labels_[0:10]

array([6, 7, 2, 5, 2, 4, 5, 3, 3, 1], dtype=int32)

In [44]:
# MERGE TORONTO DATA WITH COORDINATE DATA AND GET CLUSTER LABELS
labels = pd.merge(Toronto_complete, grouped, on='Neighborhood', how='right')
labels.drop(labels[labels['Population2016']==0].index, inplace=True)
labels.drop(labels[labels['Population2016']==10].index, inplace=True)
labels.head()

Unnamed: 0,Postal Code,Population2016,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,Japanese Restaurant,Noodle House,Ramen Restaurant,Sushi Restaurant
0,M1S,37769.0,Scarborough,"Agincourt, Agincourt",43.7942,-79.262029,Noodle House,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,0.222222,0.444444,0.0,0.333333
1,M8W,20674.0,Etobicoke,"Alderwood, Long Branch, Alderwood, Long Branch",43.602414,-79.543484,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House,0.5,0.0,0.0,0.5
2,M3H,37011.0,North York,"Bathurst Manor, Wilson Heights, Downsview Nort...",43.754328,-79.442259,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House,0.333333,0.0,0.166667,0.5
3,M2K,23852.0,North York,"Bayview Village, Bayview Village",43.786947,-79.385975,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House,0.6,0.0,0.2,0.2
4,M5M,25975.0,North York,"Bedford Park, Lawrence Manor East, Bedford Par...",43.733283,-79.41975,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House,0.333333,0.0,0.111111,0.555556


In [45]:
# ADD CLUSTERED LABELS

tablewithlabels = labels
tablewithlabels['Cluster Labels'] = kmeans.labels_

# MERGE TO ADD LAT LONG TO EACH NEIGHBORHOOD

tablewithlabels = pd.merge(labels, neighborhoods_venues_sorted, on='Neighborhood', how='left')

tablewithlabels.head()

Unnamed: 0,Postal Code,Population2016,Borough,Neighborhood,Latitude,Longitude,1st Most Common Venue_x,2nd Most Common Venue_x,3rd Most Common Venue_x,4th Most Common Venue_x,Japanese Restaurant,Noodle House,Ramen Restaurant,Sushi Restaurant,Cluster Labels,1st Most Common Venue_y,2nd Most Common Venue_y,3rd Most Common Venue_y,4th Most Common Venue_y
0,M1S,37769.0,Scarborough,"Agincourt, Agincourt",43.7942,-79.262029,Noodle House,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,0.222222,0.444444,0.0,0.333333,6,Noodle House,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant
1,M8W,20674.0,Etobicoke,"Alderwood, Long Branch, Alderwood, Long Branch",43.602414,-79.543484,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House,0.5,0.0,0.0,0.5,7,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House
2,M3H,37011.0,North York,"Bathurst Manor, Wilson Heights, Downsview Nort...",43.754328,-79.442259,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House,0.333333,0.0,0.166667,0.5,2,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House
3,M2K,23852.0,North York,"Bayview Village, Bayview Village",43.786947,-79.385975,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House,0.6,0.0,0.2,0.2,5,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Noodle House
4,M5M,25975.0,North York,"Bedford Park, Lawrence Manor East, Bedford Par...",43.733283,-79.41975,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House,0.333333,0.0,0.111111,0.555556,2,Sushi Restaurant,Japanese Restaurant,Ramen Restaurant,Noodle House


In [140]:
# FIND VALUES FOR EACH OF THE CLUSTERS

In [46]:
cluster1 = tablewithlabels.loc[tablewithlabels['Cluster Labels'] == 0, tablewithlabels.columns[[3, 4] + list(range(5, tablewithlabels.shape[1]))]]
cluster1.shape

(3, 16)

In [47]:
cluster2 = tablewithlabels.loc[tablewithlabels['Cluster Labels'] == 1, tablewithlabels.columns[[3, 4] + list(range(5, tablewithlabels.shape[1]))]]
cluster2.shape

(13, 16)

In [48]:
cluster3 = tablewithlabels.loc[tablewithlabels['Cluster Labels'] == 2, tablewithlabels.columns[[3, 4] + list(range(5, tablewithlabels.shape[1]))]]
cluster3.shape

(7, 16)

In [89]:
cluster4 = tablewithlabels.loc[tablewithlabels['Cluster Labels'] == 3, tablewithlabels.columns[[3, 4] + list(range(5, tablewithlabels.shape[1]))]]
cluster4.shape

(15, 16)

In [50]:
cluster5 = tablewithlabels.loc[tablewithlabels['Cluster Labels'] == 4, tablewithlabels.columns[[3, 4] + list(range(5, tablewithlabels.shape[1]))]]
cluster5.shape

(8, 16)

In [51]:
cluster6 = tablewithlabels.loc[tablewithlabels['Cluster Labels'] == 5, tablewithlabels.columns[[3, 4] + list(range(5, tablewithlabels.shape[1]))]]
cluster6.shape

(15, 16)

In [52]:
# CLUSTER 4 AND 6 HAVE THE SAME DENSITY, SO THEY ARE BOTH OPTIMAL LOCATIONS WITH THESE VARIABLES

In [57]:
# FIND GEOGRAPHIC CENTRE OF EACH CLUSTER

cluster4coords = cluster4[['Latitude', 'Longitude']]
cluster4coords = list(cluster4coords.values) 
lat = []
long = []

for l in cluster4coords:
  lat.append(l[0])
  long.append(l[1])

blatitude4 = sum(lat)/len(lat)
blongitude4 = sum(long)/len(long)
print(blatitude4)
print(blongitude4)

43.710583493333345
-79.39145293333334


In [58]:
cluster6coords = cluster6[['Latitude', 'Longitude']]
cluster6coords = list(cluster6coords.values) 
lat = []
long = []

for l in cluster6coords:
  lat.append(l[0])
  long.append(l[1])

blatitude6 = sum(lat)/len(lat)
blongitude6 = sum(long)/len(long)
print(blatitude6)
print(blongitude6)

43.70664769333334
-79.38809083333332


In [56]:
!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

Collecting package metadata (current_repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/jupyterlab/conda/envs/python

  added / updated specs:
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    geopy-2.1.0                |     pyhd3deb0d_0          64 KB  conda-forge
    ------------------------------------------------------------
                                           Total:          98 KB

The following NEW packages will be INSTALLED:

  geographiclib      conda-forge/noarch::geographiclib-1.50-py_0
  geopy              conda-forge/noarch::geopy-2.1.0-pyhd3deb0d_0



Downloading and Extracting Packages
geopy-2.1.0          | 64 KB     | ##################################### | 100% 
geographiclib-1.50   | 34 KB     | ################################

In [64]:
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors

In [68]:
!pip install opencage
from opencage.geocoder import OpenCageGeocode

Collecting opencage
  Downloading https://files.pythonhosted.org/packages/44/56/e912b950ab7b05902c08ebc3eb6c6e22f40ca2657194e04fc205a9d793e7/opencage-1.2.2-py3-none-any.whl
Collecting backoff>=1.10.0 (from opencage)
  Downloading https://files.pythonhosted.org/packages/f0/32/c5dd4f4b0746e9ec05ace2a5045c1fc375ae67ee94355344ad6c7005fd87/backoff-1.10.0-py2.py3-none-any.whl
Installing collected packages: backoff, opencage
Successfully installed backoff-1.10.0 opencage-1.2.2


In [69]:
from pprint import pprint

In [74]:
# Finding the best location on cluster 4 using opencage
key = 'a3c042ed8fe641e4abe8a2ae595e08a7'
geocoder = OpenCageGeocode(key)

results = geocoder.reverse_geocode(43.710583493333345, -79.39145293333334)
pprint(results)

[{'annotations': {'DMS': {'lat': "43° 42' 38.34396'' N",
                          'lng': "79° 23' 29.12172'' W"},
                  'MGRS': '17TPJ2959240993',
                  'Maidenhead': 'FN03hr30an',
                  'Mercator': {'x': -8837812.747, 'y': 5391237.11},
                  'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=56495093#map=17/43.71065/-79.39142',
                          'note_url': 'https://www.openstreetmap.org/note/new#map=17/43.71065/-79.39142&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=43.71065&mlon=-79.39142#map=17/43.71065/-79.39142'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'CA': '124',
                                         'NORTHERN_AMERICA': '021',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']},
                  'callingcode': 1,
                  'currency': {'

In [75]:
# Best location name on cluster 4
popstr = df3[df3['Postal Code'].str.contains('M4P')]

def str_join(*args):
    return ''.join(map(str, args))

popstr = str_join('Best Location: ', popstr['Neighborhood'].values, ' in ', popstr['Borough'].values)

print(popstr)

Best Location: ['Davisville North, Davisville North'] in ['Central Toronto']


In [76]:
# Finding the best location on cluster 6 using opencage (input coordinates of cluster 6)
key = 'a3c042ed8fe641e4abe8a2ae595e08a7'
geocoder = OpenCageGeocode(key)
results6 = geocoder.reverse_geocode(43.70664769333334, -79.38809083333332)
pprint(results6)

[{'annotations': {'DMS': {'lat': "43° 42' 23.97168'' N",
                          'lng': "79° 23' 17.36088'' W"},
                  'MGRS': '17TPJ2986340555',
                  'Maidenhead': 'FN03hq39ko',
                  'Mercator': {'x': -8837449.077, 'y': 5390624.455},
                  'OSM': {'edit_url': 'https://www.openstreetmap.org/edit?way=401259621#map=16/43.70666/-79.38816',
                          'note_url': 'https://www.openstreetmap.org/note/new#map=16/43.70666/-79.38816&layers=N',
                          'url': 'https://www.openstreetmap.org/?mlat=43.70666&mlon=-79.38816#map=16/43.70666/-79.38816'},
                  'UN_M49': {'regions': {'AMERICAS': '019',
                                         'CA': '124',
                                         'NORTHERN_AMERICA': '021',
                                         'WORLD': '001'},
                             'statistical_groupings': ['MEDC']},
                  'callingcode': 1,
                  'currency': 

In [77]:
# Best location name on cluster 6
popstr6 = df3[df3['Postal Code'].str.contains('M4S')]

def str_join(*args):
    return ''.join(map(str, args))

popstr6 = str_join('Best Location: ', popstr6['Neighborhood'].values, ' in ', popstr6['Borough'].values)

print(popstr6)

Best Location: ['Davisville, Davisville'] in ['Central Toronto']


In [62]:
# USING GEOPY GEOCODERS

address = 'Toronto, ON'

geolocator = Nominatim(user_agent="http")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print("Toronto's Geographical Coordinates: {}, {}".format(latitude, longitude))

Toronto's Geographical Coordinates: 43.6534817, -79.3839347


In [87]:
# Using Folium to show the Location of Cluster 4

map_clusters4 = folium.Map(location=[latitude, longitude], zoom_start=12)

# Colours

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Adding markers
markers_colors = []
for lat, lon, poi, cluster in zip(tablewithlabels['Latitude'], tablewithlabels['Longitude'], tablewithlabels['Neighborhood'], tablewithlabels['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters4)
    
folium.CircleMarker([blatitude4, blongitude4],
                    radius=50,
                    popup='Toronto',
                    color='red',
                    ).add_to(map_clusters4)

map_clusters4.save('map_clusters4.html')
map_clusters4

In [86]:
# Using Folium to show the Location of Cluster 6

map_clusters6 = folium.Map(location=[latitude, longitude], zoom_start=12)

# Colours

x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Adding markers
markers_colors = []
for lat, lon, poi, cluster in zip(tablewithlabels['Latitude'], tablewithlabels['Longitude'], tablewithlabels['Neighborhood'], tablewithlabels['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters6)
    
folium.CircleMarker([blatitude6, blongitude6],
                    radius=50,
                    popup='Toronto',
                    color='red',
                    ).add_to(map_clusters6)

map_clusters6.save('map_clusters6.html')
map_clusters6

## Observations and Conclusion

As it was expected, the ideal location to open a Japanese restaurant is in the city center of Toronto. Considering the reported results, the cluesters 4 and 6 have the maximum density, so in theory the ideal place to open a Japanese restaurant would be within this 2 clusters that are both located in Central Toronto.
By analysing the exact locations of these clusters, it is possible to see that the cluster 4 is located in a position a bit far away from other restaurants. On the other hand, the cluster 6 is closer to other restaurants. So based on this model and on the assumptions taken, we assume that the ideal location for opening a Japanese restaurant is in Davisville in Central Toronto, which is the location of Cluster's 6 center.

Please take a look to the following maps, as these support our conclusions.
The Folium interactive maps are not displayed in this notebook as it is a not trusted properly by GitHub. However, it is possible to see a screenshot of each map on my GitHub repository: https://github.com/hairtractive/Coursera_Capstone.
The first map "Map Cluster 4" displays the center location of cluster 4 and the second map "Map Cluster 6" displays the center location of cluster 6. Finally, the third map, displays the neighborhoods of the cluster 6.

Please check the maps here:
- First map "Map Cluster 4": https://github.com/hairtractive/Coursera_Capstone/blob/main/Map%20Cluster%204.png
- Second map "Map Cluster 6": https://github.com/hairtractive/Coursera_Capstone/blob/main/Map%20Cluster%206.png
- Third map "Map Cluster 6 Neighborhoods": https://github.com/hairtractive/Coursera_Capstone/blob/main/Map%20Cluster%206%20Neighborhoods.png

# Thank you!

Miguel Silva