# Capstone Project - Battle of the Neighborhoods

## Introduction
In this project we analyze the situation of the businesses in neighborhoods from Toronto, Canada. We desire to open a new restaurant and would like to know which neighborhood is more fit to receive a new restaurant, and what types of restaurants are already in there.

Restaurants are usually successful in locations in the central areas, where there's a great concentration of people. On the other hand, it's not ideal to have **too many** restaurants in a given region, especially if they are similar (for example, opening an Italian restaurant in a neighborhood that already has 2 or 3 of them).

Given this basic assumptions, let's review the information that we need.

## Data
To assess the future business opportunities, we obtained a list of neighborhoods in the more central areas of Toronto, as well as information regarding existing businesses which was downloaded from Foursquare using their developer's API.

The combination of both databases will enable us to group the amount of each business in the neighborhoods of interest.

The following section will deal with obtaining and preparing the data for our report.


### Webscraping Toronto's Wikipedia page
In this section, we download the content of the Wikipedia page about Toronto neighborhoods and postal codes ([link])(https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) to populate a DataFrame

In [1]:
#imports used in this stage
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

Using `requests.get()` to get the contents of the page, and parsing it with BeautifulSoup. We also print the page title to check if we have accessed the correct page:

In [2]:
tor_url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
htmldata = requests.get(url=tor_url).text
soup = BeautifulSoup(htmldata, 'lxml')
print(soup.title)

<title>List of postal codes of Canada: M - Wikipedia</title>


We count the number of tables, and print the beginning of each table to see which of them is the one that we want:

In [3]:
tables = soup.find_all('table')
print('There are {:} tables in the document.'.format(len(tables)))
# printing the beginning of each table to see which one is the right one
_ = [print(tables[i].prettify()[:250]) for i in range(len(tables))]

There are 3 tables in the document.
<table cellpadding="2" cellspacing="0" rules="all" style="width:100%; border-collapse:collapse; border:1px solid #ccc;">
 <tbody>
  <tr>
   <td style="width:11%; vertical-align:top; color:#ccc;">
    <p>
     <b>
      M1A
     </b>
     <br/>
     <
<table class="navbox">
 <tbody>
  <tr>
   <td style="width:36px; text-align:center">
    <a class="image" href="/wiki/File:Flag_of_Canada.svg" title="Flag of Canada">
     <img alt="Flag of Canada" data-file-height="600" data-file-width="1200" decodi
<table cellspacing="0" style="background-color: #F8F8F8;" width="100%">
 <tbody>
  <tr>
   <td style="text-align:center; border:1px solid #aaa;">
    <a href="/wiki/Newfoundland_and_Labrador" title="Newfoundland and Labrador">
     NL
    </a>
   </t


From the sample, we can see the 7th line from the first table has M1A, which is one of the postal codes of Toronto, hence, the correct table is found in `tables[0]`.

Once the table is identified, we parse its contents, while also getting rid of undesired data:
- Postal codes that were not assigned yet
- Extraneous characters such as **`( ) /`** and extra whitespaces

Finally, we convert the table to a *pandas* DataFrame

In [4]:
neigh_tbl = tables[0]
neigh_lst = []

for val in (neigh_tbl.find_all('td'))[:-1]:
    strtmp = val.get_text(strip=True, separator=';')
    strlist = strtmp.replace('/', ',').split(';')
    # restrict the parsing only to the postal codes already assigned, otherwise skip
    if strlist[1] != 'Not assigned':
        post = strlist[0]
        bor = strlist[1]
        # join the neighborhood fields, remove extraneous simbols, and split it again
        hoods = (' '.join(strlist[2:])).replace('(', ' ').replace(')', ' ').replace('  ', ' ').split(',')
        # some neighborhoods cause a weird bug in plotting because of the apostrophe
        # hoods = hoods.replace("'", '`')  
        for h in hoods:
            h = h.lstrip().rstrip()
            strtmp = [post, bor, h]  
            neigh_lst.append(strtmp)

# convert the resulting list to a pandas DataFrame
df = pd.DataFrame(neigh_lst, columns=['Postal code', 'Borough', 'Neighborhoods'])
print('Shape of the DataFrame: ', df.shape)
df.head()

Shape of the DataFrame:  (216, 3)


Unnamed: 0,Postal code,Borough,Neighborhoods
0,M3A,North York,Parkwoods
1,M4A,North York,Victoria Village
2,M5A,Downtown Toronto,Regent Park
3,M5A,Downtown Toronto,Harbourfront
4,M6A,North York,Lawrence Manor


In [5]:
#df = df.query('Borough.str.contains("Central") or Borough.str.contains("Downtown")')
# df = df.query('Borough.str.contains("Downtown")')
# df.reset_index(drop=True, inplace=True)
# df.shape

### Getting Latitude and Longitude of the Neighborhoods

We obtained the Latitude and Longitude values from ArcGIS using `geocoder`

In [6]:
#imports
#!pip install geocoder  # uncomment to install geocoder (if needed)
import geocoder
import os
#import matplotlib.pyplot as plt
from matplotlib.pyplot import cm  # Color palettes

import folium  # Plotting maps with overlays
import branca   # Fancy HTML text inside bubbles

To save some bandwidth, time, and to prevent hitting the query limit from both the Foursquare and ArcGIS APIs, we cached locally the data.

The first time each of them is run, we download the desired information, and save the results as a CSV file. The next time the script is run, it first checks if the file already exists locally before trying to query the APIs.
In case we want to refresh the database, all we need to do is delete the respective CSV file.

Here we execute this routine to obtain the geolocation data from the postal codes from Toronto. In this case there's some code to deal with eventual errors while querying the ArcGIS geolocation database.

**NOTE:** We stumbled into a situation where one of the postal codes (M7Y) was returning the same geolocation position from another postal code (M5W). This was manually fixed in the lines below with an IF clause

In [7]:
# try to open previously saved CSV file, or create a new one and save a local copy
fname = 'Toronto.csv'
retries = 0

if os.path.exists(fname):
     print("File '{}' already exists. Loading from cache...".format(fname))
     df = pd.read_csv(fname, index_col=0)
else:
     df['Latitude'], df['Longitude'] = 0, 0
     # add latitude and longitude using geocoder
     reply = geocoder.arcgis('969 Eastern Ave ON, Canada') # dummy init
     for p in df.index:
          n, b = df.loc[p, 'Neighborhoods'], df.loc[p, 'Borough']
          # retry the query if we have an error
          while (True):
               reply = geocoder.arcgis('{}, {}, Toronto, ON,  Canada'.format(n, b))
               print('.', end='')
               if reply.error:
                    retries = retries + 1  # count number of retries
                    print('\n', p, reply)
               if reply.ok:
                    break

          # Store the results once we're past the error
          df.loc[p, ['Latitude', 'Longitude']] = reply.latlng

     print("\nThere are {} missing coordinates after the query".format(sum([df['Latitude'].isna().sum(),
          df['Longitude'].isna().sum()])))
     print("We needed {} retries to get all the data".format(retries))
     # save results to a file
     df.to_csv(fname)
     print("DataFrame saved as '{}'.".format(fname))


File 'Toronto.csv' already exists. Loading from cache...


Checking for duplicates

In [8]:
#group by geolocation, count the repeated results and display the top result
df.groupby(['Latitude', 'Longitude']).count().sort_values('Neighborhoods', ascending=False).head()['Neighborhoods']

Latitude   Longitude 
43.648690  -79.385440    8
43.720197  -79.499895    5
43.612991  -79.493032    3
43.799768  -79.310048    2
43.671100  -79.373590    2
Name: Neighborhoods, dtype: int64

We can see that different postal codes point to the same coordinates. Right now we're not gonna bother with that and we'll simply drop those values.

In [9]:
df.drop_duplicates(['Latitude', 'Longitude'], inplace=True)
df.reset_index(drop=True, inplace=True)
print(df.shape)
df.head()

(186, 5)


Unnamed: 0,Postal code,Borough,Neighborhoods,Latitude,Longitude
0,M3A,North York,Parkwoods,43.758895,-79.320322
1,M4A,North York,Victoria Village,43.73154,-79.31428
2,M5A,Downtown Toronto,Regent Park,43.66069,-79.36031
3,M5A,Downtown Toronto,Harbourfront,43.63923,-79.38307
4,M6A,North York,Lawrence Manor,43.72294,-79.43116


Here we perform the conversion from latitude and longitude to x, y in kilometers, so that we can calculate the distance from the city center in meters straight out of the neighborhoods' coordinates.
We also perform a rotation of the coordinates to align the city center street grid to horizontal and vertical distances.

In [10]:
mark = geocoder.arcgis('Toronto, ON, Canada').latlng  # coordinates of the city center
dist = df[['Latitude', 'Longitude']].copy().values
conv = [111.320, 78.710]  # conversion rate from lat, log to x, y in km
angle = -16  # angle of rotation to align the streets to the X, Y axis
c = np.cos(angle*np.pi/180)
s = np.sin(angle*np.pi/180)
# approximate distance in Km
distkm = np.sqrt((((dist - mark)*conv)**2).sum(1))
distManh = (dist - mark)*conv  
# rotation of the axis to get horizontal and vertical distances relative to the city grid
df['dx'] = distManh[:, 1]*c - distManh[:, 0]*s
df['dy'] = distManh[:, 1]*s + distManh[:, 0]*c
df['distkm'] = distkm

df.sort_values('distkm')

Unnamed: 0,Postal code,Borough,Neighborhoods,Latitude,Longitude,dx,dy,distkm
13,M5B,Downtown Toronto,Garden District,43.648690,-79.385440,0.000000,0.000000,0.000000
174,M5X,Downtown Toronto,Underground city,43.649390,-79.382140,0.271160,0.003310,0.271180
47,M5H,Downtown Toronto,Richmond,43.650784,-79.383031,0.246550,0.171856,0.300535
173,M5X,Downtown Toronto,First Canadian Place,43.648465,-79.380980,0.330552,-0.120875,0.351960
68,M5K,Downtown Toronto,Toronto Dominion Centre,43.646956,-79.381460,0.247904,-0.271898,0.367947
...,...,...,...,...,...,...,...,...
22,M1C,Scarborough,Highland Creek,43.789480,-79.176140,20.155821,10.524757,22.738242
172,M1X,Scarborough,Upper Rouge,43.809279,-79.187694,19.889180,12.894105,23.703110
9,M1B,Scarborough,Rouge,43.807660,-79.174050,20.871787,12.424812,24.290069
21,M1C,Scarborough,Port Union,43.778970,-79.131090,23.241857,8.422728,24.720968


### Checking the result on the map

To check the results, we select a slice of the city defined by the following rules:
- Closer to the lake than the city center;
- Maximum 'horizontal' distance of 1.5km
- We also restricted the maximum distance in the 'vertical' axis to 1.5km, to exclude the points in the islands. 

In [11]:
map_toronto = folium.Map(location=mark, zoom_start=11, control_scale=True)

# All markers in black
daux = df
for lat, lng, lbl, n in zip(daux['Latitude'], daux['Longitude'], daux['Neighborhoods'], daux['Borough']):
    color = "#111111"
    lbl = "{} ({})".format(lbl.replace("'", "`"), n.replace("'", "`"))
    folium.CircleMarker(location=[lat, lng], popup=lbl, color=color, radius=5, fill=True, alpha=.5).add_to(map_toronto)

# Selected markers in green
daux = df[((df.dy) <=  .050) & ((df.dy) >=  -1.5) & (abs(df.dx) <= 1.5)]
for lat, lng, lbl, n in zip(daux['Latitude'], daux['Longitude'], daux['Neighborhoods'], daux['Borough']):
    color = "#118811"
    lbl = "{} ({})".format(lbl.replace("'", "`"), n.replace("'", "`"))
    folium.CircleMarker(location=[lat, lng], popup=lbl, color=color, radius=5, fill=True, alpha=.5).add_to(map_toronto)
    # Central latlong reference

folium.CircleMarker(location=mark, popup='Toronto Latlong Ref', color='blue').add_to(map_toronto)
map_toronto

Looks like it worked. Now we generate a copy of the slice as a new DataFrame

In [12]:
toronto_slice = df[((df.dy) <=  .050) & ((df.dy) >=  -1.5) & (abs(df.dx) <= 1.5)]
print(toronto_slice.shape)
toronto_slice.head()

(14, 8)


Unnamed: 0,Postal code,Borough,Neighborhoods,Latitude,Longitude,dx,dy,distkm
3,M5A,Downtown Toronto,Harbourfront,43.63923,-79.38307,-0.110954,-1.063711,1.069482
13,M5B,Downtown Toronto,Garden District,43.64869,-79.38544,0.0,0.0,0.0
35,M5E,Downtown Toronto,Berczy Park,43.64811,-79.37517,0.759241,-0.284876,0.810926
59,M5J,Downtown Toronto,Union Station,43.64517,-79.38063,0.255921,-0.481022,0.544865
68,M5K,Downtown Toronto,Toronto Dominion Centre,43.646956,-79.38146,0.247904,-0.271898,0.367947


Now that we selected a region of interest, we're gonna use the Foursquare API to collect data from venues in those neighborhoods.

### Clustering the neighborhoods using data from the Foursquare API
We kept the same parameters from the Manhattan sample exercise, and reused the function ```getNearbyVenues``` from it

In [13]:
from sklearn.cluster import KMeans  # ML library

In [14]:
#foursquare credentials
CLIENT_ID = 'L2LRCN30Z5RRWFLVVZVWHPL1JTUF05IZ3IAMMRZX40MU0TIF' # your Foursquare ID
CLIENT_SECRET = '01SEKO1V4WEHUQSEX0QJJ0FVKCIWBOEWFXXWU4OBUWLW5WQU' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 200 # A default Foursquare API limit value
RADIUS = 500 # Radius in meters

In [15]:
# function to get nearby venues and export to a DataFrame
def getNearbyVenues(names, latitudes, longitudes, radius=RADIUS):
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT)
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        # return only relevant information for each nearby venue
        venues_list.append(
            [(name, lat, lng, v['venue']['name'], v['venue']['location']['lat'], 
            v['venue']['location']['lng'], v['venue']['categories'][0]['name'])
            for v in results]
            )
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhoods', 'Neighborhood Latitude', 'Neighborhood Longitude',
                             'Venue', 'Venue Latitude', 'Venue Longitude', 'Venue Category']

    return(nearby_venues)

Once again, to save time and queries executed, in this case, for the Foursquare API.

In [16]:
fname = 'Toronto Venues.csv'
if os.path.exists(fname):
    print('File "{}" exists, loading from cache'.format(fname))
    venues = pd.read_csv(fname, index_col=0)
else:
    venues = getNearbyVenues(toronto_slice.Neighborhoods, toronto_slice.Latitude, toronto_slice.Longitude)
    venues.to_csv(fname)

print('Dataframe shape:', venues.shape)

File "Toronto Venues.csv" exists, loading from cache
Dataframe shape: (1246, 7)


In [17]:
venues.head()

Unnamed: 0,Neighborhoods,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.63923,-79.38307,Harbourfront Centre,43.638556,-79.38319,Performing Arts Venue
1,Harbourfront,43.63923,-79.38307,Harbourfront,43.639526,-79.380688,Neighborhood
2,Harbourfront,43.63923,-79.38307,Natrel Pond/Rink,43.638431,-79.382528,Skating Rink
3,Harbourfront,43.63923,-79.38307,Lick It Gelato,43.639256,-79.38465,Ice Cream Shop
4,Harbourfront,43.63923,-79.38307,Lake Ontario,43.638945,-79.379665,Lake


Selecting only the restaurants from the venues list:

In [18]:
restaurants = venues.query("`Venue Category`.str.contains('Restaurant') | `Venue Category`.str.contains('Food')").copy()
print(restaurants.shape)
print(restaurants.drop_duplicates(['Venue', 'Venue Latitude', 'Venue Longitude']).shape)

(327, 7)
(118, 7)


There are several duplicates due to the overlapping of the neighborhoods but, since we're interested in how many restaurants are within walking distance (500m) from the neighborhood, we're gonna keep the duplicates.

### Collecting info on the restaurants
Checking the amount of restaurant categories

In [19]:
l = 0
for c in sorted(restaurants['Venue Category'].unique()):
    if 'restaurant' in c.lower():
        print(c, end=', ')
        l += 1
print('{} results'.format(l))

American Restaurant, Asian Restaurant, Belgian Restaurant, Brazilian Restaurant, Caribbean Restaurant, Chinese Restaurant, Colombian Restaurant, Comfort Food Restaurant, Eastern European Restaurant, Fast Food Restaurant, French Restaurant, Gluten-free Restaurant, Greek Restaurant, Hawaiian Restaurant, Indian Restaurant, Italian Restaurant, Japanese Restaurant, Latin American Restaurant, Mediterranean Restaurant, Mexican Restaurant, Middle Eastern Restaurant, Molecular Gastronomy Restaurant, New American Restaurant, Peruvian Restaurant, Ramen Restaurant, Restaurant, Seafood Restaurant, Spanish Restaurant, Sushi Restaurant, Thai Restaurant, Vegetarian / Vegan Restaurant, 31 results


There are some restaurants that can be combined in the same category. For example, `Ramen`, `Sushi` and `Japanese`,  and so on:

In [20]:
rest = ['Sushi Restaurant', 'Japanese Restaurant'], ['Ramen Restaurant', 'Japanese Restaurant'], ['Restaurant', 'Uncategorized Restaurant']

for r in rest:
    restaurants.loc[restaurants['Venue Category'] == r[0], 'Venue Category'] = r[1]

l = 0
for c in sorted(restaurants['Venue Category'].unique()):
    if 'restaurant' in c.lower():
    #     print(c, end=', ')
        l += 1
print('{} unique results'.format(l))

29 unique results


The number of categories was slightly reduced, but some dubious categories still persist, such as `Asian`, `New American` as well as several non-categorized, simply described as `Restaurants`. Hopefully they won't be a problem.

In [21]:
restaurants.reset_index(inplace=True, drop=True)
print("Shape of the restaurant's DataFrame:", restaurants.shape)
restaurants.head()

Shape of the restaurant's DataFrame: (327, 7)


Unnamed: 0,Neighborhoods,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Harbourfront,43.63923,-79.38307,Pearl Harbourfront,43.638157,-79.380688,Chinese Restaurant
1,Harbourfront,43.63923,-79.38307,Steam Whistle's Biergarten,43.640666,-79.385859,Uncategorized Restaurant
2,Harbourfront,43.63923,-79.38307,Gonoe Sushi Japanese Restaurant,43.639014,-79.385914,Japanese Restaurant
3,Harbourfront,43.63923,-79.38307,Taverna Mercatto,43.642625,-79.383257,Italian Restaurant
4,Harbourfront,43.63923,-79.38307,e11even,43.642426,-79.381441,New American Restaurant


Grouping the restaurants by categories:

In [22]:
# group the neighborhood by restaurant counts
restaurant_counts = restaurants.groupby('Neighborhoods')['Venue Category'].count()
restaurant_counts.columns = ['count']

# group the restaurants by type
onehot = pd.get_dummies(restaurants[['Venue Category']], prefix='', prefix_sep='')
rest_onehot = pd.concat([restaurants[['Neighborhoods']], onehot], axis=1)
rest_onehot = rest_onehot.groupby('Neighborhoods').sum().reset_index()
rest_onehot.shape
rest_onehot.set_index('Neighborhoods', inplace=True)

print(rest_onehot.shape)
cols = rest_onehot.columns.values
for i in range(len(cols)):
    if cols[i] == 'Restaurant':
        cols[i] = 'Unspecified Restaurant'
cols = [(' '.join(c.split(' ')[:-1])).rstrip() for c in rest_onehot.columns]
rest_onehot.columns = cols
rest_onehot.head()

(14, 32)


Unnamed: 0_level_0,American,Asian,Belgian,Brazilian,Caribbean,Chinese,Colombian,Comfort Food,Eastern European,Fast Food,...,Mexican,Middle Eastern,Molecular Gastronomy,New American,Peruvian,Seafood,Spanish,Thai,Uncategorized,Vegetarian / Vegan
Neighborhoods,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Bathurst Quay,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Berczy Park,0,0,1,0,0,0,0,1,1,0,...,0,0,1,1,0,4,0,1,4,1
CN Tower,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
Commerce Court,3,1,0,0,0,0,0,0,0,1,...,0,0,1,1,0,3,0,2,6,2
Design Exchange,3,2,0,0,0,0,0,0,0,1,...,0,0,0,1,0,3,0,2,6,2


Total of restaurants by category

In [23]:
top_restaurant_counts = rest_onehot.sum()
print(top_restaurant_counts.sort_values(ascending=False).head())

Japanese         51
Uncategorized    50
Seafood          33
Italian          33
American         20
dtype: int64


Total of restaurants by neighborhood

In [24]:
top_counts_neigh = rest_onehot[rest_onehot.columns[:]].sum(1)
print(top_counts_neigh)

Neighborhoods
Bathurst Quay                                      4
Berczy Park                                       28
CN Tower                                           7
Commerce Court                                    35
Design Exchange                                   33
First Canadian Place                              29
Garden District                                   29
Harbourfront                                      15
King and Spadina                                  23
Stn A PO Boxes 25 The Esplanade Enclave of M5E    24
Toronto Dominion Centre                           27
Underground city                                  32
Union Station                                     18
Victoria Hotel                                    33
dtype: int64


Now we create a list with the top 10 most common restaurants in a neighborhood (if there are that many categories), as well as the total of each category in the neighborhood

In [25]:
top10list = []
for i in range(len(rest_onehot)):
    t10l = rest_onehot.iloc[i].sort_values()[::-1]
    t10l = (['{} ({})'.format(t10l[t10l > 0].index[i], t10l.values[i]) for i in range(len(t10l[t10l > 0]))] + 10*[''])[:10]
    top10list.append(t10l)

top10list = pd.DataFrame(top10list, index=rest_onehot.index, columns=range(1, 11))
top10list

Unnamed: 0_level_0,1,2,3,4,5,6,7,8,9,10
Neighborhoods,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Bathurst Quay,Japanese (3),Caribbean (1),,,,,,,,
Berczy Park,Seafood (4),Uncategorized (4),Japanese (4),Italian (4),Indian (2),Vegetarian / Vegan (1),Thai (1),Food (1),Belgian (1),New American (1)
CN Tower,Japanese (2),Greek (1),Thai (1),Indian (1),Italian (1),Middle Eastern (1),,,,
Commerce Court,Uncategorized (6),Japanese (5),Italian (5),American (3),Seafood (3),Vegetarian / Vegan (2),Thai (2),New American (1),Asian (1),Fast Food (1)
Design Exchange,Uncategorized (6),American (3),Japanese (3),Seafood (3),Asian (2),Italian (2),Vegetarian / Vegan (2),Thai (2),Food (1),Gluten-free (1)
First Canadian Place,Uncategorized (5),Japanese (4),American (3),Asian (3),Seafood (3),Latin American (1),Fast Food (1),Food (1),French (1),Gluten-free (1)
Garden District,Japanese (5),Asian (3),Vegetarian / Vegan (2),Mediterranean (2),Uncategorized (2),Italian (2),American (2),Thai (2),Seafood (2),Greek (1)
Harbourfront,Uncategorized (3),Italian (3),Chinese (2),Japanese (1),Fast Food (1),Indian (1),Vegetarian / Vegan (1),New American (1),Seafood (1),Thai (1)
King and Spadina,Uncategorized (5),Italian (4),Fast Food (2),French (2),Mexican (1),Hawaiian (1),Indian (1),Food & Drink (1),Vegetarian / Vegan (1),Middle Eastern (1)
Stn A PO Boxes 25 The Esplanade Enclave of M5E,Seafood (4),Uncategorized (4),Japanese (3),Italian (2),Vegetarian / Vegan (1),Comfort Food (1),French (1),Fast Food (1),Eastern European (1),Indian (1)


It is interesting to see that Japanese restaurants are quite common in the neighborhoods. It would probably be a bad idea to try to open another one.

## Clustering the neighborhoods by similarity

Now we use *K-means* to find similar neighborhoods with regards to the amount of restaurants in Toronto city.

In [26]:
#imports
import numpy as np

We decided to divide the neighborhood in 4 classes, by the amount of restaurants in the region

In [27]:
n_clusters = 4
cl = KMeans(init='k-means++', n_clusters = n_clusters, random_state=188, n_init=100)
counts_neigh = np.array([top_counts_neigh.values, top_counts_neigh.values]).T
cl.fit(counts_neigh)
print(cl.labels_)

[0 1 0 3 3 1 1 2 1 1 1 3 2 3]


We also analyzed the data and named each one of the classes as 'Low', 'Medium', 'High' and 'Very High' amount of restaurants.

In [28]:
#Create an labeled index for the classes
class_names = ['Low', 'Medium', 'High', 'Very High']
class_counts = np.zeros((n_clusters, 3)).astype('object')

for i in range(n_clusters):
    class_counts[i, :] = [i, top_counts_neigh[cl.labels_ == i].max(), top_counts_neigh[cl.labels_ == i].min()]

class_labels = np.zeros_like(cl.labels_).astype('object')
reverse_index = class_counts[:, 1].argsort().argsort()

for pos, i in enumerate(class_counts[:, 1].argsort().argsort()):
    class_counts[pos, 0] = class_names[i]

for i in range(len(class_labels)):
    class_labels[i] = '{} ({})'.format(class_names[reverse_index[cl.labels_[i]]], top_counts_neigh[i])

class_labels[:5]

array(['Low (4)', 'High (28)', 'Low (7)', 'Very High (35)',
       'Very High (33)'], dtype=object)

Now we restore the latitude and longitude values for each neighborhood:

In [29]:
# restore latitude and longitude data
top10list['Latitude'], top10list['Longitude'] = 0, 0
for n in top10list.index:
    top10list.loc[n, 'Latitude'] = df[df['Neighborhoods'] == n]['Latitude'].values
    top10list.loc[n, 'Longitude'] = df[df['Neighborhoods'] == n]['Longitude'].values

Here we plot the neighborhood's markers grouped by color. The colors represent the number of restaurants in each neighborhoods. We also plotted the restaurants as white circles, to get an idea about how they are distributed around the region.

In [30]:
colors = cm.jet(np.linspace(0, 1, n_clusters))
offset = np.array([-.005, 0])
reply = geocoder.arcgis('Toronto, ON, Canada')
reply.latlng
# we added a small correction in latlng to allow for greater zoom
map_toronto = folium.Map(location=list(reply.latlng + offset), zoom_start=15, control_scale=True)

# add markers to map
for lat, lng, label, grplbl, cidx, top in zip(top10list['Latitude'], top10list['Longitude'], top10list.index.values, class_labels, cl.labels_, top10list[top10list.columns[:10]].values):
    cv = (255*colors[cidx]).astype('int')
    chex = '#{:02x}{:02x}{:02x}'.format(cv[0], cv[1], cv[2])
    l = ('<b>{:}:</b><br><i>{:}:</i><br>{:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}, {:}').format(grplbl, label, top[0], top[1], top[2], top[3], top[4], top[5], top[6], top[7], top[8], top[9])
    lbl = branca.element.IFrame(l, width=250, height=130)
    label = folium.Popup(lbl)#, parse_html=True)
    folium.CircleMarker([lat, lng], radius=10, popup=label, color='black', weight=1, fill=True, fill_color=chex, fill_opacity=1,
    parse_html=True).add_to(map_toronto)

for lat, lng, cat, in zip(restaurants['Venue Latitude'], restaurants['Venue Longitude'], restaurants['Venue Category']):
     folium.CircleMarker([lat, lng], radius=6, popup=cat, color='black', weight=1, fill=True, fill_color='white', fill_opacity=.5).add_to(map_toronto)
map_toronto

Of all the analyzed Neighborhoods, only "Bathurst Quay" and "CN Tower" have less than 10 restaurants nearby. Given that CN Tower is a more central location, opening a restaurant in there would be preferable since it may serve more neighborhoods. Let us check what kind of restaurants there are in the neighborhood.

In [31]:
top10list.loc['CN Tower'][:10]

1           Japanese (2)
2              Greek (1)
3               Thai (1)
4             Indian (1)
5            Italian (1)
6     Middle Eastern (1)
7                       
8                       
9                       
10                      
Name: CN Tower, dtype: object

Once again, Japanese restaurants are the most common type. Let's see which kinds of restaurants are common in the neighborhoods, but are absent from CN Tower area:

In [32]:
cat_counts = restaurants.drop_duplicates(['Venue', 'Venue Latitude', 'Venue Longitude']).groupby('Venue Category').count().sort_values('Neighborhoods', ascending=False)
print(cat_counts.shape)
cat_counts.head(10)

(32, 6)


Unnamed: 0_level_0,Neighborhoods,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Japanese Restaurant,19,19,19,19,19,19
Uncategorized Restaurant,16,16,16,16,16,16
Italian Restaurant,14,14,14,14,14,14
Seafood Restaurant,8,8,8,8,8,8
Thai Restaurant,6,6,6,6,6,6
Vegetarian / Vegan Restaurant,5,5,5,5,5,5
Fast Food Restaurant,5,5,5,5,5,5
Indian Restaurant,5,5,5,5,5,5
Chinese Restaurant,4,4,4,4,4,4
Asian Restaurant,4,4,4,4,4,4


We can see that of the top 5 most common restaurants around, the only category that is absent in CN Tower is **Seafood**, so it would be our choice of type of restaurant to open in that neighborhood. Other options to try to avoid direct competition could be **Vegetarian**, **Fast Food**, **Chinese** and **Asian**.

## Conclusion
In this notebook, we studied a section of Downtown Toronto to find a suitable region for opening a new restaurant. We used location and venues information to find a region where there aren't that many restaurants, and in that neighborhood, we tried to find the most popular category of restaurant in Downtown that **is not** present in the chosen neighborhood.