### Initial Business problem and background

Toronto, the capital of the Canadian province of Ontario, is the most populous urban area in the country. A diverse, multicultural city, it is the country's financial and commercial center. Boasting many cultural assets, it is also a popular tourism destination.

An experienced restaurateur is looking to find an opening in the city, with an intent to open a contemporary, multi-ethnic restaurant. The question is, of the city's diverse communities and neighborhoods, to find a lucrative location. This will take, among other considerations, the appropriate business and economic ecosystem, so as to find the most promising niche in a complex tapestry of existing restaurants, other attractions, and demographics. Our role is to provide relevant data points on which their decision may be based.

Questions include whether the demographics in the relevant neighborhoods is supportive the endeavor, as it is to be targeted for especially younger professionals in the 20 – 45 age range with sufficient disposable income, which types of cuisine may compliment existing establishments so as to fill in an unmet need, and that an area is not always saturated with similar businesses.

### Target Audience

Is the primary party looking to establish the restaurant, and their investors who will base decisions on the business plan based in part on the data analysis presented here.

In [42]:
import os
import json
import re
import requests

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

import folium
from geopy.geocoders import Nominatim

### Data Selection and cleaning

Two primary data sources will be employed in a preliminary analysis:

1. Neighborhood profiles, including demographic data, will be drawn from Toronto City's Open Portal. We will rely heavily on the data sets pertaining to neighborhood boundaries (https://open.toronto.ca/dataset/neighbourhoods/) as well as demographic data for each neighborhood for a 2016 census (https://open.toronto.ca/dataset/neighbourhood-profiles/)

The neighborhood boundary data is in geojson format as shape data for mapping. 

https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/a083c865-6d60-4d1d-b6c6-b0c8a85f9c15?format=geojson&projection=4326

It omits unfortunately unique geographical coordinates for each neighborhood, but this can be derived from the vertices of the polygons describing the neighborhood boundaries. We simply calculate the centroid within each neighborhood. 


In [37]:
def getNeighborhoods(shape_data):
    '''
    Calculate centroids for neighborhoods and build dataframe
    
    Input:
    shape_data: JSON neighborhood geographic boundaries for neighborhoods from Toronto's open data portal
    
    return:
    Pandas dataframe of neighborhoods with center geographical coordinates
    '''
    neighborhoods = []
    for place in shape_data['features']:
        neighborhood = {'Neighborhood_number': place['properties']['AREA_SHORT_CODE'],
                        'Neighborhood': place['properties']['AREA_NAME']}
        coords = np.array(place['geometry']['coordinates'][0])
        centroid = np.mean(coords, axis=0)
        neighborhood['Latitude'] = centroid[1]
        neighborhood['Longitude'] = centroid[0]
        neighborhoods.append(neighborhood)
    return pd.DataFrame(neighborhoods)

In [38]:
path = os.path.join(os.path.abspath('../data'), 'Neighbourhoods.geojson')
with open(path, 'r') as FILE:
    data = json.load(FILE)

neighborhoods_df = getNeighborhoods(data)
neighborhoods_df.head()

Unnamed: 0,Neighborhood_number,Neighborhood,Latitude,Longitude
0,96,Casa Loma (96),43.680555,-79.406011
1,95,Annex (95),43.673365,-79.402468
2,109,Caledonia-Fairbank (109),43.688206,-79.459095
3,64,Woodbine Corridor (64),43.675813,-79.314969
4,103,Lawrence Park South (103),43.714161,-79.407022


We'll cleanup the neighborhood names to omit the neighborhood numbers from them.

In [39]:
neighborhoods_df['Neighborhood'] = neighborhoods_df['Neighborhood'].apply(lambda x: re.sub(r'\([^)]*\)', '', x))
neighborhoods_df.head()

Unnamed: 0,Neighborhood_number,Neighborhood,Latitude,Longitude
0,96,Casa Loma,43.680555,-79.406011
1,95,Annex,43.673365,-79.402468
2,109,Caledonia-Fairbank,43.688206,-79.459095
3,64,Woodbine Corridor,43.675813,-79.314969
4,103,Lawrence Park South,43.714161,-79.407022


How many neighborhoods are there?

In [34]:
print(f'This gives us a total of {neighborhoods_df.shape[0]} neighborhoods')

This gives us a total of 140 neighborhoods


The demographic data is downloaded in CSV form from https://ckan0.cf.opendata.inter.prod-toronto.ca/download_resource/ef0239b1-832b-4d0b-a1f3-4153e53b189e?format=csv. This contains extensive demographic data by neighborhood, from which we will be extracting population data for 2016. 

The original dataset contains a lot of information we will not be needing. It also has it in a wide format, with neighborhoods as columns, which we want to be as rows. The original dataset has many different categories, of which we will only be interested in two: Population, and Neighborhood Inofrmation. So we will filter it for those categories, drop unecessary columns, and reshape it to have population by neighborhood.

In [27]:
path = os.path.join(os.path.abspath('../data'), '2016_neighbourhood_profiles.csv')
profile_df = pd.read_csv(path)

population_df = profile_df[(profile_df['Category'] == 'Population') | 
                           (profile_df['Category'] == 'Neighbourhood Information')]
population_df.drop(columns=['Category', 'Topic', 'Data Source'], inplace=True)
population_df.rename(columns={'Characteristic': 'Neighborhood'}, inplace=True)
population_df.set_index('Neighborhood', drop=True, inplace=True)
population_df = population_df.T

population_df.dropna(axis=0, inplace=True)

population_df = population_df[['Neighbourhood Number', 'Population, 2016', 'Population Change 2011-2016', 'Working Age (25-54 years)']]

population_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Neighborhood,Neighbourhood Number,"Population, 2016",Population Change 2011-2016,Working Age (25-54 years)
Agincourt North,129,29113,-3.90%,11305
Agincourt South-Malvern West,128,23757,8.00%,9965
Alderwood,20,12054,1.30%,5220
Annex,95,30526,4.60%,15040
Banbury-Don Mills,42,27695,2.90%,10810


It appears we will need to clean up the numerical columns for Population data, converting from string format

In [28]:
population_df.dtypes

Neighborhood
Neighbourhood Number           object
Population, 2016               object
Population Change 2011-2016    object
Working Age (25-54 years)      object
dtype: object

In [29]:
# Remove commas and convert to integers
population_df['Population, 2016'] = population_df['Population, 2016'].apply(lambda x: re.sub(',', '', x))
population_df['Population, 2016'].astype(int)

population_df['Working Age (25-54 years)'] = population_df['Working Age (25-54 years)'].apply(lambda x: re.sub(',', '', x))
population_df['Working Age (25-54 years)'].astype(int)

# Remove '%' character and convert to floats
population_df['Population Change 2011-2016'] = population_df['Population Change 2011-2016'].apply(lambda x: re.sub('%', '', x))
population_df['Population Change 2011-2016'].astype(float)

population_df.head()
    

Neighborhood,Neighbourhood Number,"Population, 2016",Population Change 2011-2016,Working Age (25-54 years)
Agincourt North,129,29113,-3.9,11305
Agincourt South-Malvern West,128,23757,8.0,9965
Alderwood,20,12054,1.3,5220
Annex,95,30526,4.6,15040
Banbury-Don Mills,42,27695,2.9,10810


2. For analysis and segmentation of existing restaurants and other attractions will be obtained using the Foursquare API. This will create a profile of existing establishments already serving the diverse neighborhoods. With this data we can aggregate the various kinds of establishments in each community.

First we will need to setup for our use of the the Foursquare API.

In [35]:
from dotenv import load_dotenv

load_dotenv()
CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
VERSION = '20180605'
LIMIT = 100

BaseURL = ('https://api.foursquare.com/v2/venues/explore?' +
           f'client_id={CLIENT_ID}&client_secret={CLIENT_SECRET}&' +
           f'v={VERSION}')

In [36]:
def getVenues(names, lats, longs, radius=500):
    neighbor_venues = []
    
    count = 0
    
    # iterate through the neighborhoods grouped by postal codes
    for name, lat, long in zip(names, lats, longs):
        url = BaseURL + f'&ll={lat},{long}&radius={radius}&limit={LIMIT}'
        
        # GET response, and make sure it is valid (status_code == 200)
        results = requests.get(url)
        if results.status_code  != 200:
            raise Exception(f'HTTP response code was {results.status_code}')
            
        # Update what percentage of neighborhoods processed and print
        count += 1
        print(f'\r{round(count / len(names) * 100, 2)}% neighborhoods downloaded', end='')
            
        # Breakdown the JSON response to what we want
        venues = results.json()['response']['groups'][0]['items']
        for venue in venues:
              ven = venue['venue']
              row = {'Neighborhood': name,
                     'Neighborhood_lat': lat,
                     'Neighborhood_long': long,
                     'Venue': ven['name'],
                     'Category': ven['categories'][0]['name'],
                     'Venue_lat': ven['location']['lat'],
                     'Venue_long': ven['location']['lng']}
              neighbor_venues.append(row)
              
    # Make it into a Dataframe and return
    venues_df = pd.DataFrame(neighbor_venues)
        
    return venues_df

In [43]:
venues_df = getVenues(neighborhoods_df['Neighborhood'],
                      neighborhoods_df['Latitude'],
                      neighborhoods_df['Longitude'])

venues_df.head()

100.0% neighborhoods downloaded

Unnamed: 0,Neighborhood,Neighborhood_lat,Neighborhood_long,Venue,Category,Venue_lat,Venue_long
0,Casa Loma,43.680555,-79.406011,Casa Loma,Castle,43.677934,-79.409521
1,Casa Loma,43.680555,-79.406011,Baldwin Steps,Historic Site,43.677707,-79.408209
2,Casa Loma,43.680555,-79.406011,Casa Loma Stables,Museum,43.679395,-79.410905
3,Casa Loma,43.680555,-79.406011,Flor de Sal,Modern European Restaurant,43.677757,-79.407176
4,Casa Loma,43.680555,-79.406011,Sir Winston Churchill Park,Park,43.683732,-79.409881


How many venues does this give us?

In [46]:
print(f'Number of Toronto venues found = {venues_df.shape[0]}')

Number of Toronto venues found = 1757


Many of which are not restaraunts, so we will filter the results to just venues with "restaurant" in Category

In [49]:
restaurants_df = venues_df[venues_df['Category'].str.contains('Restaurant')]
restaurants_df.head()

Unnamed: 0,Neighborhood,Neighborhood_lat,Neighborhood_long,Venue,Category,Venue_lat,Venue_long
3,Casa Loma,43.680555,-79.406011,Flor de Sal,Modern European Restaurant,43.677757,-79.407176
17,Annex,43.673365,-79.402468,Playa Cabana,Mexican Restaurant,43.676112,-79.401279
19,Annex,43.673365,-79.402468,Mistura,Italian Restaurant,43.674285,-79.398426
20,Annex,43.673365,-79.402468,Le Paradis,French Restaurant,43.675007,-79.400036
21,Annex,43.673365,-79.402468,Roti Cuisine of India,Indian Restaurant,43.674618,-79.408249


How many restaurants?

In [50]:
print(f'Number of restaurants = {restaurants_df.shape[0]}')

Number of restaurants = 376


And how many are of different types?

In [51]:
restaurants_df['Category'].value_counts()

Italian Restaurant               42
Restaurant                       41
Fast Food Restaurant             28
Sushi Restaurant                 23
Indian Restaurant                23
Japanese Restaurant              19
Vietnamese Restaurant            17
Chinese Restaurant               17
Thai Restaurant                  16
Middle Eastern Restaurant        15
Mexican Restaurant               10
Asian Restaurant                 10
Seafood Restaurant               10
Caribbean Restaurant              9
Greek Restaurant                  8
Falafel Restaurant                8
Korean Restaurant                 6
Portuguese Restaurant             5
Vegetarian / Vegan Restaurant     5
French Restaurant                 4
American Restaurant               4
Ethiopian Restaurant              4
Mediterranean Restaurant          4
Ramen Restaurant                  4
Filipino Restaurant               3
Cantonese Restaurant              3
Afghan Restaurant                 2
Tapas Restaurant            