# Capstone Project - The Battle of Neighborhoods (Week 1)
## Introduction / Business Problem

Finding the right small business location is one of the primary steps in preparing to set up a new business. It is not always an easy task. This project aims to help current and future business owners in the process of selecting business locations. By using data from a location based social network services like Foursquare as well as neighbourhood area statistics it should be possible to recommend possible business locations.

As the types of small businesses are manifold, this project will restrict the definition to those businesses that fall under the categories of shops, restaurants, cafes and bars. These types of businesses depend on foot traffic, easy access and good visibility.

### There are several of factors that can infuence chosing a location:

* Location of similar businesses
 * Businesses are usually located where they are for a good reason,
 * Customers already in the area are more likely to be looking for a similar business
* Consumer statistics for similar business
 * Average number of customer visits. 
 * Popularity of a business
* Distance between consumers and business
 * The further the consumer is located from the business the less likely he or she is to visit.
 * Consumer location doesn't necesarily mean domestic location but could also mean job location.
* Location close to transportion hubs, parking facilities,entertainment centers like theaters, cinemas or public parks
 * Locations where there is a large amount of foot traffic: concentration of possible customers
* Population density of the surrounding area
 * More people close by: more possible customers
 * There are statistics available on population by neighbourhood or postal code area. 
* Average Income 
 * Higher average income: possible customers with more money to spend
 * There are statistics available on average income by neighbourhood. 

  **This project will attempt to combine the above factors to build a clustering and/or recommendation model<br>
  for the best areas for locating certain businesses. The recomendation(s) given by the model should help<br>
  the (future) business owner to make a more informed decision**

**Note**: only further analysis in the next stage after gathering the data will prove which machine learning method is better suited to use

In [1]:
# import the necessary libraries
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
import geopandas as gpd # libary for geo-spatial data processing and analysis
import json # library to handle JSON files
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# no warnings
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Data section - Statistics Data on Neighbourhoods

I have chosen to look at the neighbourhoods in the former city of Toronto for this study. This is based on the fact that the city has a sustantially large population with readily available statistics.

1. ### Neighbourhoods with central and boundary geo-coordinates with the following columns:

 * **CDN_Number**: Area code for the neighbourhood, 3 digits
 * **Neighbourhood**: Name of the neighbourhood
 * **geometry**: collection of geo-coordinates designating the boundary of the neigbourhood
 * **Latitude**: the latitudinal coordinate of the center of the area (centroid)
 * **Longitude**: the longitudinal coordinate of the center of the area (centroid)

### Example data:

In [2]:
# convert the neighbourhood's boundaries shapefile to a geopandas dataframe
df_toronto_nbh_geo = gpd.read_file('./data/NEIGHBORHOODS_WGS84.shp')
# rename the columns
df_toronto_nbh_geo.rename(columns={'AREA_S_CD':'CDN_Number',
                                   'AREA_NAME':'Neighbourhood'},
                          inplace=True)
# remove the brackets in the neighbourhod name column
fix_neighbourhood = lambda x: x.split('(')[0]
df_toronto_nbh_geo['Neighbourhood'] = df_toronto_nbh_geo['Neighbourhood'].apply(fix_neighbourhood)
# write to a GEOjson format for later use
df_toronto_nbh_geo
# calculate the centers of each area 
df_toronto_nbh_geo['Latitude'] = df_toronto_nbh_geo['geometry'].centroid.y
df_toronto_nbh_geo['Longitude'] = df_toronto_nbh_geo['geometry'].centroid.x
# display the dimensions and first five rows
print('Dimensions: ', df_toronto_nbh_geo.shape)
df_toronto_nbh_geo.head()

Dimensions:  (140, 5)


Unnamed: 0,CDN_Number,Neighbourhood,geometry,Latitude,Longitude
0,97,Yonge-St.Clair,"POLYGON ((-79.39119482700001 43.681081124, -79...",43.687859,-79.397871
1,27,York University Heights,"POLYGON ((-79.505287916 43.759873494, -79.5048...",43.765738,-79.488883
2,38,Lansing-Westgate,"POLYGON ((-79.439984311 43.761557655, -79.4400...",43.754272,-79.424747
3,31,Yorkdale-Glen Park,"POLYGON ((-79.439687326 43.705609818, -79.4401...",43.714672,-79.457108
4,16,Stonegate-Queensway,"POLYGON ((-79.49262119700001 43.64743635, -79....",43.635518,-79.501128


2. ### Wikipedia table containing neighbourhoods by former city / borough

  This table is used to filter the neighbourhoods by the former city area of Toronto: 
  https://en.wikipedia.org/wiki/List_of_city-designated_neighbourhoods_in_Toronto
  * **CDN_Number**: Area code for the neighbourhood, 3 digits
  * **City-designated-area**: Name of the neighbourhood
  * **Borough**: Former city or borough

In [3]:
# read the table in the Wikipedia page
df_toronto_nbh_bor = pd.read_html('https://en.wikipedia.org/wiki/List_of_city-designated_neighbourhoods_in_Toronto')[0]
# remove columns not needed and rename the remaining
df_toronto_nbh_bor.drop(columns=['Map','Neighbourhoods covered'],inplace=True)
df_toronto_nbh_bor.rename(columns={'CDN number':'CDN_Number',
                                   'Former city/borough':'Borough'},
                          inplace=True)
# format the CDN number column so that it matches that of the previous dataframe
zero_fill = lambda x: "{:03d}".format(x)
df_toronto_nbh_bor['CDN_Number'] = df_toronto_nbh_bor['CDN_Number'].apply(zero_fill)
# display the dimensions and first five rows
print('Dimensions: ', df_toronto_nbh_bor.shape)
df_toronto_nbh_bor.head()

Dimensions:  (140, 3)


Unnamed: 0,CDN_Number,City-designated area,Borough
0,129,Agincourt North,Scarborough
1,128,Agincourt South-Malvern West,Scarborough
2,20,Alderwood,Etobicoke
3,95,Annex,Old City of Toronto
4,42,Banbury-Don Mills,North York


3. ### Toronto Population Statistics by Neighbourhood

  Neighbourhood population , area and household income from 2014. 
  
  This can be retrieved from the city of Toronto neighbourhood wellbeing app
  https://www.toronto.ca/city-government/data-research-maps/neighbourhoods-communities/wellbeing-toronto/
  
  The file contains the following columns:
  * **Neighbourhood**: Name of the neighbourhood
  * **CDN_Number**: Three digit neighbourhood code
  * **TotalPopulation**: Total population for the neighbourhood based on 2014 data
  * **TotalArea**: Area of the neighbourhood in square kilometers
  * **After_TaxHouseholdIncome**: Average household income after tax in Canadian dollars
  * **PopulationDensity**: Density of the population by square kilometers
  
  This excel file will be loaded into a pandas dataframe 

### Example data:

In [4]:
# read the 2014 statistics excel file
df_toronto_nbh_sta = pd.read_excel('./data/wellbeing_toronto_2014.xlsx')
# remove unwanted columns
df_toronto_nbh_sta.drop(columns=['Combined Indicators','Average Family Income'],inplace=True)
# rename the neighbourhood id column to CDN_Number to match other dataframe
rename_columns = {'Neighbourhood Id':'CDN_Number',
                  'Total Population':'TotalPopulation',
                  'Total Area':'TotalArea',
                  'After-Tax Household Income':'AfterTaxHouseholdIncome'}
df_toronto_nbh_sta.rename(columns=rename_columns,inplace=True)
# reformat the CDN_Number column to match the other similar dataframe columns
zero_fill = lambda x: "{:03d}".format(x)
df_toronto_nbh_sta['CDN_Number'] = df_toronto_nbh_sta['CDN_Number'].apply(zero_fill)
# add column with population density
df_toronto_nbh_sta['PopulationDensity'] = round(df_toronto_nbh_sta['TotalPopulation']/df_toronto_nbh_sta['TotalArea'],0)
df_toronto_nbh_sta.head()

Unnamed: 0,Neighbourhood,CDN_Number,TotalPopulation,TotalArea,AfterTaxHouseholdIncome,PopulationDensity
0,West Humber-Clairville,1,33312,30.09,59703,1107.0
1,Mount Olive-Silverstone-Jamestown,2,32954,4.6,46986,7164.0
2,Thistletown-Beaumond Heights,3,10360,3.4,57522,3047.0
3,Rexdale-Kipling,4,10529,2.5,51194,4212.0
4,Elms-Old Rexdale,5,9456,2.9,49425,3261.0


4. ### Combined table with neighbourhood as key

  The three above mentioned tables will be loaded and joined based on FSA code to form a dataframe containing 
  the following columns:
  
 * **CDN_Number**: Three digits designating a neighbourhood (data 1.)
 * **Neighbourhood**: Name of the neighbourhood (data 1.)
 * **Latitude**: the latitudinal coordinate of the center of the area (data 1.)
 * **Longitude**: the longitudinal coordinate of the center of the area (data 1.)
 * **geometry**: a list of latitude - longitude coordinates forming the boundries of the neighbourhood (data 1.)
 * **TotalPopulation**: the total population of the neighbourhood (data 3.)
 * **TotalArea**: the total area in square kilometers (data 3.)
 * **AfterTaxHouseholdIncome**: average household income after tax for the neighbourhod (data 3.)
 * **PopulationDensity**: the population density of the area in persons by square km (TotalPopulation/TotalArea)

  This dataframe will form the features for a neighbourhod and used for the machine learning algorithm

### Example data:

  Only the neighbourhoods in the former city of Toronto have been retained<br>
  After removing several (duplicate) columns and the following columns are available.

In [5]:
# Now join the three dataframes
df_toronto_nbh_tmp = pd.merge(left=df_toronto_nbh_geo,right=df_toronto_nbh_bor,on='CDN_Number')
df_toronto_nbh_tmp = df_toronto_nbh_tmp[df_toronto_nbh_tmp['Borough'] == 'Old City of Toronto']
df_toronto_nbh_tmp.drop(columns=['City-designated area','Borough'],inplace=True)
#df_toronto_nbh.rename(columns={'NeighbourhoodGeo':'Neighbourhood'},inplace=True)
df_toronto_nbh = pd.merge(left=df_toronto_nbh_tmp,right=df_toronto_nbh_sta,on='CDN_Number')
df_toronto_nbh.drop(columns=['Neighbourhood_y'],inplace=True)
df_toronto_nbh.rename(columns={'Neighbourhood_x':'Neighbourhood'},inplace=True)
df_toronto_nbh.sort_values('Neighbourhood',inplace=True)
df_toronto_nbh.reset_index(drop=True,inplace=True)
print('Dimensions: ',df_toronto_nbh.shape)
df_toronto_nbh.head()

Dimensions:  (44, 9)


Unnamed: 0,CDN_Number,Neighbourhood,geometry,Latitude,Longitude,TotalPopulation,TotalArea,AfterTaxHouseholdIncome,PopulationDensity
0,95,Annex,"POLYGON ((-79.39414141500001 43.668720261, -79...",43.671585,-79.404,30526,2.8,49912,10902.0
1,76,Bay Street Corridor,"POLYGON ((-79.38751633 43.650672917, -79.38662...",43.657512,-79.385722,25797,1.8,44614,14332.0
2,69,Blake-Jones,"POLYGON ((-79.34082169200001 43.669213123, -79...",43.676173,-79.337394,7727,0.9,51381,8586.0
3,71,Cabbagetown-South St.James Town,"POLYGON ((-79.376716938 43.662418858, -79.3772...",43.667648,-79.366107,11669,1.4,50873,8335.0
4,96,Casa Loma,"POLYGON ((-79.414693177 43.673910413, -79.4148...",43.681852,-79.408007,10968,1.9,65574,5773.0


### Display the neighbourhoods on a map of Toronto by population density

  The each neighbourhood is shown with a boundary and a color varying from yellow to red, depending on the population density by square kilometer

In [6]:
import folium
# create map of Toronto Neighbourhoods (FSAs) using retrived latitude and longitude values
map_toronto = folium.Map(location=[43.673963, -79.387207], zoom_start=12);
toronto_geojson = "./data/toronto_neighbourhoods.json"
map_toronto.choropleth(geo_data=toronto_geojson,
    data = df_toronto_nbh,
    popup=df_toronto_nbh['Neighbourhood'],
    columns=['Neighbourhood','PopulationDensity'],
    key_on='feature.properties.Neighbourhood',
    fill_color='YlOrRd',
    fill_opacity=0.5, 
    line_opacity=0.2,
    legend_name='Population Density by Neighbourhood')   
# add markers to map
for lat, lng, cdn_number, neighborhood in zip(df_toronto_nbh['Latitude'], df_toronto_nbh['Longitude'], df_toronto_nbh['CDN_Number'], df_toronto_nbh['Neighbourhood']):
    label = '{} - {}'.format(neighborhood, cdn_number)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=2,
        popup=label,
        color='red',
        fill=True,
        #fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_toronto)  
map_toronto.save('toronto_map.html')

In [7]:
%%HTML
<iframe src='toronto_map.html' width=1080 height=540/>

# Data Used - Foursquare API data - Venue Details

The Foursquare API will be used to collect venue data by FSA area. 
This data can then be combined with the FSA statistical data 
to be used by the chosen machine learning algorithm to provide insight in business location

In [8]:
# Set up Foursqaure API credentials
CLIENT_ID = '<client_id>' # your Foursquare ID
CLIENT_SECRET = '<client_credential>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

5. ### Foursquare Venue Catagories:
 
  Each venue on Foursqaure has been assigned to a category.
  This is is the lowest level catagory that is used by Foursqaure.
  
  Foursqaure usually has two levels of categories, the top level like Food, Arts & Entertainment etc.<br>
  Under each category there are several sub-categories.<br>
  For example Food has a long list of sub-categories including different restaurant types, cafes etc.
  
  There is a special entry point in the Foursqaure API to retrieve all categories and sub-categories.<br>
  This data will be stored in a table with the following fields:
  
  * **Category**: top level Foursquare venue catagory
  * **Subcategory**: lower level venue category
  
  The top level category will be used to categorize venues on a top level as well

In [9]:
# build the request to retrieve the Foursquare venue catagories
import os
url = 'https://api.foursquare.com/v2/venues/categories?&client_id={}&client_secret={}&v={}'.format(
   CLIENT_ID, 
   CLIENT_SECRET, 
   VERSION
) 
# initialize variables
dict_cats = {}
list_cats = []
list_subcats = []
# check if the categories csv file already exists, if so then use it
# instead of calling the API
if os.path.exists('data/foursquare_categories.csv'):
    df_cats=pd.read_csv('data/foursquare_categories.csv',index_col=0)
else:
    # request the data from the API
    results = requests.get(url).json()
    # normalize the Json to a dataframe
    df_cats = json_normalize(results['response']['categories'])
    # get each category and sub-category from the categories column
    for idx,row in df_cats.iterrows():
        cats = row['categories']
        for v in cats:
            list_cats.append(row['name'])
            list_subcats.append(v['name'])
    dict_cats['Category'] = list_cats
    dict_cats['Subcategory'] = list_subcats
    # rebuild the dataframe from a dictionary
    df_cats = pd.DataFrame.from_dict(dict_cats)
    # save to csv for later use
    df_cats.to_csv('data/foursquare_categories.csv')
    
df_cats.head()

Unnamed: 0,Category,Subcategory
0,Arts & Entertainment,Amphitheater
1,Arts & Entertainment,Aquarium
2,Arts & Entertainment,Arcade
3,Arts & Entertainment,Art Gallery
4,Arts & Entertainment,Bowling Alley


6. ### Foursquare Venues by Neighbourhood

  Use the Foursquare Venue Explore API endpoint to gather basic data on venues<br>
  with a certain radius based on the central coordinates for the area. 
  
  The data retrieed in JSON format will be stored in a dataframe with the following columns:
  
  * **CDN_Number**: Three digit neighbourhood code
  * **Neighbourhood**: Name of the neighbourhood the venue is located in
  * **Name**: Name of the venue
  * **Latitude**: Latitude coordinate of the venue
  * **Longitude**: Longitude coordinate of the venue
  * **Subcategory**: Lower level category name for the venue
  * **Category**: Highest level category , this will be added later

  **Note**: the venue category will be added to the dataframe using the Foursquares categories dataframe (5)<br>
  **Note**: the venue CDN number and neighbourhood will be checked against the neighbourhoods boundaries

### Get the venues by neighbourhood using the Foursquare API explore endpoint

In [13]:
def get_nearby_venues(cdns, neighbourhoods, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for cdn, neighbourhood, lat, lng in zip(cdns, neighbourhoods, latitudes, longitudes):
        print(neighbourhood)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        venues = requests.get(url).json()["response"]['groups'][0]['items']
        # add a row for each venue
        for v in venues:
            vnam = v['venue']['name']             # venue name
            vlat = v['venue']['location']['lat']  # venue latitude
            vlng = v['venue']['location']['lng']  # venue longitude
            vcat = v['venue']['categories'][0]['name'] # venue subcategory
            venues_list.append([cdn,neighbourhood,vnam,vlat,vlng,vcat])            

    return(venues_list)

### Process retrieving venues by neighbourhood

  Loop through the neighbourhood dataframe to get the venues within a certain radius<br>
  of the center coordinates of each neighbourhood. Due to the fact that using a radius might<br>
  cause the API to get venues just outside of the current neighbourhood. All the venues found<br>
  will be verified and if necessary set to the correct neighbourhood

In [14]:
LIMIT = 200 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius in meters
# call the API explore endpoint for each neighbourhood
venues_list = get_nearby_venues(cdns=df_toronto_nbh['CDN_Number'],
                                neighbourhoods=df_toronto_nbh['Neighbourhood'],
                                latitudes=df_toronto_nbh['Latitude'],
                                longitudes=df_toronto_nbh['Longitude']
                               )

Annex 
Bay Street Corridor 
Blake-Jones 
Cabbagetown-South St.James Town 
Casa Loma 
Church-Yonge Corridor 
Corso Italia-Davenport 
Danforth 
Dovercourt-Wallace Emerson-Junction 
Dufferin Grove 
East End-Danforth 
Forest Hill North 
Forest Hill South 
Greenwood-Coxwell 
High Park North 
High Park-Swansea 
Junction Area 
Kensington-Chinatown 
Lawrence Park North 
Lawrence Park South 
Little Portugal 
Moss Park 
Mount Pleasant East 
Mount Pleasant West 
Niagara 
North Riverdale 
North St.James Town 
Palmerston-Little Italy 
Playter Estates-Danforth 
Regent Park 
Roncesvalles 
Rosedale-Moore Park 
Runnymede-Bloor West Village 
South Parkdale 
South Riverdale 
The Beaches 
Trinity-Bellwoods 
University 
Waterfront Communities-The Island 
Weston-Pellam Park 
Woodbine Corridor 
Wychwood 
Yonge-Eglinton 
Yonge-St.Clair 


### Build the venues dataframe from the venues list and rename the columns

In [15]:
# build the dataframe from the venues list
df_toronto_ven = pd.DataFrame.from_records(venues_list)
# rename the columns
df_toronto_ven.columns = ['CDN_Number',
              'Neighbourhood', 
              'Venue', 
              'Latitude', 
              'Longitude', 
              'SubCategory']
# display the first 5 rows
print('Dimensions: ', df_toronto_ven.shape)
df_toronto_ven.head()

Dimensions:  (3409, 6)


Unnamed: 0,CDN_Number,Neighbourhood,Venue,Latitude,Longitude,SubCategory
0,95,Annex,Rose & Sons,43.675668,-79.403617,American Restaurant
1,95,Annex,Ezra's Pound,43.675153,-79.405858,Café
2,95,Annex,Roti Cuisine of India,43.674618,-79.408249,Indian Restaurant
3,95,Annex,Fresh on Bloor,43.666755,-79.403491,Vegetarian / Vegan Restaurant
4,95,Annex,Playa Cabana,43.676112,-79.401279,Mexican Restaurant


### Add the category column based on a dictionary lookup using the Foursquare categories dataframe

In [16]:
dict_cats = dict(zip(df_cats['Subcategory'],df_cats['Category']))
df_toronto_ven['Category'] = df_toronto_ven['SubCategory'].map(dict_cats)
df_toronto_ven.head()

Unnamed: 0,CDN_Number,Neighbourhood,Venue,Latitude,Longitude,SubCategory,Category
0,95,Annex,Rose & Sons,43.675668,-79.403617,American Restaurant,Food
1,95,Annex,Ezra's Pound,43.675153,-79.405858,Café,Food
2,95,Annex,Roti Cuisine of India,43.674618,-79.408249,Indian Restaurant,Food
3,95,Annex,Fresh on Bloor,43.666755,-79.403491,Vegetarian / Vegan Restaurant,Food
4,95,Annex,Playa Cabana,43.676112,-79.401279,Mexican Restaurant,Food


### Check for the correct neighbourhood to each venue and correct if necessary

  The geopandas dataframe has a method to check if a geo-coordinate is with in the boundaries<br>
  of an area, in this case neighbourhood boundares. The df_toronto_nbh dataframe has a column<br>
  with these boundaries and can be used to verify the venues geo-location.

In [17]:
# import the Point object
from shapely.geometry import Point
# loop at all the venues
corrected = 0
for i,ven in df_toronto_ven.iterrows():
    # create a Point based on the venues latitude and longitude coordinates
    pnt = Point(ven['Longitude'],ven['Latitude'])
    # get the venues neighbourhood number
    vcd = ven['CDN_Number']
    # print(pnt, vcd )
    # loop at the neighbourhood dataframe
    for j, nbh in df_toronto_nbh.iterrows():
        # check if the venues coordinates are within the neighbourhood's boundaries
        isin = pnt.within(nbh['geometry'])
        # the venue is in the current neighbourhood
        if isin:
            # print('Is in',isin,nbh['CDN_Number'])
            if vcd != nbh['CDN_Number']:
                # print('Changed')
                corrected = corrected + 1
                df_toronto_ven.at[1,'CDN_Number'] = nbh['CDN_Number']  
                df_toronto_ven.at[1,'Neighbourhood'] = nbh['Neighbourhood']  
            break

print(df_toronto_ven.shape[0], 'venues checked')
if corrected:
    print(corrected,' venues corrected')

3409 venues checked
1490  venues corrected


### Display corrected venues dataframe
  **Note**: the neighbourhood of several venues has been corrected

In [18]:
df_toronto_ven.head()

Unnamed: 0,CDN_Number,Neighbourhood,Venue,Latitude,Longitude,SubCategory,Category
0,95,Annex,Rose & Sons,43.675668,-79.403617,American Restaurant,Food
1,98,Rosedale-Moore Park,Ezra's Pound,43.675153,-79.405858,Café,Food
2,95,Annex,Roti Cuisine of India,43.674618,-79.408249,Indian Restaurant,Food
3,95,Annex,Fresh on Bloor,43.666755,-79.403491,Vegetarian / Vegan Restaurant,Food
4,95,Annex,Playa Cabana,43.676112,-79.401279,Mexican Restaurant,Food
