# Capstone Project - The Battle of the Neighborhoods
## Finding the best location to open a fast food restaurant in Toronto

## Table of Content
### 1. [Introduction](#intro)
### 2. [Data](#data)


<a id='intro'></a>
## 1. Introduction
The project is aimed at leveraging the Forsquare location data to solve the problem of identifying the best neighborhood in which to open a new fast food restaurant in Toronto. Details of the project as well as processing of the data used will be described and a neighborhood will be recommended to the client. 

## 1.1 Background Information
One of the many decisions that needs to be made when opening a new business is about the location in which the business should be opened. There are usually a lot of factors that may be considered for a location depending on the specific business. One good pointer as to whether a specific area will be potentially good for a business is to consider if there is a similar business in the same or similar areas to the one being considered. In this project, we will explore how information about different locations can be used to decide where to open a new business. Specifically, we are interested in helping a client find the best location for a new fast food (FF) restaurant in Toronto.

## 1.2 Problem Statement
Our client, Maria is planning to open a new fast food (FF) restaurant in Toronto but needs guidance on the best neighborhood in whihc she can open the restaurant. Mary is not yet decided on the specific fast food (e.g KFC, McDonalds, Burger King) but she is certain that it will be a fast food restaurant. Based on initial conversations with mary, she believes that neighborhoods with an exisiting FF restaurant is a viable option. Neighborhoods that also look similar (in terms of surrounding venues) to another neighborhood with a FF restaurant are also likely to be a good location. However, opening a FF restaurant in a location that already has many FF restaurants or within close proximity to another FF restaurant will create more competition that Maria would prefer to avoid for her new business. Since there are so many neighborhoods, Maria would liek to know whihc neighbourhood is the best location in whichi to open a new FF restaurant in Toronto.

## 1.3 Target Audience
This project is targeted at helping a client - Maria to know where to open a new FF restaurant in Toronto based on how many restaurants are currently in each neighborhood and the similarity of each neighborhood to ones with existing fast food restaurants. A similar approach to this analysis can be employed to find the best location for other businesses as well.

<a id='data'></a>
## 2. Data
This project uses data from different sources to address the problem stated. Information about neighborhoods in Toronto will be extracted from a web page. Detailed information about the venues in each neighborhood will then be gathered from Foursquare for further processing

### 2.1 Description of Data
The following data are used in the project.
1. Borough and neighborhoods information: This information is gathered by webscraping a wikipedia page (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M) that contains a list of the postal codes, borough, and neighborhoods of Toronto. It is then stored in a data frame. 
2. Longitude and Latitude information: In order to explore a neighborhood using Foursquare, we need the latitude and longitude of the locations. There are two ways in which this information could be gathered. The first approach is to make use of the geocoder package, which provides the latitude and longitude of a given location/postal code. The alternative is to use a CSV file in which this information had already been stored (https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv). The project uses this alternative method and reads the latitude and longitude of each neighborhood form the CSV file.
3. Venues in each neighborhood: In order to analyze the neighborhoods in Toronto, we need more information about the venues around each neighborhood. Foursquare location services provide a way to gather this information and will be used to get data on venues in each neighborhood of Toronto. 

Although it is possible to limit our analysis to only neighborhoods with 'Toronto' in its name, we decided against this in order to have more neighborhood venues to be explored for the best location for a FF restaurant. A view of the data is provided below.


In [1]:
from bs4 import BeautifulSoup 
import requests  
import pandas as pd

#### Borough and Neighborhood information 

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M" #the url containing the data
page  = requests.get(url).text # get the page
soup = BeautifulSoup(page,"html5lib") # create a soup object using the variable 'page'
table = soup.find('table') # the data is in the first table so we can use find method  

df = pd.DataFrame(columns=["PostalCode", "Borough", "Neighborhood"])
for row in table.findAll('td'):
    if row.span.text=='Not assigned':
        pass
    else:
        postalCode = row.p.text[:3]
        borough = (row.span.text).split('(')[0]
        neighbor = (((((row.span.text).split('(')[1]).strip(')')).replace(' /',',')).replace(')',' ')).strip(' ')
        df = df.append({"PostalCode":postalCode, "Borough":borough, "Neighborhood":neighbor}, ignore_index=True)
# Rename some of the Boroughs
df['Borough']=df['Borough'].replace({'Downtown TorontoStn A PO Boxes25 The Esplanade':'Downtown Toronto Stn A',
                                             'East TorontoBusiness reply mail Processing Centre969 Eastern':'East Toronto Business',
                                             'EtobicokeNorthwest':'Etobicoke Northwest','East YorkEast Toronto':'East York/East Toronto',
                                             'MississaugaCanada Post Gateway Processing Centre':'Mississauga'})


In [3]:
print("The neighborhood information: \n")
print(df.head())

The neighborhood information: 

  PostalCode           Borough                      Neighborhood
0        M3A        North York                         Parkwoods
1        M4A        North York                  Victoria Village
2        M5A  Downtown Toronto         Regent Park, Harbourfront
3        M6A        North York  Lawrence Manor, Lawrence Heights
4        M7A      Queen's Park     Ontario Provincial Government


In [4]:
df.shape

(103, 3)

We see that there are 103 neighborhoods in Toronto.

#### Longitude and latitude information

In order to explore the neighborhoods using Foursquare location data, we need the latitude and longitude of each neighborhood. One option for getting these information is to use the function below and execute over the range of the neighborhoods. However, a CSV file containing the longitude and latitudes of these locations is available. We read the location information from this file and update the data frame to contain the latitude and longitude for each data frame.

In [5]:
# Alternatively get coordinates from here
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DS0701EN-SkillsNetwork/labs_v1/Geospatial_Coordinates.csv'
df_coord = pd.read_csv(path)

df_coord.columns = ["PostalCode", "Latitude", "Longitude"]
print("The coordinates information:\n")
print(df_coord.head())
print("\nThe shape of the coordinates information from the file: ",df_coord.shape)


The coordinates information:

  PostalCode   Latitude  Longitude
0        M1B  43.806686 -79.194353
1        M1C  43.784535 -79.160497
2        M1E  43.763573 -79.188711
3        M1G  43.770992 -79.216917
4        M1H  43.773136 -79.239476

The shape of the coordinates information from the file:  (103, 3)


The coordinates are merged to the neighborhod information as shown below.

In [6]:
df = pd.merge(df, df_coord, on='PostalCode')
df

Unnamed: 0,PostalCode,Borough,Neighborhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Queen's Park,Ontario Provincial Government,43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto Business,Enclave of M4L,43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


The resulting data consist of the complete 103 neighborhood with their coordinate information added.

#### Venues in each Neighbourhood

We query Foursquare for the venues in each neighborhood.

In [7]:
# The code was removed by Watson Studio for sharing.

In [8]:
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
RADIUS = 500

In [9]:
# Function to get nearby venues for given latitude and longitude 
def getNearbyVenues(postalCode, boroughs, names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for code, borough, name, lat, lng in zip(postalCode, boroughs, names, latitudes, longitudes):
        print("Getting venues for {} {} ({})".format(code, borough, name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            RADIUS, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            code,
            borough,
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode',
                'Borough',
                'Neighborhood', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category']
    
    return(nearby_venues)

In [10]:
toronto_venues = getNearbyVenues(postalCode=df['PostalCode'],
                                 boroughs=df['Borough'],
                                 names=df['Neighborhood'],
                                 latitudes=df['Latitude'],
                                 longitudes=df['Longitude']
                                )

Getting venues for M3A North York (Parkwoods)
Getting venues for M4A North York (Victoria Village)
Getting venues for M5A Downtown Toronto (Regent Park, Harbourfront)
Getting venues for M6A North York (Lawrence Manor, Lawrence Heights)
Getting venues for M7A Queen's Park (Ontario Provincial Government)
Getting venues for M9A Etobicoke (Islington Avenue)
Getting venues for M1B Scarborough (Malvern, Rouge)
Getting venues for M3B North York (Don Mills North)
Getting venues for M4B East York (Parkview Hill, Woodbine Gardens)
Getting venues for M5B Downtown Toronto (Garden District, Ryerson)
Getting venues for M6B North York (Glencairn)
Getting venues for M9B Etobicoke (West Deane Park, Princess Gardens, Martin Grove, Islington, Cloverdale)
Getting venues for M1C Scarborough (Rouge Hill, Port Union, Highland Creek)
Getting venues for M3C North York (Don Mills South)
Getting venues for M4C East York (Woodbine Heights)
Getting venues for M5C Downtown Toronto (St. James Town)
Getting venues fo

In [11]:
print("Venue information sample: \n")
toronto_venues.head()

Venue information sample: 



Unnamed: 0,PostalCode,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,M3A,North York,Parkwoods,43.753259,-79.329656,Brookbanks Park,43.751976,-79.33214,Park
1,M3A,North York,Parkwoods,43.753259,-79.329656,KFC,43.754387,-79.333021,Fast Food Restaurant
2,M3A,North York,Parkwoods,43.753259,-79.329656,Variety Store,43.751974,-79.333114,Food & Drink Shop
3,M4A,North York,Victoria Village,43.725882,-79.315572,Victoria Village Arena,43.723481,-79.315635,Hockey Arena
4,M4A,North York,Victoria Village,43.725882,-79.315572,Tim Hortons,43.725517,-79.313103,Coffee Shop


In [12]:
(rows, columns) =  toronto_venues.shape
print("We obtained a total of {} venues for all the neighborhoods".format(rows))

We obtained a total of 2150 venues for all the neighborhoods


In our data exploration, we will look at the categories to better understand each category. But we can already see that there is a Fast Food category, which is exactly the category for the business that our client wants to open. For example, we can check to see how many neighborhoods have each of the categories, the total number of distinct categories of venues and the number of fast food restaurants in each neighborhood. 

In [13]:
toronto_venues.groupby('Venue Category').count()

Unnamed: 0_level_0,PostalCode,Borough,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Accessories Store,1,1,1,1,1,1,1,1
Adult Boutique,1,1,1,1,1,1,1,1
Airport,2,2,2,2,2,2,2,2
Airport Food Court,1,1,1,1,1,1,1,1
Airport Gate,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...
Wine Bar,7,7,7,7,7,7,7,7
Wine Shop,1,1,1,1,1,1,1,1
Wings Joint,1,1,1,1,1,1,1,1
Women's Store,2,2,2,2,2,2,2,2


In [21]:
print('There are {} unique categories in the venues data.'.format(len(toronto_venues['Venue Category'].unique())))

There are 278 unique categories in the venues data.


In [35]:
print("The total number of fast food restaurant in neighborhoods with with at least 1.\n")
ff_restaurants = (toronto_venues[toronto_venues['Venue Category']=='Fast Food Restaurant'].groupby('Neighborhood').count()).reset_index()[['Neighborhood', 'Venue Category']]
ff_restaurants.columns=['Neighborhood', 'No_of_restaurants']
ff_restaurants

The total number of fast food restaurant in neighborhoods with with at least 1.



Unnamed: 0,Neighborhood,No_of_restaurants
0,"Bedford Park, Lawrence Manor East",1
1,Church and Wellesley,2
2,"Clarks Corners, Tam O'Shanter, Sullivan",2
3,"Commerce Court, Victoria Hotel",1
4,Enclave of M4L,1
5,Enclave of M5E,1
6,"Fairview, Henry Farm, Oriole",4
7,"Garden District, Ryerson",2
8,"High Park, The Junction South",1
9,Hillcrest Village,1


### 2.2 How the Data will be used
The data will be used to arrive at a recommendation of a neighbourhood where a new FF restaurant is best located. 

We will start out with a data frame containing all the neighbourhoods in Toronto. Latitudes and Longitudes of the neighborhoods will then be added to the data frame. This infromation will be used to query Foursquare location services to obtain venues in each neighborhood. Each venue usually belongs to a venue category in the obtained data. For the purpose of the project, we are particularly interest in the Fast Food Restaurant category since it describes exactly the business that our client is interested in opening. However, to better understand the neighborhoods and how similar they are to each other, the complete set of data for all neighborhoods and venue categories will be used to cluster the neighborhoods into groups of similar neighborhoods.

Using the data frame that contains the complete list of venues in each neighborhood, we do some data cleaning to prepare the data for one-hot ending of the venue categories. We will then use the resulting data of one-hot encoding to do some exploratory analysis to see some of the most common venues in each neighborhood. In order to identify neighborhoods that are similar to one another, the _k_-means model will be used to cluster the neighborhood into a number of clusters. The method employed to identify the best number for _k_ is the Elbow method. 

Throughout the project report, we willl use plots and Folium maps for visual exploratory analysis and understanding of the neighborhoods and other data.

Once we have the clusters defined, we will explore each cluster more closely and find the average number of FF restuarants in each cluster. Our goal is to find the top 2 clusters with the highest average number of restaurants within the cluster. We decided to use the top 2 clusters in order to have more neighborhoods to further explore. Although having highest average number of FF restaurants in a cluster would imply that there is demand for fast food in that cluster, we also want to limit the amount of competition that the business would face when opened in a neighborhood within the cluster. To find the best neighborhood for the business, we will calculate a distance matrix to show the distance between neighborhoods with no existing FF restaurant and those with at least one FF restaurant within the best clusters earlier identified. Distance to neighborhoods with multiple FF restaurant will be multiplied by the number of restaurants in the neighborhood for a better analysis of the competition that may be faced. Finally we find the maximum average distance across all neighborhoods that currently have no FF restaurant but which are within the best clusters identified. The maximum average distance implies that it has the least amount of competition with existing FF restaurants that are in the best clusters earlier identified but it is still within a neighborhood where FF restaurants are in high demand. 

The analysis will lead to the identification of the best 2 neighborhoods in which our client may open a new fast food restaurant in Toronto. The neighborhoods are shown on the map together with other neighborhoods and their clustering. Neighborhoods will existing fast food restaurants will also be identified on the final map for a visual identification of the best neighborhoods found for the new business.
