# Capstone Project - The Battle of the Neighborhoods Final Assignment
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

"Where do I open my business?" The intention of this project is to help answer that question by analyzing **DATA** that, for our benefit, is publicly accessible to all!

In my case, I want to open a **Coffee Shop**.

The idea is to help us make the best decision to the answer described above.

In this sense, the best decision is to **avoid competing with other businesses of the same category** to be able to capture more **"New Customers"**. The location analyzed in this case will be Toronto, Canada.

Just like me, there are many people who think about the idea of their **own entrepreneurship** and I understand that the problem I pose arises for everyone.

## Data <a name="data"></a>

### Global Steps

First we obtain the information about the neighborhoods of Toronto, Canada from scraping [this wikipedia article](https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M).

At this time, we have our dataframe loaded with the information about **all** the neighborhoods but we need to group the boroughs by neighborhood.

Later, we use the **Nominatim API** within **geopy** to obtain the latitude and longitude of every neighborhood.

In the next step, we use **Foursquare API** to obtain all the places from every neighborhood located. We need to clean only the coffee shops that are the places we're interested in to filter our competition.

Finally, with all the information, we make **clusters** to know the best posible locations for our business.

### Scraping Wikipedia article

In [2]:
# Import libraries
import pandas as pd
import requests as r

# GET to link
req = r.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')

# Loading the info into a DataFrame
df_postcode = pd.read_html(req.text)[0]

df_postcode.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [3]:
# We're going to ignore the "Not Assigned" neighborhoods
df_postcode = df_postcode[df_postcode["Neighbourhood"] != "Not assigned"]

# Group the neighborhoods
df_neighborhood_grouped = df_postcode.groupby(["Neighbourhood"]).count().reset_index()[["Neighbourhood"]]

df_neighborhood_grouped.head()

Unnamed: 0,Neighbourhood
0,Adelaide
1,Agincourt
2,Agincourt North
3,Albion Gardens
4,Alderwood


In [4]:
# How many Neighbourhoods we have?
df_neighborhood_grouped.count()

Neighbourhood    207
dtype: int64

### Neighborhoods Coordinates

We're going to use the Nominatim API included in geopy library. We know that the Nominatim API may fail, therefore we will do 2 retries if we cannot geolocate in the first attempt.

In [5]:
# Import libraries
import geopy as gp
import time as t
from geopy.geocoders import Nominatim
from geopy.exc import GeopyError

# Initialize an instance of Nominatim geolocator
geolocator = Nominatim(user_agent = "coursera_capstone_final_assignment")

for index, row in df_neighborhood_grouped.iterrows():
    # Initialize control variables
    located = False
    number_attempt = 1
    latitude = 0.0
    longitude = 0.0
    
    while located == False and number_attempt < 4:
        # The API doesn't allow more than 1 request per second
        t.sleep(1)
        
        # try - catch to dettect fails
        try:
            # Direction = Neighbourhood + State + Country
            direction = str(row["Neighbourhood"]).strip().lower() + ", toronto, canada"

            # Try to geolocate
            location = geolocator.geocode(direction, addressdetails = True)

            # Obtain the json from geolocator
            location_dict = location.raw

            # Find in the dictionary the coordinates
            for key, value in location_dict.items():
                if key == "lat":
                    latitude = value

                if key == "lon":
                    longitude = value

            # Update DataFrame
            df_neighborhood_grouped.loc[index, "Latitude"] = latitude
            df_neighborhood_grouped.loc[index, "Longitude"] = longitude
            print("Neighbourhood " + row["Neighbourhood"] + " located in attempt #" + str(number_attempt))
            located = True

        except (GeopyError, AttributeError):
            print("Neighbourhood " + row["Neighbourhood"] + " not located in attempt #" + str(number_attempt))
            number_attempt += 1

Neighbourhood Adelaide located in attempt #1
Neighbourhood Agincourt located in attempt #1
Neighbourhood Agincourt North located in attempt #1
Neighbourhood Albion Gardens located in attempt #1
Neighbourhood Alderwood located in attempt #1
Neighbourhood Bathurst Manor located in attempt #1
Neighbourhood Bathurst Quay located in attempt #1
Neighbourhood Bayview Village located in attempt #1
Neighbourhood Beaumond Heights not located in attempt #1
Neighbourhood Beaumond Heights not located in attempt #2
Neighbourhood Beaumond Heights not located in attempt #3
Neighbourhood Bedford Park located in attempt #1
Neighbourhood Berczy Park located in attempt #1
Neighbourhood Birch Cliff located in attempt #1
Neighbourhood Bloordale Gardens located in attempt #1
Neighbourhood Brockton located in attempt #1
Neighbourhood Business Reply Mail Processing Centre 969 Eastern not located in attempt #1
Neighbourhood Business Reply Mail Processing Centre 969 Eastern not located in attempt #2
Neighbourhoo

Neighbourhood Railway Lands not located in attempt #3
Neighbourhood Rathnelly located in attempt #1
Neighbourhood Richmond located in attempt #1
Neighbourhood Richview Gardens located in attempt #1
Neighbourhood Riverdale located in attempt #1
Neighbourhood Roncesvalles located in attempt #1
Neighbourhood Rosedale located in attempt #1
Neighbourhood Roselawn located in attempt #1
Neighbourhood Rouge located in attempt #1
Neighbourhood Rouge Hill located in attempt #1
Neighbourhood Royal York South East not located in attempt #1
Neighbourhood Royal York South East located in attempt #2
Neighbourhood Royal York South West located in attempt #1
Neighbourhood Runnymede located in attempt #1
Neighbourhood Ryerson located in attempt #1
Neighbourhood Scarborough Town Centre located in attempt #1
Neighbourhood Scarborough Village located in attempt #1
Neighbourhood Scarborough Village West located in attempt #1
Neighbourhood Silver Hills located in attempt #1
Neighbourhood Silverstone located 

We're going to check how many neighborhoods we have without coordinates

In [6]:
# Import numpy library to work with NaN values
import numpy as np

df_neighborhood_not_located = df_neighborhood_grouped[np.isnan(df_neighborhood_grouped["Latitude"].astype(np.float))]

df_neighborhood_not_located.count()

Neighbourhood    13
Latitude          0
Longitude         0
dtype: int64

In [7]:
df_neighborhood_not_located.head(len(df_neighborhood_not_located))

Unnamed: 0,Neighbourhood,Latitude,Longitude
8,Beaumond Heights,,
14,Business Reply Mail Processing Centre 969 Eastern,,
15,CFB Toronto,,
18,Caledonia-Fairbanks,,
19,Canada Post Gateway Processing Centre,,
35,Del Ray,,
73,Humber Bay Shores,,
77,Humewood-Cedarvale,,
80,Island airport,,
117,North Midtown,,


Now we know that less than 10% of the total neighborhoods are not located, we're going to validate duplicity of Neighborhoods:

In [37]:
df_neighborhood_deduplicate = df_neighborhood_grouped.groupby(by = ["Latitude", "Longitude"]).size().to_frame().reset_index()

df_neighborhood_deduplicate.columns = ["Latitude", "Longitude", "Quantity_Neighborhoods"]

df_neighborhood_deduplicate.sort_values(by = "Quantity_Neighborhoods", ascending = False).head()

Unnamed: 0,Latitude,Longitude,Quantity_Neighborhoods
124,43.7492988,-79.462248,6
48,43.653963,-79.387207,4
5,43.6166773,-79.4968048,3
136,43.7615095,-79.4109234,3
15,43.6400801,-79.3801495,3


In [43]:
for index, row in df_neighborhood_deduplicate.iterrows():
    lat = row["Latitude"]
    lon = row["Longitude"]
    
    df_filtered = df_neighborhood_grouped[(df_neighborhood_grouped["Latitude"] == lat) & (df_neighborhood_grouped["Longitude"] == lon)]
    
    neighborhood = df_filtered.iat[0, 0]
    
    df_neighborhood_deduplicate.at[index, "Neighborhood"] = neighborhood

df_neighborhood_deduplicate.head()

Unnamed: 0,Latitude,Longitude,Quantity_Neighborhoods,Neighborhood
0,43.59200455,-79.5453645065959,1,Long Branch
1,43.6007625,-79.505264,1,New Toronto
2,43.6017173,-79.5452325,1,Alderwood
3,43.6093093,-79.5677317,1,The Queensway East
4,43.6161962,-79.3693582,1,Toronto Islands


In [44]:
# Correct the original DataFrame
df_neighborhood_grouped = df_neighborhood_deduplicate[["Neighborhood", "Latitude", "Longitude"]]

df_neighborhood_grouped.columns = ["Neighbourhood", "Latitude", "Longitude"]

df_neighborhood_grouped.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Long Branch,43.59200455,-79.5453645065959
1,New Toronto,43.6007625,-79.505264
2,Alderwood,43.6017173,-79.5452325
3,The Queensway East,43.6093093,-79.5677317
4,Toronto Islands,43.6161962,-79.3693582


Finally, we check with the help of **folium** library, the located neighborhoods to see if we need the coordinates of the other neighborhoods or we can "avoid" the analysis on their areas:

In [45]:
# Import folium library to work with maps
import folium as f

# Create a map in center of Toronto
lat_toronto = 43.734070
lon_toronto = -79.347015

toronto_map = f.Map(location = [lat_toronto, lon_toronto], zoom_start = 11)

# Loop to create one Marker per Neighbourhood
for index, row in df_neighborhood_grouped.iterrows():
    # Check NaN value
    if np.isnan(float(row["Latitude"])) == False:
        f.Marker([row["Latitude"], row["Longitude"]], popup = row["Neighbourhood"]).add_to(toronto_map)

# Show Map
toronto_map

As we can see, we do not have many areas of the map without covering, therefore, we will leave the 12 neighborhoods not located outside our analysis since we understand that they will not affect us

### Neighborhoods Areas of Places

Now that we have the neighborhoods located, we will continue to define the radius of analysis for each neighborhood. This would be, to determine for each one, the radius (in meters) over which we will ask Foursquare for information on places.
To do this, we will search for each neighborhood, the distance in a straight line to all other neighborhoods. We will choose the smallest radius to define each one and in that way, cover the largest possible Toronto territory.

First, we define the function to calculate distance between 2 points:

In [46]:
# Import math to use trigonometric functions
import math as m

# Earth radius
RADIUS_EARTH_KM = 6371

# Function to calculate distance between 2 coordinates
def calculate_distance_km(lat1, lon1, lat2, lon2):
    dlon = (lon2 - lon1) * (m.pi / 180)
    dlat = (lat2 - lat1) * (m.pi / 180)

    dlat1 = lat1 * (m.pi / 180)
    dlat2 = lat2 * (m.pi / 180)

    res1 = m.sin(dlat / 2) * m.sin(dlat / 2) + m.sin(dlon / 2) * m.sin(dlon / 2) * m.cos(dlat1) * m.cos(dlat2)
    res2 = 2 * m.atan2(m.sqrt(res1), m.sqrt(1 - res1))

    return res2 * RADIUS_EARTH_KM

Now, we use the function defined to calculate distance between every neighborhood:

In [47]:
# We only work with located neighborhoods
df_neighborhood_grouped = df_neighborhood_grouped[np.isnan(df_neighborhood_grouped["Latitude"].astype(np.float)) == False]

# Loop to use the function
for index, row in df_neighborhood_grouped.iterrows():
    neighborhood = row["Neighbourhood"]
    lat_1 = float(row["Latitude"])
    lon_1 = float(row["Longitude"])
    
    df_other_neighborhood = df_neighborhood_grouped[df_neighborhood_grouped["Neighbourhood"] != neighborhood][["Neighbourhood", "Latitude", "Longitude"]]

    for index_2, row_2 in df_other_neighborhood.iterrows():
        neighborhood_2 = row_2["Neighbourhood"]
        lat_2 = float(row_2["Latitude"])
        lon_2 = float(row_2["Longitude"])
        
        distance_m = calculate_distance_km(lat_1, lon_1, lat_2, lon_2) * 1000
        
        df_other_neighborhood.at[index_2, "Distance_m"] = distance_m

    # Order by distance to take the least value
    df_other_neighborhood = df_other_neighborhood[df_other_neighborhood["Distance_m"] > 0]

    df_other_neighborhood_ordered = df_other_neighborhood.sort_values("Distance_m", ascending = True, inplace = False)
    
    neighborhood_radius = df_other_neighborhood_ordered.iat[0, 3]
    
    df_neighborhood_grouped.at[index, "Radius"] = neighborhood_radius / 2

df_neighborhood_grouped.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Radius
0,Long Branch,43.59200455,-79.5453645065959,540.030419
1,New Toronto,43.6007625,-79.505264,948.090503
2,Alderwood,43.6017173,-79.5452325,540.030419
3,The Queensway East,43.6093093,-79.5677317,999.303163
4,Toronto Islands,43.6161962,-79.3693582,1243.510604


With the radii defined, we will add circles to our map to visualize how the defined areas look:

In [48]:
# Loop to add one Circle per neighborhood
for index, row in df_neighborhood_grouped.iterrows():
    f.Circle([row["Latitude"], row["Longitude"]], radius = row["Radius"], fill = True, fill_color = 'red', fill_opacity = 0.6).add_to(toronto_map)

# Show map
toronto_map

### Foursquare

At this point, we are able to use the Foursquare API to obtain different places that make up each neighborhood in Toronto:

In [49]:
# Import requests library to use Foursquare API
import requests as r

# Obtain Foursquare credentials from file
df_credentials = pd.read_csv('foursquare_credentials.csv')

# Foursquare variables
CLIENT_ID = df_credentials.iat[0, 0]
CLIENT_SECRET = df_credentials.iat[0, 1]
VERSION = df_credentials.iat[0, 2]
LIMIT = 100

# Define function to obtain venues
def getNearbyVenues(names, latitudes, longitudes, radii):
    venues_list = []
    for name, lat, lng, radius in zip(names, latitudes, longitudes, radii):
        print("Obtaining venues for " + name)

        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        try:
            # Make the GET request
            results = r.get(url).json()["response"]['groups'][0]['items']

            # Return only relevant information for each nearby venue
            venues_list.append([(
                name, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except:
            print("No venues for " + name)

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

    return(nearby_venues)

# Obtain Toronto venues
df_toronto_venues = getNearbyVenues( names = df_neighborhood_grouped['Neighbourhood'],
                                     latitudes = df_neighborhood_grouped['Latitude'],
                                     longitudes = df_neighborhood_grouped['Longitude'],
                                     radii = df_neighborhood_grouped['Radius'])

df_toronto_venues.head()

Obtaining venues for Long Branch
Obtaining venues for New Toronto
Obtaining venues for Alderwood
Obtaining venues for The Queensway East
Obtaining venues for Toronto Islands
Obtaining venues for Mimico NE
Obtaining venues for The Queensway West
Obtaining venues for East Toronto
Obtaining venues for Markland Wood
Obtaining venues for Cloverdale
Obtaining venues for Exhibition Place
Obtaining venues for Bloordale Gardens
Obtaining venues for Bathurst Quay
Obtaining venues for Old Burnhamthorpe
Obtaining venues for Humber Bay
Obtaining venues for Harbourfront
Obtaining venues for Princess Gardens
Obtaining venues for Parkdale
Obtaining venues for Sunnylea
Obtaining venues for South Niagara
Obtaining venues for CN Tower
Obtaining venues for Union Station
Obtaining venues for Swansea
Obtaining venues for Islington
Obtaining venues for King and Spadina
Obtaining venues for North Toronto West
Obtaining venues for Toronto Dominion Centre
Obtaining venues for The Kingsway
Obtaining venues for L

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Long Branch,43.59200455,-79.5453645065959,Woody's Burgers bar and grill,43.592424,-79.541825,Burger Joint
1,Long Branch,43.59200455,-79.5453645065959,Fair Grounds Cafe & Roastery,43.592465,-79.541579,Café
2,Long Branch,43.59200455,-79.5453645065959,Marie Curtis Park,43.58925,-79.544338,Park
3,Long Branch,43.59200455,-79.5453645065959,Empanada Company,43.592213,-79.541679,South American Restaurant
4,Long Branch,43.59200455,-79.5453645065959,Busters Fish House,43.592863,-79.540045,Seafood Restaurant


## Methodology <a name="methodology"></a>

In our project we will find the most likely neighborhoods to open our coffee shop, understanding that we are looking for neighborhoods with little competition.

To do this, we were able to locate the different neighborhoods of Toronto on a map and we also defined analysis radios for each one.

We are then able to use the Foursquare API to obtain the missing information about places that characterize each neighborhood.

Once at that point, we will use KNN to cluster the neighborhoods and from there define at least 5 candidates for our business.

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data.

First let's understand the **different venues categories in Toronto**:

In [50]:
df_toronto_venues.columns = ["Neighborhood", "Neighborhood_Latitude", "Neighborhood_Longitude", "Venue", "Venue_Latitude", "Venue_Longitude", "Venue_Category"]

df_categories = df_toronto_venues.Venue_Category.value_counts().to_frame().reset_index()

df_categories.columns = ["Category", "Count"]

df_categories.head()

Unnamed: 0,Category,Count
0,Coffee Shop,236
1,Pizza Place,99
2,Café,96
3,Park,91
4,Sandwich Place,88


As we can see, there are at least 2 categories we're interested in: "Coffee Shop" and "Café". Let's verify if we have other relevant categories for our analysis:

In [51]:
df_category_begin_c = df_categories[df_categories["Category"].str[0] == 'C']

df_category_begin_c.head(len(df_category_begin_c))

Unnamed: 0,Category,Count
0,Coffee Shop,236
2,Café,96
10,Clothing Store,60
13,Chinese Restaurant,52
28,Convenience Store,26
41,Cosmetics Shop,20
49,Caribbean Restaurant,17
77,Cocktail Bar,10
99,Cantonese Restaurant,7
128,Cheese Shop,4


OK, now we understand that there is no other category relevant to us.

With our 2 categories defined: **Coffee Shop** and **Café**, we're going to analyze the distribution of this categories venues for every neighborhood:

In [52]:
coffee_categories = ["Coffee Shop", "Café"]

df_coffee_venues = df_toronto_venues[df_toronto_venues.Venue_Category.isin(coffee_categories)]

df_venues_grouped = df_coffee_venues.Neighborhood.value_counts().to_frame().reset_index()

df_venues_grouped.columns = ["Neighborhood", "Quantity_Venues"]

df_venues_grouped.count()

Neighborhood       95
Quantity_Venues    95
dtype: int64

In [53]:
# Top Ten of Venues with most Coffee Shops and Cafés
df_venues_grouped.head(10)

Unnamed: 0,Neighborhood,Quantity_Venues
0,Willowdale,15
1,Yorkville,14
2,Roselawn,11
3,Newtonbrook,10
4,Harbourfront,9
5,The Junction North,8
6,Deer Park,8
7,The Beaches,7
8,Runnymede,7
9,Kensington Market,7


We can use the "radius" defined in each Neighborhood to calculate the "density" of every one and have another metric to use in our analysis:

In [54]:
# Change the name of Neighborhood column
df_neighborhood_grouped.columns = ['Neighborhood', 'Latitude', 'Longitude', 'Radius']

# Obtain the radius for each Neighborhood
df_venues_grouped = pd.merge(df_venues_grouped, df_neighborhood_grouped, on = 'Neighborhood', how = 'inner')

# Calculate the density of interested venues for each Neighborhood
df_venues_grouped["Venues_Density"] = df_venues_grouped["Quantity_Venues"] / df_venues_grouped["Radius"]

df_venues_grouped.head()

Unnamed: 0,Neighborhood,Quantity_Venues,Latitude,Longitude,Radius,Venues_Density
0,Willowdale,15,43.7615095,-79.4109234,986.305994,0.015208
1,Yorkville,14,43.6713861,-79.3901677,414.454103,0.033779
2,Roselawn,11,43.7098517,-79.4042948,704.952281,0.015604
3,Newtonbrook,10,43.7938863,-79.4256790230105,1895.021253,0.005277
4,Harbourfront,9,43.6400801,-79.3801495,256.488743,0.035089


If we order the Neighborhoods by density of interested venues, we can now have a better idea of where we will find more competition with other businesses:

In [56]:
df_venues_grouped.sort_values(by = 'Venues_Density', ascending = False, inplace = True)

df_venues_grouped.head(20)

Unnamed: 0,Neighborhood,Quantity_Venues,Latitude,Longitude,Radius,Venues_Density
51,First Canadian Place,3,43.6487681,-79.3816917928303,78.804068,0.038069
4,Harbourfront,9,43.6400801,-79.3801495,256.488743,0.035089
1,Yorkville,14,43.6713861,-79.3901677,414.454103,0.033779
9,Kensington Market,7,43.6552136,-79.4022604,212.47358,0.032945
22,Central Bay Street,5,43.660912,-79.3858973,188.795342,0.026484
26,Queen's Park,4,43.6606092,-79.3905725,188.795342,0.021187
6,Deer Park,8,43.68809,-79.3940935,384.619903,0.0208
93,Commerce Court,1,43.64809515,-79.3790207204735,48.202675,0.020746
41,Union Station,3,43.6446934,-79.3801320058597,156.945027,0.019115
92,Toronto Dominion Centre,1,43.64736955,-79.3813733580709,54.697236,0.018282


Now we have an idea of the 20 non-candidate Neighborhoods for our goal. With that in mind, we continue with our cluster analysis of the Neighborhoods:

In [59]:
# One-hot encoding for venues categories
toronto_onehot = pd.get_dummies(df_toronto_venues[['Venue_Category']], prefix = "", prefix_sep = "")

toronto_onehot['Neighborhood'] = df_toronto_venues['Neighborhood']

fixed_columns = [toronto_onehot.columns[-1]] + list(toronto_onehot.columns[:-1])
toronto_onehot = toronto_onehot[fixed_columns]

toronto_onehot.head()

Unnamed: 0,Zoo Exhibit,Accessories Store,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,Aquarium,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
df_toronto_grouped = toronto_onehot.groupby('Neighborhood').mean().reset_index()

df_toronto_grouped.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,Accessories Store,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Agincourt North,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.071429,0.0,0.0
3,Albion Gardens,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [61]:
# We add the venue density to our DataFrame
df_toronto_grouped = pd.merge(df_toronto_grouped, df_venues_grouped, on = 'Neighborhood', how = 'inner')

df_toronto_grouped.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,Accessories Store,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Quantity_Venues,Latitude,Longitude,Radius,Venues_Density
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1,43.6504863,-79.3794979135182,110.563351,0.009045
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1,43.7853531,-79.2785494,820.253837,0.001219
2,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1,43.6017173,-79.5452325,540.030419,0.001852
3,Bathurst Quay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4,43.6357905,-79.398329,373.965651,0.010696
4,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1,43.64798435,-79.3753959113886,91.656007,0.01091


In [63]:
df_toronto_grouped.drop(columns = ['Latitude', 'Longitude', 'Radius'], inplace = True)

df_toronto_grouped.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,Accessories Store,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Quantity_Venues,Venues_Density
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.009045
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.001219
2,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.001852
3,Bathurst Quay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,0.010696
4,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,0.01091


In [64]:
df_toronto_grouped.drop(columns = ['Quantity_Venues'], inplace = True)

df_toronto_grouped.head()

Unnamed: 0,Neighborhood,Zoo Exhibit,Accessories Store,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Venues_Density
0,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009045
1,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001219
2,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001852
3,Bathurst Quay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010696
4,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01091


In [66]:
# Import scikit-learn library to apply KNN algorithm
from sklearn.cluster import KMeans

# We define 5 clusters to our analysis
kclusters = 5

toronto_grouped_clustering = df_toronto_grouped.drop('Neighborhood', 1)

kmeans = KMeans(n_clusters = kclusters, random_state = 0).fit(toronto_grouped_clustering)

kmeans.labels_[0:10]

array([3, 4, 2, 1, 3, 3, 1, 3, 3, 1])

In [72]:
# We add the cluster label to our original df
df_toronto_grouped.insert(0, 'Cluster Labels', kmeans.labels_)

df_toronto_grouped.head()

Unnamed: 0,Cluster Labels,Neighborhood,Zoo Exhibit,Accessories Store,Afghan Restaurant,African Restaurant,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Video Store,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Venues_Density
0,3,Adelaide,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009045
1,4,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001219
2,2,Alderwood,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.001852
3,1,Bathurst Quay,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010696
4,3,Berczy Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01091


OK we have our clusters defined! Now let's see the Neighborhoods in the map:

In [74]:
import matplotlib.cm as cm
import matplotlib.colors as colors

df_toronto_merged = pd.merge(df_toronto_grouped, df_neighborhood_grouped, on = 'Neighborhood', how = 'inner')

map_clusters = f.Map(location = [lat_toronto, lon_toronto], zoom_start = 11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []

for lat, lon, poi, cluster in zip(df_toronto_merged['Latitude'], df_toronto_merged['Longitude'], df_toronto_merged['Neighborhood'], df_toronto_merged['Cluster Labels']):
    if np.isnan(cluster) == False:
        label = f.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        f.CircleMarker(
            [lat, lon],
            radius = 5,
            popup = label,
            color = rainbow[int(cluster) - 1],
            fill = True,
            fill_color = rainbow[int(cluster) - 1],
            fill_opacity = 0.7).add_to(map_clusters)

map_clusters

Remember our list of the 20 neighborhoods where we do not want to open our business to avoid competition. If we now see the cluster where they belong, we will get a better idea of how to choose our candidates:

In [77]:
df_non_candidate = df_venues_grouped.head(20)

for index, row in df_non_candidate.iterrows():
    df_filtered = df_toronto_grouped[df_toronto_grouped["Neighborhood"] == row["Neighborhood"]]
    
    cluster_label = df_filtered.iat[0, 0]
    
    df_non_candidate.at[index, "Cluster_Label"] = cluster_label

df_non_candidate.head(20)

Unnamed: 0,Neighborhood,Quantity_Venues,Latitude,Longitude,Radius,Venues_Density,Cluster_Label
51,First Canadian Place,3,43.6487681,-79.3816917928303,78.804068,0.038069,3.0
4,Harbourfront,9,43.6400801,-79.3801495,256.488743,0.035089,2.0
1,Yorkville,14,43.6713861,-79.3901677,414.454103,0.033779,3.0
9,Kensington Market,7,43.6552136,-79.4022604,212.47358,0.032945,3.0
22,Central Bay Street,5,43.660912,-79.3858973,188.795342,0.026484,1.0
26,Queen's Park,4,43.6606092,-79.3905725,188.795342,0.021187,0.0
6,Deer Park,8,43.68809,-79.3940935,384.619903,0.0208,3.0
93,Commerce Court,1,43.64809515,-79.3790207204735,48.202675,0.020746,3.0
41,Union Station,3,43.6446934,-79.3801320058597,156.945027,0.019115,3.0
92,Toronto Dominion Centre,1,43.64736955,-79.3813733580709,54.697236,0.018282,3.0


As we can see, a total of 14 neighborhoods of the initial 20 belong to **cluster number 3**, therefore, we understand that the neighborhoods belonging to said cluster are **the most interesting to open our business** since they have a high degree of competition, which we indicates that if all neighborhoods in cluster 3 are alike, and **many other businesses were successful, we can have it too**. So, let's see all the neighborhoods belonging to this cluster:

In [78]:
df_cluster_3 = df_toronto_grouped[df_toronto_grouped["Cluster Labels"] == 3]

df_cluster_3.count()

Cluster Labels       44
Neighborhood         44
Zoo Exhibit          44
Accessories Store    44
Afghan Restaurant    44
                     ..
Wine Shop            44
Wings Joint          44
Women's Store        44
Yoga Studio          44
Venues_Density       44
Length: 320, dtype: int64

OK, we have a total of 44 Neighborhoods that are candidates to our business. What we can do to perform a visual analysis, is to graph on a map all the neighborhoods and for each one to define a radius where the larger the radius is, the greater the density of other competing businesses will exist:

In [89]:
df_cluster_3_merged = pd.merge(df_cluster_3, df_neighborhood_grouped, on = 'Neighborhood', how = 'inner')

map_clusters_3 = f.Map(location = [lat_toronto, lon_toronto], zoom_start = 11)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

markers_colors = []

for lat, lon, poi, density in zip(df_cluster_3_merged['Latitude'], df_cluster_3_merged['Longitude'], df_cluster_3_merged['Neighborhood'], df_cluster_3_merged['Venues_Density']):
    if np.isnan(cluster) == False:
        label = f.Popup(str(poi) + " Density: " + str(density), parse_html = True)
        f.Circle(
            [lat, lon],
            radius = density * 25000,
            popup = label,
            color = rainbow[1],
            fill = True,
            fill_color = rainbow[1],
            fill_opacity = 0.7).add_to(map_clusters_3)

map_clusters_3

**At this point we are able to recommend at least 5 candidate neighborhoods for our business!**

## Results and Discussion <a name="results"></a>

We were able to determine with our analysis that the neighborhoods where we could succeed with our entrepreneurship are several. That is, we can have an extensive list of candidate neighborhoods. However, keeping in mind the last visualization generated where we only concentrate on cluster number 3 and also understanding that **it's the cluster where opening a coffee business is an almost assured success**, that list is reduced a bit.

We see that the following neighborhoods have low competition density:
- Forest Hill North (Group 1)
- Parkdale (Group 1)
- Leaside (Group 1)
- Riverdale (Group 1)
- High Park (Group 2)
- Silverthorn (Group 2)
- Wexford (Group 2)
- Wilson Heighs (Group 2)

The list of neighborhoods provided was also divided into 2 distinct groups, since all those belonging to group 1, are **close to the rest of the densest neighborhoods** (Kensington Market and First Canadian Place for example) and the neighborhoods belonging to group 2, are found in more remote areas.

Therefore, as our main objective was to find at least 5 neighborhoods to open a new coffee business, we are in a position to say that **we comply with that premise and also prepare the list with the added value of dividing the group into 2 zones to still provide a greater understanding of the distribution of our competition**.

However, the project does not propose the places where there will be absolute certainty, but it does propose neighborhoods where, through the analysis of different data, we are closer to success than to failure.

## Conclusion <a name="conclusion"></a>

Those interested in the project (or starting their own business) now have a better idea of how to use public data to prepare a first analysis of one of the biggest doubts they have at the beginning, which would be "Where to locate my business?" . It is not an easy question to answer nor can 100% efficiency be guaranteed, but we can get close enough.

The final decision will be taken not only with our analysis in mind, there are also other factors such as culture of each neighborhood, attractive building, tourist density, or others that can further limit the list of candidate neighborhoods.

We hope it is a contribution to help make the difficult decision to enter the world of entrepreneurs and to be able to say the famous phrase "I am my own boss".