# Capstone Project: Battle of the Neighborhoods 

## Introduction/Business Problem:

Successful coffee shop owners in North York struggle to compete with the larger coffee shop chains. These coffee shop owners aspire to build brand image, establish customer loyalty, and provide the ambiance found in the larger coffee shop chains. Successful independent coffee shop owners in Toronto  may shed insight into the challenges owners confront in competing against the larger coffee shop chains. In this undertaking, I am going to search for an ideal area to open a coffee shop. In particular, this report can give a reference to partners who are keen on opening a coffee shop in North York, Municipality of Toronto, Ontario, Canada. 

North York is one of six administrative districts of Toronto, Ontario, Canada. It is located directly north of York, Old Toronto and East York, between Etobicoke to the west and Scarborough to the east. As of the 2016 Census, it had a population of 869,401. It was first created as a township in 1922 out of the northern part of the former township of York, a municipality that was located along the western border of Old Toronto. 

In this report, I will focus on all areas on North York. There are many coffee shops on Toronto Municipality, I will conclude where are the existing coffee shops. Then we will use a clustering model to find similar areas on the municipality considering demographic data of each borough and region. The preferred area shall be distant from existing coffee shops. I will utilize data science tools to get the crude information, imagine it at that point create a couple of most encouraging zones dependent on the above rules. In the interim, I will likewise clarify the bit of leeway and attributes for the applicants, with the goal that partners can settle on a ultimate conclusion base on the study.

## Data Acquisition and Cleaning

Based on the definition of our problem, factors that may impact decision are:
    
* Demographic information, e.g. density, population, age, income.
* Number of existing coffee shops in the neighborhood and nearby.
* Number of other existing beverage shops in the neighborhood and nearby.

I decided to use a regularly spaced grid of locations all around the whole Montreal island, to define our neighborhoods. Concretely, we will use popular hexagon honeycomb to define our neighborhoods.
In this project, we will fetch or extract data from the following data sources:

* North York census information of the 2019 year.
* Centers of hexagon neighborhoods will be generated algorithmically and approximate addresses of centers of those areas will be obtained using Google Geocoding API.
* Coffee shops data in every neighborhood will be obtained using Foursquare API.
* Coordinate of North York will be obtained using Google Geocoding API of well known North York location.
* North York borough shapefile is obtained from Wikipedia.

<b>North York Shape File</b><br/>
To show the North York boundary in the folium map, we need a JSON definition file for Toronto. I scraped this shapefile from the Wikipedia then I downloaded the coordinates for each postal code from the external website. After scraping and cleaning, I combined both datasets into one table.

<b>Feature Selection</b><br/>
There were 180 samples and 3 features in the data came from Wikipedia.

In [9]:
# Import Libraries
import pandas as pd
import numpy as np
import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors

In [10]:
df_wiki = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
df_wiki.head()

Unnamed: 0,Postal Code,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,"Regent Park, Harbourfront"


In [11]:
print(df_wiki.shape)

(180, 3)


In [12]:
df_coor = pd.read_csv('http://cocl.us/Geospatial_data')
df_coor.head()

Unnamed: 0,Postal Code,Latitude,Longitude
0,M1B,43.806686,-79.194353
1,M1C,43.784535,-79.160497
2,M1E,43.763573,-79.188711
3,M1G,43.770992,-79.216917
4,M1H,43.773136,-79.239476


In [13]:
# Merge two datasets
df = pd.merge(df_wiki, df_coor, on = 'Postal Code')
df

Unnamed: 0,Postal Code,Borough,Neighbourhood,Latitude,Longitude
0,M3A,North York,Parkwoods,43.753259,-79.329656
1,M4A,North York,Victoria Village,43.725882,-79.315572
2,M5A,Downtown Toronto,"Regent Park, Harbourfront",43.654260,-79.360636
3,M6A,North York,"Lawrence Manor, Lawrence Heights",43.718518,-79.464763
4,M7A,Downtown Toronto,"Queen's Park, Ontario Provincial Government",43.662301,-79.389494
...,...,...,...,...,...
98,M8X,Etobicoke,"The Kingsway, Montgomery Road, Old Mill North",43.653654,-79.506944
99,M4Y,Downtown Toronto,Church and Wellesley,43.665860,-79.383160
100,M7Y,East Toronto,"Business reply mail Processing Centre, South C...",43.662744,-79.321558
101,M8Y,Etobicoke,"Old Mill South, King's Mill Park, Sunnylea, Hu...",43.636258,-79.498509


In [14]:
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = 'Toronto'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Toronto are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Toronto are 43.6534817, -79.3839347.


<b>Folium</b><br/>
folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.¹
It’s not difficult to use folium, just required a few lines of code to show Montreal island with boundary data.

## Map of Toronto with Neighborhoods

In [80]:
import folium
map_toronto = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(df['Latitude'], df['Longitude'], df['Borough'], df['Neighbourhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_toronto)  
    
map_toronto

## Foursquare API

Now I will get all coffee shop information using Foursquare API.

From Foursquare API documentation, we can find the corresponding movie theater category in Venue Categories. The corresponding ID of Coffee Shops in Foursquare API is 4bf58dd8d48988d1e0931735.<br/><br/>
I'll fetch all the coffee shops on North York first. To do so, I will fetch coffee shops data in each borough and municipality.

In [78]:
CLIENT_ID = 'DSX4H3CY0PRSQS1JDOZ1UCDIHMWBAI3BSUSUXD1T5WYFQWP2' 
CLIENT_SECRET = 'SYAIJIV110MJGLKFWED45WNCYQW1CZONM1K3SNSBSJ55Y4MX'
VERSION = '20180602'
LIMIT = 100
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)
radius = 500

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

Your credentails:
CLIENT_ID: DSX4H3CY0PRSQS1JDOZ1UCDIHMWBAI3BSUSUXD1T5WYFQWP2
CLIENT_SECRET:SYAIJIV110MJGLKFWED45WNCYQW1CZONM1K3SNSBSJ55Y4MX


'https://api.foursquare.com/v2/venues/search?client_id=DSX4H3CY0PRSQS1JDOZ1UCDIHMWBAI3BSUSUXD1T5WYFQWP2&client_secret=SYAIJIV110MJGLKFWED45WNCYQW1CZONM1K3SNSBSJ55Y4MX&ll=43.6534817,-79.3839347&v=20180602&radius=500&limit=100'

## Explore the first neighborhood

In [69]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    venues_list=[]
    
    for name, lat, lng in zip(names, latitudes, longitudes):
        # print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [81]:
toronto_venues = getNearbyVenues(names=df['Neighbourhood'],
                                   latitudes=df['Latitude'],
                                   longitudes=df['Longitude']
                                  )

In [77]:
toronto_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agincourt,5,5,5,5,5,5
"Alderwood, Long Branch",8,8,8,8,8,8
"Bathurst Manor, Wilson Heights, Downsview North",23,23,23,23,23,23
Bayview Village,4,4,4,4,4,4
"Bedford Park, Lawrence Manor East",27,27,27,27,27,27
...,...,...,...,...,...,...
"Willowdale, Willowdale East",34,34,34,34,34,34
"Willowdale, Willowdale West",6,6,6,6,6,6
Woburn,3,3,3,3,3,3
Woodbine Heights,6,6,6,6,6,6


## Unique categories in each venues

In [82]:
print('There are {} uniques categories.'.format(len(toronto_venues['Venue Category'].unique())))

There are 272 uniques categories.


## Analyze Neighborhood

In [83]:
toronto_denc_onehot = pd.get_dummies(toronto_denc_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
toronto_denc_onehot['Neighborhood'] = toronto_denc_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [toronto_denc_onehot.columns[-1]] + list(toronto_denc_onehot.columns[:-1])
toronto_denc_onehot = toronto_denc_onehot[fixed_columns]

toronto_denc_onehot.head()

Unnamed: 0,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Antique Shop,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [85]:
toronto_denc_grouped = toronto_denc_onehot.groupby('Neighborhood').mean().reset_index()
toronto_denc_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,Accessories Store,Afghan Restaurant,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Trail,Train Station,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Warehouse Store,Wine Bar,Wings Joint,Women's Store
0,Agincourt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Alderwood, Long Branch",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Bathurst Manor, Wilson Heights, Downsview North",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bayview Village,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,"Bedford Park, Lawrence Manor East",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037037


## Ten most common venues

In [86]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    return row_categories_sorted.index.values[0:num_top_venues]

num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = toronto_denc_grouped['Neighborhood']

for ind in np.arange(toronto_denc_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(toronto_denc_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Agincourt,Lounge,Latin American Restaurant,Skating Rink,Breakfast Spot,Clothing Store,Donut Shop,Diner,Discount Store,Distribution Center,Dog Run
1,"Alderwood, Long Branch",Pizza Place,Skating Rink,Sandwich Place,Gym,Pub,Coffee Shop,Pharmacy,Department Store,Dessert Shop,Dim Sum Restaurant
2,"Bathurst Manor, Wilson Heights, Downsview North",Bank,Coffee Shop,Sushi Restaurant,Frozen Yogurt Shop,Ice Cream Shop,Middle Eastern Restaurant,Supermarket,Deli / Bodega,Mobile Phone Shop,Chinese Restaurant
3,Bayview Village,Japanese Restaurant,Café,Bank,Chinese Restaurant,Diner,Discount Store,Distribution Center,Dog Run,Doner Restaurant,Women's Store
4,"Bedford Park, Lawrence Manor East",Coffee Shop,Sandwich Place,Restaurant,Italian Restaurant,Pizza Place,Sushi Restaurant,Pharmacy,Butcher,Pub,Café


## Methodology

The business purpose of this project is to find a suitable place on North York to open a coffee shop.

Now we retrieved the following data:

* All coffee shop data in North York.
* All restaurants data in North York.
* Boundary data of each borough and municipality in Toronto.

In light of the above raw data, I will produce new highlights in like manner, for example registration data for every applicant cell, and the quantity of bistros and cafés in neighborhood and close by. 

In the final step, I will concentrate on the most encouraging regions with more cafés and coffeehouses. What's more, we will likewise introduce the up-and-comer hexagon cells in the guide see for partners to settle on an official conclusion.

## Analysis

We got the premise registration data of every district and region. 

We need to get the registration data for every competitor hexagon cell likewise, we figure those enumeration data dependent on ward and district which meets with the cell. 

On the off chance that a hexagon is in one district totally, we will utilize the precinct's statistics information as hexagon's one. So it implies for all hexagons inside one precinct, we will treat them the equivalent for enumeration include. 

As needs be, if a hexagon has a half convergence with two precincts individually, we will create the statistics information of this hexagon, half proportion from these two wards separately. 

In view of this standard, we can figure the registration for all hexagons.


We will calculate the following features for shopping malls and movie theaters:

* The number of coffee shops and restaurants within the current hexagon cell.
* The number of coffee shops and restaurants within 1 km away from the center of the hexagon cell.
* A number of coffee shops and restaurants within 3 km away from the center of the hexagon cell.

## Results and Discussion

We generated hexagon areas all over Toronto. 
And we group them into 10 clusters according to census data information including population, density, age, education, and income. Shopping center information and existing movie theaters information are also considered when running the clustering algorithm.

From data analysis and perception, we can see coffee shops are constantly situated close to shopping centers and cafés normally, which roused us to discover the territory with all the more shopping centers and less cinemas. 

After the K-Means Clustering ML calculation, we got the group with most shopping centers close by and less coffee shops all things considered. We likewise found different qualities of the group. It shows the group has the most populace and thickness which suggests the most elevated traffic among all the bunches. 

There are 40 hexagon territories in this bunch, we sort all these hexagon regions by shopping centers, restaurants, and coffee shops data in dropping request which focuses to cover all the more shopping centers and less cinemas in the neighborhood cell or close by. 

We make our inference with the 5 most encouraging hexagon territories fulfilling every one of our conditions. These suggested zones will be a decent beginning stage for additional examination. There are additionally different components which could be considered, for example genuine traffic information and the income of each restaurants, coffee shops, parking areas nearby. They will be useful to discover more exact outcomes.

## Conclusion


The motivation behind this project is to discover an area on Toronto explicitly on North York to open a coffee shop. 

In the wake of bringing information from a few information sources and procedure them into a spotless information outline, applying the K-Means bunching calculation, I picked the group with more restaurants and coffee shops all things considered. By arranging all up-and-comer territories in the bunch, we get the most 5 promising zones which are utilized as beginning stages for definite investigation by partners. 

A ultimate choice on ideal coffeehouse's area will be made by partners dependent on explicit qualities of neighborhoods and areas in each suggested zone, thinking about extra factors like the parking garage of every area, traffic of existing cinemas in the bunch, and current income of them, and so forth.