# Capstone Project - The Battle of the Neighborhoods

## Introduction/Business Problem
If I’m going out near my house to eat at a restaurant, where should I go so that there will be many restaurants nearby to choose from? In other words, where can I go so that there are lot of restaurants close by?

When I go out to try eating at a new restaurant, sometimes the restaurant will be closed or it doesn’t look as expected from the outside. Thus, in order to not get hungry, I will need to quickly find a restaurant nearby to eat.

Now, if we start with a restaurant where there are no restaurants nearby, or the closet one is more than 20km away, it’s not a very good restaurant to try out at the beginning because it’s too risky if we decide to ditch that restaurant.

If we can find one of the restaurants where there are a lot of restaurants nearby, then it will be great and we don’t need to travel that much.


## Data
The data I will be using include the latitude and longitude of my home address and the latitudes and longitudes of all the nearby restaurants.

Using these data, we can run the k-means clustering algorithm to find the clusters of restaurants near my home, then we can pick the centroid with the most restaurants in the cluster as the starting point to explore. This way we can minimize the travel time to the next restaurant if we decide not to eat at the previous restaurant, thus solving the problem.


## Analysis

### Housekeeping work

Import all libraries needed:

In [33]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import folium
import matplotlib.cm as cm
import matplotlib.colors as colors
from geopy.geocoders import Nominatim
import requests

Defining constants:

In [34]:
HOME_ADDRESS = '8 Hillcrest Ave, Toronto, Ontario'
LIMIT = 50 # limit of number of results per query
OFFSET = 50 # maximum offset
RADIUS = 5000  # 10 km
CLIENT_ID = 'WBYOCNNVXWVIKV1LE3WN4NX5LTHGX1IMWBV5N5DYQKAQW3ZG' # your Foursquare ID
CLIENT_SECRET = 'VPUNU044Q0WOIVOTPGN3ZI42ISIOGSVX4P12KNH2AWUPMWFA' # your Foursquare Secret
VERSION = '20190425' # Foursquare API version
FOOD_CATEGORY_ID = '4d4b7105d754a06374d81259' # Foursquare category ID for "Food"

### Data Preparation

First we need to gather that lat/long of my home address.

In [35]:
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(HOME_ADDRESS)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of my home is {}, {}.'.format(latitude, longitude))

The geograpical coordinate of my home is 43.7675689, -79.4122151.


Next, we need to gather all restaurents near my home and save them in a DataFrame.

In [36]:
offset = 0
total_result = None
count = 0
restaurant_list = []

# Loop through all results to get the most 
while True:
    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&offset={}'.format(
        CLIENT_ID, 
        CLIENT_SECRET, 
        VERSION, 
        latitude, 
        longitude, 
        RADIUS, 
        LIMIT,
        offset,
    )
    response = requests.get(url).json()["response"]
    if not total_result:
        total_result = response['totalResults']


    results = requests.get(url).json()["response"]['groups'][0]['items']
    for v in results:
        venue_name = v['venue']['name'] 
        lat = v['venue']['location']['lat']
        lng = v['venue']['location']['lng']  
        category_name = v['venue']['categories'][0]['name']

        restaurant_list.append((
            venue_name,
            lat,
            lng,
            category_name,
        ))
    if len(restaurant_list) >= total_result:
        break
    offset += 50 # maximum offset

In [37]:
df = pd.DataFrame(restaurant_list)
df.columns = [
    'Restaurant Name', 
    'Restaurant Latitude', 
    'Restaurant Longitude', 
    'Category Name',
]
df.head()

Unnamed: 0,Restaurant Name,Restaurant Latitude,Restaurant Longitude,Category Name
0,Konjiki Ramen,43.766998,-79.412222,Ramen Restaurant
1,The Keg,43.766579,-79.412131,Steakhouse
2,Maryam Hotel,43.766961,-79.401199,Hotel
3,Kinka Izakaya,43.760161,-79.409827,Japanese Restaurant
4,Sushi Moto Sake & Wine Bar,43.763902,-79.411559,Sushi Restaurant


In [41]:
df.shape

(230, 4)

### Setting up K-Means

First we will define some constants for k-means:

In [58]:
N_CLUSTERS = 5
N_INIT = 20

Next we will initialize the KMeans object

In [59]:
k_means = KMeans(
    init="k-means++",
    n_clusters=N_CLUSTERS,
    n_init=N_INIT,
)

Extract the series for k-means to fit:

In [60]:
X = df[['Restaurant Latitude', 'Restaurant Longitude']]

Fitting the data:

In [61]:
k_means.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=5, n_init=20, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

We have trained the model! Let's look at the label distributions:

In [62]:
labels = k_means.labels_
labels

array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4, 0, 4,
       4, 4, 2, 1, 4, 4, 4, 4, 3, 4, 0, 4, 4, 4, 1, 4, 1, 3, 2, 2, 3, 2,
       2, 3, 2, 2, 1, 0, 3, 4, 0, 1, 2, 2, 1, 2, 2, 1, 2, 3, 0, 1, 2, 1,
       2, 3, 2, 2, 1, 2, 3, 3, 1, 3, 0, 2, 0, 3, 2, 1, 2, 1, 3, 2, 1, 2,
       3, 1, 3, 3, 3, 3, 2, 1, 2, 1, 3, 1, 1, 3, 2, 3, 2, 2, 3, 3, 2, 2,
       3, 1, 3, 1, 0, 2, 3, 3, 2, 1, 1, 3, 0, 2, 3, 0, 2, 2, 0, 2, 1, 2,
       3, 3, 2, 0, 2, 1, 3, 1, 3, 2, 3, 3, 3, 1, 1, 1, 3, 3, 2, 3, 3, 2,
       3, 0, 1, 0, 0, 2, 2, 3, 2, 2, 0, 3, 1, 1, 1, 1, 3, 1, 1, 1, 2, 2,
       2, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
       0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 1, 2, 2, 2, 0,
       2, 2, 2, 2, 1, 0, 1, 1, 1, 2], dtype=int32)

Let's also look at the centroids:

In [63]:
centers = k_means.cluster_centers_
centers

array([[ 43.76474749, -79.458995  ],
       [ 43.77838812, -79.37199804],
       [ 43.79776507, -79.42838684],
       [ 43.73322245, -79.4087234 ],
       [ 43.76746644, -79.41214599]])

Let's see how many restaurants are there for each cluster:

In [71]:
from collections import defaultdict
c_dict = defaultdict(int)
for label in labels:
    c_dict[label] += 1
for label, count in c_dict.items():
    print('Cluster {} has {} restaurants'.format(label, count))

Cluster 4 has 32 restaurants
Cluster 3 has 44 restaurants
Cluster 0 has 22 restaurants
Cluster 2 has 86 restaurants
Cluster 1 has 46 restaurants


Let's also find out the address for each centroid:

In [72]:
for i, center in enumerate(centers):
    location = geolocator.reverse("{}, {}".format(center[0], center[1]))
    print('Cluster {} address is: {}'.format(i, location.address))

Cluster 0 address is: 9, Arlstan Drive, York Centre, North York, Toronto, Golden Horseshoe, Ontario, M3H 5K8, Canada
Cluster 1 address is: Newtonbrook Creek Park, Earlywood Court, Bayview Village, Don Valley North, North York, Toronto, Golden Horseshoe, Ontario, M2K 1W4, Canada
Cluster 2 address is: Hilda Avenue, Thornhill, Vaughan, York Region, Golden Horseshoe, Ontario, L4J 2L1, Canada
Cluster 3 address is: 84, Deloraine Avenue, Bedford Park, Eglinton—Lawrence, Old Toronto, Toronto, Golden Horseshoe, Ontario, M5M 3Y8, Canada
Cluster 4 address is: 8, Hillcrest Avenue, Lansing, North York, Willowdale, North York, Toronto, Golden Horseshoe, Ontario, M2N 6C6, Canada


Very interestingly, cluster 4's address is my home! That means I can start exploring 32 restaurants without even travelling to the centroid.

### Data Visualization

In [74]:
# create map, with my home address as center
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color schemes
colors = [
    '#ff0000',
    '#ffa500',
    '#ffff00',
    '#008000',
    '#0000ff',
    '#4b0082',
    '#ee82ee',
]

# add restaurants to the map
for lat, lon, name, cluster in zip(df['Restaurant Latitude'], df['Restaurant Longitude'], df['Restaurant Name'], labels):
    label = folium.Popup(name + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colors[cluster],
        fill=True,
        fill_color=colors[cluster],
        fill_opacity=0.7).add_to(map_clusters)
    
# add centroids
count = 0
for lat, lon in centers:
    label = folium.Popup('Cluster ' + str(count), parse_html=True)
    count += 1
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=colors[5],
        fill=True,
        fill_color=colors[5],
        fill_opacity=0.7).add_to(map_clusters)

# add home
folium.CircleMarker(
    [latitude, longitude],
    radius=5,
    popup=folium.Popup('Home', parse_html=True),
    color=colors[6],
    fill=True,
    fill_color=colors[6],
    fill_opacity=0.7).add_to(map_clusters)

map_clusters