### IBM Data Science - Applied Data Science Capstone Project
#### 3 April 2021
#### Lim Yi Xiang

# Selecting the best location to open a restaurant in Manhattan

## Introduction/Business Problem
#### Manhattan is a busy place and is famous for a wide range of cuisine available. This si probably due to a diverse background of immirgrant moving into the place
#### In such a location, operating a F&B business could be profitable if done right

I have a client who is passionate about food and sharing it with the people. He grew up in the suburb of Manhattan and wanted to start a business in the region. Hence, my focus will be on the borough. We define a potential type of restaurant based on high frequency of the category (based on the fact that many are opened due to profitability and customer demand) or low frequency of the category (assuming that there will be less competition).

## Business Problem
The goal is to help stakeholders to make the following important decisions: 

**Find appropriate divisions of restaurants to reduce the pickup time and distance for each group of drivers** 

We will gather the information of the venues in populated areas. Then we will filter the restaurants and group them into several groups as driver pickup areas.

# Data

To identify the characteristics of our competitors' venues in Manhattan, we would first need to find out the number of restaurants in Manhattan currently and their location. We then used Google Map API to find their geographic coordinates based on their postal code addresses. In Manhattan, the most common restaurant is sushi bar and there are 1763 sushi bars currently operating.

However, that's not the end as it may not be the wisest to compete with established sushi bars

## Manhattan Neighborhoods List

We have downloaded a neighborhood data in the form json

This json file has information about all the neighborhoods, we will limit the data type to 

1. Borough : Name of Borough
2. Neighborhood : Name of Neighborhood
3. Latitude
4. Longitude

## Geographic Information 

ArcGIS API provides information to connect people, locations, and data using interactive maps. Work with smart, data-driven styles and intuitive analysis tools that deliver location intelligence. 

We use ArcGIS to get the geo locations of the neighborhoods of Manhattan. We need to acquire the following information:

1. Latitude : Latitude for Neighborhood
2. Longitude : Longitude for Neighborhood

## Neighborhood Venues

We will need data about venues in different neighborhoods of some chosen boroughs. In order to gather that information we will use **Foursquare API**.   

Foursquare is a location data provider with information about all manner of venues and events within an area of interest. Such information includes venue names, locations, menus, ratings, and etc. It will provide all the venue information we need in this analysis.

After finding the list of neighborhoods, we then connect to the Foursquare API to gather information about venues inside each neighborhood. We set the radius, which is the specified distance to the longitude and latitude of the neighborhood, to be 500 meters.

We need to acquire the following information of venues:

1. Neighborhood  : Name of Neighborhood
2. Venue : Name of the Venue
3. VLatitude : Latitude of Venue
4. VLongitude : Longitude of Venue
5. VCategory : Category of Venue

## Final Dataframe

We use all the data of neighborhoods and venues to generate the following dataframe:
1. Borough : Name of Borough
2. Neighborhood : Name of Neighborhood
3. Latitude : Latitude for Neighborhood
4. Longitude : Longitude for Neighborhood
5. Venue : Name of the Venue
6. VLatitude : Latitude of Venue
7. VLongitude : Longitude of Venue
8. VCategory : Category of Venue


We now have all the information to do the analysis. 

# Methodology

## Acquire Neighborhoods List


### Download all the dependencies.


In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim
import urllib.request
import json
from bs4 import BeautifulSoup
from urllib.request import urlopen
import requests

from pandas.io.json import json_normalize
from arcgis.geocoding import geocode
from arcgis.gis import GIS
from sklearn.preprocessing import StandardScaler

import matplotlib.cm as cm
import matplotlib.colors as colors
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import matplotlib.colors as colors
%matplotlib inline
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes
import folium


### Create the dataframe


**Get the neighborhoods info for Manhattan**


In [2]:
with open('nyu_2451_34572-geojson.json') as json_data:
    newyork_data = json.load(json_data)
    
neighborhoods_data = newyork_data['features']
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

In [3]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


**Data cleaning**  
We only need the Borough, Post town and Postcode district

In [4]:
neighborhoods.columns = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Strip the [] and number in Borough columns

In [5]:
neighborhoods['Borough'] = neighborhoods['Borough'].map(
    lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


**Print the size of the table**

In [6]:
print("The Manhattan table has :",
      neighborhoods.shape[0], " rows and has :", neighborhoods.shape[1], "columns")

The Manhattan table has : 306  rows and has : 4 columns


## Analyze Manhattan Coordinates

### **Create Neighborhood dataframe**


In [8]:
gis = GIS()
# define a function to get all the postal codes geo

def get_geo(add):
    latcoords = 0
    lngcoords = 0
    g = geocode(address='{}, Manhattan, New York, USA'.format(add))[0]
    lngcoords = g['location']['x']
    latcoords = g['location']['y']
    return [str(latcoords), str(lngcoords)]

**Count the neighborhoods and Borough**


In [9]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
    len(neighborhoods['Borough'].unique()),
    neighborhoods.shape[0]))

The dataframe has 5 boroughs and 306 neighborhoods.


### Get the latitude and longitude values of Manhattan 


In [28]:
latitude = 40.5831
longitude = -73.9712

### Create a map of Manhattan

In [72]:
# create map of Manhattan using latitude and longitude values
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=9)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)

map_manhattan

## Explore Neighborhoods in Manhattan


### Define Foursquare Credentials and Version


In [37]:
CLIENT_ID = '54QXS15IVS1A4TLKL4YH23OQ3U0MOUJZL4PMOKRHEB52Q4NP'
CLIENT_SECRET = '0UU02LTONIX0VIYQVXVCNOCE5H0CWAW4CEFXTJMUZCF5YCJT'
VERSION = '20180605'
LIMIT = 3000

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 54QXS15IVS1A4TLKL4YH23OQ3U0MOUJZL4PMOKRHEB52Q4NP
CLIENT_SECRET:0UU02LTONIX0VIYQVXVCNOCE5H0CWAW4CEFXTJMUZCF5YCJT


### Define function to explore all the neighborhoods 


**we set the limit=3000, radius=1000 to include most venues in each neighborhoods**

In [38]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):

    venues_list = []
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)

        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID,
            CLIENT_SECRET,
            VERSION,
            lat,
            lng,
            radius,
            LIMIT)

        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']

        # return only relevant information for each nearby venue
        venues_list.append([(
            name,
            lat,
            lng,
            v['venue']['name'],
            v['venue']['location']['lat'],
            v['venue']['location']['lng'],
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame(
        [item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood',
                             'Latitude',
                             'Longitude',
                             'Venue',
                             'VLatitude',
                             'VLongitude',
                             'VCategory']

    return(nearby_venues)

### Combine geo and venue infomation


In [39]:
part_df = neighborhoods

In [40]:
venues_df = getNearbyVenues(names=part_df['Borough'],
                            latitudes=part_df['Latitude'],
                            longitudes=part_df['Longitude']
                            )

Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Manhattan
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Bronx
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Brooklyn
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manhattan
Manh

In [41]:
venues_df.shape

(20573, 7)

In [42]:
venues_df.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Venue,VLatitude,VLongitude,VCategory
0,Bronx,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Bronx,40.894705,-73.847201,Ripe Kitchen & Bar,40.898152,-73.838875,Caribbean Restaurant
2,Bronx,40.894705,-73.847201,Ali's Roti Shop,40.894036,-73.856935,Caribbean Restaurant
3,Bronx,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,Bronx,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop


**Check how many venues were returned for each neighborhood**


In [43]:
venues_df.groupby('Neighborhood')[['Venue']].count()

Unnamed: 0_level_0,Venue
Neighborhood,Unnamed: 1_level_1
Bronx,3220
Brooklyn,5654
Manhattan,3982
Queens,5278
Staten Island,2439


**Check how many unique categories are returned**


In [44]:
print('There are {} uniques categories.'.format(
    len(venues_df['VCategory'].unique())))
print(venues_df['VCategory'].unique())

There are 487 uniques categories.
['Dessert Shop' 'Caribbean Restaurant' 'Pharmacy' 'Ice Cream Shop'
 'Burger Joint' 'Donut Shop' 'Sandwich Place' 'Mobile Phone Shop' 'Bakery'
 'Fried Chicken Joint' 'Fast Food Restaurant' 'Pizza Place' 'Supermarket'
 'Bank' 'Gas Station' 'Deli / Bodega' 'Food Truck' 'Chinese Restaurant'
 'Spanish Restaurant' 'Check Cashing Service' 'Park' 'Dumpling Restaurant'
 'Seafood Restaurant' 'Discount Store' 'Shopping Mall'
 'Mexican Restaurant' 'Kids Store' 'BBQ Joint' 'Bagel Shop'
 'Department Store' 'Coffee Shop' 'Post Office' 'Shoe Store'
 'Video Game Store' 'Furniture / Home Store' 'Convenience Store'
 'Grocery Store' 'Clothing Store' 'Paper / Office Supplies Store'
 'Accessories Store' 'Nightclub' 'Movie Theater' 'Restaurant'
 'Mattress Store' 'Miscellaneous Shop' "Women's Store" "Men's Store"
 'Bus Station' 'Baseball Field' 'Salon / Barbershop' 'American Restaurant'
 'Hotel' 'Electronics Store' 'Gym / Fitness Center' 'Buffet'
 'Metro Station' 'Music Venue

## Cluster Restaurants


### Choose all venues related to restaurants

In [53]:
restaurant_dict = ['Restaurant', 'Food', 'Coffee', 'Soup', 'Breakfast', 'Pizza', 'BBQ', 'Sandwich', 'Chinese'
                   'Fish', 'Chicken', 'Burger', 'Caf', 'Bagel', 'Taco', 'Donut', 'Cupcake', 'Salad', 'Sake', 'Sushi', 'Japanese']

In [54]:
food_category = pd.DataFrame()
for i in range(0, len(restaurant_dict)):
    if food_category.shape[0] == 0:
        food_category = venues_df['VCategory'].str.contains(restaurant_dict[i])
    else:
        food_category = venues_df['VCategory'].str.contains(
            restaurant_dict[i]) | food_category

food_category = food_category.to_numpy()

In [55]:
venues_food = venues_df.iloc[food_category]
venues_food.shape
venues_food.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Venue,VLatitude,VLongitude,VCategory
1,Bronx,40.894705,-73.847201,Ripe Kitchen & Bar,40.898152,-73.838875,Caribbean Restaurant
2,Bronx,40.894705,-73.847201,Ali's Roti Shop,40.894036,-73.856935,Caribbean Restaurant
5,Bronx,40.894705,-73.847201,Jackie's West Indian Bakery,40.889283,-73.84331,Caribbean Restaurant
6,Bronx,40.894705,-73.847201,Jimbo's,40.89174,-73.858226,Burger Joint
9,Bronx,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


**Check out the top 10 frequently occuring restaurants**

In [56]:
venues_food.groupby(['VCategory'])[['Venue']].count().sort_values(by='Venue',ascending=False).head(10)

Unnamed: 0_level_0,Venue
VCategory,Unnamed: 1_level_1
Pizza Place,992
Coffee Shop,560
Italian Restaurant,525
Donut Shop,435
Chinese Restaurant,424
Sandwich Place,405
Mexican Restaurant,325
Café,310
American Restaurant,303
Fast Food Restaurant,269


<a id='item3'></a>


### Run _k_-means to cluster the restaurants into 5 clusters based on geo locations.


In [58]:
X = venues_food[['VLatitude', 'VLongitude']]
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset

array([[ 2.08077186,  1.00427218],
       [ 2.03288151,  0.81625044],
       [ 1.97758368,  0.95810427],
       ...,
       [-1.09928581, -1.49961091],
       [-1.25667508, -1.59873387],
       [-1.13414485, -1.42596361]])

**Divide restaurants into 5 groups and get the labels and centers**


In [59]:
kclusters = 5

k_means = KMeans(init="k-means++", n_clusters=kclusters, n_init=12)
k_means.fit(cluster_dataset)
k_means_labels = k_means.labels_
k_means_labels

print(k_means_labels)

[3 3 3 ... 2 2 2]


In [60]:
venues_food.insert(0, 'ClusterLabels', k_means_labels + 1)

### Explore the cluster results

In [61]:
venues_food.head()

Unnamed: 0,ClusterLabels,Neighborhood,Latitude,Longitude,Venue,VLatitude,VLongitude,VCategory
1,4,Bronx,40.894705,-73.847201,Ripe Kitchen & Bar,40.898152,-73.838875,Caribbean Restaurant
2,4,Bronx,40.894705,-73.847201,Ali's Roti Shop,40.894036,-73.856935,Caribbean Restaurant
5,4,Bronx,40.894705,-73.847201,Jackie's West Indian Bakery,40.889283,-73.84331,Caribbean Restaurant
6,4,Bronx,40.894705,-73.847201,Jimbo's,40.89174,-73.858226,Burger Joint
9,4,Bronx,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


**Check how many restaurants each group contains and its location** 

In [62]:
venues_food.groupby(['ClusterLabels', 'Neighborhood'])[['Venue']].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Venue
ClusterLabels,Neighborhood,Unnamed: 2_level_1
1,Brooklyn,405
1,Manhattan,1487
1,Queens,672
2,Brooklyn,57
2,Queens,1598
3,Staten Island,907
4,Bronx,1308
4,Manhattan,233
4,Queens,1
5,Brooklyn,1915


In [63]:
venues_food.groupby('ClusterLabels')[['Venue']].count(
).sort_values(by='Venue', ascending=False)

Unnamed: 0_level_0,Venue
ClusterLabels,Unnamed: 1_level_1
1,2564
5,1953
2,1655
4,1542
3,907


We find the 5 clusters in these neighborhoods:  

1. Cluster1 mainly consists of Brooklyn, Manhattan and Queens
2. Cluster2 mainly consists of few Brooklyn and mainly Queens
3. Cluster3 is Staten Island
4. Cluster4 consist of Bronx and Manhattan
5. Cluster5 mainly consists of Brooklyn and a few Queens

### Examine each cluster


**Find out the top 5 frequently occurring restaurants in each group**

In [66]:
venue_category = venues_food.groupby(['ClusterLabels', 'VCategory'])[
    ['VCategory']].count()
venue_category.columns = ['Numbers']
for i in range(0, kclusters):
    print(venue_category.loc[(i+1, slice(None)),
          :].sort_values(by='Numbers', ascending=False).iloc[0:5, ])

                                   Numbers
ClusterLabels VCategory                   
1             Coffee Shop              262
              Pizza Place              179
              Italian Restaurant       178
              Café                     139
              American Restaurant      125
                                  Numbers
ClusterLabels VCategory                  
2             Pizza Place             199
              Chinese Restaurant      143
              Donut Shop              112
              Sandwich Place           88
              Coffee Shop              72
                                  Numbers
ClusterLabels VCategory                  
3             Pizza Place             119
              Italian Restaurant      101
              Bagel Shop               61
              Donut Shop               59
              Sandwich Place           58
                                    Numbers
ClusterLabels VCategory                    
4             Pizza Pla

<a id='item5'></a>


## Divide the Largest Group

**We notice that the first group has the most restaurants within it. This means we need to deploy more drivers to pick up delivery there.**  
**Based on the number of restaurants, we further divide group 1 in to 2 groups.**

### Choose Cluster 1 for further exploration

**Here we try 2,3,4 groups for division. We choose 4 as the optimal group division number**


In [70]:
subgroup = venues_food.loc[venues_food['ClusterLabels'] == 1, ]
X = subgroup[['VLatitude', 'VLongitude']]
cluster_dataset = StandardScaler().fit_transform(X)
cluster_dataset

kclusters = 4

k_means = KMeans(init="k-means++", n_clusters=kclusters, n_init=12)
k_means.fit(cluster_dataset)
k_means_labels = k_means.labels_

subgroup.insert(0, 'SubLabels', k_means_labels + 1)

**Check out how many restaurants there are in each group**

In [71]:
subgroup.groupby('SubLabels')['ClusterLabels'].count()

SubLabels
1    1059
2     552
3     376
4     577
Name: ClusterLabels, dtype: int64

# Results

We choose the midpoint of the three most populated areas ** Brooklyn, Manhattan and Queens** to start our business.

It is suggested that in this area Pizza Place has less presence than Coffee Shop. In a city where the other 4 clusters are topped by Pizza Place, there is huge potential if we start Pizza Place by assuming the demand is as strong. 

However, we must bear in mind that in Cluster 1, there are a great number of Italian Restaurant too. So we have to have make sure the Pizza is unique and attractive

The 5 most frequently occuring restaurants in these three boroughs are:

| Category | Numbers |
|----------|---------|
|Coffee Shop |	262|
|Pizza Place |	179|
|Italian Restaurant |	178|
|Cafe|	139|
|American Restaurant|	125|

These 5 clusters are in these neighborhoods:  

|Cluster Label| Neighborhood|
|----------|---------|
|Cluster1 | Brooklyn, Manhattan & Queens|
|Cluster2 | Brooklyn (minor) & Queens|
|Cluster3 | Staten Island|
|Cluster4 | Bronx & Manhattan|
|Cluster5 | Brooklyn Queens (minor)|	  

# Discussion

After exploration of F&B shops in Manhattan region, we derive a 5 group division model to decide 
- What is generally popular in great Manhattan region?
- Which cluster has opportunities for a new business to be started

In order to better customize the food offered in our new F&B business, we could further collect information on nearby restaurants' pricing, demographic of the people, effect of holidays on the reigon.

Next we can come up with selection plan for region with high flow, allowing us to get free advertisement. Once we start operating, we can leverage on customer data and have social media to help us promote ourselves to locals.

# Conclusion

This reports gathers information about restaurants in Manhattan major boroughs. Widely regarded as one of the most diverse cities in the world, Manhattan’s culinary scene offers multicultural cuisine. It shows that **starting a Pizza shop will be easily accepted**.

**Brooklyn, Manhattan & Queens** are the most dense areas in New York and has a large amount of various restaurants and diners suitable for starting a F&B business. We recommend starting to look for a suitable venue from the heart of **Cluster 1** and slowly move out the radius, factoring in start-up cost and maintaining cost.

# Reference

1. [Four Square API](https://foursquare.com/)