# Capstone Project - Week 2
### Applied Data Science Capstone - IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)


## Introduction: Business Problem <a name="introduction"></a>


In this project we are going to identify optimal locations for opening a **Mexican restaurant** in **Manhattan, New York City.** This report is targeted to business owners who are looking to open a Mexican restaurant in NYC.

Based on an intital discussion with the business owners, the criteria for identifying target neighborhoods is as follows:
**1. Neighborhhods need to have high density of restaurants as the higher density reflects higher customer demand for food.**
**2. Neighborhoods need to have low percentage of Mexican restaurants.**

The reasoning for focusing on these neighborhoods is as that higher density of restaurants is usually a result of higher traffic of customers and reflects the demand for eating out. These trends usually contiue to last and get stronger over time. It is better to locate where there is already foot traffic as it is a lot harder to attract customers if we choose to locate in an area where the traffic is less. Further, by focusing on low concetration of Mexican restaurants in such high traffic areas, it provides an alternative cuisine to customers and a better chance for the cusine to make inroads as there will be less competition from other Mexican restaurants.


## Data <a name="data"></a>

Based on our problem definition, we will need the following data:

- NYC Data that allows us to map the neighborhoods so that would be the latitude and longitude data for neighborhoods. 
- Number of Total restaurants by neighborhhod 
- Number of Mexican restaurants by neighborhood. 

The following data sources will be used to gather the data:
- NYC neighborhood geo data  - https://cocl.us/new_york_dataset
- Restaurant information from Foursquare API
- Geo data to visualize maps - https://data.cityofnewyork.us/City-Government/Borough-Boundaries/tqmj-j8zm


### Import all dependencies

In [319]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


### Neighbohood Data 
Import the neighborhood data for NYC along with the latitute and longitude and convert it into a panadas dataframe for further exploration

In [320]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

### Lets explore the neighborhood data

In [321]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [322]:
neighborhoods.shape

(306, 4)

In [323]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


### Foursquare API
Now that we have 306 neighborhoods as our starting point, lets get the restaurant information for each neighborhood using Foursquare API.

In [324]:
# Foursquare credentials
CLIENT_ID = 'YT405H1Z2FRO4RUS5TIJFGEC4MIHSV1AV1NZHTKEBRKSHZED' # your Foursquare ID
CLIENT_SECRET = 'IV4HHZDXPHSMWXJISJIKLVAXGTNBUNAP3DCPT14IKBUQAKCI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

Now we will get the restaurant information for each neighborhood. We will limit the results to 100 food places within 500 meter of neighborhood center.

In [325]:


def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    LIMIT = 100 # limit of number of venues returned by Foursquare API

    radius = 500 # define radius

    category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues
   # mexican_category = '4bf58dd8d48988d1c1941735' # Category Id for all Mexican restaurants
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            category,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

In [326]:
print(manhattan_venues.shape)
manhattan_venues.head()

(2782, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place
1,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner
2,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop
3,Marble Hill,40.876551,-73.91066,Land & Sea Restaurant,40.877885,-73.905873,Seafood Restaurant
4,Marble Hill,40.876551,-73.91066,Subway,40.874667,-73.909586,Sandwich Place


In [327]:
df = manhattan_venues[manhattan_venues['Venue Category'].str.contains('Restaurant')]
df.shape

(1653, 7)

In [328]:
len(df['Neighborhood'].unique())

40

There are **40 neighborhoods** in Manhattan Borough

In [329]:
df[df['Venue Category'].str.contains('Mexican')].count()

Neighborhood              110
Neighborhood Latitude     110
Neighborhood Longitude    110
Venue                     110
Venue Latitude            110
Venue Longitude           110
Venue Category            110
dtype: int64

In the Manhattan Borough, there are a total of **1653 restaurants**. Of which **110 are Mexican restaurants**. 

## Methodology  <a name="methodology"></a>

Step 1 - We will focus on the on the top 25 Neighborhoods - based on total number of restaurants. Once we have a list of the neighborhoods, we will count the number of Mexican Restaurants in each neighborhood. 

Step 2 - To ensure that we identify areas with the most potential, we will cluster these neighborhoods using K-cluster.

Step 3 - Finally we will map these clusters to give a better visual picture of where the opportunities lie.

##  Analysis  <a name="analysis"></a>

**Step 1 (a): Now lets extract the top 25 Neighborhoods in the Manhattan Borough based on the highest number of restaurants. We will call this group - Manhattan_top25**

In [330]:
manhattan_top25 = df.groupby('Neighborhood')['Venue'].count().sort_values(ascending=False).head(25)
manhattan_top25 = manhattan_top25.reset_index()

# Rename the columns
manhattan_top25.rename(columns={'Venue':'Total Restaurants'}, inplace=True)
manhattan_top25.head()

Unnamed: 0,Neighborhood,Total Restaurants
0,East Village,76
1,Flatiron,74
2,Greenwich Village,73
3,Chinatown,72
4,West Village,72


In [331]:
# Convert the top 10 neighborhoods to a list
top25 = manhattan_top25['Neighborhood'].to_list()
top25

['East Village',
 'Flatiron',
 'Greenwich Village',
 'Chinatown',
 'West Village',
 'Noho',
 'Soho',
 'Little Italy',
 'Midtown South',
 'Midtown',
 'Clinton',
 'Lenox Hill',
 'Chelsea',
 'Murray Hill',
 'Civic Center',
 'Yorkville',
 'Financial District',
 'Turtle Bay',
 'Upper East Side',
 'Upper West Side',
 'Sutton Place',
 'Tudor City',
 'Carnegie Hill',
 'Washington Heights',
 'Tribeca']

**Step 1(b): Now that we have a list of top 25 neighborhoods where people go to eat, let's find out how many Mexican restaurants are in each of these neighborhoods.**

In [332]:
mexcount = []

for i in top25:
    mexcount.append(df[(df['Neighborhood'] == i) & (df['Venue Category'].str.contains('Mexican'))]['Venue'].count())
    
# Add the mexcount back to the table of top 10 neighborhoods
manhattan_top25['Mexican Restaurants'] = np.array(mexcount)

manhattan_top25.head()

Unnamed: 0,Neighborhood,Total Restaurants,Mexican Restaurants
0,East Village,76,7
1,Flatiron,74,3
2,Greenwich Village,73,1
3,Chinatown,72,4
4,West Village,72,4


To be able to cluster the neighborhoods, lets normalize the data

In [333]:
manhattan_top25['TR'] = manhattan_top25['Total Restaurants']/manhattan_top25['Total Restaurants'].max()
manhattan_top25['MR'] = manhattan_top25['Mexican Restaurants']/manhattan_top25['Mexican Restaurants'].max()
manhattan_top25.head()

Unnamed: 0,Neighborhood,Total Restaurants,Mexican Restaurants,TR,MR
0,East Village,76,7,1.0,1.0
1,Flatiron,74,3,0.973684,0.428571
2,Greenwich Village,73,1,0.960526,0.142857
3,Chinatown,72,4,0.947368,0.571429
4,West Village,72,4,0.947368,0.571429


In [334]:
manhattan_top25.describe()

Unnamed: 0,Total Restaurants,Mexican Restaurants,TR,MR
count,25.0,25.0,25.0,25.0
mean,54.6,2.88,0.718421,0.411429
std,13.197222,1.810157,0.173648,0.258594
min,34.0,0.0,0.447368,0.0
25%,43.0,2.0,0.565789,0.285714
50%,51.0,3.0,0.671053,0.428571
75%,65.0,4.0,0.855263,0.571429
max,76.0,7.0,1.0,1.0


In [335]:
manhattan_top25.drop(['Total Restaurants','Mexican Restaurants'],axis =1, inplace=True)
manhattan_top25.head()

Unnamed: 0,Neighborhood,TR,MR
0,East Village,1.0,1.0
1,Flatiron,0.973684,0.428571
2,Greenwich Village,0.960526,0.142857
3,Chinatown,0.947368,0.571429
4,West Village,0.947368,0.571429


**Step 2: Cluster the neighborhoods using K-cluster**

In [336]:
# set number of clusters
kclusters = 5

manhattan_grouped_clustering = manhattan_top25.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

# add clustering labels
manhattan_top25.insert(0, 'Cluster Labels', kmeans.labels_)

manhattan_top25

Unnamed: 0,Cluster Labels,Neighborhood,TR,MR
0,4,East Village,1.0,1.0
1,0,Flatiron,0.973684,0.428571
2,3,Greenwich Village,0.960526,0.142857
3,0,Chinatown,0.947368,0.571429
4,0,West Village,0.947368,0.571429
5,0,Noho,0.907895,0.714286
6,3,Soho,0.855263,0.285714
7,3,Little Italy,0.855263,0.142857
8,3,Midtown South,0.802632,0.0
9,3,Midtown,0.776316,0.285714


By analyzing the Clusters, here are the things that jump out:
- Clusters 0 has higher number of restaurants as well as Mexican restaurants
- Cluster 1 has fewer total restaurants but higher Mexican restaurants
- Cluster 2 has fewer total restaurants and fewer Mexican restaurants
- Cluster 3 has higher numer of total restaurants but fewer Mexican restaurants
- Cluster 4 only has one neighborhood.

**Step3: Visualize distribution of highly trafficked neighborhoods which lack Mexican restaurants across various Manahattan neighborhoods**

Before we can map these neighborhoods, we need add the Latitude & Logitude to each neighborhood 

In [337]:

lat = []
lon = []
for i in top25:
    lat.append(df[df['Neighborhood'] == i]['Neighborhood Latitude'].to_list()[0])
    lon.append(df[df['Neighborhood'] == i]['Neighborhood Longitude'].to_list()[0])

# Add the Neighborhood Latitude & Longitude back to the table of top 10 neighborhoods
manhattan_top25['Latitude'] = np.array(lat)
manhattan_top25['Longitude'] = np.array(lon)

manhattan_top25.head()

Unnamed: 0,Cluster Labels,Neighborhood,TR,MR,Latitude,Longitude
0,4,East Village,1.0,1.0,40.727847,-73.982226
1,0,Flatiron,0.973684,0.428571,40.739673,-73.990947
2,3,Greenwich Village,0.960526,0.142857,40.726933,-73.999914
3,0,Chinatown,0.947368,0.571429,40.715618,-73.994279
4,0,West Village,0.947368,0.571429,40.734434,-74.00618


Lets map these neighborhoods

In [338]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_top25['Latitude'], manhattan_top25['Longitude'], manhattan_top25['Neighborhood'], manhattan_top25['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

After mapping these clusters, it is clear that the area of focus needs to be between Canal Street and 23rd Street - where Clusters 0 and 3 are located. 

## Results & Discussion  <a name="results"></a>

Our analysis shows that there are 40 neighborhoods in the Manhattan Borough in New York City. Each neighborhood has its own unique character. For our purpose of identifying the neighborhoods with the most promise for locating a new Mexican restaurant, we chose to focus on the top 25 neighborhoods with the most number of restaurants. The rationale behind picking these neighborhoods was that it is easier to cater to an existing client base, in this case customers who are in the area to eat, then to locate in neighborhoods with few restaurants. 

After we picked these neighborhoods, we wanted to understand the presence of other Mexican restaurants in these areas. We looked at the distribution of Mexican restaurants by neighborhood.

Finally we grouped these neighborhoods based on the total number of restaurants and number of Mexican restaurants, we found that the area between 23rd Street and Canal Street to fit the criteria stated by the stakeholders.

## Conclusion  <a name="conclusion"></a>

Based on our analysis of the 40 neighborhoods in the Manahattan, NY area, we would recommend the 7 neighborhoods located between 23rd Street and Canal Street for further exploration for locating a new Mexican restaurant. Addtional considerations like demographics, site visibility and access, rent and traffic count will play a role in finalzing the site.