# **Applied Data Science Capstone Project**
### **The Battle of Neigborhoods**

## Introduction

For the **IBM Applied Data Science Capstone Project**, we were instructed to use the skills and tools we had 
learned in the previous weeks of the course to use location data to explore a geographical location using the 
Foursquare location data to solve problem.

## Business Problem

There are many Chinese restaurants in New York City. This project will examine 3 of the boroughs, namely, Manhattan,
Brooklyn and the Queens to find the optimal location to open another Chinese restaurant in a location that do not 
have much competition but yet will be welcome to the neighborhood.

## Data

The data used in this project is borrowed from the week 3 exercise where the New York City dataset can be found via the link: https://geo.nyu.edu/catalog/nyu_2451_34572.

For this project, only 3 of the 5 boroughs in the dataset is examined to find the optimal location in a neighborhood to open the Chinese restaurant. The neighborhood need to only have a few Chinese restaurants to keep competition at bay. 

The Foursquare API is used to explore neighborhoods in these 3 boroughs in New York City. The Foursquare API explore function will be used to get the most common venue categories in each neighborhood of each of the 3 boroughs to group the neighborhoods into clusters for examination. 

## Analysis

#### Download all dependencies that are needed for the analysis

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

#!pip install geopy
from geopy.geocoders import Nominatim  # convert an address into latitude and longitude values

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!pip install folium==0.5.0
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


#### Get the New York data from the URL

In [2]:
import json, urllib 
with urllib.request.urlopen("https://cocl.us/new_york_dataset") as url:
    newyork_data = json.loads(url.read().decode())

#### Get only the relevant data in the *features key* to transform the data into pandas dataframe

In [3]:
neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
    
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


#### Slice the original dataframe to create new dataframes for borough Manhattan, Brooklyn, and Queens

In [4]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


In [5]:
brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
brooklyn_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


In [6]:
queens_data = neighborhoods[neighborhoods['Borough'] == 'Queens'].reset_index(drop=True)
queens_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Queens,Astoria,40.768509,-73.915654
1,Queens,Woodside,40.746349,-73.901842
2,Queens,Jackson Heights,40.751981,-73.882821
3,Queens,Elmhurst,40.744049,-73.881656
4,Queens,Howard Beach,40.654225,-73.838138


#### Utilize the Foursquare API to explore the neighborhoods and segment them

##### Define Foursquare Credentials and Version

In [7]:
CLIENT_ID = 'FMN2E2XAFE0I0QJJEVUMNIAINMYLDLBCANPRXTT5OFWJIJR4' # your Foursquare ID
CLIENT_SECRET = '5NIUNLPCRAEVZLBL1U05XVDGH51CCKX4TZIDKGAMCTMHMAZI' # your Foursquare Secret
VERSION = '20191215' # Foursquare API version

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

##### Borrow **getNearbyVenues** function from Lab to get venues for all neighborhood in specified borough

In [8]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Group each neighborhood for Manhattan, Brooklyn and Queens by taking the mean of the frequency of occurrence of each venue category for that neighborhood

In [9]:
manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

In [10]:
# one hot encoding
manhattan_onehot = pd.get_dummies(manhattan_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
manhattan_onehot['Neighborhood'] = manhattan_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [manhattan_onehot.columns[-1]] + list(manhattan_onehot.columns[:-1])
manhattan_onehot =manhattan_onehot[fixed_columns]

manhattan_grouped = manhattan_onehot.groupby('Neighborhood').mean().reset_index()

print('Manhattan Neighborhood grouped')

Manhattan Neighborhood grouped


In [11]:
brooklyn_venues = getNearbyVenues(names=brooklyn_data['Neighborhood'],
                                  latitudes=brooklyn_data['Latitude'],
                                  longitudes=brooklyn_data['Longitude']
                                 )

In [12]:
# one hot encoding
brooklyn_onehot = pd.get_dummies(brooklyn_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
brooklyn_onehot['Neighborhood'] = brooklyn_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [brooklyn_onehot.columns[-1]] + list(brooklyn_onehot.columns[:-1])
brooklyn_onehot = brooklyn_onehot[fixed_columns]

brooklyn_grouped = brooklyn_onehot.groupby('Neighborhood').mean().reset_index()

print('Brooklyn Neighborhood grouped')

Brooklyn Neighborhood grouped


In [13]:
queens_venues = getNearbyVenues(names=queens_data['Neighborhood'],
                                latitudes=queens_data['Latitude'],
                                longitudes=queens_data['Longitude']
                               )

In [14]:
# one hot encoding
queens_onehot = pd.get_dummies(queens_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
queens_onehot['Neighborhood'] = queens_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [queens_onehot.columns[-1]] + list(queens_onehot.columns[:-1])
queens_onehot = queens_onehot[fixed_columns]

queens_grouped = queens_onehot.groupby('Neighborhood').mean().reset_index()

print('Queens Neighborhood grouped')

Queens Neighborhood grouped


#### Borrow the function from the lab to sort the venues in descending order

In [15]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Get the most common 10 venue for each neighborhood for the 3 boroughs (Manhattan, Brooklyn, Queens) to analyze and find the optimal location to open a Chineses restaurant. A location there are many different types of restaurants but have the least selection of Chinese cuisine.

In [16]:
##### First examine Manhattan Neighborhood

In [17]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
manhattan_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
manhattan_neighborhoods_venues_sorted['Neighborhood'] = manhattan_grouped['Neighborhood']

for ind in np.arange(manhattan_grouped.shape[0]):
    manhattan_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(manhattan_grouped.iloc[ind, :], num_top_venues)

manhattan_neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Battery Park City,Park,Coffee Shop,Hotel,Gym,Boat or Ferry,Memorial Site,Ice Cream Shop,Sandwich Place,Italian Restaurant,Beer Garden
1,Carnegie Hill,Coffee Shop,Pizza Place,Café,Yoga Studio,Bakery,Wine Shop,Bookstore,Gym / Fitness Center,French Restaurant,Gym
2,Central Harlem,African Restaurant,Public Art,Cosmetics Shop,Bar,French Restaurant,Seafood Restaurant,Chinese Restaurant,American Restaurant,Park,Fried Chicken Joint
3,Chelsea,Coffee Shop,Bakery,Italian Restaurant,Nightclub,Ice Cream Shop,Hotel,Seafood Restaurant,American Restaurant,Theater,Bookstore
4,Chinatown,Chinese Restaurant,Vietnamese Restaurant,Cocktail Bar,American Restaurant,Spa,Bakery,Hotpot Restaurant,Optical Shop,Salon / Barbershop,Bubble Tea Shop
5,Civic Center,Gym / Fitness Center,Coffee Shop,Italian Restaurant,Hotel,French Restaurant,Yoga Studio,Cocktail Bar,Sandwich Place,Park,American Restaurant
6,Clinton,Theater,Coffee Shop,American Restaurant,Italian Restaurant,Gym / Fitness Center,Wine Shop,Gym,Hotel,Steakhouse,New American Restaurant
7,East Harlem,Mexican Restaurant,Bakery,Thai Restaurant,Deli / Bodega,Latin American Restaurant,Spanish Restaurant,Cocktail Bar,Beer Bar,French Restaurant,Grocery Store
8,East Village,Bar,Chinese Restaurant,Wine Bar,Ice Cream Shop,Mexican Restaurant,Pizza Place,Cocktail Bar,Ramen Restaurant,Japanese Restaurant,Coffee Shop
9,Financial District,Coffee Shop,Pizza Place,American Restaurant,Food Truck,Wine Shop,Hotel,Gym / Fitness Center,Gym,Cocktail Bar,Steakhouse


If the Chinese Restaurant to be opened is for low-end (cost) patrons, **Morningside Heights** will be an ideal location since there are a variety of cafe and restaurants that seems to cater for budget customers. **Clinton** will be a good location to open a high-end Chinese Restaurant, preferably a fusion restaurant that can cater for the theater clients and the clients who goes to the nearby gym who probably would appreciate a healthy, low calorie meal.

##### Next, examine the 10 most common venue in Brooklyn

In [18]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
brooklyn_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
brooklyn_neighborhoods_venues_sorted['Neighborhood'] = brooklyn_grouped['Neighborhood']

for ind in np.arange(brooklyn_grouped.shape[0]):
    brooklyn_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(brooklyn_grouped.iloc[ind, :], num_top_venues)

brooklyn_neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bath Beach,Chinese Restaurant,Pharmacy,Pizza Place,Bubble Tea Shop,Italian Restaurant,Fast Food Restaurant,Donut Shop,Sushi Restaurant,Diner,Mobile Phone Shop
1,Bay Ridge,Italian Restaurant,Spa,Pizza Place,Grocery Store,Greek Restaurant,American Restaurant,Bar,Sandwich Place,Chinese Restaurant,Thai Restaurant
2,Bedford Stuyvesant,Bar,Coffee Shop,Deli / Bodega,Pizza Place,Café,Gourmet Shop,Boutique,Thrift / Vintage Store,Bagel Shop,BBQ Joint
3,Bensonhurst,Chinese Restaurant,Sushi Restaurant,Ice Cream Shop,Flower Shop,Grocery Store,Donut Shop,Italian Restaurant,Bagel Shop,Supermarket,Noodle House
4,Bergen Beach,Harbor / Marina,Baseball Field,Playground,Park,Donut Shop,Athletics & Sports,Women's Store,Filipino Restaurant,Farm,Farmers Market
5,Boerum Hill,Dance Studio,Bar,Coffee Shop,Furniture / Home Store,French Restaurant,Sandwich Place,Arts & Crafts Store,Yoga Studio,Middle Eastern Restaurant,Boutique
6,Borough Park,Bank,Pizza Place,Pharmacy,Hotel,Restaurant,Coffee Shop,Fast Food Restaurant,Café,Metro Station,Deli / Bodega
7,Brighton Beach,Beach,Russian Restaurant,Restaurant,Eastern European Restaurant,Gourmet Shop,Fast Food Restaurant,Sushi Restaurant,Bank,Pharmacy,Mobile Phone Shop
8,Broadway Junction,Donut Shop,Diner,Gas Station,Pizza Place,Burger Joint,Fried Chicken Joint,Bus Stop,Caribbean Restaurant,Seafood Restaurant,Sandwich Place
9,Brooklyn Heights,Yoga Studio,Italian Restaurant,Park,Gym,Bakery,Deli / Bodega,Cosmetics Shop,Pet Store,Ice Cream Shop,Indian Restaurant


Looking at other most common venues around Brooklyn neighborhood when eliminating neighborhoods with Asian restaurants, it does not look like Brooklyn is a good candidate to find an optimal location to open a restaurant. The surrounding activities in the neighborhood does not support a need for a new Chinese restaurant.

##### Lastly, examine the 10 most common venue in Queens

In [19]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
queens_neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
queens_neighborhoods_venues_sorted['Neighborhood'] = queens_grouped['Neighborhood']

for ind in np.arange(queens_grouped.shape[0]):
    queens_neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(queens_grouped.iloc[ind, :], num_top_venues)

queens_neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Arverne,Surf Spot,Sandwich Place,Metro Station,Beach,Bed & Breakfast,Thai Restaurant,Donut Shop,Coffee Shop,Bus Stop,Board Shop
1,Astoria,Bar,Greek Restaurant,Middle Eastern Restaurant,Hookah Bar,Seafood Restaurant,Bakery,Pizza Place,Mediterranean Restaurant,Bubble Tea Shop,Ice Cream Shop
2,Astoria Heights,Food,Bakery,Playground,Plaza,Hostel,Deli / Bodega,Cocktail Bar,Bus Station,Burger Joint,Bowling Alley
3,Auburndale,Hookah Bar,Gymnastics Gym,Korean Restaurant,Noodle House,Furniture / Home Store,Bar,Italian Restaurant,Athletics & Sports,Fast Food Restaurant,Supermarket
4,Bay Terrace,Clothing Store,Women's Store,Kids Store,Donut Shop,Mobile Phone Shop,Lingerie Store,Cosmetics Shop,American Restaurant,Coffee Shop,Men's Store
5,Bayside,Bar,Sushi Restaurant,Pizza Place,Mexican Restaurant,American Restaurant,Indian Restaurant,Spa,Ice Cream Shop,Italian Restaurant,Bakery
6,Bayswater,Playground,Park,Women's Store,Farm,Egyptian Restaurant,Electronics Store,Empanada Restaurant,Event Space,Falafel Restaurant,Farmers Market
7,Beechhurst,Yoga Studio,Italian Restaurant,Pizza Place,Deli / Bodega,Dessert Shop,Donut Shop,Chinese Restaurant,Optical Shop,Supermarket,Gym
8,Bellaire,Convenience Store,Intersection,Moving Target,Bus Station,Bus Stop,Chinese Restaurant,Coffee Shop,Italian Restaurant,Greek Restaurant,Diner
9,Belle Harbor,Beach,Spa,Deli / Bodega,Chinese Restaurant,Italian Restaurant,Boutique,Bakery,Bagel Shop,Donut Shop,Mexican Restaurant


**Middle Village** is a good location to open a low-end Chinese restaurant. The surrounding area havee Pizza Place, Dessert Shop, Bakery, Discount Store, Diner, Sports Bar and Sandwich Place but no Chinese restaurant. **Queensbridge** could be another candidate for a good location since there are not many type of restaurants around the neighborhood but there are basketball court, baseball field, and Athletics & Sports place where the players will probably be hungry after a game.

### Conclusion

This report only shows the locations that can be analyzed for an optimal location to open a Chinese restaurant in New York. Further research will need to be conducted to finalize on the optimal location to open a Chinese restaurant. Space (shop) availability, the cost to operate the restaurant in the selected neighborhood and the projection of profit according to the existing patrons in the neighborhood will need to be considered. 

If a high-end Chinese fusion restaurant that cater for health conscious patrons, **Clinton** in **Manhattan** will be an ideal candidate. 

**Morningside Heights** in **Manhattan** neighborhood, **Middle Village** and **Queensbridge** in **Queens** are other candidates for an ideal location to open a Chinese restaurant although more research will need to be conducted to finalize the selection.

### End of Report