# This Notebook is for the IBM Applied Data Science Capstone

# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

In [1]:
# import libraries and functions
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# import standard scaler
from sklearn.preprocessing import StandardScaler

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [2]:
print('Hello Capstone Project Course!')

Hello Capstone Project Course!


## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

Some stakeholders want to open a **bubble tea shop** in **New York City**, and are looking for the most strategic area for one based on competition and local demographic. The goal is to spread the love of bubble tea beyond the Asian community; the intent is to **find areas with relatively low direct and indirect competition but with a noticeable Asian population** to have an initial customer base to grow from.

## Data

Based on the definition of our problem, indicators for potential locations include:
* number of existing bubble tea shops in the neighborhood
* percentage of venue population that is sweet refreshment type shops in the neighborhood
* percentage of neighborhood population that is Asian

The following data sources will be needed to extract/generate the required information:
* New York City census data by neighborhood and borough
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**

### Map Data

Load and transform NYC map data as in the Week 3 lab.

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

neighborhoods_data = newyork_data['features']

# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

### Foursquare

Next, let's get the Foursquare data.

In [4]:
CLIENT_ID = 'KQ4VT2MLRRR1URNCEH3VLZ2TYJFPOGXQPO0U4OWOFJ5AX41X' # your Foursquare ID
CLIENT_SECRET = 'PJPE3X2OICNMD15HY2PNG32IDQVBQTFUZEZIV4V5P1GSMJ4L' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
            
        except:
            print("Couldn't find group for this one",name)
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

nyc_venues = getNearbyVenues(names=neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )


Your credentails:
CLIENT_ID: KQ4VT2MLRRR1URNCEH3VLZ2TYJFPOGXQPO0U4OWOFJ5AX41X
CLIENT_SECRET:PJPE3X2OICNMD15HY2PNG32IDQVBQTFUZEZIV4V5P1GSMJ4L
Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
P

Let's look at counts of potential competitor categories.

In [5]:
nyc_venues['Venue Category'][
        (nyc_venues['Venue Category'] == 'Bubble Tea Shop') |
        (nyc_venues['Venue Category'] == 'Frozen Yogurt Shop') | 
        (nyc_venues['Venue Category'] == 'Smoothie Shop') |
        (nyc_venues['Venue Category'] == 'Juice Bar') | 
        (nyc_venues['Venue Category'] == 'Ice Cream Shop')].value_counts()

Ice Cream Shop        141
Juice Bar              52
Bubble Tea Shop        33
Frozen Yogurt Shop     13
Smoothie Shop           2
Name: Venue Category, dtype: int64

Seems we have some competition, now to begin putting together a dataframe of neighborhood features.

In [75]:
# define the dataframe columns
column_names = ['Neighborhood', 'Total Venues', 'Refreshment Shop Count','Bubble Tea Shop Count', 'Percentage Refreshment Shops'] 

# instantiate the dataframe
nyc_bubble_tea = pd.DataFrame(columns=column_names).set_index('Neighborhood')
nyc_bubble_tea.head()

# pull in neighborhood column in alphabetical order
temp = nyc_venues['Neighborhood'].unique()
temp.sort()
nyc_bubble_tea['Neighborhood'] = temp

nyc_bubble_tea.head()
# len(nyc_bubble_tea)

Unnamed: 0,Total Venues,Refreshment Shop Count,Bubble Tea Shop Count,Percentage Refreshment Shops,Neighborhood
0,,,,,Allerton
1,,,,,Annadale
2,,,,,Arden Heights
3,,,,,Arlington
4,,,,,Arrochar


Calculate overall counts and add them to the table.

In [76]:
# total counts
temp = nyc_venues.groupby('Neighborhood').count()
nyc_bubble_tea['Total Venues'] = temp['Venue'].values

# refreshment shop counts
temp = nyc_venues[
                    (nyc_venues['Venue Category'] == 'Bubble Tea Shop') |
                    (nyc_venues['Venue Category'] == 'Frozen Yogurt Shop') | 
                    (nyc_venues['Venue Category'] == 'Smoothie Shop') |
                    (nyc_venues['Venue Category'] == 'Juice Bar') | 
                    (nyc_venues['Venue Category'] == 'Ice Cream Shop')]
temp = temp.groupby('Neighborhood').count()
temp = temp['Venue'].reset_index()
temp.set_index('Neighborhood')
temp

nyc_bubble_tea = nyc_bubble_tea.merge(temp,on='Neighborhood',how='left')
nyc_bubble_tea['Refreshment Shop Count'] = nyc_bubble_tea['Venue']
nyc_bubble_tea.drop(columns=['Venue'],inplace=True)

# bubble tea shop counts
temp = nyc_venues[(nyc_venues['Venue Category'] == 'Bubble Tea Shop')]
temp = temp.groupby('Neighborhood').count()
temp = temp['Venue'].reset_index()
temp.set_index('Neighborhood')
temp

nyc_bubble_tea = nyc_bubble_tea.merge(temp,on='Neighborhood',how='left')
nyc_bubble_tea['Bubble Tea Shop Count'] = nyc_bubble_tea['Venue']
nyc_bubble_tea.drop(columns=['Venue'],inplace=True)


We have no categorical variables here, so let's replace all our NaNs with 0s and calculate the Refreshment Shop percentage.

In [77]:
# replace nans
nyc_bubble_tea.fillna(0,inplace=True)

# calculate percentage refreshment shops
nyc_bubble_tea['Percentage Refreshment Shops'] = np.round(100 * nyc_bubble_tea['Refreshment Shop Count'] / nyc_bubble_tea['Total Venues'],2)
nyc_bubble_tea.head(10)

Unnamed: 0,Total Venues,Refreshment Shop Count,Bubble Tea Shop Count,Percentage Refreshment Shops,Neighborhood
0,33,0.0,0.0,0.0,Allerton
1,8,0.0,0.0,0.0,Annadale
2,4,0.0,0.0,0.0,Arden Heights
3,5,0.0,0.0,0.0,Arlington
4,21,0.0,0.0,0.0,Arrochar
5,19,0.0,0.0,0.0,Arverne
6,100,5.0,2.0,5.0,Astoria
7,16,0.0,0.0,0.0,Astoria Heights
8,19,1.0,0.0,5.26,Auburndale
9,48,3.0,2.0,6.25,Bath Beach


### Census

Now, pull in census data.

In [152]:
temp = pd.read_csv('asian_demo_by_ntas.csv')
temp = temp.iloc[4:8]
temp.transpose().head(9)

Unnamed: 0,4,5,6,7
DP05 ACS Demographic and Housing Estimates,,,RACE,Asian
Unnamed: 1,BK72 Williamsburg,Estimate,,30
Unnamed: 2,,Estimate MOE+/-,,31
Unnamed: 3,,Percent,,0.1
Unnamed: 4,,Percent MOE+/-,,0.1
Unnamed: 5,BK73 North Side-South Side,Estimate,,2048
Unnamed: 6,,Estimate MOE+/-,,361
Unnamed: 7,,Percent,,4.4
Unnamed: 8,,Percent MOE+/-,,0.8


Need to reformat this as a usable dataframe.

In [79]:
# define the dataframe columns
column_names = ['Neighborhood','Percent Asian','Neighborhood_New'] 

# instantiate the dataframe
demodf = pd.DataFrame(columns=column_names).set_index('Neighborhood')

# add in neighborhood tabulation areas demographic data
temp = temp.transpose()[1:]

demodf['Neighborhood'] = temp[4][temp[4].notnull()]
demodf['Percent Asian'] = temp.iloc[1:].iloc[1::4][7].values

# remove nta code from neighborhood column as we don't need it
demodf['Neighborhood'] = demodf['Neighborhood'].map(lambda x: str(x)[5:])

Some neighborhoods are lumped together for census reporting, and they're separated with dashes so let's clean that up.

In [80]:
# set index to iterate over the key column
demodf.set_index('Neighborhood',inplace=True)

# turn entries lumped together with dashes into lists
for i,j in demodf.iterrows():
    if "-" in i:
        if "Co-op" in i:
            print('Ugh')
        else:
            demodf['Neighborhood_New'].loc[i] = i.split("-")

# expand lists into the row in a new dataframe, can't do inplace at this time
demodf_new = demodf.explode('Neighborhood_New')

# fill non-lumped rows with their original neighborhood and drop the old neighborhood column
demodf_new.reset_index(inplace=True)
demodf_new["Neighborhood_New"].fillna(demodf_new['Neighborhood'],inplace=True)

demodf_new.drop(columns='Neighborhood',inplace=True)

demodf_new.rename(columns = {'Neighborhood_New':'Neighborhood'},inplace=True)
demodf_new.head(10)

Ugh


Unnamed: 0,Percent Asian,Neighborhood
0,0.0,Airport
1,9.3,Allerton
2,9.3,Pelham Gardens
3,3.3,Annadale
4,3.3,Huguenot
5,3.3,Prince's Bay
6,3.3,Eltingville
7,6.3,Arden Heights
8,13.5,Astoria
9,41.9,Auburndale


Some entries still might not match, so filter out the ones that do and take a look. 

In [81]:
n1 = nyc_bubble_tea['Neighborhood'].to_list()
n2 = demodf_new['Neighborhood'].to_list()

matches = 0
match_list = []

for i in n1:
    for j in n2:
        if i == j: 
            matches += 1
            match_list.append(i)
            


In [82]:
for i in match_list:
    try:
        n1.remove(i)
    except:
        print(i,"isn't here")
    try:
        n2.remove(i)
    except:
        print(i, "isn't here")

Elmhurst isn't here
Maspeth isn't here
Murray Hill isn't here
New Brighton isn't here
Soundview isn't here


In [83]:
for i in n1:
    for j in n2:
        if i[0:2] == j[0:2]:
            print(i,'and',j)

Bedford Stuyvesant and Bedford
Bedford Stuyvesant and Bensonhurst East
Bedford Stuyvesant and Bensonhurst West
Beechhurst and Bedford
Beechhurst and Bensonhurst East
Beechhurst and Bensonhurst West
Bellaire and Bedford
Bellaire and Bensonhurst East
Bellaire and Bensonhurst West
Bensonhurst and Bedford
Bensonhurst and Bensonhurst East
Bensonhurst and Bensonhurst West
Broadway Junction and Bruckner
Broadway Junction and Bronx River
Broadway Junction and Bronx
Broadway Junction and Brooklyn
Bulls Head and Bushwick North
Bulls Head and Bushwick South
Bushwick and Bushwick North
Bushwick and Bushwick South
Butler Manor and Bushwick North
Butler Manor and Bushwick South
Central Harlem and Central Harlem North
Central Harlem and Central Harlem South
Claremont Village and Claremont
Claremont Village and Clearview
Concord and Columbia Street
Concord and Cooper Village
Concourse and Columbia Street
Concourse and Cooper Village
Crown Heights and Crotona Park East
Crown Heights and Crown Heights N

Change values in n2 that contain or are contained in items in n1. Also, fix some obvious matches.

In [84]:
n1 = nyc_bubble_tea['Neighborhood'].to_list()
n2 = demodf_new['Neighborhood'].to_list()

for n, i in enumerate(n1):
    for m, j in enumerate(n2):
        if (i in j) or (j in i):
            n2[m] = i
            
for n, i in enumerate(n2):
    if i == 'DUMBO':
        n2[n] = 'Dumbo'
    elif i == 'Flat Iron':
        n2[n] = 'Flatiron'
    elif i == 'Highbridge':
        n2[n] = 'High  Bridge'
    elif i == 'Seagate':
        n2[n] = 'Sea Gate'
    elif i == 'SoHo':
        n2[n] = 'Soho'
    elif i == 'TriBeCa':
        n2[n] = 'Tribeca'



Now rename the whole column in demodf_new.

In [85]:
# sort first
n2.sort()
demodf_new.sort_values(by='Neighborhood',inplace=True)
demodf_new['Neighborhood'] = n2

In [86]:
# check number of matches now
n1 = nyc_bubble_tea['Neighborhood'].to_list()
n2 = demodf_new['Neighborhood'].to_list()

matches = 0

for i in n1:
    for j in n2:
        if i == j: 
            matches += 1

matches

282

Good enough for an exercise. Lets inner join to take care of the difference.

In [87]:
demodf_new.set_index('Neighborhood',inplace=True)
nyc_bubble_tea.set_index('Neighborhood',inplace=True)
bubble_tea_features = demodf_new.merge(nyc_bubble_tea,on='Neighborhood',how='inner')

Isolate the features we are modeling on.

In [88]:
bubble_tea_features.drop(columns = ['Total Venues', 'Refreshment Shop Count'], inplace=True)
bubble_tea_features.reset_index(inplace=True)
bubble_tea_features.head()

Unnamed: 0,Neighborhood,Percent Asian,Bubble Tea Shop Count,Percentage Refreshment Shops
0,Allerton,9.3,0.0,0.0
1,Annadale,3.3,0.0,0.0
2,Arden Heights,6.3,0.0,0.0
3,Arlington,8.0,0.0,0.0
4,Arrochar,10.8,0.0,0.0


In [89]:
len(bubble_tea_features)

282

In [90]:
bubble_tea_features.head()

Unnamed: 0,Neighborhood,Percent Asian,Bubble Tea Shop Count,Percentage Refreshment Shops
0,Allerton,9.3,0.0,0.0
1,Annadale,3.3,0.0,0.0
2,Arden Heights,6.3,0.0,0.0
3,Arlington,8.0,0.0,0.0
4,Arrochar,10.8,0.0,0.0


Check the data types.

In [91]:
bubble_tea_features.dtypes

Neighborhood                     object
Percent Asian                    object
Bubble Tea Shop Count           float64
Percentage Refreshment Shops    float64
dtype: object

Convert Percent Asian from string to float.

In [92]:
bubble_tea_features['Percent Asian'] = pd.to_numeric(bubble_tea_features['Percent Asian'])
bubble_tea_features.dtypes

Neighborhood                     object
Percent Asian                   float64
Bubble Tea Shop Count           float64
Percentage Refreshment Shops    float64
dtype: object

## Methodology <a name="methodology"></a>

In this project we will try to identify optimal neighborhoods for a new bubble tea shop based on number of bubble tea shops (direct competition), percentage of refreshment-type shops (indirect competition), and Asian demographic residing in the neighborhood.

In the first step we have collected the required **data: number of bubble tea shops and percentage of refreshment-style shops from Foursquare, and Asian population data from the census**.

The next step in our analysis will be a quick exploration of our features through observation of the top/bottom neighborhoods to highlight regions of interest.

In the last step we will use **k-means clustering** to assign a grouping to the neighborhoods we have complete data for to determine which areas we should be most interested in, and plot these clusters on a map.

## Analysis <a name="analysis"></a>

First, let's get a high level view of what are likely our top contenders.

10 Lowest Bubble Tea Shop Count

In [93]:
bubble_tea_features[['Neighborhood','Bubble Tea Shop Count']].sort_values(by='Bubble Tea Shop Count').head(10)

Unnamed: 0,Neighborhood,Bubble Tea Shop Count
0,Allerton,0.0
178,New Brighton,0.0
179,New Brighton,0.0
180,New Brighton,0.0
181,New Dorp Beach,0.0
182,New Springville,0.0
183,North Corona,0.0
184,North Corona,0.0
185,North Side,0.0
186,Norwood,0.0


10 Lowest Percentage Refreshment Shops

In [94]:
bubble_tea_features[['Neighborhood','Percentage Refreshment Shops']].sort_values(by='Percentage Refreshment Shops').head(10)

Unnamed: 0,Neighborhood,Percentage Refreshment Shops
0,Allerton,0.0
161,Melrose,0.0
163,Midland Beach,0.0
170,Morris Heights,0.0
173,Mott Haven,0.0
174,Mott Haven,0.0
178,New Brighton,0.0
179,New Brighton,0.0
180,New Brighton,0.0
181,New Dorp Beach,0.0


10 Highest Asian Demographic

In [95]:
bubble_tea_features[['Neighborhood','Percent Asian']].sort_values(by='Percent Asian',ascending=False).head(10)

Unnamed: 0,Neighborhood,Percent Asian
99,Flushing,66.9
48,Chelsea,64.5
208,Queensboro Hill,63.6
75,East Harlem,58.8
89,Erasmus,57.3
158,Maspeth,57.3
175,Mount Hope,54.4
257,Upper East Side,50.3
245,Stuyvesant Town,49.3
88,Emerson Hill,46.0


This is a very small snippet but we can already see that Allerton might be a good place to look. Let's keep that in mind as a reference and begin clustering the neighborhoods.

First, let's extract just the data to be modeled.

In [96]:
# remove neighborhood column and scale dataset
bubble_tea_clustering = bubble_tea_features.drop('Neighborhood', 1)
cluster_dataset = StandardScaler().fit_transform(bubble_tea_clustering)

Now, since we don't have that many data points, let's just use the rule of thumb number of clusters: 
$\sqrt{\frac{N_{samples}}{2}}$ = $\sqrt{\frac{282}{2}} $ $\approx 12 $

In [97]:
# set number of clusters
kclusters = 12

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(cluster_dataset)

# check cluster labels generated in the dataframe
kmeans.labels_[0:10] 

array([ 3, 10, 10,  3,  3, 10,  3,  9,  0,  2])

Insert the labels back into the original dataset, and merge with the ny location data to get coordinates.

In [115]:
# insert labels
bubble_tea_features.insert(0, 'Cluster Labels', kmeans.labels_)

# merge with location data
bubble_tea_data = bubble_tea_features.merge(neighborhoods,on='Neighborhood',how='left')
bubble_tea_data.head()

Unnamed: 0,Cluster Labels,Neighborhood,Percent Asian,Bubble Tea Shop Count,Percentage Refreshment Shops,Borough,Latitude,Longitude
0,3,Allerton,9.3,0.0,0.0,Bronx,40.865788,-73.859319
1,10,Annadale,3.3,0.0,0.0,Staten Island,40.538114,-74.178549
2,10,Arden Heights,6.3,0.0,0.0,Staten Island,40.549286,-74.185887
3,3,Arlington,8.0,0.0,0.0,Staten Island,40.635325,-74.165104
4,3,Arrochar,10.8,0.0,0.0,Staten Island,40.596313,-74.067124


Let's plot the clusters to see if there's any positional similarity between them.

In [99]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(bubble_tea_data['Latitude'], bubble_tea_data['Longitude'], bubble_tea_data['Neighborhood'], bubble_tea_data['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Not much, but it seems certain clusters dominate. Let's see what the clusters might represent with some group metrics.

In [130]:
grouping_data = {'asian':[],'tea':[],'refresh':[],'count':[]}

for i in range(kclusters):
    grouping_data['asian'].append(bubble_tea_data['Percent Asian'][bubble_tea_data['Cluster Labels'] == i].mean())
    grouping_data['tea'].append(bubble_tea_data['Bubble Tea Shop Count'][bubble_tea_data['Cluster Labels'] == i].mean())
    grouping_data['refresh'].append(bubble_tea_data['Percentage Refreshment Shops'][bubble_tea_data['Cluster Labels'] == i].mean())
    grouping_data['count'].append(len(bubble_tea_data['Percentage Refreshment Shops'][bubble_tea_data['Cluster Labels'] == i]))

group_df = pd.DataFrame.from_dict(grouping_data)
group_df.sort_values('asian',ascending=False)
    

Unnamed: 0,asian,tea,refresh,count
4,58.8375,0.0,4.96,8
8,49.6,3.0,6.06,2
1,33.65,0.0,4.575,10
5,32.816667,1.166667,3.56,6
9,29.285714,0.0,0.11,28
3,12.4,0.0,0.090656,61
11,7.4,1.0,20.0,1
7,5.85,1.0,2.156667,12
6,5.691667,0.0,9.820833,12
0,5.639655,0.0,3.949655,58


Cluster 4 seems like a good place to start, with high Asian population, but low competition. Let's take a closer look.

In [136]:
bubble_tea_data[bubble_tea_data['Cluster Labels'] == 4]

Unnamed: 0,Cluster Labels,Neighborhood,Percent Asian,Bubble Tea Shop Count,Percentage Refreshment Shops,Borough,Latitude,Longitude
49,4,Chelsea,64.5,0.0,5.77,Manhattan,40.744035,-74.003116
50,4,Chelsea,64.5,0.0,5.77,Staten Island,40.594726,-74.18956
77,4,East Harlem,58.8,0.0,0.0,Manhattan,40.792249,-73.944182
91,4,Erasmus,57.3,0.0,10.0,Brooklyn,40.646926,-73.948177
160,4,Maspeth,57.3,0.0,3.45,Queens,40.725427,-73.896217
177,4,Mount Hope,54.4,0.0,7.69,Bronx,40.848842,-73.908299
212,4,Queensboro Hill,63.6,0.0,4.0,Queens,40.744572,-73.825809
261,4,Upper East Side,50.3,0.0,3.0,Manhattan,40.775639,-73.960508


This cluster seems to have very large Asian populations, as well as moderate but indirect competition. Looks like Upper East Side/East Harlem would be a place to look into. Harlem is known to be culturally diverse. Let's look at this cluster on a map.

In [153]:
# create reduced dataframe for just our cluster of interest
interest = bubble_tea_data[bubble_tea_data['Cluster Labels'] == 4]

# create map
map_interest = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(interest['Latitude'], interest['Longitude'], interest['Borough'], interest['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_interest)  
    
map_interest

## Results and Discussion <a name="results"></a>

Our brief analysis shows that a few well known neighborhoods in Manhattan and Queens would be good places to look to start the bubble tea business, based on demographic and competition data.

To start, we used Foursquare in conjunction with census data to generate the following features: Asian population percentage, number of bubble tea shops, and percentage of refreshment-type shops out of total venues. To get a high level overview, we looked at highest Asian population and lowest competition. However, many neighborhoods have no competition, so we're not able to glean much from this. 

We took it a step further by using k-means clustering on the aforementioned variables to isolate groupings of neighborhoods that have both high Asian population and low competition. This resulted in discovery of a small cluster of 8 neighborhoods that have $\ge$50% Asian population, no other bubble tea shops, and venues comprising of $\le$10% refreshment shops. Within this cluster are areas that make sense, including a few neighborhoods in Manhattan, like Upper East Side/East Harlem. This is informative, and a good indicator that the features we are using as metrics are working.

There is more to be done here. As a next step, we should look at factors outside of market, like rent and crime rates. Factors like these can prohibit the starting of new businesses of this or any type. This would explain the abundance of low competition areas as well.

Just to note, another improvement that could be made is to thoroughly curate the census data; there were some issues with matching that were caused by inconsistent naming of neighborhoods.

## Conclusion <a name="conclusion"></a>

The purpose of this analysis was to take a first step into finding a suitable neighborhood to open a bubble tea business. By quantifying competition and demographic data from Foursquare and the NYC census, respectively, we have narrowed down our search to 8 neighborhoods.

This information will be used by the stakeholders to proceed with further analysis, involving other metrics like crime rates and rent.