# Capstone Project: Battle of the Neighborhoods

### by Kasey Chang started 2020/02/20

## 1. Overall Business Problem

### 1.1 Background

Boba Tea, the sweet tea drink often with tapioca pearls, is very popular with Asian populations, since its invention in the 1980's. Despite its age, demand for boba tea continued as it spread to non-Asian countries. It was estimated that [compound growth rate of boba tea marekt from 2017 to 2023 will be 7.3%.](https://journal.businesstoday.org/bt-online/2019/its-quali-tea-how-boba-became-a-craze)



<img src="https://media.istockphoto.com/photos/bubble-tea-in-a-row-picture-id531948035?k=6&m=531948035&s=612x612&w=0&h=42OINPNXpnwtXAuxkTgFomZ9oW9q8r2LtBoxa5qG1pU=">

### 1.2 Problem 

You are a boba tea franchise that is considering opening some shops in San Francisco. You want to know where your competition are (such as [Quickly USA](http://www.quicklyusa.com/)) and what sort of neighborhoods are they targeting. Are they just setting up near concentration of Asians? Or have they branched out to shadow coffee shops and other venues? 


NOTE: In Foursquare "boba tea shop" is refered to as "Bubble Tea Shop" categoryID 52e81612bcbc57f1066b7a0c

To rephrase the question in data science terms: 

#### Q: What sort of neighborhoods are the competition setting up boba shops in?  


Which includes the following assumptions

* that people would not go out of their way for boba tea, so the store is near where they live, play, or work
* we will assume the max distance is 0.5 km / 500 meters, which sounds like a nice walkable distance for a cup of boba tea



### 1.3 Interest

Knowing what competitors are doing is business inteligence that leads to business decisions. 

## 2 Searching for a pattern 

We are to analyze where the existing boba tea shops are located, and what do they have in common. 

### 2.0 Defining the pattern problem

In terms of data science, to discern patterns, we perform a KNN clustering. 

Specifically here, we need to perform the following steps

* locate all boba tea shops within San Francisco (see note later)
* get table of venues near each shop 
* flatten the table so we have a row for each shop, listing all the venue counts
* perform KNN analysis to locate clusters and patterns in each cluster 

### 2.1 Get boba tea shop list

We need to query Foursquare for all venues that fits the "bubble tea shop" category ID


In [1]:
#maybe we should just do pip install folium instead hmmm?
!pip install folium



In [2]:
import pandas as pd
import numpy as np
import io
import requests


In [3]:
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

LIMIT=50 #max return, note that Foursquare actually limits this to 50 no matter how high you set it
rad=500 #meters


Next cell has the Foursquare credentials, which you don't need to see. 

In [4]:
# The code was removed by Watson Studio for sharing.

In [5]:
sf_lat=37.7749     #center of san francisco
sf_long=-122.4194  #center of san francisco
sf_rad=15000       #15 km, which should cover ALL of San Francisco
cat_id="52e81612bcbc57f1066b7a0c"    #"bubble tea place"
  
browse_url='https://api.foursquare.com/v2/venues/search?intent=browse&client_id={}&client_secret={}&categoryId={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET,
    cat_id,
    sf_lat, 
    sf_long,
    VERSION,
    sf_rad,
    LIMIT)

results = requests.get(browse_url).json()


In [6]:
venues = results['response']

boba_venues = json_normalize(venues['venues']) # flatten JSON to get the venues

# filter columns
filtered_columns = ['name', 'location.lat', 'location.lng','location.neighborhood','location.postalCode']
boba_venues2 =boba_venues.loc[:, filtered_columns]


In [7]:
boba_venues2.head()

Unnamed: 0,name,location.lat,location.lng,location.neighborhood,location.postalCode
0,Asha Tea House,37.788175,-122.403615,,94108
1,Boba Guys,37.766448,-122.397042,Showplace Square,94107
2,Black Sugar,37.786135,-122.409948,Lower Nob Hill,94102
3,Boba Guys,37.772907,-122.423507,,94102
4,Boba Guys,37.789899,-122.407077,Downtown San Francisco-Union Square,94108


In [8]:
boba_venues2.shape

(50, 5)

Upon casual inspection of the data it's clear that many shops do NOT list a neighborhood, but at least, all of them listed a zip code. This has bearings later. 

It looks reasonable, but let's plot them on a map of SF.

In [9]:
map_clusters = folium.Map(location=[sf_lat,sf_long], zoom_start=12)
for lat, lon, name, zipcode in zip(boba_venues2['location.lat'], boba_venues2['location.lng'],boba_venues2['name'],boba_venues2['location.postalCode']):
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

My radius of 15km was too large! It included shops in Daly City, Oakland, and South San Francisco! 

We re-run the results setting radius to 10km instead. NOTE: We will discuss the limitation of this approach later. 

In [10]:
sf_rad=10000       #10 km, which should cover ALL of San Francisco
  
browse_url='https://api.foursquare.com/v2/venues/search?intent=browse&client_id={}&client_secret={}&categoryId={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET,
    cat_id,
    sf_lat, 
    sf_long,
    VERSION,
    sf_rad,
    LIMIT)

results = requests.get(browse_url).json()

venues = results['response']

boba_venues = json_normalize(venues['venues']) # flatten JSON to get the venues

# filter columns
filtered_columns = ['name', 'location.lat', 'location.lng','location.neighborhood','location.postalCode']
boba_venues2 =boba_venues.loc[:, filtered_columns]



In [11]:
map_clusters = folium.Map(location=[sf_lat,sf_long], zoom_start=12)
for lat, lon, name, zipcode in zip(boba_venues2['location.lat'], boba_venues2['location.lng'],boba_venues2['name'],boba_venues2['location.postalCode']):
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [12]:
boba_venues2.shape

(50, 5)

Please note that Foursquare returned only 50 results despite setting different radiuses. So it is clear there are MORE than 50 boba tea shops in San Francisco, but we've hit the API limit (despite setting limit to 100 in the query). This is confirmed by the API documentation: maximum results returned is 50. 

A split and combine approach was considered: divide san francisco into four quadrants and combine the dataset to eliminate overlap. But the problem again stems from the over-reach areas. We also don't need to get data from every single boba shop in the city to discern a pattern. Therefore, a decision was reached to use the "top 50" as is. Though in the future, one may consider this alternate approach. 


### 2.2  Venues around each of the boba shops we found

We will set an arbitrary distance of 500m (rad=500 earlier), making an inherent assumption that people would not want to walk very far to get a cup of boba tea. 

We'll borrow a function from previous project...

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=rad):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We will manipulate the table to separate the shops that has more than one location, assuming only one per zipcode. This may generate duplicates later, and I know I should have changed all the field names, but it works as is. We do this by create a composite name called "namezip" by concatenating name and zipcode. 

In [14]:
boba_venues2['namezip']=boba_venues2['name']+boba_venues2['location.postalCode']

In [15]:
boba_venues3 = getNearbyVenues(names=boba_venues2['namezip'],
                             latitudes=boba_venues2['location.lat'],
                             longitudes=boba_venues2['location.lng']
                             )

Asha Tea House94108
Boba Guys94107
Black Sugar94102
Boba Guys94102
Boba Guys94108
Boba Guys94115
Yi Fang Taiwan Fruit Tea94132
Plentea94108
Yi Fang Taiwan Fruit Tea94108
CoCo Fresh Tea & Juice94104
Boba Guys94117
Tea And Others94117
Urban Ritual94102
Boba Guys94110
Little Sweet94122
Sharetea94103
Little Sweet94118
Brew Cha94110
Teaspoon94109
SimplexiTea94103
fifty/fifty94118
Little Sweet94108
Tpumps94122
i-Tea94108
Gong Cha94102
Boba Butt Tea House94108
Boba Bao Bei94110
The Posh Bagel94114
i-Tea94122
Wonderful Desserts & Cafe94122
Purple Kow94121
Enough94108
District Tea94110
Sharetea94103
Mr and Mrs Tea House94118
B&B94122
Qualitea94114
Wonder Tea94122
TJ Brewed Tea and Real Fruit (TJ Cups)94122
Quickly (Kobe Bento) 快可立94133
Happy Cow Creamery & Tea94107
OMG Tea94134
Super Cue Cafe94116
Teapenter94122
Mr. T Cafe94112
Tea Hut94115
CoCo Fresh Tea & Juice94133
Sunday at the Museum94102
STEEP94107
IdentiTea94110


In [16]:
boba_venues3.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Asha Tea House94108,37.788175,-122.403615,Flatiron Wine and Spirits,37.788039,-122.401466,Wine Shop
1,Asha Tea House94108,37.788175,-122.403615,Saint Laurent,37.787774,-122.405412,Boutique
2,Asha Tea House94108,37.788175,-122.403615,Maison Margiela,37.788261,-122.405765,Boutique
3,Asha Tea House94108,37.788175,-122.403615,Crocker Galleria Roof Terrace,37.789146,-122.402447,Roof Deck
4,Asha Tea House94108,37.788175,-122.403615,Cask,37.787114,-122.403092,Liquor Store


In [17]:
boba_venues3.shape

(2375, 7)

In [18]:
boba_venues3.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asha Tea House94108,50,50,50,50,50,50
B&B94122,50,50,50,50,50,50
Black Sugar94102,50,50,50,50,50,50
Boba Bao Bei94110,50,50,50,50,50,50
Boba Butt Tea House94108,50,50,50,50,50,50
Boba Guys94102,50,50,50,50,50,50
Boba Guys94107,50,50,50,50,50,50
Boba Guys94108,50,50,50,50,50,50
Boba Guys94110,50,50,50,50,50,50
Boba Guys94115,50,50,50,50,50,50


In [19]:
print('There are {} uniques categories.'.format(len(boba_venues3['Venue Category'].unique())))

There are 273 uniques categories.


Let's collapse the table and see what's near each store in summary form...

In [20]:
boba_onehot = pd.get_dummies(boba_venues3[['Venue Category']], prefix="", prefix_sep="")

boba_onehot['namezip'] = boba_venues3['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [boba_onehot.columns[-1]] + list(boba_onehot.columns[:-1])
boba_onehot = boba_onehot[fixed_columns]

boba_onehot.head()

Unnamed: 0,namezip,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Wagashi Place,Watch Shop,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let us collapse the table, leaving only one namezip per row, and average up each column per row

In [21]:
boba_grouped = boba_onehot.groupby('namezip').mean().reset_index()
boba_grouped.head()

Unnamed: 0,namezip,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Wagashi Place,Watch Shop,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Asha Tea House94108,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02
1,B&B94122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Black Sugar94102,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0
3,Boba Bao Bei94110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
4,Boba Butt Tea House94108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0


In [22]:
boba_grouped.shape

(49, 274)

So what are the most common venues near each store? This is mainly for sanity check. 

In [23]:
num_top_venues = 5

for boba in boba_grouped['namezip']:
    print("----"+boba+"----")
    temp = boba_grouped[boba_grouped['namezip'] == boba].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Asha Tea House94108----
                  venue  freq
0              Boutique  0.08
1           Men's Store  0.06
2           Coffee Shop  0.06
3  Gym / Fitness Center  0.06
4                 Hotel  0.06


----B&B94122----
                   venue  freq
0        Bubble Tea Shop  0.08
1            Coffee Shop  0.08
2          Deli / Bodega  0.06
3  Vietnamese Restaurant  0.06
4    Japanese Restaurant  0.04


----Black Sugar94102----
             venue  freq
0          Theater  0.10
1            Hotel  0.08
2  Thai Restaurant  0.04
3        Speakeasy  0.04
4     Cocktail Bar  0.04


----Boba Bao Bei94110----
                venue  freq
0  Mexican Restaurant  0.12
1  Italian Restaurant  0.06
2         Coffee Shop  0.06
3        Cocktail Bar  0.04
4              Bakery  0.04


----Boba Butt Tea House94108----
                     venue  freq
0              Coffee Shop  0.08
1              Men's Store  0.08
2  New American Restaurant  0.06
3               Restaurant  0.04
4       Chines

The most poular venues near each boba store are restaurant (no matter which type), cafe, or bar. 

Let's see that in a flattened table, borrowing from a previous project... 

In [24]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [25]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['namezip']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['namezip'] = boba_grouped['namezip']

for ind in np.arange(boba_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boba_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,namezip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House94108,Boutique,Gym / Fitness Center,Hotel,Coffee Shop,Men's Store,Jewelry Store,Café,Plaza,Museum,Gym
1,B&B94122,Bubble Tea Shop,Coffee Shop,Vietnamese Restaurant,Deli / Bodega,Dessert Shop,Grocery Store,Bar,Bakery,Chinese Restaurant,Dumpling Restaurant
2,Black Sugar94102,Theater,Hotel,Speakeasy,Thai Restaurant,Gym / Fitness Center,Cocktail Bar,Cosmetics Shop,Jewelry Store,Optical Shop,Bubble Tea Shop
3,Boba Bao Bei94110,Mexican Restaurant,Italian Restaurant,Coffee Shop,Cocktail Bar,Art Gallery,Bakery,Latin American Restaurant,Yoga Studio,Hot Dog Joint,Hungarian Restaurant
4,Boba Butt Tea House94108,Men's Store,Coffee Shop,New American Restaurant,Chinese Restaurant,Wine Bar,Szechuan Restaurant,Cocktail Bar,Restaurant,Cantonese Restaurant,Neighborhood


In [26]:
neighborhoods_venues_sorted.shape

(49, 11)

### Let's try to find some patterns... 

In [27]:
neighborhoods_venues_sorted

Unnamed: 0,namezip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House94108,Boutique,Gym / Fitness Center,Hotel,Coffee Shop,Men's Store,Jewelry Store,Café,Plaza,Museum,Gym
1,B&B94122,Bubble Tea Shop,Coffee Shop,Vietnamese Restaurant,Deli / Bodega,Dessert Shop,Grocery Store,Bar,Bakery,Chinese Restaurant,Dumpling Restaurant
2,Black Sugar94102,Theater,Hotel,Speakeasy,Thai Restaurant,Gym / Fitness Center,Cocktail Bar,Cosmetics Shop,Jewelry Store,Optical Shop,Bubble Tea Shop
3,Boba Bao Bei94110,Mexican Restaurant,Italian Restaurant,Coffee Shop,Cocktail Bar,Art Gallery,Bakery,Latin American Restaurant,Yoga Studio,Hot Dog Joint,Hungarian Restaurant
4,Boba Butt Tea House94108,Men's Store,Coffee Shop,New American Restaurant,Chinese Restaurant,Wine Bar,Szechuan Restaurant,Cocktail Bar,Restaurant,Cantonese Restaurant,Neighborhood
5,Boba Guys94102,Sushi Restaurant,Wine Bar,Cocktail Bar,New American Restaurant,Coffee Shop,Gym / Fitness Center,Park,Beer Garden,Salon / Barbershop,Comic Shop
6,Boba Guys94107,Gym,Coffee Shop,Breakfast Spot,Mexican Restaurant,Café,Park,Brewery,Wine Shop,American Restaurant,Pet Store
7,Boba Guys94108,Boutique,Men's Store,Hotel,Jewelry Store,Spa,Coffee Shop,Clothing Store,Café,Candy Store,Breakfast Spot
8,Boba Guys94110,Cocktail Bar,Ice Cream Shop,Mexican Restaurant,Coffee Shop,Bakery,New American Restaurant,Music Venue,Bookstore,Wine Bar,Gift Shop
9,Boba Guys94115,Creperie,Jazz Club,Grocery Store,Cosmetics Shop,Gift Shop,Tea Room,Indian Restaurant,New American Restaurant,Shopping Mall,Lounge


In [28]:
boba_grouped

Unnamed: 0,namezip,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Wagashi Place,Watch Shop,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Asha Tea House94108,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.02,...,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.02
1,B&B94122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Black Sugar94102,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.02,0.0,0.0,0.02,0.0,0.02,0.0,0.0,0.0
3,Boba Bao Bei94110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.04,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02
4,Boba Butt Tea House94108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0
5,Boba Guys94102,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.06,0.02,0.0,0.0,0.0,0.02
6,Boba Guys94107,0.02,0.0,0.0,0.0,0.04,0.02,0.0,0.02,0.0,...,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.02
7,Boba Guys94108,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Boba Guys94110,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0
9,Boba Guys94115,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We try to sort them into 5 clusters using KNN, as computers can often spot trends that we humans can't see, among so many variables. 

In [29]:
kclusters = 5

boba_grouped_clustering = boba_grouped.drop('namezip', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boba_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 2, 1, 3, 4, 4, 4, 1, 3, 4, 4, 3, 1, 4, 3, 1, 1, 4, 3, 1, 4, 4,
       2, 3, 0, 4, 0, 4, 4, 4, 4, 3, 4, 0, 0, 4, 4, 2, 4, 4, 2, 4, 2, 2,
       0, 3, 4, 4, 2], dtype=int32)

In [30]:
# add clustering labels column
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boba_merged = boba_venues2

# merge boba_grouped with neighborhoods_venues_sorted to add latitude/longitude for each neighborhood
boba_merged = boba_merged.join(neighborhoods_venues_sorted.set_index('namezip'), on='namezip',how='outer')

boba_merged.head() 

Unnamed: 0,name,location.lat,location.lng,location.neighborhood,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House,37.788175,-122.403615,,94108,Asha Tea House94108,1,Boutique,Gym / Fitness Center,Hotel,Coffee Shop,Men's Store,Jewelry Store,Café,Plaza,Museum,Gym
1,Boba Guys,37.766448,-122.397042,Showplace Square,94107,Boba Guys94107,4,Gym,Coffee Shop,Breakfast Spot,Mexican Restaurant,Café,Park,Brewery,Wine Shop,American Restaurant,Pet Store
2,Black Sugar,37.786135,-122.409948,Lower Nob Hill,94102,Black Sugar94102,1,Theater,Hotel,Speakeasy,Thai Restaurant,Gym / Fitness Center,Cocktail Bar,Cosmetics Shop,Jewelry Store,Optical Shop,Bubble Tea Shop
3,Boba Guys,37.772907,-122.423507,,94102,Boba Guys94102,4,Sushi Restaurant,Wine Bar,Cocktail Bar,New American Restaurant,Coffee Shop,Gym / Fitness Center,Park,Beer Garden,Salon / Barbershop,Comic Shop
4,Boba Guys,37.789899,-122.407077,Downtown San Francisco-Union Square,94108,Boba Guys94108,1,Boutique,Men's Store,Hotel,Jewelry Store,Spa,Coffee Shop,Clothing Store,Café,Candy Store,Breakfast Spot


### 2.3 Visualize...

In [31]:
boba_merged.shape

(50, 17)

In [32]:
# create map
map_clusters = folium.Map(location=[sf_lat,sf_long], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boba_merged['location.lat'], boba_merged['location.lng'], boba_merged['namezip'], boba_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Looks actually quite reasonable. There are noticeable clusters and as a San Francisco resident, I can see some commonalities for the clusters. 

### Cluster 1 

In [33]:
boba_merged.loc[boba_merged['Cluster Labels'] == 0, boba_merged.columns[[0,4]+ list(range(7, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
8,Yi Fang Taiwan Fruit Tea,94108,Chinese Restaurant,Coffee Shop,Bakery,Tea Room,Cocktail Bar,Dim Sum Restaurant,History Museum,Dive Bar,Bubble Tea Shop,Szechuan Restaurant
30,Purple Kow,94121,Chinese Restaurant,Café,Japanese Restaurant,Bakery,Sporting Goods Shop,Bus Station,Yoga Studio,Ramen Restaurant,Burrito Place,Sandwich Place
38,TJ Brewed Tea and Real Fruit (TJ Cups),94122,Chinese Restaurant,Dim Sum Restaurant,Grocery Store,Vietnamese Restaurant,Bubble Tea Shop,Coffee Shop,Bank,Sandwich Place,Bar,Bus Station
41,OMG Tea,94134,Chinese Restaurant,Vietnamese Restaurant,Bubble Tea Shop,Sandwich Place,Coffee Shop,Bakery,Storage Facility,BBQ Joint,Pizza Place,Cantonese Restaurant
42,Super Cue Cafe,94116,Chinese Restaurant,Sushi Restaurant,Café,Sandwich Place,Park,Pizza Place,Dive Bar,Spa,Optical Shop,Light Rail Station


They all seem to be near a cofee shop, bar, or cafe

### Cluster 2 

In [34]:
boba_merged.loc[boba_merged['Cluster Labels'] == 1, boba_merged.columns[[0,4]+ list(range(7, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House,94108,Boutique,Gym / Fitness Center,Hotel,Coffee Shop,Men's Store,Jewelry Store,Café,Plaza,Museum,Gym
2,Black Sugar,94102,Theater,Hotel,Speakeasy,Thai Restaurant,Gym / Fitness Center,Cocktail Bar,Cosmetics Shop,Jewelry Store,Optical Shop,Bubble Tea Shop
4,Boba Guys,94108,Boutique,Men's Store,Hotel,Jewelry Store,Spa,Coffee Shop,Clothing Store,Café,Candy Store,Breakfast Spot
9,CoCo Fresh Tea & Juice,94104,Boutique,Hotel,Coffee Shop,Sandwich Place,Cocktail Bar,Café,Gym / Fitness Center,Sushi Restaurant,New American Restaurant,Plaza
21,Little Sweet,94108,Boutique,Jewelry Store,Gym / Fitness Center,Clothing Store,Coffee Shop,Men's Store,Furniture / Home Store,Cosmetics Shop,Electronics Store,Tea Room
24,Gong Cha,94102,Theater,Hotel,Jewelry Store,Toy / Game Store,Furniture / Home Store,Thai Restaurant,Cosmetics Shop,Optical Shop,Electronics Store,Bubble Tea Shop
31,Enough,94108,Boutique,Men's Store,Coffee Shop,Sandwich Place,Sushi Restaurant,Gym / Fitness Center,Yoga Studio,Hotel,Library,Liquor Store


This cluster is clearly over-saturated with boba tea shops and to be avoided.  

### Cluster 3 

In [35]:
boba_merged.loc[boba_merged['Cluster Labels'] == 2, boba_merged.columns[[0,4]+ list(range(7, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Tpumps,94122,Bubble Tea Shop,Vietnamese Restaurant,Deli / Bodega,Bakery,Dumpling Restaurant,Dim Sum Restaurant,Chinese Restaurant,Szechuan Restaurant,Bar,Dessert Shop
28,i-Tea,94122,Bubble Tea Shop,Vietnamese Restaurant,Bakery,Deli / Bodega,Dumpling Restaurant,Dim Sum Restaurant,Szechuan Restaurant,Bar,Bank,Dessert Shop
29,Wonderful Desserts & Cafe,94122,Bubble Tea Shop,Vietnamese Restaurant,Deli / Bodega,Bakery,Dumpling Restaurant,Dessert Shop,Szechuan Restaurant,Bar,Bank,Chinese Restaurant
34,Mr and Mrs Tea House,94118,Japanese Restaurant,Vietnamese Restaurant,Burmese Restaurant,Thai Restaurant,Bakery,Chinese Restaurant,BBQ Joint,Korean Restaurant,Asian Restaurant,Ice Cream Shop
35,B&B,94122,Bubble Tea Shop,Coffee Shop,Vietnamese Restaurant,Deli / Bodega,Dessert Shop,Grocery Store,Bar,Bakery,Chinese Restaurant,Dumpling Restaurant
37,Wonder Tea,94122,Vietnamese Restaurant,Bubble Tea Shop,Bakery,Deli / Bodega,Dim Sum Restaurant,Szechuan Restaurant,Bar,Bank,Dumpling Restaurant,Chinese Restaurant
43,Teapenter,94122,Vietnamese Restaurant,Coffee Shop,Chinese Restaurant,Bakery,Bar,Bubble Tea Shop,Szechuan Restaurant,Dessert Shop,Deli / Bodega,Cantonese Restaurant


These stand out that they are usually NOT near Asian restaurants, but simply near a venue or gallery. 

### Cluster 4 

In [36]:
boba_merged.loc[boba_merged['Cluster Labels'] == 3, boba_merged.columns[[0,4]+ list(range(7, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
6,Yi Fang Taiwan Fruit Tea,94132,Bakery,Lingerie Store,Cosmetics Shop,Clothing Store,Sandwich Place,Gym,Candy Store,Food Truck,Mobile Phone Shop,Grocery Store
13,Boba Guys,94110,Cocktail Bar,Ice Cream Shop,Mexican Restaurant,Coffee Shop,Bakery,New American Restaurant,Music Venue,Bookstore,Wine Bar,Gift Shop
17,Brew Cha,94110,Mexican Restaurant,Bar,Music Venue,Cocktail Bar,Cuban Restaurant,Chinese Restaurant,Boxing Gym,Street Art,Market,Bubble Tea Shop
19,SimplexiTea,94103,Cocktail Bar,Marijuana Dispensary,Performing Arts Venue,Theater,Beer Bar,Café,Coffee Shop,Taco Place,New American Restaurant,Ethiopian Restaurant
26,Boba Bao Bei,94110,Mexican Restaurant,Italian Restaurant,Coffee Shop,Cocktail Bar,Art Gallery,Bakery,Latin American Restaurant,Yoga Studio,Hot Dog Joint,Hungarian Restaurant
32,District Tea,94110,Mexican Restaurant,Bar,Cocktail Bar,Gym,Vietnamese Restaurant,Music Venue,Yoga Studio,Chinese Restaurant,Chocolate Shop,Cheese Shop
44,Mr. T Cafe,94112,Mexican Restaurant,Bakery,Chinese Restaurant,Latin American Restaurant,Filipino Restaurant,Coffee Shop,Thrift / Vintage Store,Vietnamese Restaurant,Bank,Sandwich Place
49,IdentiTea,94110,Mexican Restaurant,Performing Arts Venue,Coffee Shop,Grocery Store,Fish Market,Bookstore,Pizza Place,Burrito Place,South American Restaurant,Clothing Store


These seem to be near coffee shops and cafes. 

### Cluster 5 

In [37]:
boba_merged.loc[boba_merged['Cluster Labels'] == 4, boba_merged.columns[[0,4]+ list(range(7, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Boba Guys,94107,Gym,Coffee Shop,Breakfast Spot,Mexican Restaurant,Café,Park,Brewery,Wine Shop,American Restaurant,Pet Store
3,Boba Guys,94102,Sushi Restaurant,Wine Bar,Cocktail Bar,New American Restaurant,Coffee Shop,Gym / Fitness Center,Park,Beer Garden,Salon / Barbershop,Comic Shop
5,Boba Guys,94115,Creperie,Jazz Club,Grocery Store,Cosmetics Shop,Gift Shop,Tea Room,Indian Restaurant,New American Restaurant,Shopping Mall,Lounge
7,Plentea,94108,Coffee Shop,Gym,Restaurant,Hotel,Sushi Restaurant,Café,Sandwich Place,Roof Deck,Russian Restaurant,Salad Place
10,Boba Guys,94117,Bar,Mexican Restaurant,Wine Bar,Sushi Restaurant,Pizza Place,Nightclub,Ethiopian Restaurant,Park,Donut Shop,Dog Run
11,Tea And Others,94117,Coffee Shop,Grocery Store,Ice Cream Shop,Park,Wine Bar,Mediterranean Restaurant,Gift Shop,Cocktail Bar,Yoga Studio,Market
12,Urban Ritual,94102,Wine Bar,Dessert Shop,Clothing Store,French Restaurant,Sushi Restaurant,Pizza Place,Cocktail Bar,Coffee Shop,Ice Cream Shop,Restaurant
14,Little Sweet,94122,Bakery,Ice Cream Shop,Mediterranean Restaurant,Korean Restaurant,Vietnamese Restaurant,Art Gallery,Pizza Place,Sandwich Place,Thai Restaurant,Sushi Restaurant
15,Sharetea,94103,Coffee Shop,Hotel,Cosmetics Shop,Art Museum,Food Truck,Marijuana Dispensary,Women's Store,Tea Room,Museum,Spa
33,Sharetea,94103,Coffee Shop,Hotel,Cosmetics Shop,Art Museum,Food Truck,Marijuana Dispensary,Women's Store,Tea Room,Museum,Spa


These are all around Asian restaurants. Being a San Francisco resident, I can see they are located on minor concentration of Chinese shops, sometimes called tertiery Chinatowns. 

### 2.5 Limitations of the data extracted so far

Despite specifying a higher limit, the maximum results returned is 50. Indeed, this is actually listed in the [Foursquare API documentation](https://developer.foursquare.com/docs/api/venues/search) that limit is capped at 50 even if you specify a higher number. 
    
One potential alternative approach to divide San Francisco into 4 quadrants, and use 4 bounding boxes instead of center and radius, then merge the results into a single table, eliminating duplicates, but I doubt we'll pick up that many more data points. And I doubt the additional effrot will tell us a substantially different story. But it is worth exploring should we need to go into further detail.   

We have also restricted ourselves to only one shop of each franchise in each zipcode, despite the possibility of multiple franchises in the same zipcode, mainly due to Foursquare's API limit of 50 results per query. We could have included and indexed on the Foursquare venue_id, should we desire more accuracy. 


## 3. Results and Discussion

### 3.1 Meaning of the clusters 

Based on the "first 50" boba places found via Foursquare, we can conclude that most Chinese-heavy neighborhoods have at least one boba shop already. In fact, as you can look on the west side on Irving st you can find 6 shops in 8 blocks, all on the same street. That density is a bit too high. Most of the existing boba shops are in fact, simply placed near venues and restaurants. Some are even following coffee shops such as Starbucks.

Here are the trends we extrapolated from five clusters we identified earlier:

* Be where Chinese shops and restaurants are
* Be where Chinese restaurants are 
* Be where coffee shops are 
* Be where many eateries and tourists and workers are, namely, near downtown 
* Be alone and away from competitors, but still near some eateries

**Personally, these trends are not that helpful, but together, they seem to suggest that San Francisco's boba market may be approoaching or already reached saturation.** 


### 3.2 What could be done better

#### 3.2.1 Failure to actually identify specific neighborhoods

We failed to identify actual "neighborhoods" to place the shops in, only general guidelines. 

To take this further, We can choose one (or more) of these trends and search for areas with similar venue mix within San Francisco. 

The problem with this approach is the computing budget. We need grids of a fine resolution, down to 0.5km squares or smaller. A rough calculation shows that to cover San Francisco in 0.5 km resolution would require over 200 queries. Given that the free Foursquare developer account is limited to 500 queries a day, that can be a bit problematic. Problem gets much much worse if we want finer resolution. Increasing resolution to 0.25km would require over 800 queries. 

And even then, it is possible we may have overfitted the data. 

#### 3.2.2 Failure to account for store popularity

In the prior analysis we only checked physical location. We did not account for number of check-ins (and thus, popularity) or ratings. We considered all boba shops equal. 

We could have included the check-in numbers and use that to weigh the results. 

The problem with this approach is assuming that check-ins and/or ratings are actually correlated to popularity, and popularity does not always correlate with profitability. 






