# Capstone Project: Battle of the Neighborhoods

### by Kasey Chang started 2020/02/20

## 1. Defining the problem

### 1.1 Background

Boba Tea, the sweet tea drink often with tapioca pearls, is very popular with Asian populations, since its invention in teh 1980's. But despite its age, the growth of the market continued as it spread to non-Asian countries. It was estimated that [compound growth rate of boba tea marekt from 2017 to 2023 will be 7.3%.](https://journal.businesstoday.org/bt-online/2019/its-quali-tea-how-boba-became-a-craze)



<img src="https://media.istockphoto.com/photos/bubble-tea-in-a-row-picture-id531948035?k=6&m=531948035&s=612x612&w=0&h=42OINPNXpnwtXAuxkTgFomZ9oW9q8r2LtBoxa5qG1pU=">

### 1.2 Problem 

You are a boba tea franchise moving into San Franciscoo, and you need to know where to open your stores. You don't want to open one where there is no one to appreciate it (little or no consumer of boba tea), but you don't want an area where you face heavy competition, such as an established franchise such as [Quickly USA](http://www.quicklyusa.com/). 

Indeed, as we will see later, some areas are over-saturated with boba tea shops. 


### Question: Which neighborhood is good for a boba tea store?

NOTE: In Foursquare "boba tea shop" is refered to as "Bubble Tea Shop" categoryID 52e81612bcbc57f1066b7a0c

To rephrase the question in data science terms: 

### Q: How to leverage Foursquare data to look for ares that has few or no boba tea shops, yet is similar to areas that already has some boba tea shops?


Which includes the following assumptions

* that people would not go out of their way for boba tea, so the store is near where they live, play, or work
* the neighborhoods that have <=N boba tea stores are the ones we are looking for

Where N is still to be determined. 

### 1.3 Interest

If this technique works, it can be generalized into looking for any sort of area being underserved by any sort of business, provided there are already some of that type of business in the city, as long as Foursquare have that venue type and there are already some in the city, and the city is large enough to have a good variety of venues to explore. 

## 2. Getting the Data

### 2.1 Data Sources

San Francisco venue data is readily available through the Foursquare API once one signed up (for free) as a developer and received the prerequisite Client_ID and Client_Secret keys. 

The original plan was to find a set of definitions of the neighborhoods of San Francisco, query their characteristics from Foursquare, then compare them to the characteristics around each boba shop for similarities based on venue types each each, and identify neighborhoods that shows high correlations, yet has no boba shops nearby. 

However, this turned out to be far more difficult than anticipated. 

#### 2.1.1 Lack of Universal definition of neighborhood

We will need a definition of what neighborhoods are in San Francisco. However, as it turns out there is no "definitive" list of neighborhoods in San Francisco, nor are there official boundaries for many of them. Realtors, city planners, etc. have different definitions. [San Francisco Planning Department identifies 36 neighborhoods](https://data.sfgov.org/Geographic-Locations-and-Boundaries/Planning-Neighborhood-Groups-Map/iacs-ws63) but the boundary is uneven and impossible to query with Foursquare. And as we will see later, the "neighborhood" field in Foursquare was crowd-sourced and often wrong or left blank. We are unable to rely on that field. 

Thus, we are unable to go by "neighborhood". What else can we use?

#### 2.1.2 We can't use zip code either. 

We do have zipcode, which is always available with each venue data returned by Foursquare.  

However, query a radius around a zipcode is problematic as some zipcode areas are smaller and some are bigger. 94124 is huge, compared to 94108, for example. And in the 94104/94105/94111 area has three zipcodes due to high business density. 

<img src="http://www.healthysf.org/bdi/outcomes/images/zip-map.jpg">

At best, we get statistics about some arbitrary "center" of the zipcode area, which is a tiny area and would be almost useless. Furthermore, Foursquare queries are based on radius and thus, a circular area, while neighborhoods have irregular boundaries. 

### 2.1.3 A Grid-based approach

San Franciso can be roughly bound by GPS coordinates as NE=37'48.5",-122'22"  SW=37'42.50,-122.31



### 2.2 Data Cleaning

No existing sets of data is used and thus required to be cleaned. 

### Building the boba place table

We need to query Foursquare for all venues that fits the "bubble tea shop" category ID


In [3]:
#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab


Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    altair-4.0.1               |             py_0         575 KB  conda-forge
    vincent-0.4.4              |             py_1          28 KB  conda-forge
    certifi-2019.11.28         |           py36_0         149 KB  conda-forge
    folium-0.5.0               |             py_0          45 KB  conda-forge
    openssl-1.1.1d             |       h516909a_0         2.1 MB  conda-forge
    ca-certificates-2019.11.28 |       hecc5488_0         145 KB  conda-forge
    branca-0.4.0               |             py_0          26 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         3.0 MB

The following NEW packages will be 

In [4]:
import pandas as pd
import numpy as np
import io
import requests


In [5]:
import json
from pandas.io.json import json_normalize
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library
LIMIT=100 #max return
rad=500 #meters

CLIENT_ID = 'SFBL1D3NELYE2GN5ARPAN1SSEJQDF2R1PYKEJDAGG4MD1EMX' # your Foursquare ID
CLIENT_SECRET = 'TSEZSQQU4FE3ZFS5UJDSNFRZZOT53ZHQH4GIAPA01QA0LS3C' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

In [6]:
sf_lat=37.7749     #center of san francisco
sf_long=-122.4194  #center of san francisco
sf_rad=15000       #15 km, which should cover ALL of San Francisco
cat_id="52e81612bcbc57f1066b7a0c"    #"bubble tea place"
  
browse_url='https://api.foursquare.com/v2/venues/search?intent=browse&client_id={}&client_secret={}&categoryId={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET,
    cat_id,
    sf_lat, 
    sf_long,
    VERSION,
    sf_rad,
    LIMIT)

results = requests.get(browse_url).json()


In [7]:
venues = results['response']

boba_venues = json_normalize(venues['venues']) # flatten JSON to get the venues

# filter columns
filtered_columns = ['name', 'location.lat', 'location.lng','location.neighborhood','location.postalCode']
boba_venues2 =boba_venues.loc[:, filtered_columns]


In [8]:
boba_venues2.head()

Unnamed: 0,name,location.lat,location.lng,location.neighborhood,location.postalCode
0,Asha Tea House,37.788175,-122.403615,,94108
1,Boba Guys,37.766448,-122.397042,Showplace Square,94107
2,Black Sugar,37.786135,-122.409948,Lower Nob Hill,94102
3,Boba Guys,37.772907,-122.423507,,94102
4,Boba Guys,37.789899,-122.407077,Downtown San Francisco-Union Square,94108


In [9]:
boba_venues2.shape

(50, 5)

### Data Discoveries about the Boba Shops

Upon casual inspection of the data (turns out there are 50 boba tea places in San Francisco, more than I thought!) it's clear that many shops do NOT list a neighborhood, but at least, all of them listed a zip code. Let's plot them on a map of SF and see what happens?

In [10]:
map_clusters = folium.Map(location=[sf_lat,sf_long], zoom_start=12)

# add markers to the map
markers_colors = []
for lat, lon, name, zipcode in zip(boba_venues2['location.lat'], boba_venues2['location.lng'],boba_venues2['name'],boba_venues2['location.postalCode']):
    #label = folium.Popup(str(name) + ' Zipcode = ' + str(zipcode), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        #popup=label,
        #color=red,
        #fill=True,
        #fill_color=red,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

My radius of 15km was too large! It included shops in Daly City, Oakland, and South San Francisco! 

We re-run the results setting radius to 10km instead. NOTE: We will discuss the limitation of this approach later. 

In [11]:
sf_rad=10000       #10 km, which should cover ALL of San Francisco
  
browse_url='https://api.foursquare.com/v2/venues/search?intent=browse&client_id={}&client_secret={}&categoryId={}&ll={},{}&v={}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET,
    cat_id,
    sf_lat, 
    sf_long,
    VERSION,
    sf_rad,
    LIMIT)

results = requests.get(browse_url).json()

venues = results['response']

boba_venues = json_normalize(venues['venues']) # flatten JSON to get the venues

# filter columns
filtered_columns = ['name', 'location.lat', 'location.lng','location.neighborhood','location.postalCode']
boba_venues2 =boba_venues.loc[:, filtered_columns]



In [12]:
map_clusters = folium.Map(location=[sf_lat,sf_long], zoom_start=12)

# add markers to the map
markers_colors = []
for lat, lon, name, zipcode in zip(boba_venues2['location.lat'], boba_venues2['location.lng'],boba_venues2['name'],boba_venues2['location.postalCode']):
    label = folium.Popup(str(name) + ' Zipcode = ' + str(zipcode), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        #popup=label,
        #color=red,
        #fill=True,
        #fill_color=red,
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [13]:
boba_venues2.shape

(50, 5)

Please note that Foursquare returned only 50 results despite setting different radiuses. So it is clear there are MORE than 50 boba tea shops in San Francisco, but we've hit the API limit (despite setting limit to 100 in the query). 


## We will buid a table of venues around each of the boba shops we found

We will set an arbitrary distance of 500m (rad=500 earlier), making an inherent assumption that people would not want to walk very far to get a cup of boba tea. 

We'll borrow a function from previous project...

In [14]:
def getNearbyVenues(names, latitudes, longitudes, radius=rad):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

We will manipulate the table to separate the shops that has more than one location, assuming only one per zipcode. This may generate duplicates later, and I know I should have changed all the field names, but it wrks as is. We do this by create a composite name called "namezip" by concatenating name and zipcode. 

In [15]:
boba_venues2['namezip']=boba_venues2['name']+boba_venues2['location.postalCode']

In [16]:
boba_venues3 = getNearbyVenues(names=boba_venues2['namezip'],
                             latitudes=boba_venues2['location.lat'],
                             longitudes=boba_venues2['location.lng']
                             )

Asha Tea House94108
Boba Guys94107
Black Sugar94102
Boba Guys94102
Boba Guys94108
Boba Guys94115
Yi Fang Taiwan Fruit Tea94132
Plentea94108
Yi Fang Taiwan Fruit Tea94108
CoCo Fresh Tea & Juice94104
Boba Guys94117
Tea And Others94117
Urban Ritual94102
Boba Guys94110
Little Sweet94122
Sharetea94103
Little Sweet94118
Brew Cha94110
Teaspoon94109
SimplexiTea94103
fifty/fifty94118
Little Sweet94108
Tpumps94122
i-Tea94108
Gong Cha94102
Boba Butt Tea House94108
Boba Bao Bei94110
The Posh Bagel94114
i-Tea94122
Wonderful Desserts & Cafe94122
Purple Kow94121
Enough94108
Sharetea94103
District Tea94110
Mr and Mrs Tea House94118
B&B94122
Qualitea94114
Wonder Tea94122
TJ Brewed Tea and Real Fruit (TJ Cups)94122
Quickly (Kobe Bento) 快可立94133
Happy Cow Creamery & Tea94107
OMG Tea94134
Super Cue Cafe94116
Teapenter94122
Mr. T Cafe94112
Tea Hut94115
CoCo Fresh Tea & Juice94133
Sunday at the Museum94102
STEEP94107
IdentiTea94110


In [17]:
boba_venues3.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Asha Tea House94108,37.788175,-122.403615,Flatiron Wine and Spirits,37.788039,-122.401466,Wine Shop
1,Asha Tea House94108,37.788175,-122.403615,Saint Laurent,37.787774,-122.405412,Boutique
2,Asha Tea House94108,37.788175,-122.403615,Maison Margiela,37.788261,-122.405765,Boutique
3,Asha Tea House94108,37.788175,-122.403615,Crocker Galleria Roof Terrace,37.789146,-122.402447,Roof Deck
4,Asha Tea House94108,37.788175,-122.403615,Cask,37.787114,-122.403092,Liquor Store


In [18]:
boba_venues3.shape

(3958, 7)

In [19]:
boba_venues3.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Asha Tea House94108,100,100,100,100,100,100
B&B94122,60,60,60,60,60,60
Black Sugar94102,100,100,100,100,100,100
Boba Bao Bei94110,100,100,100,100,100,100
Boba Butt Tea House94108,100,100,100,100,100,100
Boba Guys94102,100,100,100,100,100,100
Boba Guys94107,51,51,51,51,51,51
Boba Guys94108,100,100,100,100,100,100
Boba Guys94110,100,100,100,100,100,100
Boba Guys94115,99,99,99,99,99,99


In [20]:
print('There are {} uniques categories.'.format(len(boba_venues3['Venue Category'].unique())))

There are 307 uniques categories.


Let's collapse the table and see what's near each store in summary form...

In [21]:

boba_onehot = pd.get_dummies(boba_venues3[['Venue Category']], prefix="", prefix_sep="")

boba_onehot['namezip'] = boba_venues3['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [boba_onehot.columns[-1]] + list(boba_onehot.columns[:-1])
boba_onehot = boba_onehot[fixed_columns]

boba_onehot.head()

Unnamed: 0,namezip,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Warehouse,Watch Shop,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Asha Tea House94108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
boba_grouped = boba_onehot.groupby('namezip').mean().reset_index()
boba_grouped.head()

Unnamed: 0,namezip,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Warehouse,Watch Shop,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Asha Tea House94108,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01
1,B&B94122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Black Sugar94102,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,...,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.02,0.0
3,Boba Bao Bei94110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
4,Boba Butt Tea House94108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.0,0.0,0.0


In [23]:
boba_grouped.shape

(49, 308)

So what are the most common venues near each store?

In [24]:
num_top_venues = 5

for boba in boba_grouped['namezip']:
    print("----"+boba+"----")
    temp = boba_grouped[boba_grouped['namezip'] == boba].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Asha Tea House94108----
            venue  freq
0        Boutique  0.07
1     Coffee Shop  0.05
2     Men's Store  0.04
3  Clothing Store  0.04
4            Café  0.03


----B&B94122----
                   venue  freq
0            Coffee Shop  0.07
1  Vietnamese Restaurant  0.07
2        Bubble Tea Shop  0.07
3    Szechuan Restaurant  0.05
4          Deli / Bodega  0.05


----Black Sugar94102----
            venue  freq
0         Theater  0.06
1           Hotel  0.05
2  Clothing Store  0.03
3    Cocktail Bar  0.03
4     Music Venue  0.03


----Boba Bao Bei94110----
                venue  freq
0  Mexican Restaurant  0.08
1         Coffee Shop  0.05
2                 Bar  0.05
3                Café  0.04
4   Indian Restaurant  0.04


----Boba Butt Tea House94108----
                     venue  freq
0              Coffee Shop  0.07
1       Chinese Restaurant  0.06
2              Men's Store  0.05
3  New American Restaurant  0.04
4                   Bakery  0.04


----Boba Guys94102---

                   venue  freq
0        Bubble Tea Shop  0.07
1  Vietnamese Restaurant  0.07
2                 Bakery  0.06
3           Dessert Shop  0.06
4     Chinese Restaurant  0.06


----Wonderful Desserts & Cafe94122----
                   venue  freq
0     Chinese Restaurant  0.06
1  Vietnamese Restaurant  0.06
2        Bubble Tea Shop  0.06
3                 Bakery  0.05
4           Dessert Shop  0.05


----Yi Fang Taiwan Fruit Tea94108----
                venue  freq
0  Chinese Restaurant  0.08
1         Coffee Shop  0.07
2              Bakery  0.06
3         Men's Store  0.05
4        Cocktail Bar  0.04


----Yi Fang Taiwan Fruit Tea94132----
            venue  freq
0          Bakery  0.07
1  Clothing Store  0.05
2  Lingerie Store  0.05
3  Sandwich Place  0.05
4     Candy Store  0.05


----fifty/fifty94118----
             venue  freq
0         Wine Bar  0.10
1  Thai Restaurant  0.10
2             Bank  0.10
3            Diner  0.05
4           Bakery  0.05


----i-Tea94108--

The most poular venues near each boba store are restaurant (no matter which type), cafe, or bar. 

Let's see that in a flattened table, borrowing from a previous project... 

In [25]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [26]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['namezip']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['namezip'] = boba_grouped['namezip']

for ind in np.arange(boba_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(boba_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,namezip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House94108,Boutique,Coffee Shop,Clothing Store,Men's Store,Café,Art Museum,Gym / Fitness Center,Hotel,Cocktail Bar,Optical Shop
1,B&B94122,Coffee Shop,Bubble Tea Shop,Vietnamese Restaurant,Japanese Restaurant,Deli / Bodega,Szechuan Restaurant,Grocery Store,Sushi Restaurant,Dessert Shop,Dumpling Restaurant
2,Black Sugar94102,Theater,Hotel,Cocktail Bar,Spa,Cosmetics Shop,Music Venue,Clothing Store,Electronics Store,Speakeasy,Women's Store
3,Boba Bao Bei94110,Mexican Restaurant,Bar,Coffee Shop,Café,Indian Restaurant,Italian Restaurant,Latin American Restaurant,Grocery Store,Record Shop,Fish Market
4,Boba Butt Tea House94108,Coffee Shop,Chinese Restaurant,Men's Store,Bakery,New American Restaurant,Dive Bar,Gym,Szechuan Restaurant,Italian Restaurant,Cocktail Bar


In [27]:
neighborhoods_venues_sorted.shape

(49, 11)

### Let's try to find some patterns... 

In [30]:
neighborhoods_venues_sorted

Unnamed: 0,namezip,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House94108,Boutique,Coffee Shop,Clothing Store,Men's Store,Café,Art Museum,Gym / Fitness Center,Hotel,Cocktail Bar,Optical Shop
1,B&B94122,Coffee Shop,Bubble Tea Shop,Vietnamese Restaurant,Japanese Restaurant,Deli / Bodega,Szechuan Restaurant,Grocery Store,Sushi Restaurant,Dessert Shop,Dumpling Restaurant
2,Black Sugar94102,Theater,Hotel,Cocktail Bar,Spa,Cosmetics Shop,Music Venue,Clothing Store,Electronics Store,Speakeasy,Women's Store
3,Boba Bao Bei94110,Mexican Restaurant,Bar,Coffee Shop,Café,Indian Restaurant,Italian Restaurant,Latin American Restaurant,Grocery Store,Record Shop,Fish Market
4,Boba Butt Tea House94108,Coffee Shop,Chinese Restaurant,Men's Store,Bakery,New American Restaurant,Dive Bar,Gym,Szechuan Restaurant,Italian Restaurant,Cocktail Bar
5,Boba Guys94102,Wine Bar,Cocktail Bar,French Restaurant,New American Restaurant,Sushi Restaurant,Clothing Store,Italian Restaurant,Bubble Tea Shop,Café,Boutique
6,Boba Guys94107,Gym,Coffee Shop,Breakfast Spot,Mexican Restaurant,Café,Park,Brewery,Wine Shop,American Restaurant,Pet Store
7,Boba Guys94108,Boutique,American Restaurant,Café,Clothing Store,Hotel,Coffee Shop,Cocktail Bar,Gym / Fitness Center,Spa,Cosmetics Shop
8,Boba Guys94110,Bar,Ice Cream Shop,Cocktail Bar,Mexican Restaurant,Music Venue,New American Restaurant,Coffee Shop,Gift Shop,Bakery,Deli / Bodega
9,Boba Guys94115,Gift Shop,Tea Room,Japanese Restaurant,Bakery,Shopping Mall,Cosmetics Shop,Grocery Store,Ramen Restaurant,Asian Restaurant,Optical Shop


In [28]:
boba_grouped

Unnamed: 0,namezip,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Arcade,Argentinian Restaurant,Art Gallery,...,Warehouse,Watch Shop,Weight Loss Center,Whisky Bar,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio
0,Asha Tea House94108,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01
1,B&B94122,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Black Sugar94102,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.02,...,0.0,0.01,0.0,0.0,0.01,0.0,0.01,0.0,0.02,0.0
3,Boba Bao Bei94110,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
4,Boba Butt Tea House94108,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.01,0.02,0.01,0.0,0.0,0.0,0.0
5,Boba Guys94102,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.05,0.02,0.0,0.0,0.0,0.01
6,Boba Guys94107,0.019608,0.0,0.0,0.0,0.039216,0.019608,0.0,0.019608,0.0,...,0.0,0.0,0.0,0.0,0.0,0.039216,0.0,0.0,0.0,0.019608
7,Boba Guys94108,0.0,0.01,0.0,0.0,0.04,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0
8,Boba Guys94110,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.02,...,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01
9,Boba Guys94115,0.0,0.0,0.0,0.0,0.010101,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010101


We try to sort them into 5 clusters using KNN, as computers can often spot trends that we humans can't see, among so many variables. 

In [29]:
kclusters = 5

boba_grouped_clustering = boba_grouped.drop('namezip', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(boba_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([2, 0, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 1,
       0, 3, 3, 2, 4, 1, 1, 1, 2, 1, 2, 4, 4, 1, 1, 0, 1, 1, 0, 1, 0, 0,
       2, 1, 1, 2, 0], dtype=int32)

In [31]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

boba_merged = boba_venues2

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
boba_merged = boba_merged.join(neighborhoods_venues_sorted.set_index('namezip'), on='namezip',how='outer')

# Somehow this join created NaN values in the table, had to drop them
#boba_merged.dropna(inplace=True)

#then recast the column back to numeric
#boba_merged['Cluster Labels']=pd.to_numeric(boba_merged['Cluster Labels'],downcast='integer')

boba_merged.head() 

Unnamed: 0,name,location.lat,location.lng,location.neighborhood,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House,37.788175,-122.403615,,94108,Asha Tea House94108,2,Boutique,Coffee Shop,Clothing Store,Men's Store,Café,Art Museum,Gym / Fitness Center,Hotel,Cocktail Bar,Optical Shop
1,Boba Guys,37.766448,-122.397042,Showplace Square,94107,Boba Guys94107,1,Gym,Coffee Shop,Breakfast Spot,Mexican Restaurant,Café,Park,Brewery,Wine Shop,American Restaurant,Pet Store
2,Black Sugar,37.786135,-122.409948,Lower Nob Hill,94102,Black Sugar94102,2,Theater,Hotel,Cocktail Bar,Spa,Cosmetics Shop,Music Venue,Clothing Store,Electronics Store,Speakeasy,Women's Store
3,Boba Guys,37.772907,-122.423507,,94102,Boba Guys94102,1,Wine Bar,Cocktail Bar,French Restaurant,New American Restaurant,Sushi Restaurant,Clothing Store,Italian Restaurant,Bubble Tea Shop,Café,Boutique
4,Boba Guys,37.789899,-122.407077,Downtown San Francisco-Union Square,94108,Boba Guys94108,2,Boutique,American Restaurant,Café,Clothing Store,Hotel,Coffee Shop,Cocktail Bar,Gym / Fitness Center,Spa,Cosmetics Shop


### Visualize...

In [32]:
boba_merged.shape

(50, 17)

In [51]:
# create map
map_clusters = folium.Map(location=[sf_lat,sf_long], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(boba_merged['location.lat'], boba_merged['location.lng'], boba_merged['namezip'], boba_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Looks actually quite reasonable. There are noticeable clusters and as a San Francisco resident, I can see some commonalities for the clusters. 

### Cluster 1 (Red)

In [44]:
boba_merged.loc[boba_merged['Cluster Labels'] == 0, boba_merged.columns[[0,4]+ list(range(5, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
22,Tpumps,94122,Tpumps94122,0,Bubble Tea Shop,Chinese Restaurant,Vietnamese Restaurant,Bakery,Dessert Shop,Dumpling Restaurant,Deli / Bodega,Szechuan Restaurant,Japanese Restaurant,Bank
28,i-Tea,94122,i-Tea94122,0,Chinese Restaurant,Bubble Tea Shop,Vietnamese Restaurant,Dessert Shop,Bakery,Dumpling Restaurant,Deli / Bodega,Szechuan Restaurant,Dim Sum Restaurant,Asian Restaurant
29,Wonderful Desserts & Cafe,94122,Wonderful Desserts & Cafe94122,0,Chinese Restaurant,Bubble Tea Shop,Vietnamese Restaurant,Dessert Shop,Bakery,Dumpling Restaurant,Deli / Bodega,Szechuan Restaurant,Dim Sum Restaurant,Bar
34,Mr and Mrs Tea House,94118,Mr and Mrs Tea House94118,0,Japanese Restaurant,Chinese Restaurant,Bakery,Burmese Restaurant,Sushi Restaurant,Korean Restaurant,Thai Restaurant,Vietnamese Restaurant,Asian Restaurant,Café
35,B&B,94122,B&B94122,0,Coffee Shop,Bubble Tea Shop,Vietnamese Restaurant,Japanese Restaurant,Deli / Bodega,Szechuan Restaurant,Grocery Store,Sushi Restaurant,Dessert Shop,Dumpling Restaurant
37,Wonder Tea,94122,Wonder Tea94122,0,Vietnamese Restaurant,Bubble Tea Shop,Chinese Restaurant,Dessert Shop,Bakery,Szechuan Restaurant,Dim Sum Restaurant,Deli / Bodega,Thai Restaurant,Light Rail Station
43,Teapenter,94122,Teapenter94122,0,Chinese Restaurant,Coffee Shop,Vietnamese Restaurant,Bakery,Szechuan Restaurant,Bubble Tea Shop,Bar,Deli / Bodega,Dessert Shop,Dumpling Restaurant


These 7 are similar in that six of them are on the same street. Irving (6 of them) and Clement (7th) are known as second Chinatowns. That's no surprise. The breakdown shows there is a heavy concentration of boba shops there, maybe except the Clement street (the northern red dot). 

The lesson to be learned here is go where the Chinese restaurants and shops are, if they are in sufficient numbers. 

### Cluster 2 (Purple)

In [45]:
boba_merged.loc[boba_merged['Cluster Labels'] == 1, boba_merged.columns[[0,4]+ list(range(5, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Boba Guys,94107,Boba Guys94107,1,Gym,Coffee Shop,Breakfast Spot,Mexican Restaurant,Café,Park,Brewery,Wine Shop,American Restaurant,Pet Store
3,Boba Guys,94102,Boba Guys94102,1,Wine Bar,Cocktail Bar,French Restaurant,New American Restaurant,Sushi Restaurant,Clothing Store,Italian Restaurant,Bubble Tea Shop,Café,Boutique
5,Boba Guys,94115,Boba Guys94115,1,Gift Shop,Tea Room,Japanese Restaurant,Bakery,Shopping Mall,Cosmetics Shop,Grocery Store,Ramen Restaurant,Asian Restaurant,Optical Shop
6,Yi Fang Taiwan Fruit Tea,94132,Yi Fang Taiwan Fruit Tea94132,1,Bakery,Food Truck,Cosmetics Shop,Lingerie Store,Candy Store,Gym,Clothing Store,Sandwich Place,Gift Shop,Grocery Store
10,Boba Guys,94117,Boba Guys94117,1,Bar,Sushi Restaurant,Ethiopian Restaurant,Wine Bar,Café,Grocery Store,Mexican Restaurant,Restaurant,Diner,Nightclub
11,Tea And Others,94117,Tea And Others94117,1,Coffee Shop,Yoga Studio,Mediterranean Restaurant,Boutique,Grocery Store,Dive Bar,Record Shop,Wine Bar,Gift Shop,Park
12,Urban Ritual,94102,Urban Ritual94102,1,Wine Bar,Clothing Store,Boutique,French Restaurant,Optical Shop,Café,Sushi Restaurant,New American Restaurant,Pizza Place,Cocktail Bar
13,Boba Guys,94110,Boba Guys94110,1,Bar,Ice Cream Shop,Cocktail Bar,Mexican Restaurant,Music Venue,New American Restaurant,Coffee Shop,Gift Shop,Bakery,Deli / Bodega
14,Little Sweet,94122,Little Sweet94122,1,Garden,Coffee Shop,Ice Cream Shop,Sushi Restaurant,Pizza Place,Bakery,Vietnamese Restaurant,Gym,Sandwich Place,Thai Restaurant
16,Little Sweet,94118,Little Sweet94118,1,Chinese Restaurant,Pizza Place,Italian Restaurant,Wine Shop,Japanese Restaurant,Korean Restaurant,Burger Joint,Burmese Restaurant,Massage Studio,Bakery


No feature stood out for this cluster in terms of location. They are just all over the place. Several of them are in the Mission district, usually known as a Latino neighborhood. 

Looking at nearest venues shows that they are near restaurant or eatery or bar. The southern mode purple dot, in the southwest corner, is actually in Stonetown Galleria, in the mall's food court. 

### Cluster 3 (Light Blue)

In [46]:
boba_merged.loc[boba_merged['Cluster Labels'] == 2, boba_merged.columns[[0,4]+ list(range(5, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Asha Tea House,94108,Asha Tea House94108,2,Boutique,Coffee Shop,Clothing Store,Men's Store,Café,Art Museum,Gym / Fitness Center,Hotel,Cocktail Bar,Optical Shop
2,Black Sugar,94102,Black Sugar94102,2,Theater,Hotel,Cocktail Bar,Spa,Cosmetics Shop,Music Venue,Clothing Store,Electronics Store,Speakeasy,Women's Store
4,Boba Guys,94108,Boba Guys94108,2,Boutique,American Restaurant,Café,Clothing Store,Hotel,Coffee Shop,Cocktail Bar,Gym / Fitness Center,Spa,Cosmetics Shop
7,Plentea,94108,Plentea94108,2,Coffee Shop,Gym,Gym / Fitness Center,Boutique,French Restaurant,Hotel,Men's Store,Bar,Sushi Restaurant,Café
8,Yi Fang Taiwan Fruit Tea,94108,Yi Fang Taiwan Fruit Tea94108,2,Chinese Restaurant,Coffee Shop,Bakery,Men's Store,Cocktail Bar,Italian Restaurant,Szechuan Restaurant,Dive Bar,Tea Room,Bookstore
9,CoCo Fresh Tea & Juice,94104,CoCo Fresh Tea & Juice94104,2,Coffee Shop,Boutique,Sushi Restaurant,Men's Store,Hotel,Sandwich Place,Gym / Fitness Center,Clothing Store,Cocktail Bar,Art Museum
15,Sharetea,94103,Sharetea94103,2,Coffee Shop,Hotel,Cosmetics Shop,Cocktail Bar,Women's Store,Pizza Place,Furniture / Home Store,Department Store,Art Gallery,Spa
32,Sharetea,94103,Sharetea94103,2,Coffee Shop,Hotel,Cosmetics Shop,Cocktail Bar,Women's Store,Pizza Place,Furniture / Home Store,Department Store,Art Gallery,Spa
21,Little Sweet,94108,Little Sweet94108,2,Boutique,Clothing Store,Hotel,Men's Store,Jewelry Store,Gym / Fitness Center,Cosmetics Shop,Optical Shop,Furniture / Home Store,Cocktail Bar
23,i-Tea,94108,i-Tea94108,2,Coffee Shop,Hotel,Clothing Store,Gym,Boutique,Men's Store,French Restaurant,Sandwich Place,Cocktail Bar,Restaurant


This group is located all near downtown. Two are in or near Chinatown, while the rest are between Chinatown and Market street, on busy streets with a lot of foot traffic, such as Kearny and O'Farrell. There is an outlier near Civic Center, but that turned out to be a built-in boba shop of the Asian Art Museum. It is worth noting that most of them are near a coffee shop, suggesting they are building near a Starbucks or Peets. 

### Cluster 4 (Lime Green) -- VERY HARD TO SEE, go just right of the southernmost purple dot, then to the right of that to the freeway

In [47]:
boba_merged.loc[boba_merged['Cluster Labels'] == 3, boba_merged.columns[[0,4]+ list(range(5, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
41,OMG Tea,94134,OMG Tea94134,3,Chinese Restaurant,Vietnamese Restaurant,Bubble Tea Shop,Coffee Shop,Sandwich Place,Bakery,Storage Facility,BBQ Joint,Cantonese Restaurant,Brewery
44,Mr. T Cafe,94112,Mr. T Cafe94112,3,Mexican Restaurant,Bakery,Chinese Restaurant,Latin American Restaurant,Coffee Shop,Thrift / Vintage Store,Bank,Filipino Restaurant,Vietnamese Restaurant,Sandwich Place


These two are outliers, being way to the south of everyone else. But they are again, grouped close to other eateries, with Chinese restaurant being most common and third most common, respectively. 

### Cluster 5 (Brown)

In [48]:
boba_merged.loc[boba_merged['Cluster Labels'] == 4, boba_merged.columns[[0,4]+ list(range(5, boba_merged.shape[1]))]]

Unnamed: 0,name,location.postalCode,namezip,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
30,Purple Kow,94121,Purple Kow94121,4,Chinese Restaurant,Café,Japanese Restaurant,Bakery,Bus Station,Sporting Goods Shop,Yoga Studio,Dessert Shop,Burrito Place,Sandwich Place
38,TJ Brewed Tea and Real Fruit (TJ Cups),94122,TJ Brewed Tea and Real Fruit (TJ Cups)94122,4,Chinese Restaurant,Dim Sum Restaurant,Grocery Store,Japanese Restaurant,Coffee Shop,Playground,Bubble Tea Shop,Donut Shop,Bus Station,Bar
42,Super Cue Cafe,94116,Super Cue Cafe94116,4,Chinese Restaurant,Pizza Place,Sushi Restaurant,Park,Sandwich Place,Café,Spa,Szechuan Restaurant,Martial Arts Dojo,Breakfast Spot


Geographically, these 3 are by themselves, and way to the west, and fairely isolated. Those who are San Francisco residents, however, would recognize them as tertiary concentrations of Chinese restaurants and shops. Namely, Balboa, Noriega, and Taravel. And indeed, first thing you notice are the common venues around them: Chinese restaurants and/or Japanese restaurants. 

## Trend Summary (in no particular order)

* Be where Chinese shops and restaurants are (cluster 1)
* Be where Chinese restaurants are (cluster 5)
* Be where coffee shops are (cluster 3)
* Be where many eateries and tourists and workers are, namely, near downtown (cluster 2)
* Be alone and away from competitors, but still near some eateries (cluster 4)

### Limitations of the data extracted so far

Despite specifying a higher limit, the maximum results returned is 50. Indeed, this is actually listed in the [Foursquare API documentation](https://developer.foursquare.com/docs/api/venues/search) that limit is capped at 50 even if you specify a higher number. 
    
One potential alternative approach to divide San Francisco into 4 quadrants, and use 4 bounding boxes instead of center and radius, then merge the results into a single table, eliminating duplicates, but I doubt we'll pick up that many more data points. And I doubt the additional effrot will tell us a substantially different story. But it is worth exploring should we need to go into further detail.   

We have also restricted ourselves to only one shop of each franchise in each zipcode, despite the possibility of multiple franchises in the same zipcode. 


In [None]:
## 

### Neighborhood Iteration 1

We will need a definition of what neighborhoods are in San Francisco. However, as it turns out there is no "definitive" list of neighborhoods in San Francisco, nor are there official boundaries for many of them. Realtors, city planners, etc. have different definitions. [San Francisco Planning Department identifies 36 neighborhoods](https://data.sfgov.org/Geographic-Locations-and-Boundaries/Planning-Neighborhood-Groups-Map/iacs-ws63) but the boundary is uneven and impossible to query with Foursquare. And as we have seen above, the "neighborhood" field in Foursquare was crowd-sourced and often wrong or left blank. We are unable to rely on that field. 



### Neighborhood Iteration 2

We do have zipcode, which is always available with each venue data returned by Foursquare.  

However, query a radius around a zipcode is problematic as some zipcode areas are smaller and some are bigger. 94124 is huge, compared to 94108, for example. And in the 94104/94105/94111 area has three zipcodes due to high business density. 

<img src="http://www.healthysf.org/bdi/outcomes/images/zip-map.jpg">

At best, we get statistics about some arbitrary "center" of the zipcode area, which would be almost useless. Furthermore, Foursquare queries are based on radius and thus, a circular area, while neighborhoods have irregular boundaries. 


### Neighborhood Iteration 3

We will then train a model using those "boba neighborhood" profiles 

then use KNN = 3, which hopefully gets us good / bad / ugly neighborhood ratings. 

We will check each cluster to make sure they make sense. 

and finally, we will make a determination analyzing each cluster and how suitable are they to new Boba tea place

