## 1.	Introduction

An L.A. based juice company named GanicJuice (GJ for short) specializes in smoothies made with organic fruits.  It targets customers who are trendy, upper-middle class, relatively young (35 of age or younger), and health conscious.  Since its start three years ago, it has established a strong customer base in L.A and a few other west coast cities.  The company executives are now planning on launching the first GJ branch on the east coast.  The company executives believe that Brooklyn will be a good location since the borough is close to Manhattan, which is considered to be a lucrative market, but without the risk associated with high rent.  

The next important step is deciding which Brooklyn neighborhood should be chosen for the first east coast branch.  Previous experience and marketing research indicate that GJ’s most loyal customers like to visit bubble tea joints, eat out at Japanese restaurants, and participate in sports in a club or gym setting.  Thus, a neighborhood with a combination of at least 15 of these retail sites will be preferred. 


## 2. Data


Main data required:
1.	Geospatial data for all New York City’s (especially Brooklyn’s) neighborhood are needed to identify neighborhoods. Specifically, name of the neighborhoods, their latitude, and longitude.  For example: Neighborhood: Bay Ridge, Latitude: 40.625801, Longitude: -74.030621.
2.	Geospatial data for all retail sites that are tagged “Japanese”, “bubble team”, and “gym” in Brooklyn are needed to categorize each of the Brooklyn neighborhoods. Specifically, name of the venue, venue category, venue latitude, and venue longitude.  For example: Venue: Inaka, Venue Latitude: 40.625141, Venue Longitude: -74.030418, Venue Category: Sushi Restaurant.

Since the GJ executive team express the desire to minimize cost, the data for the project will be sourced from Foursquare.com, as the website offers free data with relatively little limitation.  Using available retail data via the website’s API, retail sites that are tagged “bubble tea”, “Japanese”, and “gym” will be extracted for each of Brooklyn’s neighborhoods.  Afterwards, the neighborhoods will be divided into a few categories based on the distribution and number of the aforementioned sites using k-mean clustering.  The purpose of doing so is to group similar neighbourhoods together to help find the best location for the new branch.  K-mean clustering is selected over other clustering methods because it’s the easiest to understand and implement.  Furthermore, the number of clustering will be selected based on trial-and-error. Once the categories of the neighborhoods are identified, they will be presented to the GJ executive team so that it can decide the final location for the new GJ site.


## 3. Methodology  

### This section describes the steps taken to gather, clean, and analyze the gathered data

Import all the required libraries and software

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    certifi-2020.4.5.1         |   py36h9f0ad1d_0         151 KB  conda-forge
    ca-certificates-2020.4.5.1 |       hecc5488_0         146 KB  conda-forge
    geopy-1.21.0               |             py_0          58 KB  conda-forge
    openssl-1.1.1g             |       h516909a_0         2.1 MB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1

<a id='item1'></a>

#### 1. Download and Explore Dataset

The purpose of this section is to download and cleans geospecial data for all New York city's neighborhoods from: https://geo.nyu.edu/catalog/nyu_2451_34572

Download the data

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

Data downloaded!


Let's take a quick look at the data.

In [3]:
newyork_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

Define a new variable that capture the relevenat data from the data file

In [4]:
neighborhoods_data = newyork_data['features']

Double checking captured data 

In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

The next task is to transform these data of nested Python dictionaries into a *pandas* dataframe. 

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Take a look at the empty dataframe to confirm that the columns are as intended.

In [7]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


Then let's loop through the data and fill the dataframe one row at a time.

In [8]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Quickly examine the resulting dataframe.

In [9]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Checking the dataframe shape

In [10]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


Use geopy library to get the latitude and longitude values of New York City.

In [11]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


Create a map of New York with neighborhoods superimposed on top.

In [12]:
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

To simplify the above map, let's segment and cluster only the neighborhoods in Brooklyn. So, slice the original dataframe and create a new dataframe of the Brooklyn data.

In [13]:
brooklyn_data = neighborhoods[neighborhoods['Borough'] == 'Brooklyn'].reset_index(drop=True)
brooklyn_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471
5,Brooklyn,Brighton Beach,40.576825,-73.965094
6,Brooklyn,Sheepshead Bay,40.58689,-73.943186
7,Brooklyn,Manhattan Terrace,40.614433,-73.957438
8,Brooklyn,Flatbush,40.636326,-73.958401
9,Brooklyn,Crown Heights,40.670829,-73.943291


Let's get the geographical coordinates of Brooklyn.

In [14]:
address = 'Brooklyn, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Brooklyn are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Brooklyn are 40.6501038, -73.9495823.


Visualizat Brooklyn and the neighborhoods in it.

In [15]:
# create map of Manhattan using latitude and longitude values
map_brooklyn = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(brooklyn_data['Latitude'], brooklyn_data['Longitude'], brooklyn_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_brooklyn)  
    
map_brooklyn

Start utilizing the Foursquare API to explore the neighborhoods and segment them.

Define Foursquare Credentials and Version

In [16]:
CLIENT_ID = '5UL5E5JYIK2Z5TUUXBZFALZBPCLHP1IBH3O0MYT0GWBNHV1B' # your Foursquare ID
CLIENT_SECRET = 'HK1PSU0I3V3W5JNKH1RDDWXVRWOX4ZFOODUINXGRHYKRX3XP' # your Foursquare Secret

VERSION = '20191101' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5UL5E5JYIK2Z5TUUXBZFALZBPCLHP1IBH3O0MYT0GWBNHV1B
CLIENT_SECRET:HK1PSU0I3V3W5JNKH1RDDWXVRWOX4ZFOODUINXGRHYKRX3XP


Check the first neighborhood in our dataframe.
Get the neighborhood's name.

In [17]:
brooklyn_data.loc[0, 'Neighborhood']

'Bay Ridge'

Get the neighborhood's latitude and longitude values.

In [18]:
neighborhood_latitude = brooklyn_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = brooklyn_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = brooklyn_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Bay Ridge are 40.625801065010656, -74.03062069353813.


Create the GET request URL. 

In [19]:

LIMIT = 200 # limit of number of venues returned by Foursquare API
radius = 500 # define radius

# create URL
url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId=52e81612bcbc57f1066b7a0c,4bf58dd8d48988d111941735,4bf58dd8d48988d175941735&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
url # display URL

results = requests.get(url).json()
results



{'meta': {'code': 200, 'requestId': '5eafea82be61c900224cdea7'},
 'response': {'venues': [{'id': '519432c7498eae4af99449fd',
    'name': 'Inaka',
    'location': {'lat': 40.625140559775616,
     'lng': -74.03041782134507,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.625140559775616,
       'lng': -74.03041782134507}],
     'distance': 75,
     'postalCode': '11209',
     'cc': 'US',
     'city': 'Brooklyn',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['Brooklyn, NY 11209', 'United States']},
    'categories': [{'id': '4bf58dd8d48988d1d2941735',
      'name': 'Sushi Restaurant',
      'pluralName': 'Sushi Restaurants',
      'shortName': 'Sushi',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/sushi_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1588587133',
    'hasPerk': False},
   {'id': '4b32d9d5f964a520071525e3',
    'name': 'Bikram Yoga',
    'location': {'address': '8302 5th Ave',

Borrow the **get_category_type** function from the Foursquare lab for data extraction purposes

In [20]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Clean the json and structure it into a *pandas* dataframe.

In [21]:
venues = results['response']['venues']
#venues = results['response']['groups'][0]['items']

nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['name', 'categories', 'location.lat', 'location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Inaka,Sushi Restaurant,40.625141,-74.030418
1,Bikram Yoga,Yoga Studio,40.623657,-74.025093
2,New York Sports Clubs,Gym / Fitness Center,40.622364,-74.027163
3,Workout @ Crowne Plaza,Gym,40.623512,-74.02772
4,Harbor Fitness GO,Gym / Fitness Center,40.621607,-74.02847
5,Sapporro,Japanese Restaurant,40.628985,-74.029006
6,Vivi Bubble Tea,Bubble Tea Shop,40.622092,-74.026058
7,Sakana Sushi & Asian Bistro,Sushi Restaurant,40.623623,-74.02491
8,New York City Tae Kwon Do,Martial Arts Dojo,40.628218,-74.028832
9,HIT Factory,Gym / Fitness Center,40.6283,-74.029205


<a id='item2'></a>

#### 2. Explore Neighborhoods in Brooklyn

Create a function to repeat the same process to all the neighborhoods in Manhattan

In [22]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&categoryId=52e81612bcbc57f1066b7a0c,4bf58dd8d48988d111941735,4bf58dd8d48988d175941735&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['venues']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['name'], 
            v['location']['lat'], 
            v['location']['lng'],  
            v['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

    return(nearby_venues)

In [23]:
results

{'meta': {'code': 200, 'requestId': '5eafea82be61c900224cdea7'},
 'response': {'venues': [{'id': '519432c7498eae4af99449fd',
    'name': 'Inaka',
    'location': {'lat': 40.625140559775616,
     'lng': -74.03041782134507,
     'labeledLatLngs': [{'label': 'display',
       'lat': 40.625140559775616,
       'lng': -74.03041782134507}],
     'distance': 75,
     'postalCode': '11209',
     'cc': 'US',
     'city': 'Brooklyn',
     'state': 'NY',
     'country': 'United States',
     'formattedAddress': ['Brooklyn, NY 11209', 'United States']},
    'categories': [{'id': '4bf58dd8d48988d1d2941735',
      'name': 'Sushi Restaurant',
      'pluralName': 'Sushi Restaurants',
      'shortName': 'Sushi',
      'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/food/sushi_',
       'suffix': '.png'},
      'primary': True}],
    'referralId': 'v-1588587133',
    'hasPerk': False},
   {'id': '4b32d9d5f964a520071525e3',
    'name': 'Bikram Yoga',
    'location': {'address': '8302 5th Ave',

#### Write the code to run the above function on each neighborhood and create a new dataframe called *brooklyn_venues*.

In [24]:
print(brooklyn_data.shape)
brooklyn_data.head(5)


(70, 4)


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Brooklyn,Bay Ridge,40.625801,-74.030621
1,Brooklyn,Bensonhurst,40.611009,-73.99518
2,Brooklyn,Sunset Park,40.645103,-74.010316
3,Brooklyn,Greenpoint,40.730201,-73.954241
4,Brooklyn,Gravesend,40.59526,-73.973471


In [25]:


brooklyn_venues = getNearbyVenues(names=brooklyn_data['Neighborhood'],
                                   latitudes=brooklyn_data['Latitude'],
                                   longitudes=brooklyn_data['Longitude']
                                  )

Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker Heights
Gerritsen Beach
Marine Park
Clinton Hill
Sea Gate
Downtown
Boerum Hill
Prospect Lefferts Gardens
Ocean Hill
City Line
Bergen Beach
Midwood
Prospect Park South
Georgetown
East Williamsburg
North Side
South Side
Ocean Parkway
Fort Hamilton
Ditmas Park
Wingate
Rugby
Remsen Village
New Lots
Paerdegat Basin
Mill Basin
Fulton Ferry
Vinegar Hill
Weeksville
Broadway Junction
Dumbo
Homecrest
Highland Park
Madison
Erasmus


Let's check the size of the resulting dataframe

In [26]:
print(brooklyn_venues.shape)
brooklyn_venues.head()

(969, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bay Ridge,40.625801,-74.030621,Inaka,40.625141,-74.030418,Sushi Restaurant
1,Bay Ridge,40.625801,-74.030621,Bikram Yoga,40.623657,-74.025093,Yoga Studio
2,Bay Ridge,40.625801,-74.030621,New York Sports Clubs,40.622364,-74.027163,Gym / Fitness Center
3,Bay Ridge,40.625801,-74.030621,Workout @ Crowne Plaza,40.623512,-74.02772,Gym
4,Bay Ridge,40.625801,-74.030621,Harbor Fitness GO,40.621607,-74.02847,Gym / Fitness Center


Let's check how many venues were returned for each neighborhood, sorted by number of venues.

In [27]:
brooklyn_venues.groupby('Neighborhood').count()
brooklyn_vcount=brooklyn_venues.groupby('Neighborhood').count()
brooklyn_vcount.sort_values(by='Venue', axis=0, ascending=False, inplace=True, kind='quicksort', na_position='last')

brooklyn_vcount

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
South Side,50,50,50,50,50,50
North Side,50,50,50,50,50,50
Dumbo,49,49,49,49,49,49
Downtown,49,49,49,49,49,49
Boerum Hill,48,48,48,48,48,48
Brooklyn Heights,42,42,42,42,42,42
Park Slope,41,41,41,41,41,41
Prospect Heights,40,40,40,40,40,40
Fort Greene,37,37,37,37,37,37
Greenpoint,35,35,35,35,35,35


Let's find out how many unique categories can be curated from all the returned venues

In [28]:
print('There are {} uniques categories.'.format(len(brooklyn_venues['Venue Category'].unique())))

There are 42 uniques categories.


<a id='item3'></a>

#### 3. Analyze Each Neighborhood

In [29]:
brooklyn_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bay Ridge,40.625801,-74.030621,Inaka,40.625141,-74.030418,Sushi Restaurant
1,Bay Ridge,40.625801,-74.030621,Bikram Yoga,40.623657,-74.025093,Yoga Studio
2,Bay Ridge,40.625801,-74.030621,New York Sports Clubs,40.622364,-74.027163,Gym / Fitness Center
3,Bay Ridge,40.625801,-74.030621,Workout @ Crowne Plaza,40.623512,-74.02772,Gym
4,Bay Ridge,40.625801,-74.030621,Harbor Fitness GO,40.621607,-74.02847,Gym / Fitness Center
5,Bay Ridge,40.625801,-74.030621,Vivi Bubble Tea,40.622092,-74.026058,Bubble Tea Shop
6,Bay Ridge,40.625801,-74.030621,Sapporro,40.628985,-74.029006,Japanese Restaurant
7,Bay Ridge,40.625801,-74.030621,Sakana Sushi & Asian Bistro,40.623623,-74.02491,Sushi Restaurant
8,Bay Ridge,40.625801,-74.030621,New York City Tae Kwon Do,40.628218,-74.028832,Martial Arts Dojo
9,Bay Ridge,40.625801,-74.030621,Dolphin Fitness,40.621754,-74.028562,Gym / Fitness Center


In [30]:
# one hot encoding
brooklyn_onehot = pd.get_dummies(brooklyn_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
brooklyn_onehot['Neighborhood'] = brooklyn_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [brooklyn_onehot.columns[-1]] + list(brooklyn_onehot.columns[:-1])
brooklyn_onehot = brooklyn_onehot[fixed_columns]

brooklyn_onehot.head()

Unnamed: 0,Neighborhood,Asian Restaurant,Athletics & Sports,Baseball Field,Boxing Gym,Bubble Tea Shop,Building,Café,Chinese Restaurant,Chiropractor,Climbing Gym,Club House,College Gym,College Rec Center,Cycle Studio,Dance Studio,Food Stand,Football Stadium,Frozen Yogurt Shop,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Japanese Restaurant,Juice Bar,Kosher Restaurant,Martial Arts Dojo,Massage Studio,Non-Profit,Office,Pilates Studio,Poke Place,Ramen Restaurant,Residential Building (Apartment / Condo),Shabu-Shabu Restaurant,Snack Place,Spiritual Center,Sushi Restaurant,Track,Udon Restaurant,Vietnamese Restaurant,Weight Loss Center,Yoga Studio
0,Bay Ridge,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
1,Bay Ridge,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,Bay Ridge,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Bay Ridge,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Bay Ridge,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


And let's examine the new dataframe size.

In [31]:
brooklyn_onehot.shape

(969, 43)

Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category

In [32]:
brooklyn_grouped = brooklyn_onehot.groupby('Neighborhood').mean().reset_index()
brooklyn_grouped.head(25)

Unnamed: 0,Neighborhood,Asian Restaurant,Athletics & Sports,Baseball Field,Boxing Gym,Bubble Tea Shop,Building,Café,Chinese Restaurant,Chiropractor,Climbing Gym,Club House,College Gym,College Rec Center,Cycle Studio,Dance Studio,Food Stand,Football Stadium,Frozen Yogurt Shop,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Japanese Restaurant,Juice Bar,Kosher Restaurant,Martial Arts Dojo,Massage Studio,Non-Profit,Office,Pilates Studio,Poke Place,Ramen Restaurant,Residential Building (Apartment / Condo),Shabu-Shabu Restaurant,Snack Place,Spiritual Center,Sushi Restaurant,Track,Udon Restaurant,Vietnamese Restaurant,Weight Loss Center,Yoga Studio
0,Bath Beach,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.111111,0.111111,0.0,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.055556,0.111111
1,Bay Ridge,0.041667,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.291667,0.0,0.0,0.25,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.083333
2,Bedford Stuyvesant,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.6,0.2,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Bensonhurst,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.222222,0.0,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111111,0.0,0.0,0.222222,0.0,0.0,0.0,0.0,0.111111
4,Bergen Beach,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Boerum Hill,0.0,0.0,0.0,0.020833,0.041667,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145833,0.3125,0.020833,0.0,0.0625,0.0,0.0,0.166667,0.0,0.0,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.0,0.020833,0.020833,0.0,0.0,0.0,0.125
6,Borough Park,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0
7,Brighton Beach,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.384615,0.153846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.307692,0.0,0.0,0.0,0.0,0.076923
8,Broadway Junction,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Brooklyn Heights,0.02381,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02381,0.0,0.0,0.0,0.0,0.166667,0.142857,0.0,0.0,0.047619,0.0,0.0,0.0,0.02381,0.0,0.0,0.095238,0.02381,0.02381,0.0,0.0,0.0,0.0,0.119048,0.047619,0.0,0.02381,0.0,0.238095


The table above shows that only a minority of the neighbourhoods have bubble tea shops.  Previous marketing researches by GJ reveal that GJ’s clientele is most strongly correlated with spots that mainly sells bubble tea, therefore, establishments other than “bubble tea shops” should be excluded from the analysis.   Thus, let's filter out the other venues. 

In [33]:
brooklyn_grouped=brooklyn_grouped.loc[(brooklyn_grouped['Bubble Tea Shop'] > 0)]

Let's confirm the new size

In [34]:
print(brooklyn_grouped.shape)
brooklyn_grouped

(14, 43)


Unnamed: 0,Neighborhood,Asian Restaurant,Athletics & Sports,Baseball Field,Boxing Gym,Bubble Tea Shop,Building,Café,Chinese Restaurant,Chiropractor,Climbing Gym,Club House,College Gym,College Rec Center,Cycle Studio,Dance Studio,Food Stand,Football Stadium,Frozen Yogurt Shop,Gym,Gym / Fitness Center,Gym Pool,Gymnastics Gym,Japanese Restaurant,Juice Bar,Kosher Restaurant,Martial Arts Dojo,Massage Studio,Non-Profit,Office,Pilates Studio,Poke Place,Ramen Restaurant,Residential Building (Apartment / Condo),Shabu-Shabu Restaurant,Snack Place,Spiritual Center,Sushi Restaurant,Track,Udon Restaurant,Vietnamese Restaurant,Weight Loss Center,Yoga Studio
0,Bath Beach,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.055556,0.111111,0.111111,0.0,0.0,0.111111,0.0,0.0,0.111111,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.055556,0.111111
1,Bay Ridge,0.041667,0.0,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.291667,0.0,0.0,0.25,0.0,0.0,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.083333
5,Boerum Hill,0.0,0.0,0.0,0.020833,0.041667,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.145833,0.3125,0.020833,0.0,0.0625,0.0,0.0,0.166667,0.0,0.0,0.0,0.020833,0.0,0.020833,0.0,0.0,0.0,0.0,0.020833,0.020833,0.0,0.0,0.0,0.125
13,Clinton Hill,0.0,0.0,0.0,0.032258,0.064516,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.193548,0.16129,0.0,0.0,0.16129,0.0,0.0,0.096774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.096774,0.0,0.0,0.0,0.0,0.193548
19,Downtown,0.0,0.0,0.0,0.020408,0.040816,0.0,0.0,0.0,0.0,0.0,0.0,0.020408,0.020408,0.0,0.0,0.0,0.0,0.0,0.204082,0.428571,0.020408,0.0,0.061224,0.020408,0.0,0.040816,0.0,0.0,0.0,0.020408,0.0,0.020408,0.020408,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.040816
20,Dumbo,0.0,0.0,0.0,0.040816,0.040816,0.0,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.020408,0.0,0.020408,0.0,0.0,0.346939,0.142857,0.020408,0.020408,0.061224,0.0,0.0,0.020408,0.0,0.0,0.0,0.020408,0.0,0.040816,0.0,0.0,0.0,0.0,0.040816,0.0,0.0,0.0,0.0,0.142857
23,East Williamsburg,0.0,0.0,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.222222,0.185185,0.0,0.037037,0.185185,0.0,0.0,0.111111,0.0,0.0,0.0,0.074074,0.0,0.037037,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.074074
34,Greenpoint,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.257143,0.028571,0.028571,0.028571,0.0,0.0,0.028571,0.0,0.0,0.0,0.057143,0.0,0.057143,0.0,0.0,0.0,0.0,0.171429,0.0,0.0,0.0,0.0,0.171429
36,Homecrest,0.0,0.0,0.0,0.0,0.272727,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.090909,0.181818,0.0,0.0,0.0,0.0,0.0,0.090909,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.272727,0.0,0.0,0.0,0.0,0.090909
48,Park Slope,0.0,0.0,0.0,0.0,0.02439,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.170732,0.170732,0.0,0.0,0.170732,0.0,0.0,0.02439,0.0,0.0,0.0,0.02439,0.0,0.04878,0.0,0.0,0.0,0.0,0.146341,0.04878,0.0,0.02439,0.0,0.146341


Only 14 neighborhoods remain. Let's print each neighborhood along with the top 5 most common venues

In [35]:
num_top_venues = 5

for hood in brooklyn_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = brooklyn_grouped[brooklyn_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bath Beach----
                  venue  freq
0      Sushi Restaurant  0.17
1       Bubble Tea Shop  0.17
2           Yoga Studio  0.11
3     Martial Arts Dojo  0.11
4  Gym / Fitness Center  0.11


----Bay Ridge----
                  venue  freq
0  Gym / Fitness Center  0.29
1   Japanese Restaurant  0.25
2                   Gym  0.12
3      Sushi Restaurant  0.12
4           Yoga Studio  0.08


----Boerum Hill----
                  venue  freq
0  Gym / Fitness Center  0.31
1     Martial Arts Dojo  0.17
2                   Gym  0.15
3           Yoga Studio  0.12
4   Japanese Restaurant  0.06


----Clinton Hill----
                  venue  freq
0           Yoga Studio  0.19
1                   Gym  0.19
2  Gym / Fitness Center  0.16
3   Japanese Restaurant  0.16
4     Martial Arts Dojo  0.10


----Downtown----
                  venue  freq
0  Gym / Fitness Center  0.43
1                   Gym  0.20
2   Japanese Restaurant  0.06
3           Yoga Studio  0.04
4     Martial Arts Dojo  0.

Let's put that into a *pandas* dataframe

First, let's write a function to sort the venues in descending order.

In [36]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Now let's create the new dataframe and display the top 10 venues for each neighborhood.

In [37]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
neighborhoods_venues_sorted=[]

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = brooklyn_grouped['Neighborhood']

for ind in np.arange(brooklyn_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(brooklyn_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bath Beach,Bubble Tea Shop,Sushi Restaurant,Yoga Studio,Gym / Fitness Center,Japanese Restaurant,Martial Arts Dojo,Gym,Weight Loss Center,Frozen Yogurt Shop,Building
1,Bay Ridge,Gym / Fitness Center,Japanese Restaurant,Gym,Sushi Restaurant,Yoga Studio,Martial Arts Dojo,Bubble Tea Shop,Asian Restaurant,Shabu-Shabu Restaurant,Chiropractor
5,Boerum Hill,Gym / Fitness Center,Martial Arts Dojo,Gym,Yoga Studio,Japanese Restaurant,Bubble Tea Shop,Ramen Restaurant,Boxing Gym,Club House,Pilates Studio
13,Clinton Hill,Yoga Studio,Gym,Japanese Restaurant,Gym / Fitness Center,Martial Arts Dojo,Sushi Restaurant,Bubble Tea Shop,Boxing Gym,Cycle Studio,Frozen Yogurt Shop
19,Downtown,Gym / Fitness Center,Gym,Japanese Restaurant,Yoga Studio,Bubble Tea Shop,Martial Arts Dojo,Pilates Studio,Boxing Gym,College Gym,College Rec Center
20,Dumbo,Gym,Yoga Studio,Gym / Fitness Center,Japanese Restaurant,Boxing Gym,Bubble Tea Shop,Sushi Restaurant,Ramen Restaurant,Martial Arts Dojo,Climbing Gym
23,East Williamsburg,Gym,Gym / Fitness Center,Japanese Restaurant,Martial Arts Dojo,Bubble Tea Shop,Yoga Studio,Pilates Studio,Ramen Restaurant,Gymnastics Gym,Boxing Gym
34,Greenpoint,Gym / Fitness Center,Yoga Studio,Sushi Restaurant,Gym,Ramen Restaurant,Pilates Studio,Bubble Tea Shop,Gymnastics Gym,Japanese Restaurant,Martial Arts Dojo
36,Homecrest,Sushi Restaurant,Bubble Tea Shop,Gym / Fitness Center,Yoga Studio,Martial Arts Dojo,Gym,Boxing Gym,Building,Café,Chinese Restaurant
48,Park Slope,Gym,Gym / Fitness Center,Japanese Restaurant,Yoga Studio,Sushi Restaurant,Ramen Restaurant,Track,Martial Arts Dojo,Pilates Studio,Bubble Tea Shop


<a id='item4'></a>

#### 4. Cluster Neighborhoods

After trial-and-error, dividing the neighborhoods into three clusters seem to work best.  Thus, run *k*-means, where k =3, to cluster the neighborhoods into 3 categories here.

In [38]:
# set number of clusters
kclusters = 3

brooklyn_grouped_clustering = brooklyn_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(brooklyn_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 1, 1, 1, 1, 2, 1, 1, 0, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [39]:
# add clustering labels
neighborhoods_venues_sorted.insert(1, 'Cluster Labels', kmeans.labels_)

brooklyn_merged = neighborhoods_venues_sorted  #brooklyn_data

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
brooklyn_merged = brooklyn_merged.join(brooklyn_data.set_index('Neighborhood'), on='Neighborhood')

brooklyn_merged=brooklyn_merged[['Borough','Neighborhood','Latitude','Longitude','Cluster Labels','1st Most Common Venue','2nd Most Common Venue','3rd Most Common Venue','4th Most Common Venue','5th Most Common Venue','6th Most Common Venue','7th Most Common Venue','8th Most Common Venue','9th Most Common Venue','10th Most Common Venue']]
#brooklyn_merged = [col] 
brooklyn_merged # check the last columns!


Unnamed: 0,Borough,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Brooklyn,Bath Beach,40.599519,-73.998752,0,Bubble Tea Shop,Sushi Restaurant,Yoga Studio,Gym / Fitness Center,Japanese Restaurant,Martial Arts Dojo,Gym,Weight Loss Center,Frozen Yogurt Shop,Building
1,Brooklyn,Bay Ridge,40.625801,-74.030621,1,Gym / Fitness Center,Japanese Restaurant,Gym,Sushi Restaurant,Yoga Studio,Martial Arts Dojo,Bubble Tea Shop,Asian Restaurant,Shabu-Shabu Restaurant,Chiropractor
5,Brooklyn,Boerum Hill,40.685683,-73.983748,1,Gym / Fitness Center,Martial Arts Dojo,Gym,Yoga Studio,Japanese Restaurant,Bubble Tea Shop,Ramen Restaurant,Boxing Gym,Club House,Pilates Studio
13,Brooklyn,Clinton Hill,40.693229,-73.967843,1,Yoga Studio,Gym,Japanese Restaurant,Gym / Fitness Center,Martial Arts Dojo,Sushi Restaurant,Bubble Tea Shop,Boxing Gym,Cycle Studio,Frozen Yogurt Shop
19,Brooklyn,Downtown,40.690844,-73.983463,1,Gym / Fitness Center,Gym,Japanese Restaurant,Yoga Studio,Bubble Tea Shop,Martial Arts Dojo,Pilates Studio,Boxing Gym,College Gym,College Rec Center
20,Brooklyn,Dumbo,40.703176,-73.988753,2,Gym,Yoga Studio,Gym / Fitness Center,Japanese Restaurant,Boxing Gym,Bubble Tea Shop,Sushi Restaurant,Ramen Restaurant,Martial Arts Dojo,Climbing Gym
23,Brooklyn,East Williamsburg,40.708492,-73.938858,1,Gym,Gym / Fitness Center,Japanese Restaurant,Martial Arts Dojo,Bubble Tea Shop,Yoga Studio,Pilates Studio,Ramen Restaurant,Gymnastics Gym,Boxing Gym
34,Brooklyn,Greenpoint,40.730201,-73.954241,1,Gym / Fitness Center,Yoga Studio,Sushi Restaurant,Gym,Ramen Restaurant,Pilates Studio,Bubble Tea Shop,Gymnastics Gym,Japanese Restaurant,Martial Arts Dojo
36,Brooklyn,Homecrest,40.598525,-73.959185,0,Sushi Restaurant,Bubble Tea Shop,Gym / Fitness Center,Yoga Studio,Martial Arts Dojo,Gym,Boxing Gym,Building,Café,Chinese Restaurant
48,Brooklyn,Park Slope,40.672321,-73.97705,1,Gym,Gym / Fitness Center,Japanese Restaurant,Yoga Studio,Sushi Restaurant,Ramen Restaurant,Track,Martial Arts Dojo,Pilates Studio,Bubble Tea Shop


Finally, let's visualize the resulting clusters

In [40]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(brooklyn_merged['Latitude'], brooklyn_merged['Longitude'], brooklyn_merged['Neighborhood'], brooklyn_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    cluster = int(cluster)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster+0],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

#### 5. Examine Clusters

.

Cluster 0

In [41]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 0, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]


Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bath Beach,Bubble Tea Shop,Sushi Restaurant,Yoga Studio,Gym / Fitness Center,Japanese Restaurant,Martial Arts Dojo,Gym,Weight Loss Center,Frozen Yogurt Shop,Building
36,Homecrest,Sushi Restaurant,Bubble Tea Shop,Gym / Fitness Center,Yoga Studio,Martial Arts Dojo,Gym,Boxing Gym,Building,Café,Chinese Restaurant


Cluster 1

In [42]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 1, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Bay Ridge,Gym / Fitness Center,Japanese Restaurant,Gym,Sushi Restaurant,Yoga Studio,Martial Arts Dojo,Bubble Tea Shop,Asian Restaurant,Shabu-Shabu Restaurant,Chiropractor
5,Boerum Hill,Gym / Fitness Center,Martial Arts Dojo,Gym,Yoga Studio,Japanese Restaurant,Bubble Tea Shop,Ramen Restaurant,Boxing Gym,Club House,Pilates Studio
13,Clinton Hill,Yoga Studio,Gym,Japanese Restaurant,Gym / Fitness Center,Martial Arts Dojo,Sushi Restaurant,Bubble Tea Shop,Boxing Gym,Cycle Studio,Frozen Yogurt Shop
19,Downtown,Gym / Fitness Center,Gym,Japanese Restaurant,Yoga Studio,Bubble Tea Shop,Martial Arts Dojo,Pilates Studio,Boxing Gym,College Gym,College Rec Center
23,East Williamsburg,Gym,Gym / Fitness Center,Japanese Restaurant,Martial Arts Dojo,Bubble Tea Shop,Yoga Studio,Pilates Studio,Ramen Restaurant,Gymnastics Gym,Boxing Gym
34,Greenpoint,Gym / Fitness Center,Yoga Studio,Sushi Restaurant,Gym,Ramen Restaurant,Pilates Studio,Bubble Tea Shop,Gymnastics Gym,Japanese Restaurant,Martial Arts Dojo
48,Park Slope,Gym,Gym / Fitness Center,Japanese Restaurant,Yoga Studio,Sushi Restaurant,Ramen Restaurant,Track,Martial Arts Dojo,Pilates Studio,Bubble Tea Shop
57,South Side,Gym / Fitness Center,Gym,Yoga Studio,Japanese Restaurant,Sushi Restaurant,Ramen Restaurant,Pilates Studio,Boxing Gym,Bubble Tea Shop,Cycle Studio
62,Williamsburg,Yoga Studio,Gym,Gym / Fitness Center,Bubble Tea Shop,Japanese Restaurant,Ramen Restaurant,Pilates Studio,Baseball Field,College Rec Center,Frozen Yogurt Shop


Cluster 2

In [43]:
brooklyn_merged.loc[brooklyn_merged['Cluster Labels'] == 2, brooklyn_merged.columns[[1] + list(range(5, brooklyn_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Dumbo,Gym,Yoga Studio,Gym / Fitness Center,Japanese Restaurant,Boxing Gym,Bubble Tea Shop,Sushi Restaurant,Ramen Restaurant,Martial Arts Dojo,Climbing Gym
59,Sunset Park,Gym,Bubble Tea Shop,Yoga Studio,Japanese Restaurant,Baseball Field,Boxing Gym,Frozen Yogurt Shop,Football Stadium,Food Stand,Dance Studio
60,Vinegar Hill,Gym,Yoga Studio,Bubble Tea Shop,Martial Arts Dojo,Pilates Studio,Gym / Fitness Center,Gym Pool,Sushi Restaurant,Football Stadium,Dance Studio


## 4. Results  

An examination of the result and the neighborhood clusters reveals that:
1.	Most of the Brooklyn neighborhoods are crowded with gyms of different sorts and Japanese restaurants.  However, the distribution of bubble tea joints is more concentrated and limited to fewer neighborhoods.  
2.	Establishments tagged with “bubble tea” includes other related eateries, such as “Frozen Yogurt Shops”, “Juice Bar’, and “Café”.  Since previous marketing researches by GJ reveal that GJ’s clientele is most strongly correlated with spots that mainly sells bubble tea, establishments other than “bubble tea shops” are excluded from the analysis.   
3.	Establishments tagged with “Japanese” includes other related eateries, such as “Asian”, “Chinese”, and “Food Stand’, likely due to Japanese fusion restaurants in different formats.  Further, the term includes different styles of Japanese food, such as sushi and ramen as well. Feedback from GJ’s marketing department indicates that the variations are fine for the current objective as they are also observed in some of the west coast markets as well.
4.	Establishments tagged with “gym” includes other related businesses, such as dancing studios, yoga studios, and martial art dojos. Feedback from GJ’s marketing department indicates that the variations are fine for the current objective as they are also observed in some of the west coast markets as well.

After trial-and-error, three categories are found to be best number of clustering for the neighborhoods under consideration because they differentiate the neighborhoods just enough without spread them too thin in too many clusters.  Given the popularity of “Japanese” restaurants and “gyms”, but relative scarcity of “bubble tea shops” in the neighborhoods, the number of “bubble tea shops” in a neighborhood is likely the most significant factor in identifying the best expansion territory.   

Among the three categories of neighborhoods, cluster 0 neighborhoods seem to be best candidates for the expansion since bubble tea shops are most common in those, compared to others.  Cluster 2 neighborhoods seem to be the next best candidates.  


## 5. Discussion  

This is the first time GJ has engaged an external data science analyst on a project like this.  While the company believes that its clientele is strongly correlated with neighborhoods with gyms, bubble tea joints, and Japanese restaurants, it may be worthwhile to gather data and run similar analyses in the currently established markets and compare the results with this one.  In doing so, the company can check on the validity of the correlation between clientele and neighborhoods, as well as comparing the similarity of the proposed new territory  against some of GJ’s most successful locations.  

Further, as noted earlier, the definition of the terms “Japanese”, and “gym” are fairly broad in this context.  While the broad terms are tolerable for this exercise, GJ may want to consider refining the criteria of a neighborhood to see if GJ can find other neighborhood attributes increase the effectiveness of its future marketing campaign.      


## 6. Conclusion  

After discussion with GJ’s executive team, a business case for finding the best Brooklyn neighborhood for GJ’s first east coast branch was established.  Data was sourced from Foursqure.com to minimize cost.  After analyzing the gathered data and grouping all the Brooklyn neighborhoods based on distribution and number of “babble tea”, “Japanese”, and “gym” establishments, a few neighbourhoods that best fit BJ’s criteria were selected and presented to the executive team.  

Two suggestions for improving this analysis’ result in the future are also presented.   


.