#Selecting place to open a bar in Ho Chi Minh city

## 1. Introduction

My friends are planning to start new business in Ho Chi Minh city. They would like to open a new bar, and they are looking a good area to open it. This capstone project is helping them on making decision.

The areas we will target to:
* as close to center of each neighborhood as possible, which more potential customers
* not too many bars already existed, which as less competitors as possible

## 2. Data

To analyze place for a new bar opening, I would like to use
* First of all, list of boroughs and neighborhoods of Ho Chi Minh City. They are available from General Statistics Office of Vietnam website https://www.gso.gov.vn/dmhc2015/Default.aspx. This page provides button to export all boroughs and neighborhoods of Ho Chi Minh City. After exporting, data is stored to a file and uploaded to a cloud service which is easily  accessed from the notebook
* Second, I would use Google Maps API to get lat/long of each neighborhood before exploring them with Foursquare API
* Next, Foursquare API helps to get bars and similar places which people usually visit and have a good feedback
* Finally, I would use power of k-means clustering algorithm to analyse on these places

### Preprocessing

Download boroughs and neighborhoods of Ho Chi Minh City which I have saved to Google Drive

In [166]:
!wget -q -O hcmc_data.csv 'https://drive.google.com/uc?export=download&id=1P6_DOzAk1CeEUCkh8X2wOyj1WtjMoS32'

In [167]:
import pandas as pd
import numpy as np

dataset = pd.read_csv("hcmc_data.csv")
dataset.head()

Unnamed: 0,Tỉnh Thành Phố,Mã TP,Quận Huyện,Mã QH,Phường Xã,Mã PX,Cấp
0,Thành phố Hồ Chí Minh,79,Quận 1,760,Phường Tân Định,26734,Phường
1,Thành phố Hồ Chí Minh,79,Quận 1,760,Phường Đa Kao,26737,Phường
2,Thành phố Hồ Chí Minh,79,Quận 1,760,Phường Bến Nghé,26740,Phường
3,Thành phố Hồ Chí Minh,79,Quận 1,760,Phường Bến Thành,26743,Phường
4,Thành phố Hồ Chí Minh,79,Quận 1,760,Phường Nguyễn Thái Bình,26746,Phường


Get geolocation of these neighborhoods from Google Maps Platform APIs

In [168]:
# Get Google API key
google_map_api_key = "removed" # removed this when possible

Following method helps to get lat and long from Google Map API. The address transfering to API is Vietnamese names

In [None]:
import requests

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        fmt_name = results[0]["formatted_address"]
        return [lat, lon, fmt_name]
    except:
        return [None, None]
    
address = 'Phường Bến Thành, Quận 1, Thành phố Hồ Chí Minh'
address_coor = get_coordinates(google_map_api_key, address)
print('Coordinate of {}: {}'.format(address, address_coor))

Coordinate of Phường Bến Thành, Quận 1, Thành phố Hồ Chí Minh: [10.7744331, 106.6954544, 'Ben Thanh, District 1, Ho Chi Minh City, Vietnam']


Because the address name transfering to Google Map APIs is Vietnamese, I would like to assign some meaningful column headers to following variables:
* cityname
* boroughname
* neighborhoodname


The address name would be the form of neighborhoodname + boroughname + cityname

In [None]:
columns = dataset.columns     # get all columns names (all in Vietnamese)
cityname = columns[0]         # 1st column is city name (Ho Chi Minh City)
boroughname = columns[2]
neighborhoodname = columns[4]
print(cityname)
print(boroughname)
print(neighborhoodname)

Now, using Google API to get lat long of all these places.

In [None]:
# get lat long of each neighborhood
lat_array = []
long_array = []
fmt_name_array = []

for n, b, c in zip(dataset[neighborhoodname], dataset[boroughname], dataset[cityname]):
  address = n + ", " + b + ", " + c
  address_coor = get_coordinates(google_map_api_key, address)
  lat_array.append(address_coor[0])
  long_array.append(address_coor[1])
  fmt_name_array.append(address_coor[2])
  print('Coordinate of {}: {}'.format(address, address_coor))

# check results
# print(lat_array)
# print(long_array)
# print(fmt_name_array)

Coordinate of Phường Tân Định, Quận 1, Thành phố Hồ Chí Minh: [10.7930968, 106.6902951, 'Tan Dinh, District 1, Ho Chi Minh City, Vietnam']
Coordinate of Phường Đa Kao, Quận 1, Thành phố Hồ Chí Minh: [10.7878843, 106.6984026, 'Da Kao, District 1, Ho Chi Minh City, Vietnam']
Coordinate of Phường Bến Nghé, Quận 1, Thành phố Hồ Chí Minh: [10.7808334, 106.702825, 'Bến Nghé, District 1, Ho Chi Minh City, Vietnam']
Coordinate of Phường Bến Thành, Quận 1, Thành phố Hồ Chí Minh: [10.7744331, 106.6954544, 'Ben Thanh, District 1, Ho Chi Minh City, Vietnam']
[10.7930968, 10.7878843, 10.7808334, 10.7744331, 10.7693846, 10.7658855, 10.7655446, 10.7616235, 10.7640301, 10.7577834, 10.8834303, 10.8712302, 10.8825023, 10.8760697, 10.866797, 10.8596614, 10.8603672, 10.856544, 10.8384209, 10.8433839, 10.8292885, 10.8804079, 10.8775897, 10.8637312, 10.8676413, 10.8551341, 10.8506683, 10.8339953, 10.8538209, 10.8560516, 10.843909, 10.8467644, 10.832358, 10.8909381, 10.8421949, 10.8569656, 10.8461073, 10.840

Create new dataset with lat long results

In [None]:
# newdataset = dataset[[cityname, boroughname, neighborhoodname]]
newdataset = pd.DataFrame({})
newdataset.insert(newdataset.shape[1], "lat", lat_array)
newdataset.insert(newdataset.shape[1], "lng", long_array)
newdataset.insert(newdataset.shape[1], "fmt_name", fmt_name_array)

print("original dataset shape:", dataset.shape)
print("     new dataset shape:", newdataset.shape)

original dataset shape: (322, 7)
     new dataset shape: (322, 3)


**NOTES:**
 * In this project, I used Google Geocoding APIs to find lat and long of each location, this Google Service is charged on each APIs call. To prevent unexpected charging, I would like to export above ouput of new dataset into a file and upload it to another place
 * This project is done in Google Colab so following code is downloaded files from Google Colab
 * This downloaded file is shared to https://drive.google.com/uc?export=download&id=1On-W6eRHCbQx-Q5qBHLBkzqyNkhnBRq8

In [None]:
from google.colab import files
output_filename = "hcmc_formatted_data.csv"
newdataset.to_csv(output_filename)
files.download(output_filename)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Load data from downloaded files incase Google API is not called

In [None]:
!wget -q -O hcmc_data_new.csv 'https://drive.google.com/uc?export=download&id=1On-W6eRHCbQx-Q5qBHLBkzqyNkhnBRq8'
reloaddf = pd.read_csv("hcmc_data_new.csv")
reloaddf[["lat", "lng", "fmt_name"]].head()

Unnamed: 0,lat,lng,fmt_name
0,10.793097,106.690295,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam"
1,10.787884,106.698403,"Da Kao, District 1, Ho Chi Minh City, Vietnam"
2,10.780833,106.702825,"Bến Nghé, District 1, Ho Chi Minh City, Vietnam"
3,10.774433,106.695454,"Ben Thanh, District 1, Ho Chi Minh City, Vietnam"
4,10.769385,106.700614,"Nguyen Thai Binh, District 1, Ho Chi Minh City..."


**NOTES:**
* If newdataset is unavailable because of Google API, remember get data from above data file
https://drive.google.com/uc?export=download&id=1On-W6eRHCbQx-Q5qBHLBkzqyNkhnBRq8

In [None]:
newdataset.head()

Unnamed: 0,lat,lng,fmt_name
0,10.793097,106.690295,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam"
1,10.787884,106.698403,"Da Kao, District 1, Ho Chi Minh City, Vietnam"
2,10.780833,106.702825,"Bến Nghé, District 1, Ho Chi Minh City, Vietnam"
3,10.774433,106.695454,"Ben Thanh, District 1, Ho Chi Minh City, Vietnam"
4,10.769385,106.700614,"Nguyen Thai Binh, District 1, Ho Chi Minh City..."


Now, it is safe to have data all the time. Move to other part of this project

### Get places from Foursquare

Exploring top 100 places within the neighborhood center using Foursquare APIs

In [276]:
# Register Foursquare account and get credentical to its APIs.
CLIENT_ID = 'removed'
CLIENT_SECRET = 'removed'
VERSION = '20200808'

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: removed
CLIENT_SECRET:removed


Following method is retrieving top 100 popular places within each neighborhood center.

In [277]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    LIMIT = 100
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [279]:
# Get top 100 popular places of each neighborhood
hcmc_venues = getNearbyVenues(names=newdataset["fmt_name"],
                                latitudes=newdataset["lat"],
                                longitudes=newdataset["lng"])

Tan Dinh, District 1, Ho Chi Minh City, Vietnam
Da Kao, District 1, Ho Chi Minh City, Vietnam
Bến Nghé, District 1, Ho Chi Minh City, Vietnam
Ben Thanh, District 1, Ho Chi Minh City, Vietnam
Nguyen Thai Binh, District 1, Ho Chi Minh City, Vietnam
Pham Ngu Lao, District 1, Ho Chi Minh City, Vietnam
Cầu Ông Lãnh, District 1, Ho Chi Minh City, Vietnam
Co Giang, District 1, Ho Chi Minh City, Vietnam
Nguyen Cu Trinh, District 1, Ho Chi Minh City, Vietnam
Cau Kho, District 1, Ho Chi Minh City, Vietnam
Thạnh Xuân, District 12, Ho Chi Minh City, Vietnam
Thạnh Lộc, Quận 12, Thành phố Hồ Chí Minh, Vietnam
Hiệp Thành, District 12, Ho Chi Minh City, Vietnam
Thoi An, District 12, Ho Chi Minh City, Vietnam
Tân Chánh Hiệp, Quận 12, Thành phố Hồ Chí Minh, Vietnam
An Phú Đông, Quận 12, Thành phố Hồ Chí Minh, Vietnam
Tân Thới Hiệp, District 12, Ho Chi Minh City, Vietnam
Trung My Tay, District 12, Ho Chi Minh City, Vietnam
Tân Hưng Thuận, Quận 12, Thành phố Hồ Chí Minh, Vietnam
Đông Hưng Thuận, Quận 12, 

Because of Foursquare API calls limit, I would like to export following data to files which can reusable.

In [281]:
# Download data to check offline
# to prevent out of regular API calls
from google.colab import files
output_nearby_places = "hcmc_nearby_place_nightlife.csv"
hcmc_venues.to_csv(output_nearby_places)
files.download(output_nearby_places)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

##3. Methodology

Now, I have already had data of each neighborhoods, I will apply k-means algorithm to cluster each area, then count total existed bars for each cluster

In [324]:
# Import libs
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [284]:
# one hot encoding
hcmc_onehot = pd.get_dummies(hcmc_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
hcmc_onehot['Neighbourhood'] = hcmc_venues['Neighbourhood'] 

# move neighborhood column to the first column
fixed_columns = [hcmc_onehot.columns[-1]] + list(hcmc_onehot.columns[:-1])
hcmc_onehot = hcmc_onehot[fixed_columns]

hcmc_onehot.head(3)

Unnamed: 0,Neighbourhood,Airport Food Court,American Restaurant,Argentinian Restaurant,Art Gallery,Arts & Crafts Store,Arts & Entertainment,Asian Restaurant,Athletics & Sports,Auto Garage,Auto Workshop,BBQ Joint,Badminton Court,Bagel Shop,Bakery,Bar,Baseball Field,Baseball Stadium,Basketball Stadium,Bathing Area,Beach,Bed & Breakfast,Beer Bar,Beer Garden,Bistro,Boarding House,Boat or Ferry,Bookstore,Boutique,Breakfast Spot,Brewery,Bubble Tea Shop,Buffet,Building,Burger Joint,Burrito Place,Bus Station,Business Service,Cafeteria,Café,Cajun / Creole Restaurant,Camera Store,Cantonese Restaurant,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Comfort Food Restaurant,Convenience Store,Convention Center,Cosmetics Shop,Cruise Ship,Cupcake Shop,Deli / Bodega,Department Store,Design Studio,Dessert Shop,Dim Sum Restaurant,Diner,Dive Bar,Dumpling Restaurant,Eastern European Restaurant,Electronics Store,Exhibit,Fabric Shop,Fast Food Restaurant,Fish Market,Fishing Spot,Flea Market,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,Fountain,French Restaurant,Fried Chicken Joint,Frozen Yogurt Shop,Furniture / Home Store,Garden,Gastropub,Gay Bar,German Restaurant,Gift Shop,Greek Restaurant,Grocery Store,Gym / Fitness Center,Hardware Store,Health & Beauty Service,History Museum,Home Service,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Curry Restaurant,Japanese Restaurant,Juice Bar,Karaoke Bar,Korean Restaurant,Lake,Lighthouse,Lounge,Malay Restaurant,Market,Massage Studio,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Miscellaneous Shop,Mobile Phone Shop,Mongolian Restaurant,Motel,Motorcycle Shop,Movie Theater,Multiplex,Museum,Music Venue,Nightclub,Noodle House,North Indian Restaurant,Opera House,Organic Grocery,Outdoors & Recreation,Park,Performing Arts Venue,Pet Café,Pet Store,Pizza Place,Playground,Plaza,Pool,Pool Hall,Ramen Restaurant,Residential Building (Apartment / Condo),Rest Area,Restaurant,Rock Club,Russian Restaurant,Sandwich Place,Scenic Lookout,Seafood Restaurant,Shoe Store,Shop & Service,Shopping Mall,Snack Place,Soccer Field,Soup Place,Spa,Spanish Restaurant,Sporting Goods Shop,Sports Bar,Sports Club,Stadium,Steakhouse,Street Food Gathering,Supermarket,Sushi Restaurant,Tapas Restaurant,Tattoo Parlor,Taxi,Tea Room,Temple,Thai Restaurant,Theater,Theme Park,Toy / Game Store,Track Stadium,Trail,Tunnel,Udon Restaurant,University,Vegetarian / Vegan Restaurant,Video Game Store,Vietnamese Restaurant,Warehouse Store,Water Park,Whisky Bar,Yoga Studio
0,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
1,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [285]:
hcmc_onehot.shape
hcmc_grouped = hcmc_onehot.groupby('Neighbourhood').mean().reset_index()
hcmc_grouped
hcmc_grouped.shape

(243, 182)

In [286]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [287]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighbourhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighbourhood'] = hcmc_grouped['Neighbourhood']

for ind in np.arange(hcmc_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(hcmc_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"01 Cao Thắng, Phường 2, Quận 3, Thành phố Hồ C...",Vietnamese Restaurant,Café,Asian Restaurant,Bookstore,Bakery,Noodle House,Chinese Restaurant,Food Truck,Pizza Place,Hotel
1,"09 Cao Thắng, Phường 2, Quận 3, Thành phố Hồ C...",Vietnamese Restaurant,Café,Asian Restaurant,Bakery,Bookstore,Food Truck,Noodle House,Chinese Restaurant,Seafood Restaurant,Vegetarian / Vegan Restaurant
2,"107 Cao Văn Lầu, Phường 1, Quận 6, Thành phố H...",Food,Food Truck,Market,Dessert Shop,Yoga Studio,Dive Bar,Fast Food Restaurant,Fabric Shop,Exhibit,Electronics Store
3,"137 Nguyễn Văn Đậu, Phường 7, Bình Thạnh, Thàn...",Café,Vietnamese Restaurant,Asian Restaurant,Brewery,Stadium,Yoga Studio,Diner,Fast Food Restaurant,Fabric Shop,Exhibit
4,"155, Đường Nguyễn Văn Trỗi, Phường 11, Phú Nhu...",Café,Coffee Shop,Vietnamese Restaurant,Hotel,Chinese Restaurant,Juice Bar,BBQ Joint,Flea Market,Bed & Breakfast,Bar


In [288]:
# set number of clusters
kclusters = 5

hcmc_grouped_clustering = hcmc_grouped.drop('Neighbourhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(hcmc_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 3, 0, 1, 3, 1, 3, 2, 3, 3], dtype=int32)

In [289]:
neighborhoods_venues_sorted.head(3)

Unnamed: 0,Neighbourhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,"01 Cao Thắng, Phường 2, Quận 3, Thành phố Hồ C...",Vietnamese Restaurant,Café,Asian Restaurant,Bookstore,Bakery,Noodle House,Chinese Restaurant,Food Truck,Pizza Place,Hotel
1,"09 Cao Thắng, Phường 2, Quận 3, Thành phố Hồ C...",Vietnamese Restaurant,Café,Asian Restaurant,Bakery,Bookstore,Food Truck,Noodle House,Chinese Restaurant,Seafood Restaurant,Vegetarian / Vegan Restaurant
2,"107 Cao Văn Lầu, Phường 1, Quận 6, Thành phố H...",Food,Food Truck,Market,Dessert Shop,Yoga Studio,Dive Bar,Fast Food Restaurant,Fabric Shop,Exhibit,Electronics Store


In [290]:
newdataset.head(3)

Unnamed: 0,lat,lng,fmt_name
0,10.793097,106.690295,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam"
1,10.787884,106.698403,"Da Kao, District 1, Ho Chi Minh City, Vietnam"
2,10.780833,106.702825,"Bến Nghé, District 1, Ho Chi Minh City, Vietnam"


In [291]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

hcmc_merged = newdataset

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
hcmc_merged = hcmc_merged.join(neighborhoods_venues_sorted.set_index('Neighbourhood'), on='fmt_name')

hcmc_merged.head()

Unnamed: 0,lat,lng,fmt_name,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,10.793097,106.690295,"Tan Dinh, District 1, Ho Chi Minh City, Vietnam",3.0,Vietnamese Restaurant,Café,Coffee Shop,Breakfast Spot,Vegetarian / Vegan Restaurant,Asian Restaurant,Yoga Studio,Design Studio,Spa,Snack Place
1,10.787884,106.698403,"Da Kao, District 1, Ho Chi Minh City, Vietnam",3.0,Vietnamese Restaurant,Café,French Restaurant,Coffee Shop,Japanese Restaurant,Vegetarian / Vegan Restaurant,Beer Garden,Dessert Shop,Bubble Tea Shop,Breakfast Spot
2,10.780833,106.702825,"Bến Nghé, District 1, Ho Chi Minh City, Vietnam",0.0,Coffee Shop,Hotel,Café,Spa,Ramen Restaurant,Vietnamese Restaurant,Massage Studio,French Restaurant,Bar,Steakhouse
3,10.774433,106.695454,"Ben Thanh, District 1, Ho Chi Minh City, Vietnam",3.0,Vietnamese Restaurant,Hotel,Park,BBQ Joint,Shoe Store,Sandwich Place,Japanese Restaurant,Noodle House,Spanish Restaurant,Cajun / Creole Restaurant
4,10.769385,106.700614,"Nguyen Thai Binh, District 1, Ho Chi Minh City...",3.0,Vietnamese Restaurant,Café,Hotel,Coffee Shop,Burger Joint,Japanese Restaurant,Tapas Restaurant,Cocktail Bar,Bar,Beer Bar


In [292]:
hcmc_merged_nona = hcmc_merged.dropna()
hcmc_merged_nona.shape
hcmc_merged_nona["Cluster Labels"].values

array([3., 3., 0., 3., 3., 3., 3., 3., 3., 3., 1., 3., 3., 3., 0., 0., 0.,
       1., 0., 4., 1., 0., 0., 1., 1., 1., 3., 0., 0., 0., 0., 3., 0., 0.,
       4., 0., 3., 0., 0., 3., 3., 0., 1., 1., 1., 0., 1., 1., 3., 1., 0.,
       0., 0., 1., 3., 1., 3., 1., 1., 3., 0., 1., 0., 0., 1., 0., 1., 0.,
       0., 0., 3., 3., 3., 3., 0., 3., 3., 3., 3., 0., 3., 0., 3., 3., 3.,
       0., 3., 0., 1., 0., 3., 0., 0., 0., 0., 3., 3., 3., 3., 0., 3., 3.,
       3., 3., 3., 3., 3., 3., 1., 3., 0., 4., 1., 0., 4., 1., 0., 1., 3.,
       3., 3., 3., 3., 3., 0., 3., 3., 3., 0., 3., 3., 3., 3., 3., 3., 3.,
       3., 3., 3., 0., 3., 3., 3., 0., 3., 0., 0., 3., 3., 3., 3., 1., 0.,
       0., 0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 3., 3., 3., 3.,
       1., 3., 3., 3., 0., 3., 3., 0., 3., 3., 0., 0., 3., 0., 0., 0., 0.,
       0., 0., 3., 3., 0., 0., 0., 3., 3., 3., 0., 3., 1., 0., 0., 0., 0.,
       0., 0., 0., 2., 2., 0., 3., 3., 2., 3., 1., 3., 3., 0., 1., 1., 1.,
       1., 3., 0., 1., 1.

In [293]:
hcmc_merged_nona['Cluster Labels'] = hcmc_merged_nona['Cluster Labels'].astype('int')
hcmc_merged_nona["Cluster Labels"].values

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


array([3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 0, 0, 0, 1, 0, 4, 1, 0,
       0, 1, 1, 1, 3, 0, 0, 0, 0, 3, 0, 0, 4, 0, 3, 0, 0, 3, 3, 0, 1, 1,
       1, 0, 1, 1, 3, 1, 0, 0, 0, 1, 3, 1, 3, 1, 1, 3, 0, 1, 0, 0, 1, 0,
       1, 0, 0, 0, 3, 3, 3, 3, 0, 3, 3, 3, 3, 0, 3, 0, 3, 3, 3, 0, 3, 0,
       1, 0, 3, 0, 0, 0, 0, 3, 3, 3, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 1, 3,
       0, 4, 1, 0, 4, 1, 0, 1, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 0, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 0, 3, 3, 3, 0, 3, 0, 0, 3, 3, 3, 3, 1, 0, 0,
       0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 3, 3, 3, 3, 1, 3, 3, 3, 0, 3,
       3, 0, 3, 3, 0, 0, 3, 0, 0, 0, 0, 0, 0, 3, 3, 0, 0, 0, 3, 3, 3, 0,
       3, 1, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0, 3, 3, 2, 3, 1, 3, 3, 0, 1, 1,
       1, 1, 3, 0, 1, 1, 0, 1, 4, 1, 0, 0, 3, 0, 4, 3, 0, 0, 0, 3, 1, 1,
       3, 1, 4, 1, 0, 0, 0, 0, 3, 1, 0, 0, 0, 3, 1, 0, 0, 0, 0, 0])

In [294]:
# Ho Chi Minh City
citylat = 10.762622
citylng = 106.660172
map_clusters = folium.Map(location=[citylat, citylng], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(hcmc_merged_nona['lat'], hcmc_merged_nona['lng'], hcmc_merged_nona['fmt_name'], hcmc_merged_nona['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [323]:
clusters = []
for i in range(kclusters):
  print("Checking on cluster %d" % (i))
  clusters.append(hcmc_merged_nona.loc[hcmc_merged_nona['Cluster Labels'] == i, hcmc_merged_nona.columns[[2] + list(range(4, hcmc_merged_nona.shape[1]))]])

popular_bars_per_cluster = []
for i in range(kclusters):
  # print("shape of cluster %d: %s" % (i, clusters[i].shape))
  total = 0
  for col in clusters[i]:
    count = clusters[i][col].str.count("[bB]ar").sum()
    if (count > 0):
      total += count
      # print("found %d from %s" % (count, col))
  popular_bars_per_cluster.append(total)
  print("number of popular bars on cluster %d: %d" % (i, popular_bars_per_cluster[i]))

Checking on cluster 0
Checking on cluster 1
Checking on cluster 2
Checking on cluster 3
Checking on cluster 4
number of popular bars on cluster 0: 38
number of popular bars on cluster 1: 8
number of popular bars on cluster 2: 3
number of popular bars on cluster 3: 26
number of popular bars on cluster 4: 6


The cluster 0 and 3 have the most bars available

##4. Result

From the clustering result above, we have seen that cluster 0 and 3 have most bars, therefore it will have more competitors to open new bars in these areas. Howver, other cluster 1, 2 and 4 have a few bars, less competitors on these areas

##5. Discussion and Conclusion

Despite, we have already known that cluster 0 and 3 will have more struggle to open bars but they also have a lot of customers because most of people visit bars in these areas
Therefore, it depends on the business owner, how they would their business to be. If they would expect less competitors, they should open bars in area of cluster 1, 2 or 4. However, if they would expect more competitors with more potential customers, they would open bars in area of cluster 0 or 3