# IBM Data Science - Final Course Assignment 

### Topic: Analyze the best place in Ho Chi Minh City, Vietnam to open the new coffee shop

## Business Statement

The customer comes to our company and ask for the support to get the business report for finding the suitable place to open the coffee shop in Ho Chi Minh City.

As a date engineer, the project is mainly focused on geospatial analysis of the Ho Chi Minh city to understand which would be the best place and give the report

In [1]:
!pip install geocoder
from bs4 import BeautifulSoup
import urllib
import geocoder
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 7.9MB/s ta 0:00:011
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2020.6.20  |       hecda079_0         145 KB  conda-forge
    python_abi-3.6             |          1_cp36m           4 KB  conda-forge
    geogra

## 1. Getting the Data

Build a dataframe of districts in Ho Chi Minh city, Vietnam by using web scraping the data from Wikipedia page https://en.wikipedia.org/wiki/Category:Districts_of_Ho_Chi_Minh_City

Get the geographical coordinates of the districts by Python Geocoder package.

Obtain the venue data for the districts from Foursquare API Explore and cluster the neighbourhoods.

Select the best cluster to open a new coffee shop.

In [2]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Districts_of_Ho_Chi_Minh_City").text

In [3]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [4]:
# create a list to store neighborhood data
neighborhoodList = []

In [5]:
# append the data into the list
for row in soup.find_all("div",class_="mw-category-generated")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [6]:
# remove the "Template:List of HCMC Administrative Units" in the list - index 0
del(neighborhoodList[0])

In [7]:
# create a new DataFrame from the list
HCM_df = pd.DataFrame({"Neighborhood": neighborhoodList})
HCM_df

Unnamed: 0,Neighborhood
0,Bình Chánh District
1,"Bình Tân District, Ho Chi Minh City"
2,Bình Thạnh District
3,Cần Giờ District
4,Củ Chi District
5,"District 1, Ho Chi Minh City"
6,"District 2, Ho Chi Minh City"
7,"District 3, Ho Chi Minh City"
8,"District 4, Ho Chi Minh City"
9,"District 5, Ho Chi Minh City"


In [8]:
# print the number of rows of the dataframe
HCM_df.shape

(24, 1)

In [9]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Ho Chi Minh City, Vietnam'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [10]:
coords = [ get_latlng(neighborhood) for neighborhood in HCM_df["Neighborhood"].tolist() ]

In [11]:
coords

[[10.679220000000043, 106.57654000000008],
 [10.73684000000003, 106.61448000000007],
 [10.805180000000064, 106.69280000000003],
 [10.41566000000006, 106.96130000000005],
 [10.977340000000027, 106.50223000000005],
 [10.78096000000005, 106.69911000000008],
 [10.791990000000055, 106.74985000000004],
 [10.775650000000041, 106.68672000000004],
 [10.766700000000071, 106.70647000000008],
 [10.755690000000072, 106.66637000000009],
 [10.745970000000057, 106.64769000000007],
 [10.70515000000006, 106.73748000000006],
 [10.74771000000004, 106.66334000000006],
 [10.820050000000037, 106.83182000000005],
 [10.768830000000037, 106.66599000000008],
 [10.763160000000028, 106.64314000000007],
 [10.850440000000049, 106.62731000000008],
 [10.833790000000022, 106.66556000000008],
 [10.888360000000034, 106.59640000000007],
 [10.701530000000048, 106.73818000000006],
 [10.795650000000023, 106.67464000000007],
 [10.73684000000003, 106.61448000000007],
 [10.782320000000027, 106.63667000000004],
 [10.846260000000

In [12]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
HCM_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [13]:
# merge the coordinates into the original dataframe
HCM_df['Latitude'] = HCM_coords['Latitude']
HCM_df['Longitude'] = HCM_coords['Longitude']

In [14]:
# check the neighborhoods and the coordinates
print(HCM_df.shape)
HCM_df

(24, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Bình Chánh District,10.67922,106.57654
1,"Bình Tân District, Ho Chi Minh City",10.73684,106.61448
2,Bình Thạnh District,10.80518,106.6928
3,Cần Giờ District,10.41566,106.9613
4,Củ Chi District,10.97734,106.50223
5,"District 1, Ho Chi Minh City",10.78096,106.69911
6,"District 2, Ho Chi Minh City",10.79199,106.74985
7,"District 3, Ho Chi Minh City",10.77565,106.68672
8,"District 4, Ho Chi Minh City",10.7667,106.70647
9,"District 5, Ho Chi Minh City",10.75569,106.66637


In [15]:
# get the coordinates of Ho Chi Minh city
address = 'Ho Chi Minh, Vietnam'

geolocator = Nominatim(user_agent="ntlgiang-capstone-project")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of HO Chi Minh City, Vietnam {}, {}.'.format(latitude, longitude))

The geograpical coordinate of HO Chi Minh City, Vietnam 10.7758439, 106.7017555.


In [16]:
# create map of Ho Chi Minh City using latitude and longitude values
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(HCM_df['Latitude'], HCM_df['Longitude'], HCM_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_kl)  
    
map_kl

In [20]:
# save the map as HTML file
map_kl.save('home\map_kl.html')

## 2. Use the foursquare API to explore the district data in Ho Chi Minh city, Vietnam

In [21]:
CLIENT_ID = '3QIVFINRNQ34QWFTV4TXLDEXP4DOGUOPNZVW4FF35TG5ETLR' 
CLIENT_SECRET = 'NGDNKESZYMT0QVZ13D3UPKSLRCKO3N3BQIRYES504JRKGML3' 
VERSION = '20180604'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 3QIVFINRNQ34QWFTV4TXLDEXP4DOGUOPNZVW4FF35TG5ETLR
CLIENT_SECRET:NGDNKESZYMT0QVZ13D3UPKSLRCKO3N3BQIRYES504JRKGML3


In [22]:
radius = 20000
LIMIT = 1000

venues = []

for lat, long, neighborhood in zip(HCM_df['Latitude'], HCM_df['Longitude'], HCM_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [23]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head(15)

(2306, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Bình Chánh District,10.67922,106.57654,AEON Mall Bình Tân,10.742904,106.611836,Shopping Mall
1,Bình Chánh District,10.67922,106.57654,CGV Cinemas Sư Vạn Hạnh,10.770912,106.66967,Multiplex
2,Bình Chánh District,10.67922,106.57654,The Common Room Project,10.758276,106.679811,Hostel
3,Bình Chánh District,10.67922,106.57654,Fusion Suites Sai Gon,10.772773,106.689894,Hotel
4,Bình Chánh District,10.67922,106.57654,Artinus 3D Painting Gallery,10.742991,106.694927,Art Gallery
5,Bình Chánh District,10.67922,106.57654,Christina's Saigon,10.765303,106.686612,Bed & Breakfast
6,Bình Chánh District,10.67922,106.57654,Starbucks,10.753839,106.669614,Coffee Shop
7,Bình Chánh District,10.67922,106.57654,Sushi Hokkaido Sachi Nguyễn Trãi,10.769229,106.690371,Sushi Restaurant
8,Bình Chánh District,10.67922,106.57654,Pizza 4P's,10.773301,106.697599,Pizza Place
9,Bình Chánh District,10.67922,106.57654,CGV Cinemas SC VivoCity,10.729917,106.703477,Multiplex


In [24]:
# check how many venues were returned for each neighbourhood
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bình Chánh District,100,100,100,100,100,100
Bình Thạnh District,100,100,100,100,100,100
"Bình Tân District, Ho Chi Minh City",100,100,100,100,100,100
Cần Giờ District,69,69,69,69,69,69
Củ Chi District,37,37,37,37,37,37
"District 1, Ho Chi Minh City",100,100,100,100,100,100
"District 10, Ho Chi Minh City",100,100,100,100,100,100
"District 11, Ho Chi Minh City",100,100,100,100,100,100
"District 12, Ho Chi Minh City",100,100,100,100,100,100
"District 2, Ho Chi Minh City",100,100,100,100,100,100


In [25]:
#check out how many unique categories can be curated from all the returned values
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 87 uniques categories.


In [26]:
# displaying the first 50 Venue Category names
venues_df['VenueCategory'].unique()[:50] 

array(['Shopping Mall', 'Multiplex', 'Hostel', 'Hotel', 'Art Gallery',
       'Bed & Breakfast', 'Coffee Shop', 'Sushi Restaurant',
       'Pizza Place', 'Vietnamese Restaurant', 'Dim Sum Restaurant',
       'Food Court', 'Vegetarian / Vegan Restaurant', 'Cocktail Bar',
       'Dessert Shop', 'Park', 'Supermarket', 'Café', 'Whisky Bar',
       'Deli / Bodega', 'Burger Joint', 'Bar', 'Department Store',
       'BBQ Joint', 'Tattoo Parlor', 'Hotel Bar', 'Brewery',
       'Massage Studio', 'Asian Restaurant', 'Mexican Restaurant',
       'Italian Restaurant', 'German Restaurant', 'Japanese Restaurant',
       'Spa', 'Bookstore', 'Noodle House', 'Convention Center',
       'Jazz Club', 'Music Venue', 'Pub', 'Nightclub',
       'Hotpot Restaurant', 'Speakeasy', 'Ramen Restaurant', 'Lounge',
       'French Restaurant', 'Beer Bar', 'Middle Eastern Restaurant',
       'Opera House', 'Clothing Store'], dtype=object)

In [27]:
# check if the results contain "Restaurant"
"Coffee Shop" in venues_df['VenueCategory'].unique()

True

## 3. Analyze each district of Ho Chi Minh city

In [28]:
# one hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]

print(kl_onehot.shape)
kl_onehot.head(15)

(2306, 88)


Unnamed: 0,Neighborhoods,Art Gallery,Asian Restaurant,BBQ Joint,Bakery,Bar,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Bistro,Bookstore,Brewery,Burger Joint,Café,Campground,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Convenience Store,Convention Center,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Flea Market,Food,Food Court,French Restaurant,Garden,Garden Center,German Restaurant,Golf Course,Gun Range,Health & Beauty Service,Historic Site,History Museum,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Lighthouse,Lounge,Market,Massage Studio,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Multiplex,Music Venue,Nightclub,Noodle House,Opera House,Other Great Outdoors,Park,Pizza Place,Pool,Pub,Public Art,Racetrack,Ramen Restaurant,Resort,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Snack Place,Soup Place,Spa,Speakeasy,Sports Bar,Street Food Gathering,Supermarket,Surf Spot,Sushi Restaurant,Tattoo Parlor,Tea Room,Theme Park,Travel Agency,Tunnel,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar
0,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Bình Chánh District,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Bình Chánh District,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
8,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [29]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
kl_grouped = kl_onehot.groupby(["Neighborhoods"]).sum().reset_index()
#df.group/by("state")["last_name"].count()

#kl_onehot.head()
print(kl_grouped.shape)
kl_grouped

(24, 88)


Unnamed: 0,Neighborhoods,Art Gallery,Asian Restaurant,BBQ Joint,Bakery,Bar,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Bistro,Bookstore,Brewery,Burger Joint,Café,Campground,Chinese Restaurant,Clothing Store,Cocktail Bar,Coffee Shop,Convenience Store,Convention Center,Deli / Bodega,Department Store,Dessert Shop,Dim Sum Restaurant,Flea Market,Food,Food Court,French Restaurant,Garden,Garden Center,German Restaurant,Golf Course,Gun Range,Health & Beauty Service,Historic Site,History Museum,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Lighthouse,Lounge,Market,Massage Studio,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Multiplex,Music Venue,Nightclub,Noodle House,Opera House,Other Great Outdoors,Park,Pizza Place,Pool,Pub,Public Art,Racetrack,Ramen Restaurant,Resort,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Snack Place,Soup Place,Spa,Speakeasy,Sports Bar,Street Food Gathering,Supermarket,Surf Spot,Sushi Restaurant,Tattoo Parlor,Tea Room,Theme Park,Travel Agency,Tunnel,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar
0,Bình Chánh District,1,1,2,0,1,0,0,1,1,1,1,1,1,7,0,1,1,1,6,1,1,1,1,1,1,0,0,1,1,0,0,1,0,0,0,0,0,1,12,1,2,0,1,1,1,0,1,0,1,0,1,1,4,1,1,1,1,0,4,3,0,1,0,0,1,0,0,0,0,1,0,0,3,1,0,0,2,0,2,1,0,0,0,0,3,9,2
1,Bình Thạnh District,1,1,2,0,1,0,0,1,1,0,1,1,2,7,0,1,1,1,3,1,1,1,1,1,0,0,0,1,3,0,0,1,0,0,1,0,0,1,12,2,1,0,1,1,1,0,1,0,1,0,2,1,3,1,1,1,1,0,3,4,0,0,1,0,1,1,2,0,0,0,0,0,2,1,0,0,2,0,2,1,0,0,0,0,3,9,2
2,"Bình Tân District, Ho Chi Minh City",1,1,1,0,1,0,0,1,1,1,1,1,1,7,0,0,1,1,6,0,1,1,1,1,1,0,0,1,3,0,0,1,0,0,0,0,0,1,12,1,2,0,1,1,1,0,1,0,1,0,2,1,4,1,1,1,1,0,3,3,0,1,0,0,1,0,0,0,0,1,0,0,3,1,0,0,2,0,2,1,0,0,0,0,3,10,2
3,Cần Giờ District,0,5,1,0,1,3,1,1,0,0,0,0,0,9,1,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,7,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,4,0,0,1,0,7,1,0,8,1,1,0,0,0,1,0,0,1,1,0,0,0,0,0,0,5,0
4,Củ Chi District,0,2,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,1,0,1,1,0,0,1,1,0,1,1,0,1,2,0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,1,2,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,4,0
5,"District 1, Ho Chi Minh City",1,1,1,0,1,0,0,1,2,1,1,1,2,6,0,1,1,1,4,1,1,1,1,1,0,0,0,1,3,0,0,1,0,0,1,0,0,1,12,2,2,0,1,1,1,0,1,0,1,0,2,1,4,1,1,1,1,0,3,4,0,0,1,0,1,0,0,0,0,0,0,0,2,1,0,0,2,0,2,1,0,0,0,0,3,9,2
6,"District 10, Ho Chi Minh City",1,1,1,0,1,0,0,1,1,1,1,1,1,7,0,1,1,1,5,0,1,1,1,1,1,0,0,1,3,0,0,1,0,0,0,0,0,1,12,1,2,0,1,1,1,0,1,0,1,0,2,1,4,1,1,1,1,0,4,4,0,0,0,0,1,0,0,0,0,1,0,0,2,1,0,0,2,0,2,1,0,0,0,0,3,10,2
7,"District 11, Ho Chi Minh City",1,1,1,0,1,0,0,1,1,1,1,1,1,7,0,1,1,1,6,0,1,1,1,1,1,0,0,1,3,0,0,1,0,0,0,0,0,1,12,1,2,0,1,1,1,0,1,0,1,0,2,1,4,1,1,1,1,0,3,4,0,0,0,0,1,0,0,0,0,1,0,0,2,1,0,0,2,0,2,1,0,0,0,0,3,10,2
8,"District 12, Ho Chi Minh City",1,1,1,0,1,0,0,1,1,0,1,1,2,8,0,1,1,1,4,0,1,1,1,1,1,0,0,1,3,0,0,1,1,0,1,0,0,1,11,2,1,0,1,1,1,0,1,0,1,0,2,1,2,1,1,1,1,0,3,4,0,0,1,0,1,1,0,0,0,1,0,0,2,1,0,0,3,0,2,1,0,0,0,0,3,9,2
9,"District 2, Ho Chi Minh City",0,3,3,1,2,0,0,1,1,0,1,1,2,8,0,1,0,1,4,0,1,1,1,1,0,0,0,1,4,0,0,1,0,0,1,0,0,0,10,2,1,0,2,1,1,0,1,0,1,2,2,0,1,0,0,1,0,0,0,4,0,0,1,0,1,1,5,1,0,0,0,0,2,1,0,1,2,0,2,1,1,0,0,0,3,7,2


In [30]:
len((kl_grouped[kl_grouped["Café"]>0]))  

24

In [31]:
len((kl_grouped[kl_grouped["Coffee Shop"]>0]))  

23

There are 47 coffee shops at the centre of Ho Chi Minh City.

So now we want to select a good location where the no of coffee shop are less so that the business opportunity is higher.

## 4. Create a dataframe for coffee shop or café data 

In [32]:
kl_mall = kl_grouped[["Neighborhoods","Coffee Shop","Café"]]

In [33]:
# show the number of coffee shop in Ho Chi Minh city
kl_mall

Unnamed: 0,Neighborhoods,Coffee Shop,Café
0,Bình Chánh District,6,7
1,Bình Thạnh District,3,7
2,"Bình Tân District, Ho Chi Minh City",6,7
3,Cần Giờ District,2,9
4,Củ Chi District,0,8
5,"District 1, Ho Chi Minh City",4,6
6,"District 10, Ho Chi Minh City",5,7
7,"District 11, Ho Chi Minh City",6,7
8,"District 12, Ho Chi Minh City",4,8
9,"District 2, Ho Chi Minh City",4,8


## 5. Cluster the districts

In [46]:
# set number of clusters
kclusters = 3

kl_clustering = kl_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 0, 1, 1, 2, 0, 0, 2, 2], dtype=int32)

In [47]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_mall.copy()

# add clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_

In [48]:
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels
0,Bình Chánh District,6,7,0
1,Bình Thạnh District,3,7,2
2,"Bình Tân District, Ho Chi Minh City",6,7,0
3,Cần Giờ District,2,9,1
4,Củ Chi District,0,8,1
5,"District 1, Ho Chi Minh City",4,6,2
6,"District 10, Ho Chi Minh City",5,7,0
7,"District 11, Ho Chi Minh City",6,7,0
8,"District 12, Ho Chi Minh City",4,8,2
9,"District 2, Ho Chi Minh City",4,8,2


In [49]:
#Add latitude and longitude values by using the join operation(the new dataframe with the old dataframe containing the latitude and longitude values)
kl_merged['Latitude'] = HCM_coords['Latitude']
kl_merged['Longitude'] = HCM_coords['Longitude']
kl_merged.head()

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
0,Bình Chánh District,6,7,0,10.67922,106.57654
1,Bình Thạnh District,3,7,2,10.73684,106.61448
2,"Bình Tân District, Ho Chi Minh City",6,7,0,10.80518,106.6928
3,Cần Giờ District,2,9,1,10.41566,106.9613
4,Củ Chi District,0,8,1,10.97734,106.50223


In [50]:
# sorting the results by Cluster Labels
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

(24, 6)


Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
0,Bình Chánh District,6,7,0,10.67922,106.57654
19,Nhà Bè District,5,7,0,10.70153,106.73818
15,"District 8, Ho Chi Minh City",6,7,0,10.76316,106.64314
14,"District 7, Ho Chi Minh City",5,7,0,10.76883,106.66599
13,"District 6, Ho Chi Minh City",6,7,0,10.82005,106.83182
12,"District 5, Ho Chi Minh City",6,7,0,10.74771,106.66334
22,Tân Bình District,6,7,0,10.78232,106.63667
10,"District 3, Ho Chi Minh City",5,7,0,10.74597,106.64769
11,"District 4, Ho Chi Minh City",6,6,0,10.70515,106.73748
7,"District 11, Ho Chi Minh City",6,7,0,10.77565,106.68672


In [51]:
kl_merged["Café"].max()

9

## 6. Visualize the result clusters

In [52]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [53]:
# save the map as HTML file
map_clusters.save('home\map_cluster.html')

## 7. Analyze the clusters

In [54]:
# cluster 0
kl_merged.loc[kl_merged['Cluster Labels'] == 0]
#len(kl_merged.loc[kl_merged['Cluster Labels'] == 0])# 

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
0,Bình Chánh District,6,7,0,10.67922,106.57654
19,Nhà Bè District,5,7,0,10.70153,106.73818
15,"District 8, Ho Chi Minh City",6,7,0,10.76316,106.64314
14,"District 7, Ho Chi Minh City",5,7,0,10.76883,106.66599
13,"District 6, Ho Chi Minh City",6,7,0,10.82005,106.83182
12,"District 5, Ho Chi Minh City",6,7,0,10.74771,106.66334
22,Tân Bình District,6,7,0,10.78232,106.63667
10,"District 3, Ho Chi Minh City",5,7,0,10.74597,106.64769
11,"District 4, Ho Chi Minh City",6,6,0,10.70515,106.73748
7,"District 11, Ho Chi Minh City",6,7,0,10.77565,106.68672


In [55]:
# cluster 1
kl_merged.loc[kl_merged['Cluster Labels'] == 1] 

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
4,Củ Chi District,0,8,1,10.97734,106.50223
3,Cần Giờ District,2,9,1,10.41566,106.9613


In [56]:
# cluster 2
kl_merged.loc[kl_merged['Cluster Labels'] == 2] 

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
9,"District 2, Ho Chi Minh City",4,8,2,10.75569,106.66637
5,"District 1, Ho Chi Minh City",4,6,2,10.78096,106.69911
16,"District 9, Ho Chi Minh City",3,8,2,10.85044,106.62731
17,Gò Vấp District,3,8,2,10.83379,106.66556
18,Hóc Môn District,4,7,2,10.88836,106.5964
1,Bình Thạnh District,3,7,2,10.73684,106.61448
20,Phú Nhuận District,4,7,2,10.79565,106.67464
21,Thủ Đức District,3,8,2,10.73684,106.61448
8,"District 12, Ho Chi Minh City",4,8,2,10.7667,106.70647
23,"Tân Phú District, Ho Chi Minh City",4,6,2,10.84626,106.76992


## 8. Conclusion

Based on the analysis report

1. Most of the coffee shop are concentrated in the centre of Ho Chi Minh which having the highest number of coffee shops in cluster 0 and cluster 2. The customer should not setup the business at these areas since the business opportunity is small because of having many coffee shops. 


2. The customer should consider to open the coffee shop in cluster 1 which have very less number of coffee shop.