# Capstone Project - Final Assignment
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In Ho Chi Minh City – the South of Vietnam, the people usually drink coffee in the morning and coffee shop is the common place for meeting with customer or hangout with friends. There are many kinds of coffee shop in Vietnam. It can be the small shop on the pavement or luxury coffee shop with big garden for children playing or group gathering for big event. 

For opening the coffee shop, the investment is not so high and does not require the big team so there are many start-ups or franchising to setup this business. Therefore, the competitiveness is the big challenge before doing it. In the other hand, the location of coffee shop is playing the important role that will determine whether the coffee shop will be a success or a failure. 

The objective of this capstone project is to **analyze and select the best locations** in Ho Chi Minh City, Vietnam to make the report and give the consultant to the customer. By using the **data science methodology and machine learning techniques**, this project aims to provide solutions to answer the business question: **In the Ho Chi Minh city, Vietnam if an investor is looking to open a new coffee shop, where would you recommend?**

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* List of districts in Ho Chi Minh city
* number of existing coffee shops in the Ho Chi Minh city
* location and distribution of coffee shop in Ho Chi Minh city

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of restaurants and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of Ho Chi Minh center will be obtained using **Google Maps API geocoding** of well known Ho Chi Minh location

**1. Import libraries**

In [1]:
!pip install geocoder
from bs4 import BeautifulSoup
import urllib
import geocoder
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |███▎                            | 10kB 15.1MB/s eta 0:00:01[K     |██████▋                         | 20kB 12.4MB/s eta 0:00:01[K     |██████████                      | 30kB 9.5MB/s eta 0:00:01[K     |█████████████▎                  | 40kB 9.1MB/s eta 0:00:01[K     |████████████████▋               | 51kB 5.3MB/s eta 0:00:01[K     |████████████████████            | 61kB 5.9MB/s eta 0:00:01[K     |███████████████████████▎        | 71kB 6.0MB/s eta 0:00:01[K     |██████████████████████████▋     | 81kB 6.4MB/s eta 0:00:01[K     |██████████████████████████████  | 92kB 6.3MB/s eta 0:00:01[K     |████████████████████████████████| 102kB 3.3MB/s 
Collecting ratelim
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad

**2. Collect the Data**
Build a dataframe of districts in Ho Chi Minh city, Vietnam by using web scraping the data from Wikipedia page: https://en.wikipedia.org/wiki/Category:Districts_of_Ho_Chi_Minh_City

Get the geographical coordinates of the districts by Python Geocoder package.
Obtain the venue data for the districts from Foursquare API Explore and cluster the neighbourhoods.
Select the best cluster to open a new coffee shop.

In [2]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Districts_of_Ho_Chi_Minh_City").text

In [3]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [4]:
# create a list to store neighborhood data
neighborhoodList = []

In [5]:
# append the data into the list
for row in soup.find_all("div",class_="mw-category-generated")[0].findAll("li"):
    neighborhoodList.append(row.text)

In [6]:
# remove the "Template:List of HCMC Administrative Units" in the list - index 0
del(neighborhoodList[0])

In [7]:
# create a new DataFrame from the list
HCM_df = pd.DataFrame({"Neighborhood": neighborhoodList})
HCM_df

Unnamed: 0,Neighborhood
0,Bình Chánh District
1,"Bình Tân District, Ho Chi Minh City"
2,Bình Thạnh District
3,Cần Giờ District
4,Củ Chi District
5,"District 1, Ho Chi Minh City"
6,"District 3, Ho Chi Minh City"
7,"District 4, Ho Chi Minh City"
8,"District 5, Ho Chi Minh City"
9,"District 6, Ho Chi Minh City"


In [8]:
# print the number of rows of the dataframe
HCM_df.shape

(22, 1)

In [9]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Ho Chi Minh City, Vietnam'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [10]:
coords = [ get_latlng(neighborhood) for neighborhood in HCM_df["Neighborhood"].tolist() ]

In [11]:
coords

[[10.679220000000043, 106.57654000000008],
 [10.73684000000003, 106.61448000000007],
 [10.806080000000065, 106.69297000000006],
 [10.41566000000006, 106.96130000000005],
 [10.977340000000027, 106.50223000000005],
 [10.78096000000005, 106.69911000000008],
 [10.775650000000041, 106.68672000000004],
 [10.766700000000071, 106.70647000000008],
 [10.755690000000072, 106.66637000000009],
 [10.745970000000057, 106.64769000000007],
 [10.70515000000006, 106.73748000000006],
 [10.74771000000004, 106.66334000000006],
 [10.768830000000037, 106.66599000000008],
 [10.763160000000028, 106.64314000000007],
 [10.850440000000049, 106.62731000000008],
 [10.833790000000022, 106.66556000000008],
 [10.888360000000034, 106.59640000000007],
 [10.701530000000048, 106.73818000000006],
 [10.795650000000023, 106.67464000000007],
 [10.73684000000003, 106.61448000000007],
 [10.782320000000027, 106.63667000000004],
 [10.861779986287589, 106.79610692772711]]

In [12]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
HCM_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [13]:
# merge the coordinates into the original dataframe
HCM_df['Latitude'] = HCM_coords['Latitude']
HCM_df['Longitude'] = HCM_coords['Longitude']

In [14]:
# check the neighborhoods and the coordinates
print(HCM_df.shape)
HCM_df

(22, 3)


Unnamed: 0,Neighborhood,Latitude,Longitude
0,Bình Chánh District,10.67922,106.57654
1,"Bình Tân District, Ho Chi Minh City",10.73684,106.61448
2,Bình Thạnh District,10.80608,106.69297
3,Cần Giờ District,10.41566,106.9613
4,Củ Chi District,10.97734,106.50223
5,"District 1, Ho Chi Minh City",10.78096,106.69911
6,"District 3, Ho Chi Minh City",10.77565,106.68672
7,"District 4, Ho Chi Minh City",10.7667,106.70647
8,"District 5, Ho Chi Minh City",10.75569,106.66637
9,"District 6, Ho Chi Minh City",10.74597,106.64769


In [16]:
# get the coordinates of Ho Chi Minh city
address = 'Ho Chi Minh, Vietnam'

geolocator = Nominatim(user_agent="ntlgiang-capstone-project")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Ho Chi Minh City, Vietnam {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Ho Chi Minh City, Vietnam 10.7758439, 106.7017555.


In [17]:
map_kl = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(HCM_df['Latitude'], HCM_df['Longitude'], HCM_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_kl)  
    
map_kl

In [18]:
# save the map as HTML file
map_kl.save('home\map_kl.html')

**3. Use the Foursquare API to explore the district data in Ho Chi Minh city, Vietnam**

In [19]:
CLIENT_ID = 'NWBZSMK2XQF0UG2G1P5C2LSUHZS1AUGK0AKILNP5GTDFRJOJ' 
CLIENT_SECRET = '4YR0U13R3OXM2GHNFLIFXBVUU2T5LKSQVHSJLTG4O2VDGPRA' 
VERSION = '20180604'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: NWBZSMK2XQF0UG2G1P5C2LSUHZS1AUGK0AKILNP5GTDFRJOJ
CLIENT_SECRET:4YR0U13R3OXM2GHNFLIFXBVUU2T5LKSQVHSJLTG4O2VDGPRA


In [20]:
radius = 20000
LIMIT = 1000

venues = []

for lat, long, neighborhood in zip(HCM_df['Latitude'], HCM_df['Longitude'], HCM_df['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [21]:
# convert the venues list into a new DataFrame
venues_df = pd.DataFrame(venues)

# define the column names
venues_df.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(venues_df.shape)
venues_df.head(15)

(2107, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Bình Chánh District,10.67922,106.57654,AEON Mall Bình Tân,10.742904,106.611836,Shopping Mall
1,Bình Chánh District,10.67922,106.57654,CGV Cinemas Sư Vạn Hạnh,10.770912,106.66967,Multiplex
2,Bình Chánh District,10.67922,106.57654,Christina's Saigon,10.765303,106.686612,Bed & Breakfast
3,Bình Chánh District,10.67922,106.57654,Tường Phong,10.751563,106.66499,Dessert Shop
4,Bình Chánh District,10.67922,106.57654,Starbucks,10.753839,106.669614,Coffee Shop
5,Bình Chánh District,10.67922,106.57654,Fusion Suites Sai Gon,10.772773,106.689894,Hotel
6,Bình Chánh District,10.67922,106.57654,Artinus 3D Painting Gallery,10.742991,106.694927,Art Gallery
7,Bình Chánh District,10.67922,106.57654,CGV Cinemas SC VivoCity,10.729917,106.703477,Multiplex
8,Bình Chánh District,10.67922,106.57654,Plan K BBQ,10.723099,106.710083,BBQ Joint
9,Bình Chánh District,10.67922,106.57654,Pizza 4P's,10.773301,106.697599,Pizza Place


In [22]:
# check how many venues were returned for each neighbourhood
venues_df.groupby(["Neighborhood"]).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bình Chánh District,100,100,100,100,100,100
Bình Thạnh District,100,100,100,100,100,100
"Bình Tân District, Ho Chi Minh City",100,100,100,100,100,100
Cần Giờ District,67,67,67,67,67,67
Củ Chi District,40,40,40,40,40,40
"District 1, Ho Chi Minh City",100,100,100,100,100,100
"District 10, Ho Chi Minh City",100,100,100,100,100,100
"District 11, Ho Chi Minh City",100,100,100,100,100,100
"District 12, Ho Chi Minh City",100,100,100,100,100,100
"District 3, Ho Chi Minh City",100,100,100,100,100,100


In [23]:
#check out how many unique categories can be curated from all the returned values
print('There are {} uniques categories.'.format(len(venues_df['VenueCategory'].unique())))

There are 83 uniques categories.


In [24]:
# displaying the first 50 Venue Category names
venues_df['VenueCategory'].unique()[:50] 

array(['Shopping Mall', 'Multiplex', 'Bed & Breakfast', 'Dessert Shop',
       'Coffee Shop', 'Hotel', 'Art Gallery', 'BBQ Joint', 'Pizza Place',
       'Deli / Bodega', 'Park', 'Sushi Restaurant',
       'Vietnamese Restaurant', 'Indian Restaurant', 'Café', 'Whisky Bar',
       'Noodle House', 'Vegetarian / Vegan Restaurant', 'Flower Shop',
       'Asian Restaurant', 'Beer Bar', 'Department Store', 'Brewery',
       'Hotel Bar', 'Tapas Restaurant', 'Bar', 'Steakhouse',
       'Massage Studio', 'Supermarket', 'French Restaurant',
       'German Restaurant', 'Italian Restaurant', 'Spa',
       'Japanese Restaurant', 'Bookstore', 'Seafood Restaurant',
       'Hotpot Restaurant', 'Bistro', 'Clothing Store', 'Lounge',
       'Ramen Restaurant', 'Middle Eastern Restaurant',
       'Korean Restaurant', 'Opera House', 'Golf Course',
       'Mexican Restaurant', 'Public Art', 'Chinese Restaurant',
       'Burger Joint', 'Health & Beauty Service'], dtype=object)

In [25]:
# check if the results contain "Restaurant"
"Coffee Shop" in venues_df['VenueCategory'].unique()

True

**4. Analyze each district of Ho Chi Minh city**

In [26]:
# one hot encoding
kl_onehot = pd.get_dummies(venues_df[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
kl_onehot['Neighborhoods'] = venues_df['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [kl_onehot.columns[-1]] + list(kl_onehot.columns[:-1])
kl_onehot = kl_onehot[fixed_columns]

print(kl_onehot.shape)
kl_onehot.head(15)

(2107, 84)


Unnamed: 0,Neighborhoods,American Restaurant,Art Gallery,Asian Restaurant,BBQ Joint,Bakery,Bar,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Bistro,Bookstore,Brewery,Buffet,Burger Joint,Café,Campground,Chinese Restaurant,Clothing Store,Coffee Shop,Convenience Store,Deli / Bodega,Department Store,Dessert Shop,Flea Market,Flower Shop,Food,French Restaurant,Garden,Garden Center,German Restaurant,Golf Course,Gun Range,Health & Beauty Service,Historic Site,History Museum,Hotel,Hotel Bar,Hotel Pool,Hotpot Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lighthouse,Lounge,Market,Massage Studio,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Multiplex,Noodle House,Opera House,Other Great Outdoors,Park,Pizza Place,Pool,Public Art,Racetrack,Ramen Restaurant,Resort,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Snack Place,Soup Place,Spa,Sports Bar,Steakhouse,Street Food Gathering,Supermarket,Surf Spot,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Theater,Theme Park,Tunnel,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar
0,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,Bình Chánh District,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
5,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,Bình Chánh District,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,Bình Chánh District,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,Bình Chánh District,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [27]:
# group rows by neighborhood and by taking the mean of the frequency of occurrence of each category
kl_grouped = kl_onehot.groupby(["Neighborhoods"]).sum().reset_index()
#df.group/by("state")["last_name"].count()

#kl_onehot.head()
print(kl_grouped.shape)
kl_grouped

(22, 84)


Unnamed: 0,Neighborhoods,American Restaurant,Art Gallery,Asian Restaurant,BBQ Joint,Bakery,Bar,Beach,Beach Bar,Bed & Breakfast,Beer Bar,Bistro,Bookstore,Brewery,Buffet,Burger Joint,Café,Campground,Chinese Restaurant,Clothing Store,Coffee Shop,Convenience Store,Deli / Bodega,Department Store,Dessert Shop,Flea Market,Flower Shop,Food,French Restaurant,Garden,Garden Center,German Restaurant,Golf Course,Gun Range,Health & Beauty Service,Historic Site,History Museum,Hotel,Hotel Bar,Hotel Pool,Hotpot Restaurant,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Korean Restaurant,Lighthouse,Lounge,Market,Massage Studio,Mediterranean Restaurant,Mexican Restaurant,Middle Eastern Restaurant,Multiplex,Noodle House,Opera House,Other Great Outdoors,Park,Pizza Place,Pool,Public Art,Racetrack,Ramen Restaurant,Resort,Restaurant,Sandwich Place,Seafood Restaurant,Shopping Mall,Snack Place,Soup Place,Spa,Sports Bar,Steakhouse,Street Food Gathering,Supermarket,Surf Spot,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Theater,Theme Park,Tunnel,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Whisky Bar
0,Bình Chánh District,0,1,1,2,0,1,0,0,1,2,1,1,1,0,0,6,0,0,1,4,0,1,1,3,0,1,0,2,0,0,1,0,0,0,0,0,14,2,0,1,1,2,3,1,0,1,0,3,0,0,1,4,3,1,0,3,4,0,0,0,1,0,0,0,1,1,0,0,2,0,3,0,2,0,1,1,0,0,0,0,4,7,2
1,Bình Thạnh District,0,0,1,1,0,1,0,0,1,2,0,1,1,0,1,7,0,1,1,2,0,0,1,2,0,0,0,4,0,0,1,0,0,1,0,0,16,3,0,1,1,2,3,1,0,2,0,3,0,1,1,3,2,0,0,1,4,0,1,0,1,1,2,0,0,0,0,0,2,0,3,0,1,0,1,1,0,0,0,0,4,8,2
2,"Bình Tân District, Ho Chi Minh City",0,1,1,2,0,1,0,0,1,2,1,1,1,0,0,6,0,0,1,4,0,1,1,3,0,1,0,2,0,0,1,1,0,0,0,0,14,2,0,1,1,2,3,1,0,1,0,3,0,0,1,4,3,0,0,3,4,0,0,0,1,0,0,0,1,1,0,0,2,0,3,0,2,0,1,1,0,0,0,0,4,7,2
3,Cần Giờ District,0,0,4,1,0,0,3,1,1,0,0,0,0,0,0,8,1,0,0,2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,7,0,1,0,1,1,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,3,0,1,0,6,1,0,9,1,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,6,0
4,Củ Chi District,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,10,0,0,0,0,0,0,0,1,1,0,1,0,1,1,0,2,1,0,2,2,2,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0,0,0,0,0,1,0,1,2,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0,3,0
5,"District 1, Ho Chi Minh City",0,1,1,1,0,1,0,0,1,2,1,1,1,0,1,6,0,1,1,2,1,0,1,2,0,1,0,4,0,0,1,0,0,0,0,0,13,3,0,1,1,2,3,1,0,2,0,3,0,1,1,3,2,1,0,1,4,0,0,0,1,0,0,0,0,0,0,0,2,0,3,0,1,0,1,1,1,1,0,0,5,9,2
6,"District 10, Ho Chi Minh City",0,1,1,2,0,1,0,0,1,2,1,1,1,0,0,6,0,0,1,3,0,1,1,3,0,1,0,2,0,0,1,0,0,0,0,0,14,2,0,1,1,2,3,1,0,1,0,3,0,1,1,4,3,1,0,2,4,0,0,0,1,0,0,0,1,1,0,0,2,0,3,0,1,0,1,1,0,0,0,0,4,9,2
7,"District 11, Ho Chi Minh City",0,1,1,2,0,1,0,0,1,2,1,1,1,0,0,6,0,0,1,3,0,1,1,3,0,1,0,2,0,0,1,0,0,0,0,0,14,2,0,1,1,2,3,1,0,1,0,3,0,1,1,4,3,0,0,3,4,0,0,0,1,0,0,0,1,1,0,0,2,0,3,0,2,0,1,1,0,0,0,0,4,8,2
8,"District 12, Ho Chi Minh City",0,1,1,1,0,1,0,0,1,1,0,1,1,0,1,8,0,1,1,3,0,0,1,3,0,1,0,3,0,0,1,1,0,1,0,0,12,3,0,1,1,2,3,0,0,1,0,3,0,1,1,3,2,0,0,3,4,0,1,0,1,1,0,0,1,1,0,0,2,0,2,0,2,0,1,1,0,0,0,0,4,8,2
9,"District 3, Ho Chi Minh City",0,1,1,1,0,1,0,0,1,2,1,1,1,0,0,6,0,1,1,3,0,0,1,3,0,1,0,3,0,0,1,0,0,0,0,0,13,2,0,1,1,2,3,1,0,2,0,3,0,1,1,4,3,1,0,2,3,0,0,0,1,0,0,0,1,0,0,0,2,0,3,0,1,0,1,1,0,1,0,0,5,9,2


In [28]:
len((kl_grouped[kl_grouped["Café"]>0]))  

22

In [29]:
len((kl_grouped[kl_grouped["Coffee Shop"]>0]))  

21

There are 43 coffee shops at the centre of Ho Chi Minh city. Now we shall classification it based on the district location to know where is the right location to open the new coffee shop.

**5. Create the Dataframe for "Coffee shop" or "Café" data**

In [30]:
kl_mall = kl_grouped[["Neighborhoods","Coffee Shop","Café"]]

In [31]:
# show the number of coffee shop in Ho Chi Minh city
kl_mall

Unnamed: 0,Neighborhoods,Coffee Shop,Café
0,Bình Chánh District,4,6
1,Bình Thạnh District,2,7
2,"Bình Tân District, Ho Chi Minh City",4,6
3,Cần Giờ District,2,8
4,Củ Chi District,0,10
5,"District 1, Ho Chi Minh City",2,6
6,"District 10, Ho Chi Minh City",3,6
7,"District 11, Ho Chi Minh City",3,6
8,"District 12, Ho Chi Minh City",3,8
9,"District 3, Ho Chi Minh City",3,6


**6. Use Machine Learning algorithm Clustering to group the "Coffee shop" and "Café" into districts**

In [32]:
# set number of clusters
kclusters = 3

kl_clustering = kl_mall.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(kl_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([0, 2, 0, 2, 1, 2, 0, 0, 2, 0], dtype=int32)

In [33]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
kl_merged = kl_mall.copy()

# add clustering labels
kl_merged["Cluster Labels"] = kmeans.labels_

In [34]:
kl_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
kl_merged

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels
0,Bình Chánh District,4,6,0
1,Bình Thạnh District,2,7,2
2,"Bình Tân District, Ho Chi Minh City",4,6,0
3,Cần Giờ District,2,8,2
4,Củ Chi District,0,10,1
5,"District 1, Ho Chi Minh City",2,6,2
6,"District 10, Ho Chi Minh City",3,6,0
7,"District 11, Ho Chi Minh City",3,6,0
8,"District 12, Ho Chi Minh City",3,8,2
9,"District 3, Ho Chi Minh City",3,6,0


In [35]:
#Add latitude and longitude values by using the join operation(the new dataframe with the old dataframe containing the latitude and longitude values)
kl_merged['Latitude'] = HCM_coords['Latitude']
kl_merged['Longitude'] = HCM_coords['Longitude']
kl_merged.head()

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
0,Bình Chánh District,4,6,0,10.67922,106.57654
1,Bình Thạnh District,2,7,2,10.73684,106.61448
2,"Bình Tân District, Ho Chi Minh City",4,6,0,10.80608,106.69297
3,Cần Giờ District,2,8,2,10.41566,106.9613
4,Củ Chi District,0,10,1,10.97734,106.50223


In [36]:
# sorting the results by Cluster Labels
print(kl_merged.shape)
kl_merged.sort_values(["Cluster Labels"], inplace=True)
kl_merged

(22, 6)


Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
0,Bình Chánh District,4,6,0,10.67922,106.57654
18,Phú Nhuận District,3,6,0,10.79565,106.67464
17,Nhà Bè District,4,6,0,10.70153,106.73818
16,Hóc Môn District,3,6,0,10.88836,106.5964
14,"District 8, Ho Chi Minh City",4,6,0,10.85044,106.62731
13,"District 7, Ho Chi Minh City",4,6,0,10.76316,106.64314
12,"District 6, Ho Chi Minh City",4,6,0,10.76883,106.66599
11,"District 5, Ho Chi Minh City",4,6,0,10.74771,106.66334
20,Tân Bình District,4,6,0,10.78232,106.63667
21,"Tân Phú District, Ho Chi Minh City",3,6,0,10.86178,106.796107


In [37]:
kl_merged["Café"].max()

10

**7. Visualize the clusters on Ho Chi Minh city map**

In [38]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(kl_merged['Latitude'], kl_merged['Longitude'], kl_merged['Neighborhood'], kl_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [39]:
# save the map as HTML file
map_clusters.save('home\map_cluster.html')

**8. Analyze the clusters**

In [40]:
# cluster 0
kl_merged.loc[kl_merged['Cluster Labels'] == 0]
#len(kl_merged.loc[kl_merged['Cluster Labels'] == 0])# 

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
0,Bình Chánh District,4,6,0,10.67922,106.57654
18,Phú Nhuận District,3,6,0,10.79565,106.67464
17,Nhà Bè District,4,6,0,10.70153,106.73818
16,Hóc Môn District,3,6,0,10.88836,106.5964
14,"District 8, Ho Chi Minh City",4,6,0,10.85044,106.62731
13,"District 7, Ho Chi Minh City",4,6,0,10.76316,106.64314
12,"District 6, Ho Chi Minh City",4,6,0,10.76883,106.66599
11,"District 5, Ho Chi Minh City",4,6,0,10.74771,106.66334
20,Tân Bình District,4,6,0,10.78232,106.63667
21,"Tân Phú District, Ho Chi Minh City",3,6,0,10.86178,106.796107


In [41]:
# cluster 1
kl_merged.loc[kl_merged['Cluster Labels'] == 1] 

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
4,Củ Chi District,0,10,1,10.97734,106.50223


In [42]:
# cluster 2
kl_merged.loc[kl_merged['Cluster Labels'] == 2] 

Unnamed: 0,Neighborhood,Coffee Shop,Café,Cluster Labels,Latitude,Longitude
8,"District 12, Ho Chi Minh City",3,8,2,10.75569,106.66637
5,"District 1, Ho Chi Minh City",2,6,2,10.78096,106.69911
15,Gò Vấp District,2,8,2,10.83379,106.66556
3,Cần Giờ District,2,8,2,10.41566,106.9613
1,Bình Thạnh District,2,7,2,10.73684,106.61448
19,Thủ Đức,2,7,2,10.73684,106.61448
10,"District 4, Ho Chi Minh City",2,6,2,10.70515,106.73748


## Results and Discussion <a name="results"></a>

Our analysis shows that there is a great number of Coffee shops in the Centre of Ho Chi Minh city which are located in Cluster 0 and 2. It is not good if we setup the new business here because we might get the strong competitive from competitors.

At Cu Chi District (Cluster 1), only 10 Coffee shops are operating. Therefore, this is the right location for our customer to open the new business. 


## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify which location has less coffee shop in order to aid customer in narrowing down the search for optimal location for a new coffee shop. By collecting the number of coffee shop density distribution from Foursquare data we have first identified general boroughs that justify further analysis, and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby coffee shops. Clustering of coffee shops based on location was then performed in order to show the insight about distribution of major zones which are containing greatest number of coffee shops. It helps the customer to take the decision. 

Final decission on optimal coffee shop location will be made by customer based on specific characteristics of neighborhoods and locations in every recommended zone.