# Week 5 - Capstone Project - The Battle of Neighborhoods

## About this project "Opening a New Restaurant in Hanoi, Vietnam"
Problem: In Hanoi, if someone is looking to open a restaurant, where would you recommend that they open it? To solve this problem, we should:
- Build a dataframe of neighborhoods in Hanoi, Vietnam by web scraping the data from Wikipedia page
- Get the geographical coordinates of the neighborhoods
- Obtain the venue data for the neighborhoods from Foursquare API
- Explore and cluster the neighborhoods
- Select the best cluster to open a new restaurant

### 1. Import libraries

In [2]:
pip install geopy

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/0c/67/915668d0e286caa21a1da82a85ffe3d20528ec7212777b43ccd027d94023/geopy-2.1.0-py3-none-any.whl (112kB)
[K     |████████████████████████████████| 112kB 6.6MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-2.1.0
Note: you may need to restart the kernel to use updated packages.


In [17]:
pip install geocoder

Collecting geocoder
[?25l  Downloading https://files.pythonhosted.org/packages/4f/6b/13166c909ad2f2d76b929a4227c952630ebaf0d729f6317eb09cbceccbab/geocoder-1.38.1-py2.py3-none-any.whl (98kB)
[K     |████████████████████████████████| 102kB 5.1MB/s eta 0:00:01
Collecting ratelim (from geocoder)
  Downloading https://files.pythonhosted.org/packages/f2/98/7e6d147fd16a10a5f821db6e25f192265d6ecca3d82957a4fdd592cad49c/ratelim-0.1.6-py2.py3-none-any.whl
Installing collected packages: ratelim, geocoder
Successfully installed geocoder-1.38.1 ratelim-0.1.6
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install BeautifulSoup4

Collecting BeautifulSoup4
[?25l  Downloading https://files.pythonhosted.org/packages/d1/41/e6495bd7d3781cee623ce23ea6ac73282a373088fcd0ddc809a047b18eae/beautifulsoup4-4.9.3-py3-none-any.whl (115kB)
[K     |████████████████████████████████| 122kB 7.0MB/s eta 0:00:01
[?25hCollecting soupsieve>1.2; python_version >= "3.0" (from BeautifulSoup4)
  Downloading https://files.pythonhosted.org/packages/02/fb/1c65691a9aeb7bd6ac2aa505b84cb8b49ac29c976411c6ab3659425e045f/soupsieve-2.1-py3-none-any.whl
Installing collected packages: soupsieve, BeautifulSoup4
Successfully installed BeautifulSoup4-4.9.3 soupsieve-2.1
Note: you may need to restart the kernel to use updated packages.


In [18]:
# Import libraries
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import geocoder 

import requests # library to handle requests

from bs4 import BeautifulSoup # library to parse HTML and XML documents

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

print("Libraries imported.")

Libraries imported.


### 2. Scrape data from Wikipedia page into DataFrame

In [5]:
# send the GET request
data = requests.get("https://en.wikipedia.org/wiki/Category:Districts_of_Hanoi").text

In [6]:
# parse data from the html into a beautifulsoup object
soup = BeautifulSoup(data, 'html.parser')

In [10]:
# create a list to store neighborhood data
neighborhoodList = []

In [11]:
# append the data into the list
for row in soup.find_all("div", class_="mw-category")[0].findAll("li"):
    neighborhoodList.append(row.text.replace(" District",""))

In [12]:
# create a new DataFrame from the list
df_hanoi = pd.DataFrame({"Neighborhood": neighborhoodList})

df_hanoi.head()

Unnamed: 0,Neighborhood
0,Ba Đình
1,Ba Vì
2,Bắc Từ Liêm
3,Cầu Giấy
4,Chương Mỹ


In [13]:
# shape of our dataframe
df_hanoi.shape

(30, 1)

That's mean in Hanoi, we have 30 neighborhoods (districts). Let's continue

### 3. Get the geographical coordinates

In [14]:
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Hanoi, Vietnam'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

In [19]:
# call the function to get the coordinates, store in a new list using list comprehension
coords = [ get_latlng(neighborhood) for neighborhood in df_hanoi["Neighborhood"].tolist() ]

In [20]:
coords

[[21.022010000000023, 105.81934000000007],
 [21.083330000000046, 105.38333000000006],
 [20.99831000000006, 105.75457000000006],
 [21.035596667556106, 105.80597932307698],
 [21.029090000000053, 105.82682000000005],
 [21.089800000000025, 105.66401000000008],
 [21.072940000000074, 105.78559000000007],
 [20.99702961960129, 105.88268682260897],
 [21.0113810087877, 105.91717007525097],
 [20.97031000000004, 105.78181000000006],
 [21.155050000000074, 105.73429000000004],
 [21.08627000000007, 105.77028000000007],
 [21.07884000000007, 105.81941000000006],
 [21.02843000000007, 105.83206000000007],
 [21.043110522938534, 105.86527602847573],
 [20.973070000000064, 105.77827000000008],
 [21.014580000000024, 105.85160000000008],
 [20.99831000000006, 105.75457000000006],
 [21.233560000000068, 105.43031000000008],
 [21.03227000000004, 105.85244000000006],
 [21.00334130793233, 105.65226374280645],
 [21.27515421204921, 105.88907311722683],
 [21.032794055797336, 105.83013917169016],
 [21.066670000000045, 1

In [21]:
# create temporary dataframe to populate the coordinates into Latitude and Longitude
df_coords = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

In [22]:
# merge the coordinates into the original dataframe
df_hanoi['Latitude'] = df_coords['Latitude']
df_hanoi['Longitude'] = df_coords['Longitude']

In [24]:
# check neighborhoods and coordinates
df_hanoi

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Ba Đình,21.02201,105.81934
1,Ba Vì,21.08333,105.38333
2,Bắc Từ Liêm,20.99831,105.75457
3,Cầu Giấy,21.035597,105.805979
4,Chương Mỹ,21.02909,105.82682
5,Đan Phượng,21.0898,105.66401
6,Đông Anh,21.07294,105.78559
7,Đống Đa,20.99703,105.882687
8,Gia Lâm,21.011381,105.91717
9,Hà Đông,20.97031,105.78181


In [25]:
# save the DataFrame as CSV file
df_hanoi.to_csv("Hanoi_district_coordinates.csv", index=False)

### 4. Create a map of Hanoi with neighborhoods superimposed on top

In [27]:
# get the coordinates of Kuala Lumpur
address = 'Hanoi, Vietnam'

geolocator = Nominatim(user_agent="hanoi_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Hanoi, Vietnam {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Hanoi, Vietnam 21.0294498, 105.8544441.


In [28]:
# create map of Hanoi using latitude and longitude values
map_hanoi = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, neighborhood in zip(df_hanoi['Latitude'], df_hanoi['Longitude'], df_hanoi['Neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7).add_to(map_hanoi)  
    
map_hanoi

In [29]:
# save the map as HTML file
map_hanoi.save('Hanoi_district_coordinates_map.html')

### 5. Use the Foursquare API to explore the neighborhoods

In [30]:
# define Foursquare credential
CLIENT_ID = 'LGQWG0TYWVC3YYGB5DNMW5GIKIQD2OCZJJQ0SH0N0SFFFLJT' # your Foursquare ID
CLIENT_SECRET = '5EODL4DTA4XXL40GVKHXV0CM422X3HZPVEES0AHI2JKHY5AF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: LGQWG0TYWVC3YYGB5DNMW5GIKIQD2OCZJJQ0SH0N0SFFFLJT
CLIENT_SECRET:5EODL4DTA4XXL40GVKHXV0CM422X3HZPVEES0AHI2JKHY5AF


#### Now, let's get the top 100 venues that are within a radius of 2000 meters

In [31]:
radius = 2000

venues = []

for lat, long, neighborhood in zip(df_hanoi['Latitude'], df_hanoi['Longitude'], df_hanoi['Neighborhood']):
    
    # create the API request URL
    url = "https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}".format(
        CLIENT_ID,
        CLIENT_SECRET,
        VERSION,
        lat,
        long,
        radius, 
        LIMIT)
    
    # make the GET request
    results = requests.get(url).json()["response"]['groups'][0]['items']
    
    # return only relevant information for each nearby venue
    for venue in results:
        venues.append((
            neighborhood,
            lat, 
            long, 
            venue['venue']['name'], 
            venue['venue']['location']['lat'], 
            venue['venue']['location']['lng'],  
            venue['venue']['categories'][0]['name']))

In [32]:
# convert the venues list into a new DataFrame
df_venues = pd.DataFrame(venues)

# define the column names
df_venues.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueLatitude', 'VenueLongitude', 'VenueCategory']

print(df_venues.shape)
df_venues.head()

(1071, 7)


Unnamed: 0,Neighborhood,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
0,Ba Đình,21.02201,105.81934,1946,21.01883,105.821899,Vietnamese Restaurant
1,Ba Đình,21.02201,105.81934,Nhật Cường Mobile 12 Láng Hạ,21.020113,105.817417,Tiki Bar
2,Ba Đình,21.02201,105.81934,割烹 㐂六(キロク),21.027993,105.81013,Japanese Restaurant
3,Ba Đình,21.02201,105.81934,博多幸龍,21.020768,105.817985,Ramen Restaurant
4,Ba Đình,21.02201,105.81934,Chợ Thành Công,21.022261,105.812759,Market


As you can see, in Hanoi, we have 1071 venues.

#### Let's check how many venues were return for each neighborhood

In [34]:
df_venues.groupby(['Neighborhood']).count()

Unnamed: 0_level_0,Latitude,Longitude,VenueName,VenueLatitude,VenueLongitude,VenueCategory
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Ba Vì,7,7,7,7,7,7
Ba Đình,100,100,100,100,100,100
Bắc Từ Liêm,5,5,5,5,5,5
Chương Mỹ,100,100,100,100,100,100
Cầu Giấy,100,100,100,100,100,100
Gia Lâm,4,4,4,4,4,4
Hai Bà Trưng,6,6,6,6,6,6
Hoài Đức,6,6,6,6,6,6
Hoàn Kiếm,48,48,48,48,48,48
"Hoàng Mai, Hanoi",100,100,100,100,100,100


#### Let's find out how many unique categories can be curated from all the returned venues

In [35]:
print('There are {} uniques categories.'.format(len(df_venues['VenueCategory'].unique())))

There are 146 uniques categories.


In [36]:
# print out the list categories
df_venues['VenueCategory'].unique()

array(['Vietnamese Restaurant', 'Tiki Bar', 'Japanese Restaurant',
       'Ramen Restaurant', 'Market', 'Lake', 'Pizza Place', 'Beer Bar',
       'Restaurant', 'Coffee Shop', 'Hotpot Restaurant', 'Hotel',
       'Rock Club', 'Multiplex', 'Escape Room', 'Movie Theater',
       'Russian Restaurant', 'Scenic Lookout', 'Sushi Restaurant',
       'Massage Studio', 'Dessert Shop', 'Wine Bar', 'Café',
       'Confucian Temple', 'Steakhouse', 'BBQ Joint',
       'Chinese Restaurant', 'Ice Cream Shop', 'Wings Joint', 'Tea Room',
       'Supermarket', 'Korean Restaurant', 'Shopping Mall', 'Park',
       'Noodle House', 'Vegetarian / Vegan Restaurant',
       'Fried Chicken Joint', 'Bar', 'Wedding Hall', 'Spa',
       'Himalayan Restaurant', 'Bookstore', 'Peruvian Restaurant',
       'Mongolian Restaurant', 'Sandwich Place', 'History Museum',
       'Soccer Stadium', 'Bakery', 'Bulgarian Restaurant', 'Karaoke Bar',
       'Fast Food Restaurant', 'Food Court', 'Arepa Restaurant', 'Bridge',
       

#### 6. Analyze each neighborhood

In [38]:
# one hot encoding
hanoi_onehot = pd.get_dummies(df_venues[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
hanoi_onehot['Neighborhoods'] = df_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [hanoi_onehot.columns[-1]] + list(hanoi_onehot.columns[:-1])
hanoi_onehot = hanoi_onehot[fixed_columns]

hanoi_onehot.head()

Unnamed: 0,Neighborhoods,Arepa Restaurant,Armenian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Australian Restaurant,BBQ Joint,Baby Store,Bakery,Bar,Bed & Breakfast,Beer Bar,Beer Garden,Bistro,Bookstore,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Buddhist Temple,Buffet,Building,Bulgarian Restaurant,Burger Joint,Bus Station,Café,Camera Store,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Confucian Temple,Cultural Center,Czech Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Donut Shop,Electronics Store,Escape Room,Event Space,Farm,Fast Food Restaurant,Fish & Chips Shop,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Garden,Gastropub,Grocery Store,Gym,Gym / Fitness Center,Himalayan Restaurant,History Museum,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Housing Development,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Karaoke Bar,Kebab Restaurant,Korean Restaurant,Lake,Latin American Restaurant,Lounge,Malay Restaurant,Market,Massage Studio,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Mongolian Restaurant,Monument / Landmark,Motel,Movie Theater,Multiplex,Museum,National Park,Neighborhood,Nightclub,Noodle House,Opera House,Pakistani Restaurant,Park,Pastry Shop,Peruvian Restaurant,Photography Studio,Pizza Place,Polish Restaurant,Pub,Ramen Restaurant,Resort,Restaurant,Rock Club,Roof Deck,Russian Restaurant,Salad Place,Sandwich Place,Satay Restaurant,Scenic Lookout,Seafood Restaurant,Shopping Mall,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,Spa,Sports Bar,Stadium,Steakhouse,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tapas Restaurant,Tea Room,Temple,Thai Restaurant,Thrift / Vintage Store,Tiki Bar,Train Station,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Village,Water Park,Wedding Hall,Wine Bar,Wings Joint,Women's Store,Zoo
0,Ba Đình,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,Ba Đình,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,Ba Đình,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,Ba Đình,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Ba Đình,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


#### Let's group rows by neighborhoods and by taking the mean of the frequency of occurence of each category

In [39]:
hanoi_grouped = hanoi_onehot.groupby(["Neighborhoods"]).mean().reset_index()

hanoi_grouped

Unnamed: 0,Neighborhoods,Arepa Restaurant,Armenian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Australian Restaurant,BBQ Joint,Baby Store,Bakery,Bar,Bed & Breakfast,Beer Bar,Beer Garden,Bistro,Bookstore,Breakfast Spot,Brewery,Bridge,Bubble Tea Shop,Buddhist Temple,Buffet,Building,Bulgarian Restaurant,Burger Joint,Bus Station,Café,Camera Store,Chinese Restaurant,Chocolate Shop,Church,Climbing Gym,Clothing Store,Cocktail Bar,Coffee Shop,Confucian Temple,Cultural Center,Czech Restaurant,Department Store,Dessert Shop,Dim Sum Restaurant,Donut Shop,Electronics Store,Escape Room,Event Space,Farm,Fast Food Restaurant,Fish & Chips Shop,Flower Shop,Food,Food & Drink Shop,Food Court,Food Truck,French Restaurant,Fried Chicken Joint,Garden,Gastropub,Grocery Store,Gym,Gym / Fitness Center,Himalayan Restaurant,History Museum,Hostel,Hotel,Hotel Bar,Hotpot Restaurant,Housing Development,Ice Cream Shop,Indian Restaurant,Italian Restaurant,Japanese Restaurant,Jazz Club,Juice Bar,Karaoke Bar,Kebab Restaurant,Korean Restaurant,Lake,Latin American Restaurant,Lounge,Malay Restaurant,Market,Massage Studio,Mexican Restaurant,Middle Eastern Restaurant,Modern European Restaurant,Mongolian Restaurant,Monument / Landmark,Motel,Movie Theater,Multiplex,Museum,National Park,Neighborhood,Nightclub,Noodle House,Opera House,Pakistani Restaurant,Park,Pastry Shop,Peruvian Restaurant,Photography Studio,Pizza Place,Polish Restaurant,Pub,Ramen Restaurant,Resort,Restaurant,Rock Club,Roof Deck,Russian Restaurant,Salad Place,Sandwich Place,Satay Restaurant,Scenic Lookout,Seafood Restaurant,Shopping Mall,Smoothie Shop,Snack Place,Soccer Field,Soccer Stadium,Soup Place,Spa,Sports Bar,Stadium,Steakhouse,Supermarket,Supplement Shop,Sushi Restaurant,Taco Place,Tapas Restaurant,Tea Room,Temple,Thai Restaurant,Thrift / Vintage Store,Tiki Bar,Train Station,Ukrainian Restaurant,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Village,Water Park,Wedding Hall,Wine Bar,Wings Joint,Women's Store,Zoo
0,Ba Vì,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.285714,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Ba Đình,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.03,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.06,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.09,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.04,0.0,0.02,0.0,0.02,0.0,0.0,0.05,0.0,0.0,0.01,0.0,0.04,0.02,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.02,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.01,0.0,0.04,0.0,0.0,0.01,0.0,0.01,0.03,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.03,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.07,0.0,0.0,0.01,0.01,0.01,0.0,0.0
2,Bắc Từ Liêm,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Chương Mỹ,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.01,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.01,0.01,0.01,0.0,0.0,0.01,0.08,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.08,0.0,0.02,0.0,0.01,0.01,0.0,0.03,0.0,0.0,0.0,0.0,0.02,0.02,0.0,0.0,0.01,0.01,0.01,0.0,0.0,0.0,0.01,0.02,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.02,0.02,0.0,0.01,0.0,0.01,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.02,0.0,0.0,0.01,0.0,0.0,0.02,0.1,0.0,0.0,0.01,0.01,0.01,0.0,0.0
4,Cầu Giấy,0.01,0.01,0.01,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.12,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.05,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.01,0.0,0.0,0.0,0.0,0.08,0.0,0.0,0.0,0.01,0.0,0.0,0.08,0.0,0.0,0.0,0.0,0.04,0.0,0.01,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.02,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.01,0.01,0.01,0.01,0.0,0.01,0.0,0.01,0.02,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,0.0,0.03,0.01,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.0,0.01,0.0,0.0,0.01
5,Gia Lâm,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Hai Bà Trưng,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Hoài Đức,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,Hoàn Kiếm,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.041667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.041667,0.0,0.020833,0.0,0.020833,0.0,0.020833,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.020833,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0625,0.0,0.020833,0.020833,0.041667,0.0,0.0,0.041667,0.020833,0.020833,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0,0.0,0.020833,0.0,0.020833,0.020833,0.0,0.020833,0.0,0.020833,0.020833,0.0,0.0,0.0,0.0,0.0,0.020833,0.0,0.0625,0.0,0.020833,0.0,0.020833,0.0,0.020833,0.0
9,"Hoàng Mai, Hanoi",0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.02,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.09,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.15,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0,0.01,0.0,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.07,0.0,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.01,0.0,0.0,0.01,0.01,0.0,0.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.03,0.0,0.0,0.01,0.01,0.0,0.01,0.0,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.02,0.1,0.0,0.0,0.0,0.01,0.01,0.0,0.0


In [41]:
len(hanoi_grouped[hanoi_grouped["Restaurant"] > 0])

8

#### Create a new DataFrame for Restaurant only

In [42]:
hanoi_restaurant = hanoi_grouped[['Neighborhoods', 'Restaurant']]

hanoi_restaurant

Unnamed: 0,Neighborhoods,Restaurant
0,Ba Vì,0.0
1,Ba Đình,0.01
2,Bắc Từ Liêm,0.0
3,Chương Mỹ,0.02
4,Cầu Giấy,0.01
5,Gia Lâm,0.0
6,Hai Bà Trưng,0.0
7,Hoài Đức,0.0
8,Hoàn Kiếm,0.0
9,"Hoàng Mai, Hanoi",0.01


### 7. Cluster neighborhoods
Run k-means to cluster the neighborhoods in Hanoi into 3 clusters

In [43]:
# set number of clusters
kclusters = 3

hanoi_clustering = hanoi_restaurant.drop(["Neighborhoods"], 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(hanoi_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0, 2, 0, 2, 2, 0, 0, 0, 0, 2], dtype=int32)

In [44]:
# create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
hanoi_merged = hanoi_restaurant.copy()

# add clustering labels
hanoi_merged["Cluster Labels"] = kmeans.labels_

In [45]:
hanoi_merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
hanoi_merged.head()

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels
0,Ba Vì,0.0,0
1,Ba Đình,0.01,2
2,Bắc Từ Liêm,0.0,0
3,Chương Mỹ,0.02,2
4,Cầu Giấy,0.01,2


In [46]:
# merge hanoi_grouped with hanoi_data to add latitude/longitude for each neighborhood
hanoi_merged = hanoi_merged.join(df_hanoi.set_index("Neighborhood"), on="Neighborhood")

hanoi_merged.head() # check the last columns!

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
0,Ba Vì,0.0,0,21.08333,105.38333
1,Ba Đình,0.01,2,21.02201,105.81934
2,Bắc Từ Liêm,0.0,0,20.99831,105.75457
3,Chương Mỹ,0.02,2,21.02909,105.82682
4,Cầu Giấy,0.01,2,21.035597,105.805979


In [47]:
# sort the results by Cluster Labels
hanoi_merged.sort_values(["Cluster Labels"], inplace=True)
hanoi_merged

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
0,Ba Vì,0.0,0,21.08333,105.38333
22,Đan Phượng,0.0,0,21.0898,105.66401
19,Thanh Xuân,0.0,0,21.0376,105.77507
17,Sóc Sơn,0.0,0,21.275154,105.889073
16,Quốc Oai,0.0,0,21.003341,105.652264
15,Phúc Thọ,0.0,0,21.03227,105.85244
14,Nam Từ Liêm,0.0,0,20.99831,105.75457
23,Đông Anh,0.0,0,21.07294,105.78559
11,Long Biên,0.0,0,21.043111,105.865276
10,Hà Đông,0.0,0,20.97031,105.78181


#### Finally, let's visualize the resulting clusters

In [48]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(hanoi_merged['Latitude'], hanoi_merged['Longitude'], hanoi_merged['Neighborhood'], hanoi_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [49]:
# save the map as HTML file
map_clusters.save('hanoi_restaurant_clusters_map.html')

### 8. Examine clusters

#### Clusters 0

In [50]:
hanoi_merged.loc[hanoi_merged['Cluster Labels'] == 0]

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
0,Ba Vì,0.0,0,21.08333,105.38333
22,Đan Phượng,0.0,0,21.0898,105.66401
19,Thanh Xuân,0.0,0,21.0376,105.77507
17,Sóc Sơn,0.0,0,21.275154,105.889073
16,Quốc Oai,0.0,0,21.003341,105.652264
15,Phúc Thọ,0.0,0,21.03227,105.85244
14,Nam Từ Liêm,0.0,0,20.99831,105.75457
23,Đông Anh,0.0,0,21.07294,105.78559
11,Long Biên,0.0,0,21.043111,105.865276
10,Hà Đông,0.0,0,20.97031,105.78181


#### Cluster 1

In [51]:
hanoi_merged.loc[hanoi_merged['Cluster Labels'] == 1]

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
20,Thường Tín,0.125,1,20.870261,105.879775


#### Cluster 2

In [52]:
hanoi_merged.loc[hanoi_merged['Cluster Labels'] == 2]

Unnamed: 0,Neighborhood,Restaurant,Cluster Labels,Latitude,Longitude
13,Mỹ Đức,0.03,2,21.01458,105.8516
4,Cầu Giấy,0.01,2,21.035597,105.805979
3,Chương Mỹ,0.02,2,21.02909,105.82682
18,"Sơn Tây, Hanoi",0.01,2,21.032794,105.830139
21,Tây Hồ,0.01,2,21.06667,105.83333
1,Ba Đình,0.01,2,21.02201,105.81934
9,"Hoàng Mai, Hanoi",0.01,2,21.02843,105.83206


### 9. Conclusion

Most of the restaurant are concentrated in central area of Hanoi, with the highest in cluster 1 (0.125) and moderate number in cluster 2 (0.01 ~ 0.03). On the other hand, cluster 0 has no restaurant in the neighborhoods (0.0 for every). This represents as a great opportunity and high potential areas to open new restaurants as there is very little to no competition from existing restaurants.