# IBM Data Science Coursera Capstone Project

# Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
___

# Introduction: Business Problem<a id='introduction'></a>

Jakarta is the capital of Indonesia with a population of 10,5 million, and is the heart of the second densest metropolitan area in the world behind Tokyo, Japan. Having hosted ASEAN games recently back in 2018, it has witnessed heavy investment in transportational infrastructure with the opening of the first MRT line in Indonesia just this year.

Given that, the city will surely see more growth of which information regarding the lay of the land will be invaluable for investors and entrepreneurs to make strategic decision for investment or choosing locations for business operation.

This project will attempt to explore patterns of subdistricts within Jakarta by categorizing them into clusters in order to identify existing trends within neighborhoods of Jakarta. From there on, recommendations can be made on which category of neighborhood will be most suitable for a certain type of venue to be opened.

The result of this project is aimed at general entrepreneur but may be most useful for entrepreneurs on the food and beverage sector given that location can be the deciding factor for a success.
___

# Data<a id='data'></a>

To analyze trends in Jakarta's subdistrict, the list of subdistrict is obtained from [Jakarta subdistrict wikipedia page](https://en.wikipedia.org/wiki/Subdistricts_of_Jakarta).

Venue queries will then be made by subdistricts using FourSquare APIs. The resulting data regarding venue category will be used to observe commonality between subdistricts. The commonality clusters can then provide insight on which type of venue will thrive better on which cluster. K-means clustering algorithm will be used to find pattern between the subdistricts.

In summary, the following data is required to meet the objective:

- Subdistricts of Jakarta
- Coordinates of these subdistricts
- Trending Venues on the area
- Venue categories

___

## Data Gathering

Initialize required library.

In [1]:
# Load needed libraries for data collection

# HTML request and scraper library
import requests
from bs4 import BeautifulSoup

# Geocoding library
#!conda install -c conda-forge geopy --yes # Unquote to install geopy
from geopy.geocoders import ArcGIS # module to convert an address into latitude and longitude values

# Library for data analysis
import pandas as pd
from pandas.io.json import json_normalize # Function to transform json
import numpy as np

#!conda install -c conda-forge folium=0.5.0 --yes # Unquote to install folium
import folium # map plotting library
import matplotlib.cm as cm
import matplotlib.colors as colors

# Import collapsible JSON for exploration
from IPython.display import JSON

# k-means for categorization
from sklearn.cluster import KMeans

# Pretty print
from pprint import pprint

Parse Jakarta's subdistrict from the wiki page.

In [2]:
# Scraped list of subdistricts from wikipedia

page = requests.get("https://en.wikipedia.org/wiki/Subdistricts_of_Jakarta")
soup = BeautifulSoup(page.content, 'html.parser')

# Parse soup for subdistrict
subdistrict =[i.text for i in soup.select('table.multicol tbody tr td ul li') if len(i.text) < 50]

pprint(subdistrict)

['Cengkareng',
 'Grogol Petamburan',
 'Kalideres',
 'Kebon Jeruk',
 'Kembangan',
 'Palmerah',
 'Taman Sari',
 'Tambora',
 'Cempaka Putih',
 'Gambir',
 'Johar Baru',
 'Kemayoran',
 'Sawah Besar',
 'Senen',
 'Tanah Abang',
 'Cilandak',
 'Jagakarsa',
 'Kebayoran Baru',
 'Kebayoran Lama',
 'Mampang Prapatan',
 'Pancoran',
 'Pasar Minggu',
 'Pesanggrahan',
 'Setiabudi',
 'Tebet',
 'Cakung',
 'Cipayung',
 'Ciracas',
 'Duren Sawit',
 'Jatinegara',
 'Kramat Jati',
 'Makasar',
 'Matraman',
 'Menteng',
 'Pasar Rebo',
 'Pulo Gadung',
 'Cilincing',
 'Kelapa Gading',
 'Koja',
 'Pademangan',
 'Penjaringan',
 'Tanjung Priok',
 'Kepulauan Seribu Selatan',
 'Kepulauan Seribu Utara']


In [3]:
# Obtain geocoding for each

arcg = ArcGIS() #instantiate the geolocator from geopy
subdist_coord = {} 


# Define a function to query coordinates
def get_coord_jkt(addr):
    'Take a list of address and return a dictionary of address-coordinate pair'
    dic = {}
    for i in addr:
        try:
            location = arcg.geocode(i+", Jakarta") # Query the address geocode with added Jakarta as a specifier
            print(i, "queried, returned as",location[0]) # Check
            dic[i] = location[1]
        except Exception as E:
            print("ERROR: occured at", i, E)
    print("Query complete, total query:", len(addr))
    return dic

subdistcoord = get_coord_jkt(subdistrict)
pprint(subdistcoord)

Cengkareng queried, returned as Cengkareng, Jakarta, DKI Jakarta
Grogol Petamburan queried, returned as Grogol Petamburan, Jakarta, DKI Jakarta
Kalideres queried, returned as Kalideres, Jakarta, DKI Jakarta
Kebon Jeruk queried, returned as Kebon Jeruk, Jakarta, DKI Jakarta
Kembangan queried, returned as Kembangan, Jakarta, DKI Jakarta
Palmerah queried, returned as Palmerah, Jakarta, DKI Jakarta
Taman Sari queried, returned as Taman Sari, Tamansari, Jakarta, DKI Jakarta
Tambora queried, returned as Tambora, Jakarta, DKI Jakarta
Cempaka Putih queried, returned as Cempaka Putih, Jakarta, DKI Jakarta
Gambir queried, returned as Gambir, Jakarta, DKI Jakarta
Johar Baru queried, returned as Johar Baru, Jakarta, DKI Jakarta
Kemayoran queried, returned as Kemayoran, Jakarta, DKI Jakarta
Sawah Besar queried, returned as Sawah Besar, Jakarta, DKI Jakarta
Senen queried, returned as Senen, Jakarta, DKI Jakarta
Tanah Abang queried, returned as Tanah Abang, Jakarta, DKI Jakarta
Cilandak queried, retu

In [4]:
# Put the dictionary into data frame
jakarta_df = pd.DataFrame.from_dict(subdistcoord, orient='index', columns=['Latitude','Longitude'])
jakarta_df

Unnamed: 0,Latitude,Longitude
Cengkareng,-6.1306,106.74559
Grogol Petamburan,-6.16777,106.78457
Kalideres,-6.1221,106.70727
Kebon Jeruk,-6.19702,106.77308
Kembangan,-6.21823,106.73749
Palmerah,-6.19153,106.79556
Taman Sari,-6.15327,106.82357
Tambora,-6.15194,106.80777
Cempaka Putih,-6.176,106.8706
Gambir,-6.17299,106.81571


As we are more interested in the city area, drop the 2 island subdistricts from the data frame.

Also convert dataframe's index into its own column as *Subdistrict*.

In [5]:
# Drop 2 subdistrict which is located on a separate island and turn index into its own column
k = ['Kepulauan' in i for i in jakarta_df.index] 
pd.Series(k)
jakarta_df.drop(jakarta_df[k].index, inplace=True)

# Convert index to its own column under Subdistrict
jakarta_df.reset_index(inplace=True)
jakarta_df.columns = ['Subdistrict','Latitude','Longitude']
jakarta_df

Unnamed: 0,Subdistrict,Latitude,Longitude
0,Cengkareng,-6.1306,106.74559
1,Grogol Petamburan,-6.16777,106.78457
2,Kalideres,-6.1221,106.70727
3,Kebon Jeruk,-6.19702,106.77308
4,Kembangan,-6.21823,106.73749
5,Palmerah,-6.19153,106.79556
6,Taman Sari,-6.15327,106.82357
7,Tambora,-6.15194,106.80777
8,Cempaka Putih,-6.176,106.8706
9,Gambir,-6.17299,106.81571


Draw them on a map to verify.

In [6]:
# Draw the map centered on Jakarta
jakarta = get_coord_jkt(["DKI Jakarta"])
jktmap = folium.Map(location=jakarta["DKI Jakarta"], zoom_start=11) 

# Add a red circle marker to represent the center of Jakarta
folium.CircleMarker(
    jakarta["DKI Jakarta"],
    radius=10,
    color='red',
    popup='Jakarta',
    fill = True,
    fill_color = 'red',
    fill_opacity = 1
    ).add_to(jktmap)

# Add the subdistricts as blue circle markers
for subdist, coord in subdistcoord.items():
    folium.CircleMarker(
        coord,
        radius=5,
        color='blue',
        popup=subdist,
        fill = True,
        fill_color='blue',
        fill_opacity=1
    ).add_to(jktmap)

# display map
jktmap

DKI Jakarta queried, returned as Jakarta, DKI Jakarta
Query complete, total query: 1


Having the subdistricts geocoded, it is time for us to pull venue data on each subdistrict using FourSquare API.

## FourSquare call

Add all credentials.

In [9]:
CLIENT_ID = '5DUR252JEQWBIYILVBDTQ33IUY5FZG2ATYSTZVR5XYSMFBGR' # your Foursquare ID
CLIENT_SECRET = 'VESG00GQDAW23EAXWHUTBA1H5DKUICWX3WSRKJ5CAMWGN44T' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 100
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: 5DUR252JEQWBIYILVBDTQ33IUY5FZG2ATYSTZVR5XYSMFBGR
CLIENT_SECRET:VESG00GQDAW23EAXWHUTBA1H5DKUICWX3WSRKJ5CAMWGN44T


Check Gambir area to verify

In [7]:
# Check Gambir area recommended venues
latitude = jakarta_df['Latitude'][9]
longitude = jakarta_df['Longitude'][9]
radius = 1500

In [10]:
# Import collapsible JSON for exploration
from IPython.display import JSON

# define URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude, 
    longitude, 
    radius, 
    LIMIT)

# send GET request and get recommended venues
results = requests.get(url).json()
print('There are {} recommended venues.'.format(len(results['response']['groups'][0]['items'])))

JSON(results)

There are 100 recommended venues.


<IPython.core.display.JSON object>

Process and convert JSON response into a dataframe.

In [11]:
# Part adapted from example exercise
def get_category_type(row):
    'A function that extracts the category of the venue in FourSquare JSON'
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']


# Get relevant part of FourSquare's explore call JSON
items = results['response']['groups'][0]['items']

dataframe = json_normalize(items) # flatten JSON

# Filter columns
filtered_columns = ['venue.name', 'venue.categories'] + [col for col in dataframe.columns if col.startswith('venue.location.')] + ['venue.id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# Filter the category for each row
dataframe_filtered['venue.categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# Clean columns
dataframe_filtered.columns = [col.split('.')[-1] for col in dataframe_filtered.columns]

dataframe_filtered.head(10)

Unnamed: 0,name,categories,address,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,crossStreet,neighborhood,id
0,Roemah Noni,Indonesian Restaurant,Jl. Kesehatan No. 12,-6.17259,106.815097,"[{'label': 'display', 'lat': -6.17258972867655...",81,DKI Jakarta,ID,Jakarta Barat,Jakarta,Indonesia,"[Jl. Kesehatan No. 12, Jakarta, Jakarta DKI Ja...",,,4c9598ba03413704bda47def
1,Starbucks,Coffee Shop,Jalan Tanah Abang II No. 76,-6.175441,106.81222,"[{'label': 'display', 'lat': -6.175441, 'lng':...",472,10160,ID,Jakarta Pusat,Jakarta,Indonesia,[Jalan Tanah Abang II No. 76 (Jalan Cideng Tim...,Jalan Cideng Timur,,4e500fd81fc7e04d29e3287e
2,Focus Nusantara - Camera & Photography Shop,Camera Store,Tanah Abang II No.7,-6.175639,106.812901,"[{'label': 'display', 'lat': -6.17563886742587...",428,,ID,Jakarta,Jakarta,Indonesia,"[Tanah Abang II No.7, Jakarta, Jakarta, Indone...",,,4ef30c5930f8e7873c876660
3,RM. Adem Ayem,Indonesian Restaurant,Jl. AM Sangaji 27,-6.168778,106.81447,"[{'label': 'display', 'lat': -6.16877788919146...",488,,ID,Jakarta,Jakarta,Indonesia,"[Jl. AM Sangaji 27, Jakarta, Jakarta, Indonesia]",,,4ca5810a76d3a093f9c5f86a
4,McDonald's Cideng,Fast Food Restaurant,Jl. Cideng,-6.173391,106.811314,"[{'label': 'display', 'lat': -6.17339057589308...",488,,ID,Jakarta Pusat,Jakarta,Indonesia,"[Jl. Cideng, Jakarta Pusat, Jakarta, Indonesia]",,,54460633498e1a99f7831268
5,Ibis Budget Jakarta Tanah Abang,Vacation Rental,Jalan Tanah Abang II No.35,-6.175735,106.816305,"[{'label': 'display', 'lat': -6.17573545874999...",312,10160,ID,Jakarta Pusat,Jakarta,Indonesia,"[Jalan Tanah Abang II No.35 (Jalan Kesehatan),...",Jalan Kesehatan,,57bcfedf498eb883b3f76285
6,Beautika,Manadonese Restaurant,Jl. Abdul Muis No. 70A,-6.179549,106.818076,"[{'label': 'display', 'lat': -6.17954884800084...",775,,ID,Jakarta Pusat,Jakarta,Indonesia,"[Jl. Abdul Muis No. 70A (Jalan Tanah Abang 5),...",Jalan Tanah Abang 5,,4e34fbb9fa7656ba316c914e
7,Bubur Ayam Musi,Breakfast Spot,Jl. Musi 19,-6.17442,106.809554,"[{'label': 'display', 'lat': -6.17441955029394...",699,,ID,Jakarta,Jakarta,Indonesia,"[Jl. Musi 19 (Depan gereja GPIB Petrus Musi), ...",Depan gereja GPIB Petrus Musi,,4bcf9f78462cb7138827d707
8,Petrof Piano,Music Venue,Cideng,-6.172733,106.811713,"[{'label': 'display', 'lat': -6.17273256156308...",443,,ID,Jakarta,Jakarta,Indonesia,"[Cideng, Jakarta, Jakarta, Indonesia]",,,4bee70e5d355a593d3e20a60
9,Wita Tour Head Office,Office,Jl. Balikpapan No.5,-6.170505,106.81411,"[{'label': 'display', 'lat': -6.17050476836452...",328,,ID,Jakarta Pusat,Jakarta,Indonesia,"[Jl. Balikpapan No.5, Jakarta Pusat, Jakarta, ...",,,4dca3e56152052a40fa72418


Visualize venues around the subdistrict to verify.

In [12]:
# Generate map centred around Gambir
venues_map = folium.Map(location=[latitude, longitude], zoom_start=15) 

# Add center of Gambir subdistrict as a red circle mark
folium.CircleMarker(
    [latitude, longitude],
    radius=10,
    popup='Gambir',
    fill=True,
    color='red',
    fill_color='red',
    fill_opacity=0.6
    ).add_to(venues_map)


# Add popular spots to the map as blue circle markers
for lat, lng, cat, venue in zip(dataframe_filtered.lat, dataframe_filtered.lng, dataframe_filtered.categories, dataframe_filtered.name):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup= venue + ", " + cat,
        fill=True,
        color='blue',
        fill_color='blue',
        fill_opacity=0.6
        ).add_to(venues_map)

# display map
venues_map

The venue distribution looks good. Keep in mind that there is a limit of 100 venues for FourSquare API calls.

Now to get all the venues from each subdistricts.

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    '''
    A function to pull nearby venues for each of the subdistricts
    Adapted from previous exercise
    '''
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Subdistrict', 
                  'Subdistrict Latitude', 
                  'Subdistrict Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Call the function to make the repeated query.

In [14]:
jakarta_venues_cat = getNearbyVenues(names=jakarta_df['Subdistrict'],
                                 latitudes=jakarta_df['Latitude'],
                                 longitudes=jakarta_df['Longitude'],
                                 radius = 1500)

jakarta_venues_cat.head(10)

Cengkareng
Grogol Petamburan
Kalideres
Kebon Jeruk
Kembangan
Palmerah
Taman Sari
Tambora
Cempaka Putih
Gambir
Johar Baru
Kemayoran
Sawah Besar
Senen
Tanah Abang
Cilandak
Jagakarsa
Kebayoran Baru
Kebayoran Lama
Mampang Prapatan
Pancoran
Pasar Minggu
Pesanggrahan
Setiabudi
Tebet
Cakung
Cipayung
Ciracas
Duren Sawit
Jatinegara
Kramat Jati
Makasar
Matraman
Menteng
Pasar Rebo
Pulo Gadung
Cilincing
Kelapa Gading
Koja
Pademangan
Penjaringan
Tanjung Priok


Unnamed: 0,Subdistrict,Subdistrict Latitude,Subdistrict Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Cengkareng,-6.1306,106.74559,Ikana - Ikana Seafood & Cafe,-6.124557,106.749859,Seafood Restaurant
1,Cengkareng,-6.1306,106.74559,Bakso Aan,-6.123921,106.743012,Asian Restaurant
2,Cengkareng,-6.1306,106.74559,MYSTIQUE Pool - Lounge - Dine,-6.123594,106.740567,Sports Bar
3,Cengkareng,-6.1306,106.74559,Fins Recipe,-6.124033,106.743319,Dessert Shop
4,Cengkareng,-6.1306,106.74559,Bakmi pejagalan AMI,-6.128617,106.754013,Noodle House
5,Cengkareng,-6.1306,106.74559,MeaterS PIK,-6.123683,106.741391,Steakhouse
6,Cengkareng,-6.1306,106.74559,Bakmi Bintang Gading PIK,-6.123673,106.741235,Noodle House
7,Cengkareng,-6.1306,106.74559,PVBLIC Bistro & BAR,-6.123571,106.738523,Bar
8,Cengkareng,-6.1306,106.74559,Uncle Tjhin Bistro,-6.123666,106.741271,Bistro
9,Cengkareng,-6.1306,106.74559,Tea Garden,-6.132895,106.734944,Asian Restaurant


Check the resulting data frame size and content.

In [15]:
print("Shape of df:",jakarta_venues_cat.shape)
jakarta_venues_cat.head()

Shape of df: (2968, 7)


Unnamed: 0,Subdistrict,Subdistrict Latitude,Subdistrict Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Cengkareng,-6.1306,106.74559,Ikana - Ikana Seafood & Cafe,-6.124557,106.749859,Seafood Restaurant
1,Cengkareng,-6.1306,106.74559,Bakso Aan,-6.123921,106.743012,Asian Restaurant
2,Cengkareng,-6.1306,106.74559,MYSTIQUE Pool - Lounge - Dine,-6.123594,106.740567,Sports Bar
3,Cengkareng,-6.1306,106.74559,Fins Recipe,-6.124033,106.743319,Dessert Shop
4,Cengkareng,-6.1306,106.74559,Bakmi pejagalan AMI,-6.128617,106.754013,Noodle House


Check how many venues were returned for each subdistrict.

In [16]:
jakarta_venues_cat.groupby('Subdistrict').count()

Unnamed: 0_level_0,Subdistrict Latitude,Subdistrict Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Subdistrict,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Cakung,7,7,7,7,7,7
Cempaka Putih,100,100,100,100,100,100
Cengkareng,32,32,32,32,32,32
Cilandak,100,100,100,100,100,100
Cilincing,7,7,7,7,7,7
Cipayung,11,11,11,11,11,11
Ciracas,29,29,29,29,29,29
Duren Sawit,55,55,55,55,55,55
Gambir,100,100,100,100,100,100
Grogol Petamburan,100,100,100,100,100,100


Drop subdistrict with less than 10 venues. These sub-districts are considered to be not popular for our purpose.

In [17]:
# Filter subdistrict
lowvenue_subdistrict = jakarta_venues_cat.groupby('Subdistrict').Venue.count() < 10
lowvenue_subdistrict = list(lowvenue_subdistrict[lowvenue_subdistrict].index)

# duplicate df
jakarta_venues = jakarta_venues_cat

# Exclude the subdistricts
for i in lowvenue_subdistrict:
    jakarta_venues = jakarta_venues[jakarta_venues.Subdistrict != i]

Recheck how many venues were returned for each subdistrict.

In [18]:
jakarta_venues.groupby('Subdistrict').count()

Unnamed: 0_level_0,Subdistrict Latitude,Subdistrict Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Subdistrict,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Cempaka Putih,100,100,100,100,100,100
Cengkareng,32,32,32,32,32,32
Cilandak,100,100,100,100,100,100
Cipayung,11,11,11,11,11,11
Ciracas,29,29,29,29,29,29
Duren Sawit,55,55,55,55,55,55
Gambir,100,100,100,100,100,100
Grogol Petamburan,100,100,100,100,100,100
Jagakarsa,12,12,12,12,12,12
Jatinegara,65,65,65,65,65,65


Check the number of unique categories.

In [19]:
print('There are {} uniques categories.'.format(len(jakarta_venues['Venue Category'].unique())))

There are 252 uniques categories.


Visualize the venues on the map.

In [20]:
jktmap_venue = folium.Map(location=jakarta["DKI Jakarta"], zoom_start=11) # generate map centred around the Conrad Hotel

# add a red circle marker to represent center of Jakarta
folium.CircleMarker(
    jakarta["DKI Jakarta"],
    radius=10,
    color='red',
    popup='Jakarta',
    fill = True,
    fill_color = 'red',
    fill_opacity = 1
    ).add_to(jktmap_venue)

# add the subdistritcs as blue circle markers
for subdist, coord in subdistcoord.items():
    folium.CircleMarker(
        coord,
        radius=5,
        color='blue',
        popup=subdist,
        fill = True,
        fill_color='blue',
        fill_opacity=1
    ).add_to(jktmap_venue)

# add venues to the map as green circle markers
for lat, lng, label, cat in zip(jakarta_venues["Venue Latitude"], jakarta_venues["Venue Longitude"], 
                                jakarta_venues["Venue"], jakarta_venues["Venue Category"]):
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label +", " + cat,
        fill=True,
        color='green',
        fill_color='green',
        fill_opacity=0.6
        ).add_to(jktmap_venue)


# display map
jktmap_venue

As can be seen from the clustering, there is a big skew on central Jakarta.

Also, the two excluded subdistrict is located quite far off the city center. It should be confirmed to be save for exclusion.

# Methodology<a id='methodology'></a>

Given that our objective is to generally categorize the subdistricts, we will use K-means clustering algorithm to categorize each of the subdistricts within Jakarta.

A one-hot encoding will be done on the venue dataframe and it will be grouped by subdistrict. The encoding will return venue categories as column per subdistrict, which will then be grouped to provide weighting of venue type occurence on each subdistrict.

The encoded dataframe will be further filtered into top venues before the K-means clustering algorithm will be run over it. This will return cluster labels over the subdistricts. The clusters will be observed one by one manually to determine its content.

Recommendation will be made based on the clusterring.

# Analysis<a id='analysis'></a>

Apply one-hot encoding to the dataframe.

In [21]:
# one hot encoding
jakarta_onehot = pd.get_dummies(jakarta_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
jakarta_onehot['Subdistrict'] = jakarta_venues['Subdistrict'] 

# move neighborhood column to the first column
fixed_columns = [jakarta_onehot.columns[-1]] + list(jakarta_onehot.columns[:-1])
jakarta_onehot = jakarta_onehot[fixed_columns]

jakarta_onehot.head()

Unnamed: 0,Subdistrict,Accessories Store,Acehnese Restaurant,African Restaurant,Airport,American Restaurant,Aquarium,Arcade,Art Gallery,Art Museum,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Water Park,Wine Bar,Winery,Wings Joint,Women's Store,Yoga Studio
0,Cengkareng,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Cengkareng,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Cengkareng,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Cengkareng,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Cengkareng,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Check shape.

In [22]:
jakarta_onehot.shape

(2954, 253)

Group the encoded dataframe.

In [23]:
jakarta_grouped = jakarta_onehot.groupby('Subdistrict').mean().reset_index()
jakarta_grouped

Unnamed: 0,Subdistrict,Accessories Store,Acehnese Restaurant,African Restaurant,Airport,American Restaurant,Aquarium,Arcade,Art Gallery,Art Museum,...,Vegetarian / Vegan Restaurant,Veterinarian,Video Game Store,Vietnamese Restaurant,Water Park,Wine Bar,Winery,Wings Joint,Women's Store,Yoga Studio
0,Cempaka Putih,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Cengkareng,0.0,0.0,0.0,0.0,0.03125,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Cilandak,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,0.0
3,Cipayung,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Ciracas,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.0,0.0
5,Duren Sawit,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Gambir,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Grogol Petamburan,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0,...,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.0,0.01,0.0
8,Jagakarsa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Jatinegara,0.0,0.0,0.0,0.015385,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
jakarta_grouped.shape

(40, 253)

In [25]:
num_top_venues = 5

for hood in jakarta_grouped['Subdistrict']:
    print("----"+hood+"----")
    temp = jakarta_grouped[jakarta_grouped['Subdistrict'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Cempaka Putih----
                   venue  freq
0  Indonesian Restaurant  0.10
1            Pizza Place  0.08
2            Coffee Shop  0.06
3   Fast Food Restaurant  0.06
4                   Café  0.05


----Cengkareng----
              venue  freq
0  Asian Restaurant  0.09
1      Noodle House  0.09
2               Bar  0.06
3              Café  0.06
4       Pizza Place  0.06


----Cilandak----
                   venue  freq
0            Coffee Shop  0.10
1  Indonesian Restaurant  0.06
2       Asian Restaurant  0.06
3             Food Truck  0.06
4                   Café  0.06


----Cipayung----
                   venue  freq
0  Indonesian Restaurant  0.18
1       Asian Restaurant  0.09
2                 Garden  0.09
3          Grocery Store  0.09
4             Food Truck  0.09


----Ciracas----
                   venue  freq
0  Indonesian Restaurant   0.1
1       Asian Restaurant   0.1
2   Fast Food Restaurant   0.1
3            Coffee Shop   0.1
4           Noodle House   0.1



Define a function to return the most common venues.

In [26]:
# Adapted from previous exercise
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Create a new dataframe and display the top 10 venues for each neighborhood.

In [27]:
# Adapted from previous exercise
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Subdistrict']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Subdistrict'] = jakarta_grouped['Subdistrict']

for ind in np.arange(jakarta_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(jakarta_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Cempaka Putih,Indonesian Restaurant,Pizza Place,Fast Food Restaurant,Coffee Shop,Café,Hotel,Indonesian Meatball Place,Restaurant,Asian Restaurant,Flea Market
1,Cengkareng,Asian Restaurant,Noodle House,Café,Indonesian Restaurant,Bistro,Bar,Pizza Place,Clothing Store,Coffee Shop,Seafood Restaurant
2,Cilandak,Coffee Shop,Food Truck,Asian Restaurant,Café,Indonesian Restaurant,Fast Food Restaurant,Steakhouse,Motorcycle Shop,Chinese Restaurant,Bakery
3,Cipayung,Indonesian Restaurant,Food & Drink Shop,Noodle House,High School,Pizza Place,Asian Restaurant,Grocery Store,Convenience Store,Garden,Food Truck
4,Ciracas,Fast Food Restaurant,Asian Restaurant,Noodle House,Indonesian Restaurant,Coffee Shop,Convenience Store,Farmers Market,Bakery,High School,Dumpling Restaurant


## Cluster Subdistricts

Run *k*-means to cluster the neighborhood into 5 clusters.

In [28]:
# set number of clusters
kclusters = 5

jakarta_grouped_clustering = jakarta_grouped.drop('Subdistrict', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(jakarta_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]

array([1, 1, 1, 0, 0, 3, 1, 4, 3, 1], dtype=int32)

Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.

In [29]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

jakarta_merged = jakarta_df

# drop NA from the excluded subdistrict
for i in lowvenue_subdistrict:
    jakarta_merged = jakarta_merged[jakarta_merged.Subdistrict != i]

# merge df to add latitude/longitude for each subdistrict
jakarta_merged = jakarta_merged.join(neighborhoods_venues_sorted.set_index('Subdistrict'), on='Subdistrict')

# Shift label to start from index 1
jakarta_merged['Cluster Labels'] = jakarta_merged['Cluster Labels'] + 1

jakarta_merged.head()

Unnamed: 0,Subdistrict,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Cengkareng,-6.1306,106.74559,2,Asian Restaurant,Noodle House,Café,Indonesian Restaurant,Bistro,Bar,Pizza Place,Clothing Store,Coffee Shop,Seafood Restaurant
1,Grogol Petamburan,-6.16777,106.78457,5,Noodle House,Coffee Shop,Chinese Restaurant,Clothing Store,Seafood Restaurant,Indonesian Restaurant,Steakhouse,Asian Restaurant,Restaurant,Sushi Restaurant
2,Kalideres,-6.1221,106.70727,2,Noodle House,Coffee Shop,Pizza Place,Asian Restaurant,Chinese Restaurant,Café,Food Truck,Convenience Store,Japanese Restaurant,Fast Food Restaurant
3,Kebon Jeruk,-6.19702,106.77308,2,Asian Restaurant,Convenience Store,Indonesian Restaurant,Coffee Shop,Steakhouse,Pizza Place,Fast Food Restaurant,Noodle House,Café,Seafood Restaurant
4,Kembangan,-6.21823,106.73749,4,Convenience Store,Food Court,Department Store,Park,Indonesian Restaurant,Music Venue,Spa,Plaza,Japanese Restaurant,Snack Place


Visualize the resulting clusters

In [30]:
# create map
map_clusters = folium.Map(location=jakarta["DKI Jakarta"], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(jakarta_merged['Latitude'], jakarta_merged['Longitude'], jakarta_merged['Subdistrict'], jakarta_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<a id='item5'></a>

## Examine Clusters

Examine each clusters.

#### Cluster 1

Cluster 1 contains a concentration of noodle house and other asian restaurant. Geographically, they are all located at the south east corner of the city with golf course as one of its common venue. Seems like this cluster revolves around the golf course.

In [31]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 1, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
26,Cipayung,Indonesian Restaurant,Food & Drink Shop,Noodle House,High School,Pizza Place,Asian Restaurant,Grocery Store,Convenience Store,Garden,Food Truck
27,Ciracas,Fast Food Restaurant,Asian Restaurant,Noodle House,Indonesian Restaurant,Coffee Shop,Convenience Store,Farmers Market,Bakery,High School,Dumpling Restaurant
31,Makasar,Golf Course,Indonesian Restaurant,Fast Food Restaurant,Pizza Place,Noodle House,Asian Restaurant,Department Store,Shopping Mall,Supermarket,Bookstore


#### Cluster 2

With the most member, cluster 2 seems to contain a good amount of Coffee shop and hotels. Located central-south of the city, it is expected to have a high concentration of places to hang out and stay.

In [32]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 2, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Cengkareng,Asian Restaurant,Noodle House,Café,Indonesian Restaurant,Bistro,Bar,Pizza Place,Clothing Store,Coffee Shop,Seafood Restaurant
2,Kalideres,Noodle House,Coffee Shop,Pizza Place,Asian Restaurant,Chinese Restaurant,Café,Food Truck,Convenience Store,Japanese Restaurant,Fast Food Restaurant
3,Kebon Jeruk,Asian Restaurant,Convenience Store,Indonesian Restaurant,Coffee Shop,Steakhouse,Pizza Place,Fast Food Restaurant,Noodle House,Café,Seafood Restaurant
5,Palmerah,Coffee Shop,Asian Restaurant,Indonesian Restaurant,Hotel,Food Truck,Convenience Store,Café,Chinese Restaurant,Pizza Place,Soup Place
8,Cempaka Putih,Indonesian Restaurant,Pizza Place,Fast Food Restaurant,Coffee Shop,Café,Hotel,Indonesian Meatball Place,Restaurant,Asian Restaurant,Flea Market
9,Gambir,Hotel,Indonesian Restaurant,Coffee Shop,Seafood Restaurant,Asian Restaurant,Camera Store,Noodle House,Pharmacy,Padangnese Restaurant,Fast Food Restaurant
10,Johar Baru,Indonesian Restaurant,Hotel,Pizza Place,Pharmacy,Furniture / Home Store,Seafood Restaurant,Coffee Shop,Restaurant,Convenience Store,Food Truck
11,Kemayoran,Hotel,Indonesian Restaurant,Indonesian Meatball Place,Asian Restaurant,Food Court,Seafood Restaurant,Convenience Store,Donut Shop,Comedy Club,Mosque
13,Senen,Indonesian Restaurant,Coffee Shop,Hotel,Café,Fast Food Restaurant,Bookstore,Asian Restaurant,Food Truck,Restaurant,Pizza Place
14,Tanah Abang,Coffee Shop,Hotel,Indonesian Restaurant,Japanese Restaurant,Restaurant,Lounge,Building,Italian Restaurant,Chinese Restaurant,Train Station


#### Cluster 3

Containing only 1 subdistrict, cluster 3 seems to be in its own group located at the north east of the city. This might be due to the low count of trending venues of 15. Nothing much can be gathered from this cluster.

In [33]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 3, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
38,Koja,Indonesian Restaurant,Indonesian Meatball Place,Pizza Place,Restaurant,Convenience Store,Bakery,High School,Government Building,Cosmetics Shop,Art Museum


#### Cluster 4

Having commonality with cluster 3 with Indonesian restaurants as its common venue, cluster 4 have higher count of convenience store in general. The subdistricts within this cluster are more scatered, being dispersed on around the border surrounding cluster 2

In [34]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 4, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Kembangan,Convenience Store,Food Court,Department Store,Park,Indonesian Restaurant,Music Venue,Spa,Plaza,Japanese Restaurant,Snack Place
16,Jagakarsa,Convenience Store,Department Store,College Residence Hall,Noodle House,Burger Joint,Other Great Outdoors,Soccer Field,Fast Food Restaurant,Seafood Restaurant,Food Truck
22,Pesanggrahan,Food Truck,Noodle House,Fast Food Restaurant,Indonesian Restaurant,Convenience Store,Food Court,Café,Grocery Store,Electronics Store,Indonesian Meatball Place
28,Duren Sawit,Indonesian Meatball Place,Fast Food Restaurant,Food Truck,Noodle House,Convenience Store,Salon / Barbershop,Ice Cream Shop,Indonesian Restaurant,Asian Restaurant,Seafood Restaurant
34,Pasar Rebo,Indonesian Restaurant,Grocery Store,Factory,Food Court,Fast Food Restaurant,Bakery,Seafood Restaurant,Café,Food,Baseball Stadium


#### Cluster 5

Cluster 5 contains a higher concentration of noodle house and chinese restaurant. Located north of Jakarta, this might be the cluster for Chinese restaurants.

In [35]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 5, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Grogol Petamburan,Noodle House,Coffee Shop,Chinese Restaurant,Clothing Store,Seafood Restaurant,Indonesian Restaurant,Steakhouse,Asian Restaurant,Restaurant,Sushi Restaurant
6,Taman Sari,Chinese Restaurant,Noodle House,Asian Restaurant,Hotel,Seafood Restaurant,Bakery,Coffee Shop,Fast Food Restaurant,Hotel Bar,Steakhouse
7,Tambora,Chinese Restaurant,Noodle House,Asian Restaurant,Hotel,Coffee Shop,Bakery,Indonesian Restaurant,Food Truck,Massage Studio,Steakhouse
12,Sawah Besar,Chinese Restaurant,Noodle House,Hotel,Coffee Shop,Indonesian Restaurant,Seafood Restaurant,Café,Padangnese Restaurant,Restaurant,Steakhouse
35,Pulo Gadung,Noodle House,Indonesian Restaurant,Coffee Shop,Food Truck,Chinese Restaurant,Asian Restaurant,Café,Soup Place,Seafood Restaurant,Convenience Store
39,Pademangan,Seafood Restaurant,Hotel,Noodle House,Chinese Restaurant,Coffee Shop,Theme Park,Asian Restaurant,Theme Park Ride / Attraction,Indonesian Restaurant,Food Truck
40,Penjaringan,Seafood Restaurant,Chinese Restaurant,Noodle House,Coffee Shop,Café,Bakery,Indonesian Restaurant,Restaurant,Balinese Restaurant,Asian Restaurant
41,Tanjung Priok,Asian Restaurant,Noodle House,Chinese Restaurant,Indonesian Restaurant,Pizza Place,Café,Beach,Seafood Restaurant,Fast Food Restaurant,Massage Studio


# Result and Discussion<a id='results'></a>

Groupings as a result of K-means clustering algorithm tallies with how Jakarta historically develops. Having most of cluster 5, containing a high count of chinese restaurant, at the north side of the city fits the chinatown part of the city. Cluster 2 being the dominant type of subdistrict which is located in the middle also fits the reality. North eastern part being quite sparse in trending venue also fits the reality that the area is more of an industrial area, thus having less venues.

There are definite limitation with using the FourSquare API as the 100 venues limit might skew the result of the more densely populated subdistrict. Also, some subdistricts have low count of venues that it might be considered to be insufficient in determining its characteristics. It might also be the case that FourSquare user base are skewed to the foodie type, which might explain the limited trending venues on the North east part of the city.

For most of the subdistricts, restaurants and coffee shops are the dominant venue type with cluster 2 having more variation in terms of cuisine.

# Conclusion<a id='conclusion'></a>

Opening of new western restaurant may be best done in cluster 5 where there are less of such restaurant to compete. Business which does not rely on foot traffic may choose to locate themself in the north east of Jakarta.