# IBM Data Science Coursera Capstone Project

# Table of Contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)
___

# Introduction: Business Problem<a id='introduction'></a>

Jakarta is the capital of Indonesia with a population of 10,5 million, and is the heart of the second densest metropolitan area in the world behind Tokyo, Japan. Having hosted ASEAN games recently back in 2018, it has witnessed heavy investment in transportational infrastructure with the opening of the first MRT line in Indonesia just this year.

Given that, the city will surely see more growth of which information regarding the lay of the land will be invaluable for investors and entrepreneurs to make strategic decision for investment or choosing locations for business operation.

This project will attempt to explore patterns of subdistricts within Jakarta by categorizing them into clusters in order to identify existing trends within neighborhoods of Jakarta. From there on, recommendations can be made on which category of neighborhood will be most suitable for a certain type of venue to be opened.

The result of this project is aimed at general entrepreneur but may be most useful for entrepreneurs on the food and beverage sector given that location can be the deciding factor for a success.
___

# Data<a id='data'></a>

To analyze trends in Jakarta's subdistrict, the list of subdistrict is obtained from [Jakarta subdistrict wikipedia page](https://en.wikipedia.org/wiki/Subdistricts_of_Jakarta).

Venue queries will then be made by subdistricts using FourSquare APIs. The resulting data regarding venue category will be used to observe commonality between subdistricts. The commonality clusters can then provide insight on which type of venue will thrive better on which cluster. K-means clustering algorithm will be used to find pattern between the subdistricts.

In summary, the following data is required to meet the objective:

- Subdistricts of Jakarta
- Coordinates of these subdistricts
- Trending Venues on the area
- Venue categories

___

# Methodology<a id='methodology'></a>

Given that our objective is to generally categorize the subdistricts, we will use K-means clustering algorithm to categorize each of the subdistricts within Jakarta.

A one-hot encoding will be done on the venue dataframe and it will be grouped by subdistrict. The encoding will return venue categories as column per subdistrict, which will then be grouped to provide weighting of venue type occurence on each subdistrict.

The encoded dataframe will be further filtered into top venues before the K-means clustering algorithm will be run over it. This will return cluster labels over the subdistricts. The clusters will be observed one by one manually to determine its content.

Recommendation will be made based on the clusterring.

# Analysis<a id='analysis'></a>

## Map of Clusters

In [36]:
# create map
map_clusters = folium.Map(location=jakarta["DKI Jakarta"], zoom_start=11)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(jakarta_merged['Latitude'], jakarta_merged['Longitude'], jakarta_merged['Subdistrict'], jakarta_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
    
map_clusters

## Examine Clusters

Examine each clusters.

#### Cluster 1

Cluster 1 contains a higher concentration of noodle house and chinese restaurant. Located north of Jakarta, this might be the cluster for Chinese restaurants.

In [37]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 1, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Grogol Petamburan,Noodle House,Coffee Shop,Chinese Restaurant,Clothing Store,Asian Restaurant,Seafood Restaurant,Restaurant,Indonesian Restaurant,Steakhouse,Korean Restaurant
6,Taman Sari,Chinese Restaurant,Noodle House,Asian Restaurant,Hotel,Seafood Restaurant,Bakery,Coffee Shop,Fast Food Restaurant,Hotel Bar,Steakhouse
7,Tambora,Chinese Restaurant,Noodle House,Asian Restaurant,Hotel,Coffee Shop,Bakery,Food Truck,Indonesian Restaurant,Steakhouse,Massage Studio
12,Sawah Besar,Chinese Restaurant,Noodle House,Hotel,Seafood Restaurant,Coffee Shop,Indonesian Restaurant,BBQ Joint,Asian Restaurant,Café,Restaurant
40,Penjaringan,Seafood Restaurant,Chinese Restaurant,Noodle House,Coffee Shop,Café,Indonesian Restaurant,Bakery,Restaurant,Balinese Restaurant,Vegetarian / Vegan Restaurant
41,Tanjung Priok,Asian Restaurant,Chinese Restaurant,Noodle House,Seafood Restaurant,Indonesian Restaurant,Pizza Place,Beach,Massage Studio,Café,Restaurant


#### Cluster 2

With the most member, cluster 2 seems to contain a good amount of coffee shops and hotels. Located central-south of the city, it is expected to have a high concentration of places to hang out and stay.

In [38]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 2, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
5,Palmerah,Coffee Shop,Asian Restaurant,Indonesian Restaurant,Hotel,Convenience Store,Food Truck,Café,Chinese Restaurant,Pizza Place,Clothing Store
8,Cempaka Putih,Indonesian Restaurant,Pizza Place,Coffee Shop,Fast Food Restaurant,Hotel,Asian Restaurant,Café,Restaurant,Indonesian Meatball Place,Food Truck
9,Gambir,Hotel,Indonesian Restaurant,Coffee Shop,Seafood Restaurant,Asian Restaurant,Fast Food Restaurant,Camera Store,Chinese Restaurant,Restaurant,Soup Place
10,Johar Baru,Indonesian Restaurant,Hotel,Pharmacy,Seafood Restaurant,Pizza Place,Restaurant,Convenience Store,BBQ Joint,Furniture / Home Store,Coffee Shop
11,Kemayoran,Hotel,Indonesian Restaurant,Indonesian Meatball Place,Asian Restaurant,Food Court,Seafood Restaurant,Chinese Restaurant,Convenience Store,Cupcake Shop,Gym
13,Senen,Indonesian Restaurant,Coffee Shop,Hotel,Café,Fast Food Restaurant,Bookstore,Restaurant,Asian Restaurant,Bakery,Chinese Restaurant
14,Tanah Abang,Coffee Shop,Hotel,Indonesian Restaurant,Japanese Restaurant,Restaurant,Lounge,Building,Italian Restaurant,Chinese Restaurant,Breakfast Spot
15,Cilandak,Coffee Shop,Asian Restaurant,Café,Food Truck,Indonesian Restaurant,Fast Food Restaurant,Steakhouse,Motorcycle Shop,Chinese Restaurant,Bakery
17,Kebayoran Baru,Indonesian Restaurant,Coffee Shop,Japanese Restaurant,Hotel,Noodle House,Food Truck,Sushi Restaurant,BBQ Joint,Breakfast Spot,Burger Joint
18,Kebayoran Lama,Coffee Shop,Indonesian Restaurant,Clothing Store,Café,Bakery,Shopping Mall,Pizza Place,Sushi Restaurant,Japanese Restaurant,Dessert Shop


#### Cluster 3

Containing only 1 subdistrict, cluster 3 seems to be in its own group located at the south of the city. This might be due to the low count of trending venues of 10. Nothing much can be gathered from this cluster.

In [39]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 3, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,Jagakarsa,Convenience Store,Burger Joint,Food Truck,Theme Park,Restaurant,Noodle House,Fast Food Restaurant,College Residence Hall,Other Great Outdoors,Flower Shop


#### Cluster 4

Having commonality with cluster 3 with Indonesian restaurants as its common venue, cluster 4 have higher count of convenience store in general. The subdistricts within this cluster are more scatered, being dispersed on around the border surrounding cluster 2

In [40]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 4, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Cengkareng,Noodle House,Asian Restaurant,Café,Pizza Place,Bar,Bistro,Seafood Restaurant,Indonesian Restaurant,Steakhouse,Gym
2,Kalideres,Noodle House,Asian Restaurant,Coffee Shop,Pizza Place,Fast Food Restaurant,Food Court,Food Truck,Convenience Store,Japanese Restaurant,Café
3,Kebon Jeruk,Asian Restaurant,Convenience Store,Indonesian Restaurant,Coffee Shop,Pizza Place,Steakhouse,Noodle House,Café,Seafood Restaurant,Fast Food Restaurant
4,Kembangan,Shop & Service,Japanese Restaurant,Snack Place,Soup Place,Gas Station,Plaza,Diner,Park,Convenience Store,Indonesian Restaurant
22,Pesanggrahan,Food Truck,Noodle House,Fast Food Restaurant,Indonesian Restaurant,Convenience Store,Food Court,Café,Electronics Store,Asian Restaurant,Bakery
26,Cipayung,Indonesian Restaurant,Convenience Store,Garden,Arcade,High School,Pizza Place,Asian Restaurant,Food & Drink Shop,Noodle House,Food Stand
27,Ciracas,Fast Food Restaurant,Coffee Shop,Noodle House,Convenience Store,Asian Restaurant,Indonesian Restaurant,Water Park,Rest Area,Garden,Dim Sum Restaurant
28,Duren Sawit,Indonesian Meatball Place,Fast Food Restaurant,Salon / Barbershop,Convenience Store,Food Truck,Noodle House,Indonesian Restaurant,Seafood Restaurant,Asian Restaurant,Department Store
31,Makasar,Indonesian Restaurant,Fast Food Restaurant,Golf Course,Noodle House,Pizza Place,Asian Restaurant,Bookstore,Monument / Landmark,Food Truck,Music Venue
32,Matraman,Indonesian Restaurant,Asian Restaurant,Pizza Place,Convenience Store,Coffee Shop,Pharmacy,Seafood Restaurant,Café,Fast Food Restaurant,Noodle House


#### Cluster 5

Similar with cluster 3 with only 1 subdistrict, cluster 5 seems to be in its own group located at the north east of the city. This might be due to the low count of trending venues of 15. Nothing much can be gathered from this cluster.

In [41]:
jakarta_merged.loc[jakarta_merged['Cluster Labels'] == 5, jakarta_merged.columns[[0] + list(range(4, jakarta_merged.shape[1]))]]

Unnamed: 0,Subdistrict,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
38,Koja,Indonesian Meatball Place,Indonesian Restaurant,Pizza Place,Restaurant,Convenience Store,Satay Restaurant,Bakery,High School,Government Building,Food & Drink Shop


# Result and Discussion<a id='results'></a>

Groupings as a result of K-means clustering algorithm tallies with how Jakarta historically develops. Having most of cluster 1, containing a high count of chinese restaurant, at the north side of the city fits the chinatown part of the city. Cluster 2 being the dominant type of subdistrict which is located in the middle also fits the reality. North eastern part being quite sparse in trending venue also fits the reality that the area is more of an industrial area, thus having less venues.

There are definite limitation with using the FourSquare API as the 100 venues limit might skew the result of the more densely populated subdistrict. Also, some subdistricts have low count of venues that it might be considered to be insufficient in determining its characteristics. It might also be the case that FourSquare user base are skewed to the foodie type, which might explain the limited trending venues on the north east part of the city.

For most of the subdistricts, restaurants and coffee shops are the dominant venue type with cluster 2 having more variation in terms of cuisine.

# Conclusion<a id='conclusion'></a>

Opening of new western restaurant may be best done in cluster 5 where there are less of such restaurant to compete. Business which does not rely on foot traffic may choose to locate themself in the north east of Jakarta.