# IBM Data Science Capstone Project

**By: Linh Bao Pham**

## A. Problem statement

Vietnam has been experiencing rapid retail growth of +10% between 2013-2020. As the country with the fastest-growing middle class in Southeast Asia, this phenomenal growth rates in its retail sector is expected to continue.

![alt text](https://i.ibb.co/KyMNM6X/Retail-VN.png)

However, although Vietnam's retail sector holds immense potential for growth, the level of competition is intense. With different store formats: commercial centres, supermarkets, grocery stores and convenience stores, domestic and foreign retail giants in Vietnam are engaging in a battle for dominance as they embark on aggressive expansion strategies. At the same time, despite the rise of digital channels, physical channels continue to dominate the retail scene. 

Therefore, an interesting questions to ask is (1) how is the current retail footprint of these retail giants in Vietnam, and (2) which province/city/district should be the next target for new stores. Who would be interested in the answer to these questions? The answer is the retail giants in Vietnam, as well as any new players who plan to enter Vietnam retail market. 
* For retail giants in Vietnam, answering this question will provide critical input for their decision of where to expand the store network, and may be which of their current stores should be closed (due to the number of competitive stores in the same area). Making those decisions with data-driven approach will allow these companies to best position themself in the market, and to capture new demand.
* For new players planning to enter Vietnam retail market, great insights for their market entry strategy can be generate from understanding the current retail footprint in Vietnam.


To answer the two questions above, this project is aimed to understand and to present:
* The retail footprint of retal giants (both domestic and foreign companies) in Vietnam
* Clustering the province/districts based on the retail footprint (number of stores available) and census statistic (population, area, GDP/capita, economic growth, etc.)
* Understand the different clusters of province/district to recommend the next city/province/district for stores expansion, or generating key insights on the retail footprint across Vietnam.


## B. Data and analytical approaches

### 1. Data

For the analysis, the following data was obtained to perform the analysis:
* **List of Vietnam's Administrative Division:** at Provincial and District level. This information is obtained from the General Statistics Office of Vietnam database (retrieved at April 04, 2020, at [GSO database](https://www.gso.gov.vn/dmhc2015/)) 
* **Census data:** includes each province's data on population, area, population density, Human Development Index, GDP/capital (retrieved at April 04, 2020, from [Wikipedia](https://vi.wikipedia.org/wiki/T%E1%BB%89nh_th%C3%A0nh_Vi%E1%BB%87t_Nam#cite_note-5)).
* **Geospatial Coordinates:** are polygon datasets containing limit of provinces and district administration in Vietnam from [OpenDevelopmentMekong](https://data.opendevelopmentmekong.net/en/dataset/a-phn-huyn?type=dataset) database (retrieved at April 05, 2020). These datasets (in json format) provide input for visualization of project result on the map of Vietnam.
* **Stores list of major retail chains:** based on [Deloitte report on Vietnam Retail 2019](https://www2.deloitte.com/vn/en/pages/consumer-business/articles/vietnam-consumer-retail-2019.html), the following retail chains are identified and their store list are obtained from their main websites by web scrapping. The result of webscrapping is the store list with (1) store name, (2) store adress, and (3) latitude and longitude of the store.

![alt text](https://www.vir.com.vn/e-mag/images/demo.jpg)

 a. Vingroup: with  
    [VinMart](https://www.vincommerce.com/vinmart/he-thong-cua-hang) - supermarkets, and  
    [VinMart+](http://www.vinmartplus.vn/he-thong-cua-hang) - convenient stores (cvs)


 b. Saigon Co.Op: including  
    [Co.Op Mart](http://www.co-opmart.com.vn/lienhe/hethongcoopmart.aspx) supermarkets,  
    [Co.Op Smile](https://momo.vn/thanh-toan-momo-coop-smile) - cvs, and  
    [Co.Op Food](http://www.saigonco-op.com.vn/linhvuchoatdong/banle/chuoicuahangCoopFood/chuoi-cua-hang-thuc-pham-coop-food_442.html) - grocery stores


 c. Satra: with  
    SatraMart - supermarkets, and.  
    [SatraFood](https://satrafoods.com.vn/vn/cua-hang) - cvs
    
    
 d. Thegioididong: [Bachhoaxanh](https://www.bachhoaxanh.com/he-thong-sieu-thi) - grocery stores


 e. 7-Eleven: cvs (only in Ho Chi Minh City)


 f. Aeon: [MiniStop](http://ministop.vn/ms/all) - cvs (only in Ho Chi Minh City and Binh Duong province)


 g. [BigC](https://www.bigc.vn/en/store.html): supermarket chain


 h. [Circle K](https://www.7-eleven.vn/cua-hang-7-eleven-viet-nam/): cvs


 i. [B'mart](http://www.bsmartvina.com/bsmart_store/en): cvs


 k. Mom-and-Pop shops and small retail stores: obtained via Foursquare API. 

*Note: for [Shop&Go](https://www.vir.com.vn/vingroup-to-acquire-shop-go-grocery-store-chain-for-1-66816.html), the chain was sold to VinGroup in April 2019, and [Auchan](https://www.vir.com.vn/saigon-coop-acquires-auchan-vietnam-68923.html) was acquired by Saigon Co.Op in July, 2019.*

With the store lists from the above sources, we can then determine the number of stores (supermarket, grocery and convenient stores) per province/districts. Based on number of stores and basic statistical census, I can then conduct clustering analyses and understand the differences between clusters.

### 2. Analytical approach

For this project, K-Means Clustering method will be use to profilling province/district base on the retail footprint and statistical census. From [Toward Data Science](https://towardsdatascience.com/machine-learning-algorithms-part-9-k-means-example-in-python-f2ad05ed5203), K-Means Clustering is an unsupervised machine learning algorithm, attempting to classify data without having first been trained with labeled data. Since the purpose of this project is to understand the retail footprint acrross different Administrative Division in Vietnam, this method is selected.

To identify the optimum number of clusters in this analysis, we will evaluate the relationship between the number of clusters and Within Cluster Sum of Squares (WCSS) to select the number of clusters where the change in WCSS begins to level off (elbow method). An example graph of elbow method:

![alt text](https://miro.medium.com/max/1400/1*vLTnh9xdgHvyC8WDNwcQQw.png)

WCSS is defined as the sum of the squared distance between each member of the cluster and its centroid.

![alt text](https://miro.medium.com/max/610/1*bgpKrYZIVBuDirYk0JMnGg.png)
