---
# Applied Data Science Capstone - The Battle of Neighborhoods


# **Plan Model for Decathlon Business Expansion in Singapore**
## (Segmenting Singapore)
----


1. [Introduction](#Introduction)
- [Data](#Data)
- [Methodology](#Methodology)
- [Results](#Results)
- [Discussion](#Discussion)
- [Conclusion](#Conclusion)


-----
## Introduction
-------

**Decathlon** is a French sporting goods retailer. With over 1500 stores in 49 countries, it is the largest sporting goods retailer in the world. It stocks a wide range of sporting goods for tennis rackets to advanced scuba diving equipment. Decathlon Group also owns over 20 brands with research and development facilities to develop the latest innovative designs registering up to 40 patents per year. Each brand represents a different sport or group of sports, with a dedicated product development and design team.

In recent global management discussion, it was decided to expand their presence in South East Asia especially Singapore. Plans are underway to open more retail outlets all over island to improve business. In addition, it becomes necessity to understand each locality in Singapore better based on top outdoor/recreational activities available in surrounding areas so that planned new outlets could sell right product in right places.

Marketing Consultants has determined that special attention needs to be given for top outdoor/recreational activities available in specific areas so that they can plan on products/brands which can be given importance in those areas as well as advertise them accordingly to have good reach among pubic.

Our project will obtain information about neighborhoods in Singapore and make recommendations for Decathlon management to understand Singapore locality better for expanding their business. 
  

------
## Data
--------

Required data can be gathered from:


- Singapore regions information can be obtained from **Wikipedia**: [Regions of Singapore](https://en.wikipedia.org/wiki/Regions_of_Singapore)

    - Inorder to make recommendations on suggested product brands in new retail outlets to be opened, Singapore segmentation will be made based   on different neighborhoods.

    - Full list of neighborhoods can be obtained from Wikipedia under Regions of Singapore, but only their names. They must be geolocated in order to use Foursquare services for obtaining venues.



- For geolocation of neighborhoods **Python geocoder** will be used.

    - Geocoder returns latitude and longitude information for every neighborhood center, then it will be used as main Foursquare API input.



- In order to obtain top attractions/recreational facilities in each locality we will use **[FOURSQUARE](https://foursquare.com/) API.**

    - Using services provided by Foursquare we can obtain top outdoor / recreational facilities for every neighborhood. Such services requires input geolocalization (i.e) latitude and longitude which we obtained in previous step.
  
  



-----------
## Methodology
-----------

A Jupyter Notebook will be used for coding to process data and segment the neighborhoods in Singapore.

Following steps will be implemented:


**a) Build Neighborhoods List**  

A list of localities in Singapore based on regions is obtained from Wikipedia page. That list contains the names of the locality in every region.  
As output a dataset containing a list of _"region,locality"_ is build.
  
  
**b) Neighborhoods GeoLocation**  

Every element in the neighborhoods dataset is geolocated using _Python Geolocator_ and two new columns containing latitude and longitude were added for each locality. 

Since geolocator service has frequent connection issues resulting in time out error, so to handle this information obtained is saved as a text file in CSV format.  

Therefore this step can be run many times, invoking geolocator only for missing data (timed out errors in previous executions). After various executions all the neighborhoods geolocation is obtained and it's used in text file.
  
  
**c) Venues Compilation (Using Foursquare API)**

As next step Foursquare API services are used for obtaining venues for every neighborhood locality. The output is a new dataset with top three outdoor/recreational place in every neighborhood locality.

Foursquare Developer (Personal Account) with 99,500 regular calls/per day is used to extract required information. In order to minimize the usage of Foursquare, the information which have been extracted is saved in a text (CSV) file. Here it's assumed that information gathered doesn't change in short period of time. When the analysis needs to be re-done after long while then it's suggested to delete existing file's and regenerate them by calling Foursquare API service


**d) Neighborhoods Segmentation**  

The business case in hand falls under unsupervised machine learning approach so K-means clustering algorithm was chosen for analysis.   


 * Taking in consideration that the venues information obtained from Foursquare is categorical, it must be previously processed in order to be handled by K-means algorithm. For this _"pandas.get_dummies"_ is used for dummies variables.
 
 
 * The list of dummy variables obtained is then grouped as features of every neighborhood locality.
 
    - After executing K-means algorithm the "Elbow Curve" it's plotted in order to obtain the _best K_. Analyzing the change in the slope of the       curve, it's determined that K=15 is a good value.
    
    - K-means algorithm is then executed.
    
    - In next step segmented data-frame is built which is composed of the top three venues for every neighborhood locality plus a segment label         which is determined by K-means algorithm. 


**e) Segment Cluster Analysis**  

Every segment is listed individually and it's further analyzed to derive meaningful insight as described in next section.



----------
## Results
----------

Please refer below for list of sports-product categories currently sold by **Decathlon** worldwide. 


**CYCLING** (City Bikes, Hybrid Bikes, Kids Bikes, Mountain Bikes, Road Bikes, Triathlon)

**FITNESS SPORTS** (Bodybuilding, Cross Training, Dance, Fitness Cardio, Gymnastics, Pilates, Yoga)

**RUNNING SPORTS** (Athletics, Fitness & Nordic Walking, Running, Trail Running, Triathlon)

**URBAN SPORTS** (Inline Skates, Scooters, Skateboards)

**TEAM SPORTS** (Baseball, Basketball, Cricket, Floorball, Football, Handball, Rugby, US Football, Volleyball & Beach Volley)

**OUTDOOR SPORTS** (Camping, Canyoning, Climbing & Mountaineering, Fishing, Hiking & Trekking, Horse Riding, Ski & Snowboard, Wildlife Exploration)

**WATER SPORTS** (Aqua Aerobics, Kayak, Kites, Sailing, Scuba Diving, Snorkeling, Stand-Up Paddle (SUP), Surf & Bodyboard, Swimming, Water Sport Games)

**RACKET SPORTS** (Badminton, Squash, Table Tennis, Tennis, Turnball)

**TARGET SPORTS** (Archery, Billiards, Darts, Golf, Petanque)

**COMBAT SPORTS** (Boxing, Judo, Karate, Taekwondo)



Refer below for **suggested sports-products(top-3, in-order of importance)** which can be sold at each localities in Singapore based on segmentation result


**Cluster 1**  

1. TARGET SPORTS
- FITNESS SPORTS
- RUNNING SPORTS

**Cluster 2**

1. WATER SPORTS
- RACKET SPORTS
- OUTDOOR SPORTS

**Cluster 3**

1. FITNESS SPORTS
- TEAM SPORTS
- URBAN SPORTS

**Cluster 4**

1. OUTDOOR SPORTS
- WATER SPORTS
- FITNESS SPORTS

**Cluster 5**

1. WATER SPORTS
- TEAM SPORTS
- FITNESS SPORTS

**Cluster 6**

1. FITNESS SPORTS
- OUTDOOR SPORTS
- RUNNING SPORTS

**Cluster 7**

1. COMBAT SPORTS 
- FITNESS SPORTS
- TEAM SPORTS

**Cluster 8**

1. TARGET SPORTS 
- FITNESS SPORTS
- WATER SPORTS

**Cluster 9**

1. RUNNING SPORTS 
- TEAM SPORTS
- FITNESS SPORTS

**Cluster 10**

1. TEAM SPORTS 
- RUNNING SPORTS
- FITNESS SPORTS

**Cluster 11**

1. WATER SPORTS
- OUTDOOR SPORTS
- FITNESS SPORTS

**Cluster 12**

1. OUTDOOR SPORTS
- CYCLING
- FITNESS SPORTS

**Cluster 13**

1. TEAM SPORTS
- FITNESS SPORTS
- OUTDOOR SPORTS

**Cluster 14**

1. FITNESS SPORTS
- RUNNING SPORTS
- URBAN SPORTS

**Cluster 15**

1. RUNNING SPORTS
- FITNESS SPORTS
- OUTDOOR SPORTS


--------
## Discussion
--------

The objective of this project task is to find better localities where specific product brands and categories can be focused. Also it aids Decathlon management to consider starting new outlets serving specific region based on top suggested products. 
  
By applying K-Means Algorithm it was possible to achieve desired objective by segmenting localities in Singapore based on top venues thereby assisting to understand local market better.

We are now ready to present our findings to Decathlon management helping them to take decisions for expanding their business in Singapore. 
  

--------
## Conclusion
--------

We have gathered data from trustworthy data sources, applied widely recognized clustering algorithm (K-Means) to perform segmentation to derive meaningful insight, hopefully it will be considered by top management for decision making with considerable level of confidence.

Proposed idea can be easily applied in any domain to perform base studies on competitor insight, market research analysis etc., based on requirement.
