# Data Science Capstone Project 

## Battle of the Neighborhoods - Singapore: Easties vs Westies 

### Problem Statement

Which region is the better region in Singapore to live in? Is it the East region or the West region? 

For someone who is looking to purchase and move into a new home in singapore, can we recommend a region for him/her based on his/her personal preferences?

### Background 

Singaporeans have always been debated over this topic of whether the East region or the West region of Singapore is the better one to live in for years. This debate is termed the "Easties vs Westies" and is usually over attributes of the neighborhoods in the East and West regions of Singapore, such as the number of food options, the number of amenities, the accessibility to key points of interest, etc.


### Data Description

Data requirements for this project:
- Name, coordinates and area of the neighborhoods in the West and East regions of Singapore (to be obtained via web scrapping)
- Name and categories of venues in each of the neighborhoods (to be obtained using Fourquare API)

How is the data used to answer the question:
- Information on the neighborhoods in the West and East region is required as input to the Foursquare API
- Output of the API call will then give us a list of venues and the corresponding category for each venue
- Group the venues based on the neighborhood to calculate the profile of the neighborhood (weightage of each category in the neighborhood)
- Given a person's preferences for the different categories, we can then recommend the best region and neighborhood for the person 

### Import libraries

In [1]:
import pandas as pd
import numpy as np
import math
import requests
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
!pip install folium
import folium

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

# Import StandardScaler required to normalize the dataset
from sklearn.preprocessing import StandardScaler

print('Libraries imported')

Libraries imported


### Import neighborhood data

In [2]:
# The code was removed by Watson Studio for sharing.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18 entries, 0 to 17
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Neighborhood  18 non-null     object 
 1   Region        18 non-null     object 
 2   Area          18 non-null     float64
 3   Population    18 non-null     int64  
 4   Density       18 non-null     float64
 5   Latitude      18 non-null     float64
 6   Longitude     18 non-null     float64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.1+ KB


In [3]:
# Drop rows where population = 0
df_neighborhoods = df_neighborhoods[df_neighborhoods['Population']!=0].reset_index(drop=True)
df_neighborhoods

Unnamed: 0,Neighborhood,Region,Area,Population,Density,Latitude,Longitude
0,Boon Lay,West,8.23,30,3.6,1.3386,103.7058
1,Bukit Batok,West,11.13,139270,12513.0,1.359,103.7637
2,Bukit Panjang,West,8.99,139030,15466.7,1.3774,103.7719
3,Choa Chu Kang,West,6.11,174330,28513.2,1.384,103.747
4,Clementi,West,9.49,91630,9650.3,1.3162,103.7649
5,Jurong East,West,17.83,84980,4766.9,1.3329,103.7436
6,Jurong West,West,14.69,272660,18563.5,1.3404,103.709
7,Pioneer,West,12.1,100,8.3,1.3376,103.6974
8,Tengah,West,7.4,10,1.4,1.3555,103.7308
9,Tuas,West,30.04,70,2.3,1.2949,103.6305


### Get Venues Info using Foursquare API

#### Input API credentials

In [4]:
# Foursquare API credentials
CLIENT_ID = 'SDWG3TUM54SOW1LZBRZIPFONPSQQQNZLRCUJVWZ1FNALWWMT' # your Foursquare ID
CLIENT_SECRET = 'YXJN4X1TRESBVZZ24FHFAEFVGIUCA5XYLV1WYABFKYJGYAMJ' # your Foursquare Secret
VERSION = '20190605' # Foursquare API version
LIMIT = 200 # A default Foursquare API limit value

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: SDWG3TUM54SOW1LZBRZIPFONPSQQQNZLRCUJVWZ1FNALWWMT
CLIENT_SECRET:YXJN4X1TRESBVZZ24FHFAEFVGIUCA5XYLV1WYABFKYJGYAMJ


#### Create a function to get nearby venues 

In [5]:
def getNearbyVenues(names, latitudes, longitudes, radius=1000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]["groups"][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Use function to get the nearby venue information for all the neighborhoods in the df_neighborhoods

In [6]:
names = df_neighborhoods['Neighborhood']
latitudes = df_neighborhoods['Latitude']
longitudes = df_neighborhoods['Longitude']

df_venues = getNearbyVenues(names, latitudes, longitudes)
print(df_venues.shape)
df_venues.head()

Boon Lay
Bukit Batok
Bukit Panjang
Choa Chu Kang
Clementi
Jurong East
Jurong West
Pioneer
Tengah
Tuas
Western Water Catchment
Bedok
Changi
Pasir Ris
Paya Lebar
Tampines
(935, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Boon Lay,1.3386,103.7058,Carl's Jr.,1.340575,103.706384,Fast Food Restaurant
1,Boon Lay,1.3386,103.7058,Subway,1.340278,103.706548,Sandwich Place
2,Boon Lay,1.3386,103.7058,Din Tai Fung 鼎泰豐 (Din Tai Fung),1.339029,103.705765,Chinese Restaurant
3,Boon Lay,1.3386,103.7058,Ya Kun Kaya Toast,1.339619,103.705786,Breakfast Spot
4,Boon Lay,1.3386,103.7058,Starbucks,1.338906,103.705642,Coffee Shop


In [7]:
df_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Bedok,98,98,98,98,98,98
Boon Lay,70,70,70,70,70,70
Bukit Batok,50,50,50,50,50,50
Bukit Panjang,47,47,47,47,47,47
Changi,40,40,40,40,40,40
Choa Chu Kang,44,44,44,44,44,44
Clementi,92,92,92,92,92,92
Jurong East,95,95,95,95,95,95
Jurong West,85,85,85,85,85,85
Pasir Ris,56,56,56,56,56,56


## Most common venue categories in the neighborhoods

#### One-Hot encode the venues category to flatten the DataFrame

In [8]:
# one hot encoding
venues_cat_onehot = pd.get_dummies(df_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
venues_cat_onehot['Neighborhood'] = df_venues['Neighborhood'] 

# Merge region to venues_cat_onehot
regions = df_neighborhoods[['Neighborhood', 'Region']]
venues_cat_onehot = venues_cat_onehot.merge(regions, how='inner', on='Neighborhood')

# move neighborhood column to the first column
fixed_columns = ['Neighborhood', 'Region'] + list(venues_cat_onehot.columns[:-2])
venues_cat_onehot = venues_cat_onehot[fixed_columns]

venues_cat_onehot.head()

Unnamed: 0,Neighborhood,Region,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Apres Ski Bar,...,Thrift / Vintage Store,Track,Trail,Tunnel,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Water Park,Wings Joint
0,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Group by neighborhood and calculate weightage of each category by taking the mean of the frequency of occurrence of each category

In [9]:
neigh_venues_cat_grouped = venues_cat_onehot.groupby(['Neighborhood', 'Region']).mean().reset_index()
neigh_venues_cat_grouped.head()

Unnamed: 0,Neighborhood,Region,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Apres Ski Bar,...,Thrift / Vintage Store,Track,Trail,Tunnel,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Water Park,Wings Joint
0,Bedok,East,0.0,0.0,0.0,0.0,0.0,0.0,0.010204,0.0,...,0.010204,0.0,0.0,0.0,0.020408,0.0,0.0,0.0,0.0,0.010204
1,Boon Lay,West,0.0,0.0,0.0,0.0,0.0,0.0,0.014286,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.014286
2,Bukit Batok,West,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.02,...,0.02,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0
3,Bukit Panjang,West,0.0,0.0,0.0,0.0,0.0,0.0,0.021277,0.0,...,0.0,0.0,0.021277,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Changi,East,0.025,0.05,0.025,0.075,0.1,0.05,0.0,0.0,...,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0


#### Calculate no. of unique categories for each neighborhood as another feature in our dataset

In [10]:
# Instantiate a new dataframe with column for number of unique categories
#df_unique_cats = pd.DataFrame(columns=['Neighborhood', 'No. of Unique Categories'])
#df_unique_cats['Neighborhood'] = df_neighborhoods['Neighborhood']
#df_unique_cats.set_index('Neighborhood', inplace=True)

# For each neighborhood, count the number of unique categories for each neighborhood
#for neighborhood in df_venues['Neighborhood'].unique():
#    df_neighborhood_venues = df_venues[df_venues['Neighborhood']==neighborhood]
#    unique_cats_count = df_neighborhood_venues['Venue Category'].nunique()
#    df_unique_cats.loc[neighborhood, 'No. of Unique Categories'] = unique_cats_count
#df_unique_cats.reset_index(inplace=True)

# Merge df_unique_cats to neigh_venues_cat_grouped
#neigh_venues_cat_grouped = neigh_venues_cat_grouped.merge(df_unique_cats, on='Neighborhood', how='left')

#print(neigh_venues_cat_grouped.head())

#### Define a function to sort the venues

In [11]:
def return_most_common_venues(row, num_top_venues, col_num):
    row_categories = row.iloc[col_num:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#### Use function on each neighborhood to get the top 10 venue category for each neighborhood

In [12]:
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood', 'Region']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = neigh_venues_cat_grouped['Neighborhood']
neighborhoods_venues_sorted['Region'] = neigh_venues_cat_grouped['Region']

for ind in np.arange(neigh_venues_cat_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 2:] = return_most_common_venues(neigh_venues_cat_grouped.iloc[ind, :], num_top_venues, 2)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,Region,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bedok,East,Chinese Restaurant,Food Court,Coffee Shop,Bakery,Supermarket,Asian Restaurant,Fast Food Restaurant,Noodle House,Japanese Restaurant,Sandwich Place
1,Boon Lay,West,Japanese Restaurant,Asian Restaurant,Fast Food Restaurant,Chinese Restaurant,Dessert Shop,Coffee Shop,Indian Restaurant,Food Court,Convenience Store,Playground
2,Bukit Batok,West,Italian Restaurant,Coffee Shop,Supermarket,Historic Site,Indian Restaurant,Chinese Restaurant,Café,Bus Station,Shopping Mall,Gym / Fitness Center
3,Bukit Panjang,West,Food Court,Gym,Park,Coffee Shop,Fast Food Restaurant,Sushi Restaurant,Noodle House,Asian Restaurant,Supermarket,Shopping Mall
4,Changi,East,Coffee Shop,Airport Service,Airport Lounge,Airport,Airport Terminal,BBQ Joint,Fast Food Restaurant,Ice Cream Shop,Public Art,Noodle House
5,Choa Chu Kang,West,Fast Food Restaurant,Coffee Shop,Playground,Sushi Restaurant,Asian Restaurant,Food Court,Gym / Fitness Center,Café,Sandwich Place,Miscellaneous Shop
6,Clementi,West,Food Court,Bakery,Chinese Restaurant,Coffee Shop,Asian Restaurant,Dessert Shop,Dim Sum Restaurant,Supermarket,Gym,Café
7,Jurong East,West,Coffee Shop,Café,Japanese Restaurant,Chinese Restaurant,Food Court,Bus Station,Shopping Mall,Playground,Clothing Store,Multiplex
8,Jurong West,West,Japanese Restaurant,Asian Restaurant,Fast Food Restaurant,Chinese Restaurant,Food Court,Dessert Shop,Playground,Indian Restaurant,Coffee Shop,Shopping Mall
9,Pasir Ris,East,Food Court,Fast Food Restaurant,Asian Restaurant,Pool,Park,Sushi Restaurant,Bus Line,Italian Restaurant,Coffee Shop,Playground


## Clustering Neighborhoods

#### Use k-means clustering to cluster neighborhoods

In [13]:
# Define number of clusters
k_clusters = 6

# Drop neighborhood column from neigh_venues_cat_grouped
neighborhoods_venues_clustering = neigh_venues_cat_grouped.drop(['Neighborhood', 'Region'], axis=1)

# instantiate kmeans model and cluster neighborhood
kmeans = KMeans(n_clusters=k_clusters, random_state=0).fit(neighborhoods_venues_clustering)

# Print cluster labels for all neighborhoods
kmeans.labels_

array([1, 0, 1, 2, 5, 2, 0, 1, 0, 2, 0, 0, 1, 0, 4, 3], dtype=int32)

#### Add cluster labels to DataFrame containing the top 10 venue categories 

In [14]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

#### Add coordinates to the DataFrame contain top 10 venue categories and cluster labels

In [15]:
# Merge coordinates data and cluster labels into a new dataframe
neighborhoods_venues_merged = neighborhoods_venues_sorted.merge(df_neighborhoods[['Neighborhood', 'Latitude', 'Longitude']], on='Neighborhood', how='left').set_index('Neighborhood', drop=True)
neighborhoods_venues_merged

Unnamed: 0_level_0,Cluster Labels,Region,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue,Latitude,Longitude
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Bedok,1,East,Chinese Restaurant,Food Court,Coffee Shop,Bakery,Supermarket,Asian Restaurant,Fast Food Restaurant,Noodle House,Japanese Restaurant,Sandwich Place,1.3236,103.9273
Boon Lay,0,West,Japanese Restaurant,Asian Restaurant,Fast Food Restaurant,Chinese Restaurant,Dessert Shop,Coffee Shop,Indian Restaurant,Food Court,Convenience Store,Playground,1.3386,103.7058
Bukit Batok,1,West,Italian Restaurant,Coffee Shop,Supermarket,Historic Site,Indian Restaurant,Chinese Restaurant,Café,Bus Station,Shopping Mall,Gym / Fitness Center,1.359,103.7637
Bukit Panjang,2,West,Food Court,Gym,Park,Coffee Shop,Fast Food Restaurant,Sushi Restaurant,Noodle House,Asian Restaurant,Supermarket,Shopping Mall,1.3774,103.7719
Changi,5,East,Coffee Shop,Airport Service,Airport Lounge,Airport,Airport Terminal,BBQ Joint,Fast Food Restaurant,Ice Cream Shop,Public Art,Noodle House,1.345,103.9832
Choa Chu Kang,2,West,Fast Food Restaurant,Coffee Shop,Playground,Sushi Restaurant,Asian Restaurant,Food Court,Gym / Fitness Center,Café,Sandwich Place,Miscellaneous Shop,1.384,103.747
Clementi,0,West,Food Court,Bakery,Chinese Restaurant,Coffee Shop,Asian Restaurant,Dessert Shop,Dim Sum Restaurant,Supermarket,Gym,Café,1.3162,103.7649
Jurong East,1,West,Coffee Shop,Café,Japanese Restaurant,Chinese Restaurant,Food Court,Bus Station,Shopping Mall,Playground,Clothing Store,Multiplex,1.3329,103.7436
Jurong West,0,West,Japanese Restaurant,Asian Restaurant,Fast Food Restaurant,Chinese Restaurant,Food Court,Dessert Shop,Playground,Indian Restaurant,Coffee Shop,Shopping Mall,1.3404,103.709
Pasir Ris,2,East,Food Court,Fast Food Restaurant,Asian Restaurant,Pool,Park,Sushi Restaurant,Bus Line,Italian Restaurant,Coffee Shop,Playground,1.3721,103.9474


#### Create map for visualization of the neighborhood clustering

In [22]:
# create map
sg_latitude = 1.3521
sg_longitude = 103.8198
map_clusters = folium.Map(location=[sg_latitude, sg_longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(neighborhoods_venues_merged['Latitude'], neighborhoods_venues_merged['Longitude'] \
                                  , neighborhoods_venues_merged.index, neighborhoods_venues_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

#### Deep-diving into the clusters characteristics

In [20]:
# Label the onehot-encoded dataframe with cluster label
df_neigh_cluster = neighborhoods_venues_sorted[['Neighborhood', 'Cluster Labels']]
venues_cat_onehot_labelled = venues_cat_onehot.merge(df_neigh_cluster, on='Neighborhood', how='left')
venues_cat_onehot_labelled.head()

Unnamed: 0,Neighborhood,Region,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Apres Ski Bar,...,Track,Trail,Tunnel,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Water Park,Wings Joint,Cluster Labels
0,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9,Boon Lay,West,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Group rows by the cluster label
cluster_venues_onehot = venues_cat_onehot_labelled.groupby('Cluster Labels').mean()
cluster_venues_onehot

Unnamed: 0_level_0,Accessories Store,Airport,Airport Food Court,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,Apres Ski Bar,Arts & Crafts Store,Asian Restaurant,...,Thrift / Vintage Store,Track,Trail,Tunnel,Vegetarian / Vegan Restaurant,Video Game Store,Video Store,Vietnamese Restaurant,Water Park,Wings Joint
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.0,0.0,0.0,0.0,0.0,0.007059,0.0,0.002353,0.075294,...,0.0,0.0,0.002353,0.0,0.007059,0.002353,0.0,0.002353,0.0,0.007059
1,0.003185,0.0,0.0,0.0,0.0,0.0,0.009554,0.003185,0.003185,0.015924,...,0.006369,0.003185,0.006369,0.0,0.009554,0.0,0.003185,0.0,0.0,0.003185
2,0.0,0.0,0.0,0.0,0.0,0.0,0.006803,0.0,0.0,0.054422,...,0.0,0.0,0.006803,0.0,0.0,0.0,0.0,0.0,0.006803,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.025,0.05,0.025,0.075,0.1,0.05,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0


## Results section

In [23]:
# List top 10 common venue categories in each cluster
num_top_venues = 10
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Cluster']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
cluster_venues_top10 = pd.DataFrame(columns=columns)
cluster_venues_top10['Cluster'] = cluster_venues_onehot.index

for ind in np.arange(cluster_venues_onehot.shape[0]):
    cluster_venues_top10.iloc[ind, 1:] = return_most_common_venues(cluster_venues_onehot.iloc[ind, :], num_top_venues, 1)

cluster_venues_top10

Unnamed: 0,Cluster,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,0,Asian Restaurant,Chinese Restaurant,Food Court,Fast Food Restaurant,Coffee Shop,Japanese Restaurant,Dessert Shop,Indian Restaurant,Supermarket,Grocery Store
1,1,Coffee Shop,Chinese Restaurant,Bus Station,Food Court,Café,Japanese Restaurant,Fast Food Restaurant,Shopping Mall,Playground,Supermarket
2,2,Fast Food Restaurant,Food Court,Coffee Shop,Asian Restaurant,Park,Gym,Sushi Restaurant,Playground,Pool,Sandwich Place
3,3,Food Court,Bus Station,Flower Shop,Gun Range,Wings Joint,Furniture / Home Store,Fried Chicken Joint,French Restaurant,Food Truck,Food Service
4,4,Food Court,Gas Station,Food Truck,Asian Restaurant,Wings Joint,Filipino Restaurant,Fried Chicken Joint,French Restaurant,Food Service,Flower Shop
5,5,Airport Service,Coffee Shop,Airport Lounge,Airport,Airport Terminal,BBQ Joint,Fast Food Restaurant,Convenience Store,Sandwich Place,Public Art


## Observations and Discussion

#### Interesting observations
Observation 1: Neighborhoods in both regions closer to the city (central area) are similar to each other as visualised by the purple (cluster 1), red (cluster 0) and blue (cluster 2) markers on the map\
Observation 2: Neighborhoods that are on the extreme ends of Singapore are very unique in nature.\

#### Based on the coordinates of the neighborhoods within each cluster as well as the top 10 most common venue categories in each cluster, we can describe each of the cluster as such: 
Cluster 0: Neighborhoods with huge variety of restaurants \
Cluster 1: Neighborhoods with good bus network accessibility and good variety of restaurants and cafe\
Cluster 2: Neighborhoods with a well-balanced profile of food establishments and outdoor facilities\
Cluster 3: Neighborhoods with military-related facilities and furniture stores in close proximity, but with limited food options\
Cluster 4: Neighborhoods with cheap food options and great accessibility to gas stations\
Cluster 5: Neighborhoods with the airport in close proximity

#### Back to the problem statement - Which region is the better region in Singapore to live in? Is it the East region or the West region?
Based on the cluster descriptions, it seems that the East region is a more comfortable place to live in, as it contains clusters 0,1,2 and 5 which means great accessbility to restaurants, cafes,outdoor facilities as well as the international airport. Whilst the West region of Singapore contains clusters 0,1 and 2 which also means that the people living in the West region do also have great accessibility to restaurants, cafes and outdoor facilities, the region also contains clusters 3 and 4 which seem to be highly industrialised and militarised areas. These areas are likely to have higher levels of air pollution and noise pollution, potentially resulting in poorer quality of life for people living in the neighborhoods in the West Region, albeit with varying extents depending on the proximity of the neighborhoods to these areas. 

Purely based on these on the data used in this notebook, I would recommend the East region to be the better region to live in. Within the east region, the choice of neighborhood then depends on one's personal preferences. For example, someone who likes cafe-hopping should live in Bedok, while someone who is a frequent flyer should live in Changi.

## Conclusion

In conclusion, this report sets out to answer the question 'Which region is the better region in Singapore to live in? Is it the East region or the West region?' by using a dataset that is a combination of coordinates data from Wikipedia and location data from Foursquare API. Using the location data on nearby venues, we were able to do identify the venue categories and the associated weightage of each venue category for each neighborhood. k-means clustering algorithm with a n_cluster of 6 was then employed for on the dataset containing the weightage of each venue category, resulting in 6 clusters of neighborhoods. Upon the visualization of the clustering on the folium map as well as deep diving into the characteristics of each cluster, we were able to gain key observations and insights into the neighborhoods. 

Purely based on these on the data used in this notebook, the East region seems to be the better region to live in. Within the east region, the choice of neighborhood then depends on one's personal preferences. Note that this conclusion is done on a limited dataset containing only location data. To provide a more holistic analysis, we would need more data such as housing prices, temperature, humidity, google ratings, etc. This will provide the clustering algorithms with more features to work with and