# Capstone Project - Best District for New Coffee Shop in Hong Kong

### Applied Data Science Capstone by IBM/Coursera

## 1. Introduction


### 1.1 Background

Hong Kong is primarily the entry into the international market because it scores high on several factors including its strategic location, productive work-force, attractive tax regime, world-class infrastructure and and an effective legal system. Many investors and entrepreneurs have chosen to set up their businesses in Hong Kong. Although Hong Kong is small in terms of the area of land, the population is comparatively large which contributes to the great business opportunities. 

### 1.2 Problem

Hong Kong is small, merely around 1,100 km², and the majority of Hong Kong's landscape consists of steep, undeveloped mountains and hills, which explain why Hong Kong has limited land for development. Officially, there are 18 districts in Hong Kong. The first question for a startup in Hong Kong would be where the shop will be. In the context of opening a new coffee shop, it is concerned that which district should be chosen regarding the business opportunities and competition. 

### 1.3 Stakeholders

The quantitative analysis aims to provide potential investors, or startup entrepreneurs, especially those who are interested in opening a new coffee shop with a guide to analyze the important problem scientifically. Supplement information such as rental prices of certain potential retail shops and their community facilities nearby is needed for more thorough consideration. Plus, government authorities can refer to the analysis for better understanding the city's culture diversity. 

## 2. Data


The analysis to find the best districts for new coffee shops is based on the following aspects:

* number of existing coffee shops in the districts;
* population density in the districts.

The sources of data are the following to achieve their respective aims:

* **Wikipedia**: To obtain the district data, including names of regions, names of districts, population density;
* **OpenCage Geocoder API**: To look up the latitudes and longitudes of all districts;
* **Foursquare API**: To obtain the number of coffee shop, their types and locations in all districts.

There are different websites scraping libraries and packages in Python. For scraping the table from Wiki, `pandas` is simply used to read the table into a pandas dataframe. Then, a free API, OpenCage Geocoder, is utilized to find the longitude-latitude coordinates for the list of districts in Hong Kong.

### Scrapping District Data (Names of Regions, Names of Districts, Population Density)

Before scrapping and exploring the data, all the dependencies needed should be downloaded first.

In [None]:
import pandas as pd
!pip install lxml

!pip install opencage
from opencage.geocoder import OpenCageGeocode

!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim

!conda install -c conda-forge folium=0.5.0 --yes
import folium

import requests

import numpy as np

import matplotlib.cm as cm
import matplotlib.colors as colors

from sklearn.cluster import KMeans

from pandas.io.json import json_normalize

print('Libraries imported.')

Next, `pandas` is used to read the tables in the Wikipedia page and a `for` loop is used to scrap a 'District table'. The name of a column is adjusted and the dataframe with the column names, Region, District, Population, Area(km²), and Density(/km²) can be obtained. 

In [2]:
tables = pd.read_html('https://en.wikipedia.org/wiki/Districts_of_Hong_Kong', header=0)

headings = ['District']

for table in tables:
    current_headings = table.columns.values[:1]
    if len(current_headings) != len(headings):
        continue
    if all(current_headings == headings):
        break

df = table.rename(columns={"Population[when?] [6]":"Population",})
df = df[['Region','District','Population','Area(km²)','Density(/km²)']]

df

Unnamed: 0,Region,District,Population,Area(km²),Density(/km²)
0,Hong Kong Island,Central and Western,244600,12.44,19983.92
1,Hong Kong Island,Eastern,574500,18.56,31217.67
2,Hong Kong Island,Southern,269200,38.85,6962.68
3,Hong Kong Island,Wan Chai,150900,9.83,15300.1
4,Kowloon,Sham Shui Po,390600,9.35,41529.41
5,Kowloon,Kowloon City,405400,10.02,40194.7
6,Kowloon,Kwun Tong,641100,11.27,56779.05
7,Kowloon,Wong Tai Sin,426200,9.3,45645.16
8,Kowloon,Yau Tsim Mong,318100,6.99,44864.09
9,New Territories,Islands,146900,175.12,825.14


In order to get the data of latitudes and longitudes of the districts, OpenCage Geocoder, which is a free API that can be use to look up coordinates of places, and also find out the place a set of coordinates corresponds to, is used. 

In [3]:
#Geocoding Tutorial from Amaral Lab: https://amaral.northwestern.edu/blog/getting-long-lat-list-cities

key = '1cfb1dbb86d54891a7c74a57c4761949'
geocoder = OpenCageGeocode(key)

In [4]:
list_lat = []
list_long = []

for index, row in df.iterrows():
    
    District = row['District']
    Region = row['Region']       
    query = str(District)+','+str(Region)
    
    geo_results = geocoder.geocode(query)   
    district_lat = geo_results[0]['geometry']['lat']
    district_long = geo_results[0]['geometry']['lng']
    
    list_lat.append(district_lat)
    list_long.append(district_long)

df['Latitude'] = list_lat
df['Longitude'] = list_long

df

Unnamed: 0,Region,District,Population,Area(km²),Density(/km²),Latitude,Longitude
0,Hong Kong Island,Central and Western,244600,12.44,19983.92,22.281938,114.158077
1,Hong Kong Island,Eastern,574500,18.56,31217.67,22.273078,114.233594
2,Hong Kong Island,Southern,269200,38.85,6962.68,22.244541,114.205376
3,Hong Kong Island,Wan Chai,150900,9.83,15300.1,22.279015,114.172483
4,Kowloon,Sham Shui Po,390600,9.35,41529.41,22.32819,114.160854
5,Kowloon,Kowloon City,405400,10.02,40194.7,22.33016,114.189937
6,Kowloon,Kwun Tong,641100,11.27,56779.05,22.312937,114.22561
7,Kowloon,Wong Tai Sin,426200,9.3,45645.16,22.341654,114.193859
8,Kowloon,Yau Tsim Mong,318100,6.99,44864.09,22.302857,114.182032
9,New Territories,Islands,146900,175.12,825.14,22.230076,113.986785


In [5]:
print('The dataframe has {} regions and {} districts.'.format(
        len(df['Region'].unique()),
        df.shape[0]
    )
)

The dataframe has 3 regions and 18 districts.
