<h1>Cousera Capstone Hong Kong Codebook</h1>

<h2>Introduction</h2>

The aim of this project is to locate the optimum areas for setting up a vegetarian restaurant in Hong Kong. This notebook shows the code for this project.

The most important factors considered for finding a good location for restaurant are:
* That area should have less number of restaurants especailly, vegetarian ones.
* That area should be away from dense populated ressidentials.
* It should be near some scenic spot.

<h2>Initial Preparation</h2>

<b>First of all, import required libraries.</b>

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
#from pygeocoder import Geocoder

#pip install reverse_geocoder
#pip install pprint

#!conda install reverse_geocoder
#!conda install pprint

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

#Mathematical Functions
from math import radians, cos, sin, asin, sqrt, atan2, degrees

from bs4 import BeautifulSoup

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library



print('Libraries imported.')

Libraries imported.


<h2>Data Acquisition and Data Cleaning</h2>

<b>Get the neighborhoods of Hong Kong and their coordinates.</b>

In [2]:
url = requests.get('https://www.geodatos.net/en/coordinates/hong-kong').text
soup = BeautifulSoup(url,'lxml')

In [3]:
#Filter Out Required Table
My_table = soup.find('div',{'class':'col-md-12 panel-body overflowauto'})
#Fetch out table body
My_table = My_table.find('tbody')
#Intialise lists
city_list = []
coordinate_list = []
#Fetch out city names and their respectice coordinates and store them in lists
for row in My_table.findAll('tr'):
    city_list.append(row.find('td').text.strip())
    coordinate_list.append(row.find('a').text.strip())

<b>Store them into dataframe.</b>

In [4]:
hongkong_df = pd.DataFrame(columns = ['City','Coordinates'])
hongkong_df = hongkong_df.assign(City = city_list, Coordinates = coordinate_list)
hongkong_df.head()

Unnamed: 0,City,Coordinates
0,Hong Kong,"22.2783203, 114.1746902"
1,Kowloon,"22.3166695, 114.1833267"
2,Tsuen Wan,"22.3706608, 114.1047897"
3,Yuen Long Kau Hui,"22.4500008, 114.0333328"
4,Tung Chung,"22.2878304, 113.9424286"


<b>We have coordinates in a single column. Let's split into two different columns Latitude aand Longitude.</b>

In [5]:
Latitude = []
Longitude = []
for x in hongkong_df['Coordinates']:
    x = x.split(', ')
    Latitude.append(x[0])
    Longitude.append(x[1])
print(Latitude[:5])
print(Longitude[0:5])

['22.2783203', '22.3166695', '22.3706608', '22.4500008', '22.2878304']
['114.1746902', '114.1833267', '114.1047897', '114.0333328', '113.9424286']


In [6]:
hongkong_df = hongkong_df.assign(Latitude = Latitude, Longitude = Longitude)

In [7]:
hongkong_df.head()

Unnamed: 0,City,Coordinates,Latitude,Longitude
0,Hong Kong,"22.2783203, 114.1746902",22.2783203,114.1746902
1,Kowloon,"22.3166695, 114.1833267",22.3166695,114.1833267
2,Tsuen Wan,"22.3706608, 114.1047897",22.3706608,114.1047897
3,Yuen Long Kau Hui,"22.4500008, 114.0333328",22.4500008,114.0333328
4,Tung Chung,"22.2878304, 113.9424286",22.2878304,113.9424286


<b>Get rid of Coordinates column.</b>

In [7]:
hongkong_df.drop('Coordinates', axis = 1, inplace =True)

In [8]:
hongkong_df.head()

Unnamed: 0,City,Latitude,Longitude
0,Hong Kong,22.2783203,114.1746902
1,Kowloon,22.3166695,114.1833267
2,Tsuen Wan,22.3706608,114.1047897
3,Yuen Long Kau Hui,22.4500008,114.0333328
4,Tung Chung,22.2878304,113.9424286


In [10]:
hongkong_df.dtypes

City         object
Latitude     object
Longitude    object
dtype: object

<b>But Latitude and Longitude columns are objects, we need to convert them into float.</b>

In [9]:
hongkong_df['Latitude'] =pd.to_numeric(hongkong_df['Latitude'])
hongkong_df['Longitude'] =pd.to_numeric(hongkong_df['Longitude'])

In [11]:
hongkong_df.dtypes

City          object
Latitude     float64
Longitude    float64
dtype: object

<b>Get Coordinates of Hong Kong using geopy.</b>

In [10]:
address = 'HONG KONG, HKG'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Hong Kong are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Hong Kong are 22.30742895, 113.917059658642.


<b>Let's visualize the neighborhoods of Hong Kong</b>

In [11]:
map_hongkong = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, city in zip(hongkong_df['Latitude'], hongkong_df['Longitude'], hongkong_df['City']):
    label = '{}'.format(city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_hongkong)  
    
map_hongkong

<h2>Analysis boroughs of Hong Kong</h2>

<b>Define Foursquare Credentials.</b>

<b>Sorry, these credentials are deliberately not being shown because of security reasons.</b>

In [100]:
CLIENT_ID = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' #Foursquare ID
CLIENT_SECRET = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' #Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
CLIENT_SECRET:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


<b>Function for exploring venues of all the neighbourhoods.</b>

In [13]:
def getNearbyVenues(categoryid, names, latitudes, longitudes, radius=500):
    
    LIMIT = 100
    venues_list=[]
    category = categoryid    
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        if(category == ''):
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
              CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        else:
            url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
           CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            category,
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<b>Use the above function to check out nearby food venues.</b>

In [14]:
hongkong_venues = getNearbyVenues(categoryid ='4d4b7105d754a06374d81259',
                                  names = hongkong_df['City'],
                                   latitudes = hongkong_df['Latitude'],
                                   longitudes = hongkong_df['Longitude'])
print(hongkong_venues.shape)

Hong Kong
Kowloon
Tsuen Wan
Yuen Long Kau Hui
Tung Chung
Sha Tin
Tuen Mun
Tai Po
Sai Kung
Yung Shue Wan
Ngong Ping
Sok Kwu Wan
Tai O
Wong Tai Sin
Wan Chai
Sham Shui Po
Central
(449, 7)


In [15]:
hongkong_venues.head(5)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant
1,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant
2,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant
3,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant
4,Hong Kong,22.27832,114.17469,Sang Kee Seafood Restaurant (生記海鮮飯店),22.277755,114.172093,Seafood Restaurant


In [16]:
print('There are {} unique venue categories.'.format(len(hongkong_venues['Venue Category'].unique())))

There are 68 unique venue categories.


In [17]:
hongkong_venues['Venue Category'].value_counts()

Chinese Restaurant               43
Café                             27
Hong Kong Restaurant             26
Japanese Restaurant              26
Cantonese Restaurant             22
Noodle House                     20
Seafood Restaurant               19
Italian Restaurant               18
Thai Restaurant                  15
Sushi Restaurant                 14
Bakery                           14
Fast Food Restaurant             12
Steakhouse                       11
Dumpling Restaurant              10
Korean Restaurant                10
French Restaurant                 8
Snack Place                       8
Sandwich Place                    8
Cha Chaan Teng                    8
Restaurant                        7
Dim Sum Restaurant                7
Shanghai Restaurant               7
Vegetarian / Vegan Restaurant     7
Indian Restaurant                 7
Szechuan Restaurant               6
Asian Restaurant                  6
Burger Joint                      6
Ramen Restaurant            

The above list gives data regarding various types of restaurants. Though there are a large number of restaurants but fortunately, there  are only 7 vegetarian restaurants. Apart from these restaurants, there are also various other food venues like food points, burger points, cafes etc.

<b>But we are interested only in restaurants</b>

In [18]:
hongkong_restaurants = hongkong_venues[hongkong_venues['Venue Category'].str.contains('Restaurant')]

In [19]:
hongkong_restaurants.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant
1,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant
2,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant
3,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant
4,Hong Kong,22.27832,114.17469,Sang Kee Seafood Restaurant (生記海鮮飯店),22.277755,114.172093,Seafood Restaurant


In [20]:
hongkong_restaurants.shape

(322, 7)

So let's first visulaise these restaurants

In [21]:
map_rest = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.TileLayer('cartodbpositron').add_to(map_rest)

# add markers to map
for lat, lng, ven in zip(hongkong_restaurants['Venue Latitude'], hongkong_restaurants['Venue Longitude'], 
                          hongkong_restaurants['Venue']):
    label = '{}'.format(ven)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_rest)  
    
map_rest

The above map makes it clear that the restaurants are tightly packed in each neighborhood. It's most probably because other regions are not that suitable. So,we need to locate those regions which are not farther from central point as well as close to some scenic spot.

<h3>Exploratory Analysis</h3>

Let's see the distribution of these restaurants in variors neighborhoods first.

In [22]:
hongkong_restaurants['Neighbourhood'].value_counts()

Hong Kong            76
Central              66
Tai Po               38
Tung Chung           30
Sham Shui Po         22
Wan Chai             18
Yung Shue Wan        11
Tsuen Wan            10
Kowloon              10
Tai O                 9
Sok Kwu Wan           9
Ngong Ping            7
Tuen Mun              6
Sha Tin               5
Sai Kung              3
Yuen Long Kau Hui     1
Wong Tai Sin          1
Name: Neighbourhood, dtype: int64

We can see that Hong Kong and Central have way larger number of restaurantsmost of the neighborhoods have very less number of restaurants. While . So, opening up a restaurant in these areas should not be a big issue. We will emphasize our analysis to only those neighborhoods where number of restaurants is more than total average number of locations. 

<b>So get rid of all those neighborhoods where number of restaurants is less than 20.</b>

In [23]:
values = hongkong_restaurants['Neighbourhood'].value_counts().keys().tolist()
counts = hongkong_restaurants['Neighbourhood'].value_counts().tolist()
dicto = dict(zip(values,counts))

In [24]:
hongkong_restaurants_sp = hongkong_restaurants.copy()

In [25]:
for x in dicto:
    if dicto.get(x) < 20:
        hongkong_restaurants_sp.drop(hongkong_restaurants_sp[hongkong_restaurants_sp.Neighbourhood == x].index, inplace =True)    

In [26]:
hongkong_restaurants_sp = hongkong_restaurants_sp.reset_index(drop = True)

In [27]:
hongkong_restaurants_sp['Neighbourhood'].value_counts()

Hong Kong       76
Central         66
Tai Po          38
Tung Chung      30
Sham Shui Po    22
Name: Neighbourhood, dtype: int64

In [28]:
hongkong_restaurants_sp.shape

(232, 7)

In [29]:
hongkong_restaurants_sp.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant
1,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant
2,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant
3,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant
4,Hong Kong,22.27832,114.17469,Sang Kee Seafood Restaurant (生記海鮮飯店),22.277755,114.172093,Seafood Restaurant


<b>Function for calculating distance between two points.</b>

In [30]:
def calcdistance(lat1, long1, lat2, long2):
    # convert decimal degrees to radians 
    lat1, long1, lat2, long2 = map(radians, [lat1, long1, lat2, long2])
    # haversine formula 
    dlong = long2 - long1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlong/2)**2
    c = 2 * asin(sqrt(a)) 
    # Mutliply with the Radius of earth ie 6371 km
    d = 6371* c
    # Convert it into meters and Round off the result
    d = round(d * 1000)
    return d

In [31]:
Focaldistance = []
l = hongkong_restaurants_sp['Venue'].size
for i in range(hongkong_restaurants_sp['Venue'].size):
    d = calcdistance(hongkong_restaurants_sp['Neighbourhood Latitude'][i], 
                     hongkong_restaurants_sp['Neighbourhood Longitude'][i],
                     hongkong_restaurants_sp['Venue Latitude'][i], 
                     hongkong_restaurants_sp['Venue Longitude'][i])
    Focaldistance.append(d)

In [32]:
hongkong_restaurants_sp['Distance From Focus'] = Focaldistance

In [33]:
hongkong_restaurants_sp.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Distance From Focus
0,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant,127
1,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant,123
2,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant,102
3,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant,56
4,Hong Kong,22.27832,114.17469,Sang Kee Seafood Restaurant (生記海鮮飯店),22.277755,114.172093,Seafood Restaurant,275


In [34]:
max(hongkong_restaurants_sp['Distance From Focus'])

497

The farthest distance between Central point and Venue point is not even 500 m. So, it would not be a wiser approach to further divide these small neighborhoods. We needd to find some other approach. 

<b>Let's find the locations of our venues with respect to central point like North, North West etc.</b>

<b>Function for getting directions of a venue according to focal point</b>

In [36]:
def checkposition(foclat, foclong, venlat, venlong): 
    x1 = venlat
    y1 = venlong
    x2 = foclat
    y2 = foclong

    #radians = getAtan2((y1 - y2), (x1 - x2));

    #Get angle between them
    #Destination - Source
    #End - Start
    theta1 = atan2((y1 - y2), (x1 - x2))
    #Convert to degrees
    #compassReading = theta1 * (180 / Math.PI);

    compassReading = degrees(theta1)
    
    coordNames = ["N", "NE", "E", "SE", "S", "SW", "W", "NW", "N"]
    coordIndex = round(compassReading / 45)
    if (coordIndex < 0):
        coordIndex = coordIndex + 8
        
    return coordNames[coordIndex] # returns the coordinate value


In [37]:
pos = []
for i in range(hongkong_restaurants_sp['Venue'].size):
    pos.append(checkposition(hongkong_restaurants_sp['Neighbourhood Latitude'][i], hongkong_restaurants_sp['Neighbourhood Longitude'][i],
                   hongkong_restaurants_sp['Venue Latitude'][i], hongkong_restaurants_sp['Venue Longitude'][i]))

In [38]:
hongkong_restaurants_sp['Location'] = pos

In [39]:
hongkong_restaurants_sp.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Distance From Focus,Location
0,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant,127,E
1,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant,123,E
2,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant,102,SE
3,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant,56,W
4,Hong Kong,22.27832,114.17469,Sang Kee Seafood Restaurant (生記海鮮飯店),22.277755,114.172093,Seafood Restaurant,275,W


<b>Let's view the distribution of Hong Kong neighborhood according to location areas</b>

In [40]:
hklcount = hongkong_restaurants_sp[hongkong_restaurants_sp['Neighbourhood'] == 'Hong Kong']['Location'].value_counts()
hklcount

SW    31
E     13
W     10
S      9
SE     6
NW     3
NE     2
N      2
Name: Location, dtype: int64

This distribution clearly shows that South western Hong Kong is densily crowded with restaurants while other regions especially Northern regions don't have even one third number of restaurants in comparison to South west.

So, we will focus only on those areas in which number of restaurants is less than the avergae of total number of restaurants. We will call these regions as 'Low Density Restaurants' areas. 

In [41]:
hkavg = round(sum(hklcount)/len(hklcount))
hkavg

10

In [42]:
hklarea = hklcount[lambda x: x <= hkavg]
hklarea

W     10
S      9
SE     6
NW     3
NE     2
N      2
Name: Location, dtype: int64

The same is being done of other four neighborhoods as well in the following cells.

<b>Central</b>

In [43]:
centralcount = hongkong_restaurants_sp[hongkong_restaurants_sp['Neighbourhood'] == 'Central']['Location'].value_counts()
centralavg = round(sum(centralcount)/len(centralcount))
centralarea = centralcount[lambda x: x <= centralavg]
centralarea

N     8
SE    3
NE    3
E     2
NW    1
Name: Location, dtype: int64

<b>Tung Chung</b>

In [44]:
tungcount = hongkong_restaurants_sp[hongkong_restaurants_sp['Neighbourhood'] == 'Tung Chung']['Location'].value_counts()
tungavg = round(sum(tungcount)/len(tungcount))
tunglarea = tungcount[lambda x: x <= tungavg]
tunglarea

NE    3
Name: Location, dtype: int64

<b>Sham Shui Po</b>

In [45]:
shamcount = hongkong_restaurants_sp[hongkong_restaurants_sp['Neighbourhood'] == 'Sham Shui Po']['Location'].value_counts()
shamavg = round(sum(shamcount)/len(shamcount))
shamlarea = shamcount[lambda x: x <= shamavg]
shamlarea

E     5
SE    4
N     3
Name: Location, dtype: int64

<b>Tai Po</b>

In [46]:
taicount = hongkong_restaurants_sp[hongkong_restaurants_sp['Neighbourhood'] == 'Tai Po']['Location'].value_counts()
taiavg = round(sum(taicount)/len(taicount))
taiarea = taicount[lambda x: x <= taiavg]
taiarea

NW    5
N     4
W     1
Name: Location, dtype: int64

<b>Now, we have defined low density areas in these neighborhoods.So, let's create another dataframe which will show low density restaurant areas, low residence density areas and scenic location areas.</b>

In [49]:
sparea = pd.DataFrame()

In [50]:
sparea['Neighbourhood'] = hongkong_restaurants_sp['Neighbourhood'].unique().tolist()

Get Latitudes and Longitudes of these neighborhoods as well.

In [51]:
sparea['Latitude'] = hongkong_restaurants_sp['Neighbourhood Latitude'].unique().tolist()
sparea['Longitude'] = hongkong_restaurants_sp['Neighbourhood Longitude'].unique().tolist()
sparea

Unnamed: 0,Neighbourhood,Latitude,Longitude
0,Hong Kong,22.27832,114.17469
1,Tung Chung,22.28783,113.942429
2,Tai Po,22.450069,114.16877
3,Sham Shui Po,22.330231,114.159447
4,Central,22.282989,114.158462


In [52]:
#CHeck
sparea['Latitude'] =pd.to_numeric(sparea['Latitude'])
sparea['Longitude'] =pd.to_numeric(sparea['Longitude'])

<b>Add Low Restaurant Density Area Column</b>

In [53]:
sparea['Low Restaurant Density']  = [hklarea.keys().tolist(), tunglarea.keys().tolist(),
                          taiarea.keys().tolist(), shamlarea.keys().tolist(),
                          centralarea.keys().tolist()]

In [54]:
sparea.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Low Restaurant Density
0,Hong Kong,22.27832,114.17469,"[W, S, SE, NW, NE, N]"
1,Tung Chung,22.28783,113.942429,[NE]
2,Tai Po,22.450069,114.16877,"[NW, N, W]"
3,Sham Shui Po,22.330231,114.159447,"[E, SE, N]"
4,Central,22.282989,114.158462,"[N, SE, NE, E, NW]"


<b>Let's look out for some scenes and parks.</b>

<b>Scenes</b>

In [55]:
hongkong_scenes = getNearbyVenues(categoryid = '4bf58dd8d48988d165941735' ,
                                  names = hongkong_df['City'],
                                   latitudes = hongkong_df['Latitude'],
                                   longitudes = hongkong_df['Longitude'])
print(hongkong_scenes.shape)

Hong Kong
Kowloon
Tsuen Wan
Yuen Long Kau Hui
Tung Chung
Sha Tin
Tuen Mun
Tai Po
Sai Kung
Yung Shue Wan
Ngong Ping
Sok Kwu Wan
Tai O
Wong Tai Sin
Wan Chai
Sham Shui Po
Central
(10, 7)


Get location areas of these spots as well.

In [56]:
spos = []
for i in range(hongkong_scenes['Venue'].size):
    spos.append(checkposition(hongkong_scenes['Neighbourhood Latitude'][i], hongkong_scenes['Neighbourhood Longitude'][i],
                   hongkong_scenes['Venue Latitude'][i], hongkong_scenes['Venue Longitude'][i]))
hongkong_scenes['Location'] = spos

In [57]:
hongkong_scenes.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Location
0,Hong Kong,22.27832,114.17469,Central Plaza Sky Lobby,22.280109,114.17371,Scenic Lookout,NW
1,Hong Kong,22.27832,114.17469,Крыша,22.276177,114.170997,Scenic Lookout,SW
2,Yung Shue Wan,22.226231,114.112411,Waterfront Bar & Grill,22.225468,114.111306,Scenic Lookout,SW
3,Yung Shue Wan,22.226231,114.112411,Pavilion @ O Tsai (澳仔觀景亭),22.228115,114.108419,Scenic Lookout,NW
4,Ngong Ping,22.25556,113.903908,Tian Tan Buddha (Giant Buddha) (天壇大佛),22.253953,113.905011,Scenic Lookout,SE


Add location areas corresponding to neighborhoods in sparea dataframe. 

In [58]:
t =[]
temp = []
for x in sparea['Neighbourhood']:
    temp = hongkong_scenes[hongkong_scenes['Neighbourhood'] == x]['Location'].values
    if(len(temp) != 0):
        l = []
        for s in temp:
            l.append(s)
        t.append(list(set(l)))
    else:
        t.append('None')    

In [59]:
sparea['Scenes'] = t

In [60]:
sparea.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Low Restaurant Density,Scenes
0,Hong Kong,22.27832,114.17469,"[W, S, SE, NW, NE, N]","[SW, NW]"
1,Tung Chung,22.28783,113.942429,[NE],
2,Tai Po,22.450069,114.16877,"[NW, N, W]",
3,Sham Shui Po,22.330231,114.159447,"[E, SE, N]",
4,Central,22.282989,114.158462,"[N, SE, NE, E, NW]","[N, NE]"


<b>Parks</b>

In [61]:
hongkong_parks = getNearbyVenues(categoryid = '4bf58dd8d48988d163941735' ,
                                  names = hongkong_df['City'],
                                   latitudes = hongkong_df['Latitude'],
                                   longitudes = hongkong_df['Longitude'])
print(hongkong_parks.shape)

Hong Kong
Kowloon
Tsuen Wan
Yuen Long Kau Hui
Tung Chung
Sha Tin
Tuen Mun
Tai Po
Sai Kung
Yung Shue Wan
Ngong Ping
Sok Kwu Wan
Tai O
Wong Tai Sin
Wan Chai
Sham Shui Po
Central
(25, 7)


Get location areas for these parks as well.

In [62]:
spos = []
for i in range(hongkong_parks['Venue'].size):
    spos.append(checkposition(hongkong_parks['Neighbourhood Latitude'][i], hongkong_parks['Neighbourhood Longitude'][i],
                   hongkong_parks['Venue Latitude'][i], hongkong_parks['Venue Longitude'][i]))
hongkong_parks['Location'] = spos

In [63]:
hongkong_parks.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Location
0,Hong Kong,22.27832,114.17469,Lee Tung Avenue (利東街),22.275942,114.17254,Shopping Plaza,SW
1,Hong Kong,22.27832,114.17469,Wan Chai Park (灣仔公園),22.275421,114.176166,Park,SE
2,Hong Kong,22.27832,114.17469,Harbour Road Garden 港灣道花園,22.2805,114.175626,Garden,NE
3,Kowloon,22.316669,114.183327,Red Signal Hill 紅燈山,22.316208,114.183516,Park,S
4,Kowloon,22.316669,114.183327,Ko Shan Road Park (高山道公園),22.314341,114.185913,Park,SE


Add location areas corresponding to neighborhoods in sparea dataframe. 

In [64]:
t =[]
temp = []
for x in sparea['Neighbourhood']:
    temp = hongkong_parks[hongkong_parks['Neighbourhood'] == x]['Location'].values
    if(len(temp) != 0):
        l = []
        for s in temp:
             l.append(s)
        if len(l) == 0:
            t.append('None')
        else:
            t.append(list(set(l)))
    else:
        t.append('None')    

In [65]:
sparea['Parks'] = t

In [66]:
sparea.head()

Unnamed: 0,Neighbourhood,Latitude,Longitude,Low Restaurant Density,Scenes,Parks
0,Hong Kong,22.27832,114.17469,"[W, S, SE, NW, NE, N]","[SW, NW]","[NE, SW, SE]"
1,Tung Chung,22.28783,113.942429,[NE],,[N]
2,Tai Po,22.450069,114.16877,"[NW, N, W]",,[NW]
3,Sham Shui Po,22.330231,114.159447,"[E, SE, N]",,"[W, S, SE]"
4,Central,22.282989,114.158462,"[N, SE, NE, E, NW]","[N, NE]","[N, E, SE]"


<b>Residential Area</b>

In [67]:
hongkong_residence = getNearbyVenues(categoryid = '4e67e38e036454776db1fb3a' ,
                                  names = hongkong_df['City'],
                                   latitudes = hongkong_df['Latitude'],
                                   longitudes = hongkong_df['Longitude'])
print(hongkong_residence.shape)

Hong Kong
Kowloon
Tsuen Wan
Yuen Long Kau Hui
Tung Chung
Sha Tin
Tuen Mun
Tai Po
Sai Kung
Yung Shue Wan
Ngong Ping
Sok Kwu Wan
Tai O
Wong Tai Sin
Wan Chai
Sham Shui Po
Central
(59, 7)


Get Location areas for these residential areas too.

In [68]:
pos = []
for i in range(hongkong_residence['Venue'].size):
    pos.append(checkposition(hongkong_residence['Neighbourhood Latitude'][i], hongkong_residence['Neighbourhood Longitude'][i],
                   hongkong_residence['Venue Latitude'][i], hongkong_residence['Venue Longitude'][i]))

In [69]:
hongkong_residence['Location'] = pos

In [70]:
hongkong_residence.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Location
0,Hong Kong,22.27832,114.17469,Tai Wo Court 泰和閣,22.276413,114.173652,Residential Building (Apartment / Condo),SW
1,Hong Kong,22.27832,114.17469,Alliance Française 香港法國文化協會,22.277386,114.172557,Office,SW
2,Hong Kong,22.27832,114.17469,Convention Plaza Apartments 會景閣,22.280194,114.17256,Residential Building (Apartment / Condo),NW
3,Hong Kong,22.27832,114.17469,Kapok Apartment (木棉花),22.280524,114.176478,Residential Building (Apartment / Condo),NE
4,Hong Kong,22.27832,114.17469,The Oakhill 萃峯,22.276143,114.176717,Residential Building (Apartment / Condo),SE


<b>Identify lower residence density areas</b>

In [71]:
hkresi = hongkong_residence[hongkong_residence['Neighbourhood'] == 'Hong Kong']['Location'].value_counts()
hkresi

SW    4
S     2
E     2
SE    1
W     1
NE    1
NW    1
Name: Location, dtype: int64

South West seems to be pretty crowded area. We will focus over only those areas where residential count is less than average of total residential count.

The cell below does the same job for all neighborhoods.

In [72]:
t =[]
temp = []
for x in sparea['Neighbourhood']:
    #All values
    neighloc = hongkong_residence[hongkong_residence['Neighbourhood'] == x]['Location'].values 
    #Only low residential area values
    #Residential Areas loc values
    neighloc_count = hongkong_residence[hongkong_residence['Neighbourhood'] == x]['Location'].value_counts()
    neighavg = round(sum(neighloc_count)/len(neighloc_count))
    
    neighlresi = neighloc_count[lambda x : x <= neighavg]
    
    if(len(neighlresi.keys().tolist()) != 0):
        l =[]
        #for s in neighloc:
        for s in neighlresi.keys().tolist():
            l.append(s)
        if len(l) == 0:
            t.append('None')
        else:
            t.append(list(set(l)))
    else:
        t.append('None')

In [73]:
sparea['Low Residence'] = t

In [74]:
sparea

Unnamed: 0,Neighbourhood,Latitude,Longitude,Low Restaurant Density,Scenes,Parks,Low Residence
0,Hong Kong,22.27832,114.17469,"[W, S, SE, NW, NE, N]","[SW, NW]","[NE, SW, SE]","[NW, SE, E, NE, W, S]"
1,Tung Chung,22.28783,113.942429,[NE],,[N],"[NE, W, S, SE]"
2,Tai Po,22.450069,114.16877,"[NW, N, W]",,[NW],"[NE, N, NW]"
3,Sham Shui Po,22.330231,114.159447,"[E, SE, N]",,"[W, S, SE]","[NW, W, NE, SE]"
4,Central,22.282989,114.158462,"[N, SE, NE, E, NW]","[N, NE]","[N, E, SE]","[N, SW]"


<b>Combine Scenes and Parks for ease of analysis</b>

In [75]:
temp = []
t =[]
for x in sparea['Neighbourhood']:
    rest = sparea[sparea['Neighbourhood'] == x]['Low Restaurant Density'].values
    resi = sparea[sparea['Neighbourhood'] == x]['Low Residence'].values
    park = sparea[sparea['Neighbourhood'] == x]['Parks'].values
    scene = sparea[sparea['Neighbourhood'] == x]['Scenes'].values   
    #a = set(rest[0]).intersection(resi[0],park[0])    
    if(scene != 'None'):
        a = park + scene
    else:
        a = park
    temp.append(list(set(a[0])))   
    


In [76]:
sparea['Scenes/Parks'] = temp

In [77]:
sparea.drop(['Scenes','Parks'], axis = 1, inplace =True)
sparea

Unnamed: 0,Neighbourhood,Latitude,Longitude,Low Restaurant Density,Low Residence,Scenes/Parks
0,Hong Kong,22.27832,114.17469,"[W, S, SE, NW, NE, N]","[NW, SE, E, NE, W, S]","[NE, NW, SW, SE]"
1,Tung Chung,22.28783,113.942429,[NE],"[NE, W, S, SE]",[N]
2,Tai Po,22.450069,114.16877,"[NW, N, W]","[NE, N, NW]",[NW]
3,Sham Shui Po,22.330231,114.159447,"[E, SE, N]","[NW, W, NE, SE]","[W, S, SE]"
4,Central,22.282989,114.158462,"[N, SE, NE, E, NW]","[N, SW]","[N, SE, NE, E]"


<b>Optimum Regions</b>

<br>We will define all those areas who have atleast two factors in common :</br>
* Low Restaurant Density and Scenes/Parks
* Low Residence and Scenes/Parks
* Low Resrurant Density and Low Residence Density

<b>The common areas will be our optimum zones.</b>

In [78]:
temp = []
for x in sparea['Neighbourhood']:
    #set_rest = set(sparea[sparea['Neighbourhood'] == x]['Low Restaurant Density'].values)
    rest = sparea[sparea['Neighbourhood'] == x]['Low Restaurant Density'].values
    resi = sparea[sparea['Neighbourhood'] == x]['Low Residence'].values
    sp = sparea[sparea['Neighbourhood'] == x]['Scenes/Parks'].values
    
    set_rest = set(rest[0])
    set_resi = set(resi[0])
    set_sp = set(sp[0])
    
    cl1 = list(set_rest & set_resi)
    cl2 = list(set_rest & set_sp)
    cl3 = list(set_resi & set_sp)
    
    temp.append(list(set(cl1 + cl2 + cl3)))

In [79]:
sparea['Optimum Regions'] = temp


In [80]:
sparea

Unnamed: 0,Neighbourhood,Latitude,Longitude,Low Restaurant Density,Low Residence,Scenes/Parks,Optimum Regions
0,Hong Kong,22.27832,114.17469,"[W, S, SE, NW, NE, N]","[NW, SE, E, NE, W, S]","[NE, NW, SW, SE]","[SE, NW, NE, W, S]"
1,Tung Chung,22.28783,113.942429,[NE],"[NE, W, S, SE]",[N],[NE]
2,Tai Po,22.450069,114.16877,"[NW, N, W]","[NE, N, NW]",[NW],"[N, NW]"
3,Sham Shui Po,22.330231,114.159447,"[E, SE, N]","[NW, W, NE, SE]","[W, S, SE]","[W, SE]"
4,Central,22.282989,114.158462,"[N, SE, NE, E, NW]","[N, SW]","[N, SE, NE, E]","[N, E, NE, SE]"


Finally, we are about to reach our goal. We have identified preferred zones.

<b>Get Venues Of These Optimum Regions</b>

In [81]:
prefered_locs = getNearbyVenues(categoryid = '' ,
                                  names = sparea['Neighbourhood'],
                                   latitudes = sparea['Latitude'],
                                   longitudes = sparea['Longitude'])
print(prefered_locs.shape)

Hong Kong
Tung Chung
Tai Po
Sham Shui Po
Central
(347, 7)


In [82]:
prefered_locs.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hong Kong,22.27832,114.17469,The Fleming (芬名酒店),22.279033,114.174722,Hotel
1,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant
2,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant
3,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant
4,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant


Get venue locations as well.

In [83]:
spos = []
for i in range(prefered_locs['Venue'].size):
    spos.append(checkposition(prefered_locs['Neighbourhood Latitude'][i], prefered_locs['Neighbourhood Longitude'][i],
                   prefered_locs['Venue Latitude'][i], prefered_locs['Venue Longitude'][i]))
prefered_locs['Location'] = spos

In [84]:
prefered_locs.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Location
0,Hong Kong,22.27832,114.17469,The Fleming (芬名酒店),22.279033,114.174722,Hotel,N
1,Hong Kong,22.27832,114.17469,Zahrabel,22.278194,114.175912,Middle Eastern Restaurant,E
2,Hong Kong,22.27832,114.17469,The Optimist,22.278049,114.175854,Spanish Restaurant,E
3,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant,SE
4,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant,W


<b>In the above table we have venues from all the areas. Now let's get rid of all those venues which do not come under common areas which we defined previously. In this way, we will be left with only those venues which lie under preferd zones. </b>

In [86]:
for index,row in prefered_locs.iterrows():
    sneigh = row['Neighbourhood']
    sloc = row['Location']
    
    l = sparea[sparea['Neighbourhood'] == sneigh]['Optimum Regions'].values
    l = l[0]

    if(sloc not in l):
        prefered_locs.drop(index,inplace = True)

In [87]:
prefered_locs = prefered_locs.reset_index(drop = True)

In [88]:
prefered_locs.shape

(101, 8)

In [89]:
prefered_locs.head()

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Location
0,Hong Kong,22.27832,114.17469,Kam's Roast Goose (甘牌燒鵝),22.277647,114.175361,Cantonese Restaurant,SE
1,Hong Kong,22.27832,114.17469,Seorae (喜來稀肉),22.27828,114.174143,Korean Restaurant,W
2,Hong Kong,22.27832,114.17469,Sang Kee Seafood Restaurant (生記海鮮飯店),22.277755,114.172093,Seafood Restaurant,W
3,Hong Kong,22.27832,114.17469,Wooloomooloo Steakhouse,22.277696,114.17614,Steakhouse,SE
4,Hong Kong,22.27832,114.17469,Hong Zhou Restaurant (杭州酒家),22.27703,114.175606,Jiangsu Restaurant,SE


In [225]:
prefered_locs['Venue Category'].value_counts()

Cantonese Restaurant             7
Japanese Restaurant              7
Hong Kong Restaurant             6
Lounge                           4
Hotel                            4
Café                             4
Steakhouse                       4
Chinese Restaurant               4
Hotel Bar                        4
Thai Restaurant                  4
Korean Restaurant                3
Sandwich Place                   3
Italian Restaurant               2
Spa                              2
Dim Sum Restaurant               2
Szechuan Restaurant              2
Sushi Restaurant                 2
Noodle House                     2
Bookstore                        2
Bakery                           2
French Restaurant                1
Bus Station                      1
Dessert Shop                     1
Jiangsu Restaurant               1
Hotpot Restaurant                1
Pool                             1
Mediterranean Restaurant         1
Massage Studio                   1
Supermarket         

<b>Let's see how many restaurants are there in Optimum regions</b>

In [221]:
optimum_rest = prefered_locs[prefered_locs['Venue Category'].str.contains('Rest')]
optimum_rest = optimum_rest.reset_index(drop = True)

In [222]:
orest = optimum_rest.groupby('Neighbourhood', sort = False).size()
orest

Neighbourhood
Hong Kong       25
Tung Chung       3
Tai Po           7
Sham Shui Po     4
Central         12
dtype: int64

Give it a tabular form

In [228]:
optimum_df = pd.DataFrame()

In [229]:
optimum_df['Neighbourhood'] = orest.keys()
optimum_df['Optimum Regions'] = sparea['Optimum Regions']
optimum_df['Restaurants'] = orest.values
optimum_df

Unnamed: 0,Neighbourhood,Optimum Regions,Restaurants
0,Hong Kong,"[SE, NW, NE, W, S]",25
1,Tung Chung,[NE],3
2,Tai Po,"[N, NW]",7
3,Sham Shui Po,"[W, SE]",4
4,Central,"[N, E, NE, SE]",12


Let's see how many vegetarian restaurants are there in optimum regions

In [230]:
optimum_df['Veg'] = [0,0,0,0,0]

In [231]:
for index,row in optimum_rest.iterrows():
    if 'Veg' in row['Venue Category']:
        temp = optimum_df[optimum_df['Neighbourhood'] == row['Neighbourhood']]['Veg']
        idx = optimum_df[optimum_df['Neighbourhood'] == row['Neighbourhood']].index.values.astype(int)[0]
        optimum_df.loc[idx, 'Veg'] = int(temp)+1

In [232]:
optimum_df

Unnamed: 0,Neighbourhood,Optimum Regions,Restaurants,Veg
0,Hong Kong,"[SE, NW, NE, W, S]",25,1
1,Tung Chung,[NE],3,0
2,Tai Po,"[N, NW]",7,0
3,Sham Shui Po,"[W, SE]",4,0
4,Central,"[N, E, NE, SE]",12,0


It can be seen that in our optimum zones, there is only one vegetarian restaurant. It is worth mentioning that there are several restaurants in each and every region of Hong Kong. We have reached our optimum regions where number restaurants or residential areas are relatively pretty less as well as near some scenic spot.

Let's visualise our optimum regions before proceeding further

In [91]:
map_pref = folium.Map(location=[latitude, longitude], zoom_start=10)
folium.TileLayer('cartodbpositron').add_to(map_pref)

# add markers to map
for lat, lng, ven in zip(prefered_locs['Venue Latitude'], prefered_locs['Venue Longitude'], 
                          prefered_locs['Venue']):
    label = '{}'.format(ven)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_pref)  
    
map_pref

<h2>K- means Clustering</h2>

In [92]:
from sklearn.cluster import KMeans

number_of_clusters = 5

pref_lat_long = prefered_locs[['Venue Latitude', 'Venue Longitude']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(pref_lat_long)

# check cluster labels generated for each row in the dataframe
#kmeans.labels_[0:10] 


Coordinates for cluster centers

In [93]:
cluster_centers =[]
clat = []
clong = []
for cc in kmeans.cluster_centers_:
    clat.append(cc[0])
    clong.append(cc[1])
    #cluster_centers.append(t[0])
    print(cc)

[ 22.29027682 113.94351962]
[ 22.28332206 114.15918593]
[ 22.45305701 114.16780213]
[ 22.27802033 114.1732967 ]
[ 22.32812909 114.16167531]


Let's draw this information into a dataframe

In [94]:
cluster_df = pd.DataFrame()

In [95]:
cluster_df['Clusters'] = ['Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5']

In [96]:
cluster_df['Latitude'] = clat[:5]
cluster_df['Longitude'] = clong[:5]

In [97]:
cluster_df

Unnamed: 0,Clusters,Latitude,Longitude
0,Cluster 1,22.290277,113.94352
1,Cluster 2,22.283322,114.159186
2,Cluster 3,22.453057,114.167802
3,Cluster 4,22.27802,114.173297
4,Cluster 5,22.328129,114.161675


<h2>Final Visualisation.</h2>

In [99]:
from folium.plugins import HeatMap

map_hk = folium.Map(location=[latitude, longitude], zoom_start=11)
folium.TileLayer('cartodbpositron').add_to(map_hk)
folium.Circle(location=[latitude, longitude], color='white', fill=True, fill_opacity=0.4).add_to(map_hk)
folium.Marker([latitude, longitude]).add_to(map_hk)

for lat,lon in zip(clat,clong):
    folium.Circle([lat, lon], radius=800, color='green', fill=True, fill_opacity=0.25).add_to(map_hk) 
for lat, lon in zip(prefered_locs['Venue Latitude'], prefered_locs['Venue Longitude']):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_hk)
map_hk