# Capstone Project - The Battle of the Neighborhoods (Week 1)
### Applied Data Science Capstone by IBM/Coursera

## Data <a name="data"></a>

To assess a business case, the well-known SWOT business analysis methodology tell us to look into these four aspects, namely, Strengths, Weakness, Opportunities and Threats. As strengths and weakness are internal to the business itself, here we concentrate on the opportunities and threats aspect. 

High restaurant density can represent threats as it means more competitions. On the other hand, it can also represent opportunities. For example, a cluster of French restaurants may have some kind of “branding” effect as this may foster a reputation of being a go-to place for French foods.

The other type of opportunities/threats are represented by what type of the potential customer base the restaurant location has - Is it in a high traffic area, i.e. busy and bustling with people, and whether the people ready to spend money in restaurant?
Therefore, to prediction model will need to take into account of the above ‘features’ of a location in order to predict whether it is a good or bad location. 

For this purpose, firstly, the datasets I am going to use to represent restaurant density are:
* For a given location, how many nearby *Chinese restaurants* and how closed are they?
* For a given location, how many nearby *similar types of restaurants* and how closed are they? For similar restaurants, I refer to Asian type such as Japanese, Korean, Thai, Indian, Malay, etc.
* For a given location, how many nearby *non-similar types of restaurants* and the proximity of them?

The above datasets can be obtained by using the **Foursquare API**.

Secondly, for the data about the potential customer base, there is no readily available source. However, I found two data sources I believe to be good representation of potential customer base:
1. ATM (i.e. bank automated teller machine) location data. ATMs are placed by bank in places where there are high traffic of people who want to withdraw cash. Therefore, the proximity of ATMs can give good indication for the traffic volume and whether people are spending money. The source of ATM I use for this project is **Overpass API (OpenStreetMap)**.
2. Public Transportation link data. Underground and bus are two major public transportation links within London. **London Underground passenger counts** published by the London Transportation Department can be used as a good indication of traffic volume. 

Finally, we also need the historical data of the prediction target (i.e. how good/bad a location for a Chinese restaurant) for train and verify the prediction model. Ideally, past profitability of the existing Chinese restaurants would be the ideal historical data but such data are not readily available. Another good representative data would be the venue statistics from Foursquare, unfortunately, I found that for many Chinese restaurants in London, Foursquare does not provide such statistics. Therefore, for the purpose of this project, I use the “number of likes” in the Foursquare venues information.
The recap, the data I am going to use to train and verify the prediction model are:
* Data for the prediction target: “number of likes” in the **Foursquare venues information**; 
* Data for the features: 
  * number of nearby Chinese restaurants and average distance (**Foursquare API**)
  * number of nearby similar types of restaurants and average distance (**Foursquare API**)
  * number of nearby non-similar types of restaurants and average distance (**Foursquare API**)
  * number of nearby ATMs locations and average distance (**Overpass API -OpenStreetMap**) 
  * **London Underground passenger counts**.


Now, let's start extracting the data and presenting some **examples of the data**.

In [None]:
###
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
import json

from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

import pickle


Firstly, I set out to investigate an area of ~6KM radius of central London. The area of interest can be changed according to stakeholder requirement by adjusting the location of the center point and the length of the radius.

London Chinatown is at the very heart of London right next to the National Gallery, therefore, I choose it as the center point for London.  

In [2]:
###
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
address = '10 Wardour St, London W1D 6BZ'  # London Chinatown

geolocator = Nominatim(user_agent="me")
location = geolocator.geocode(address)
london_center_lat = location.latitude
london_center_lon = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(london_center_lat, london_center_lon))

The geograpical coordinate of London are 51.5110319, -0.1315618.


Foursquare credentials:

In [51]:
foursquare_client_id = 'ZXCANGNCU5JTR51CLBORW53VDA44REIK1TAHMDV4DESHYNJZ' # your Foursquare ID
foursquare_client_secret = '1232MYEFLMOPMQN0OOKRROZAEORP5FHHF2ICH0WQNTP41ZWO' # your Foursquare Secret
foursquare_version = '20190318'

In initially use one Foursquare API call with a radius of 6Km, but it only return 100 restaurants: 

In [4]:
chinese_restaurant_category = '4bf58dd8d48988d145941735'
radius = 6000
limit = 600
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        foursquare_client_id, foursquare_client_secret, foursquare_version, london_center_lat, london_center_lon, chinese_restaurant_category, radius, limit)
results = requests.get(url).json()['response']['groups'][0]['items']
chinese_restaurants = [(item['venue']['id'],
                        item['venue']['name'],
                        (item['venue']['location']['lat'], item['venue']['location']['lng'])) for item in results] 
len(chinese_restaurants)

100

As 150m radius around Chinatown has already nearly 100 restaurants, an area of 6Km radius should have more than 100 restaurants!

**It seems Foursquare API limits the number of venues it returns**. Therefore, I’ll divide the (large) area of interest into smaller ones and make a Foursquare API call for each.


### *\*Acknowledgement\**
*The Python code for dividing the area is very similar to that of the example notebook provided as part of the Capstone assignment. So* ***a massive thanks to the original author of the notebook***. 

*However,* ***the similarity stops here as I’ll be using a complete different inferential statistical method*** *to address a similar business problem.*

In [5]:
###
#!pip install shapely
#import shapely.geometry

#!pip install pyproj
import pyproj

import math

def latlon_to_xy(lat, lon):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=30, datum='WGS84')
    x, y = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return x, y

def xy_to_latlon(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=30, datum='WGS84')
    lon, lat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lat, lon

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('London center latitude={}, longitude={}'.format(london_center_lat, london_center_lon))
london_center_x, london_center_y = latlon_to_xy(london_center_lat, london_center_lon)
print('Projected London UTM X={}, Y={}'.format(london_center_x, london_center_y))
lat, lon = xy_to_latlon(london_center_x, london_center_y)
print('Projected back London center latitude={}, longitude={}'.format(lat,lon))


Coordinate transformation check
-------------------------------
London center latitude=51.5110319, longitude=-0.1315618
Projected London UTM X=699039.3970373612, Y=5710557.318131826
Projected back London center latitude=51.51103189999999, longitude=-0.1315617999999997


Let's create a **hexagonal grid of cells** of 150m radius: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

In [6]:
### 
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = london_center_x - 6000
x_step = 600
y_min = london_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 41):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(london_center_x, london_center_y, x, y)
        if (distance_from_center <= 6001):
            lat, lon = xy_to_latlon(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

364 candidate neighborhood centers generated.


Let's visualize the data we have so far: city center location and candidate neighborhood centers:

In [None]:
#!pip install folium

import folium
from folium.plugins import HeatMap
london_center=[london_center_lat,london_center_lon]
map_london = folium.Map(location=london_center, zoom_start=12)
folium.Marker(london_center, popup='Chinatown').add_to(map_london)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_london) 
    folium.Circle([lat, lon], radius=300, color='yellow', fill=False).add_to(map_london)
    #folium.Marker([lat, lon]).add_to(map_london)
map_london


In [15]:
### 

#def get_primary_category(categories):
#    cat = []
#    for cat in categories:
#        if cat['primary']:
#            cat = [cat['id'], cat['name']]
#    return cat

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, foursquare_version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   (item['venue']['location']['lat'], item['venue']['location']['lng'])) for item in results]       
    except:
        venues = []
    return venues

# get venues information from a list of latitudes and longitudes
def get_venues(lats, lons, cat_id):
    venues = {}
    count = 0
    print('Start', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to make sure we have overlaps/full coverage so we don't miss any restaurant 
        # (we're using dictionaries to remove any duplicates resulting from area overlaps)
        results = get_venues_near_location(lat, lon, cat_id, foursquare_client_id, foursquare_client_secret, radius=350, limit=150)
        for venue in results:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_latlon = venue[2]
            x, y = latlon_to_xy(venue_latlon[0], venue_latlon[1])
            count += 1 
            print(count, end='.')
            venue = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], x, y)                
            venues[venue_id] = venue
        print('.', end='')
    print(' done.')
    return venues

In [25]:
### Get CHINESE restaurants data

chinese_restaurants = {}
# Try to load from local file system first in case there is a saved dateset
loaded = False
try:
    with open('chinese_restaurants.pkl', 'rb') as f:
        chinese_restaurants = pickle.load(f)
    print('Chinese restaurant data loaded from local file.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    # According to Foursquare web documentation (https://developer.foursquare.com/docs/resources/categories),
    # the category for Chinese restaurants include: Chinese Restaurant, Cantonese Restaurant, 
    # Cha Chaan Teng, Dim Sum Restaurant, and Szechuan Restaurant. Their IDs are:
    chinese_restaurant_categories = ['4bf58dd8d48988d145941735', '52af3a7c3cf9994f4e043bed', '58daa1558bbb0b01f18ec1d3',
                                     '4bf58dd8d48988d1f5931735', '52af3b773cf9994f4e043c03'] 
    for id in chinese_restaurant_categories:
        chinese_restaurants.update(get_venues(latitudes,longitudes,id))

    # save this in local file system
    with open('chinese_restaurants.pkl', 'wb') as f:
        pickle.dump(chinese_restaurants, f)

print('There are {} Chinese restaurants in the dataset.'.format(len(chinese_restaurants)))

Chinese restaurant data loaded from local file.
There are 369 Chinese restaurants in the dataset.


In [34]:
### Now let's get the number of "likes" for each Chinese restaurant using Foursquare venue API 

# Try to load from local file system first in case there is a saved dateset
loaded = False
try:
    with open('chinese_restaurants_with_likes.pkl', 'rb') as f:
        chinese_restaurants = pickle.load(f)
    print('Chinese restaurant data loaded from local file.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    chinese_restaurants_with_likes = {}
    result_jsons = []
    for venue_id in list(chinese_restaurants):
        url = 'https://api.foursquare.com/v2/venues/{}?client_id={}&client_secret={}&v={}'.format(venue_id, foursquare_client_id, foursquare_client_secret, foursquare_version)
        results = requests.get(url).json()
        likes = results['response']['venue']['likes']['count']
        result_jsons.append(results)
        chinese_restaurants_with_likes[venue_id] = chinese_restaurants[venue_id]+(likes,)

    # save this in local file system
    with open('chinese_restaurants_with_likes.pkl', 'wb') as f:
        pickle.dump(chinese_restaurants_with_likes, f)

len(chinese_restaurants_with_likes)

In [21]:
### Get data of similar types of restaurants

similar_restaurants = {}
# Try to load from local file system first in case there is a saved dateset
loaded = False
try:
    with open('similar_restaurants.pkl', 'rb') as f:
        similar_restaurants = pickle.load(f)
    print('Similar restaurant data loaded from local file.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    # Restaurant categories that are similar to Chinese mainly include: Indonesian, Japanese, Korean Restaurant, Malay,  
    # Thai, Vietnamese,Indian, Pakistani, and Sri Lankan Restaurants. According to Foursquare web documentation
    # (https://developer.foursquare.com/docs/resources/categories), their IDs are:
    similar_restaurant_categories = ['4deefc054765f83613cdba6f','4bf58dd8d48988d111941735','4bf58dd8d48988d113941735',
                                     '4bf58dd8d48988d156941735','4bf58dd8d48988d149941735','4bf58dd8d48988d14a941735',
                                     '4bf58dd8d48988d10f941735','52e81612bcbc57f1066b79f8','5413605de4b0ae91d18581a9']
    for id in similar_restaurant_categories:
        similar_restaurants.update(get_venues(latitudes,longitudes,id))

    # save this in local file system
    with open('similar_restaurants.pkl', 'wb') as f:
        pickle.dump(similar_restaurants, f)

print('There are {} similar restaurants in the dataset.'.format(len(similar_restaurants)))

Similar restaurant data loaded from local file.
There are 1608 similar restaurants in the dataset.


In [24]:
### Get data of non-similar types of restaurants

non_similar_restaurants = {}
# Try to load from local file system first in case there is a saved dateset
loaded = False
try:
    with open('non_similar_restaurants.pkl', 'rb') as f:
        non_similar_restaurants = pickle.load(f)
    print('Non-similar restaurant data loaded from local file.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    # Restaurant categories that are not similar to Chinese mainly include: American, Caribbean, Eastern European, English, French, 
    # German, Greek, Italian, Jewish, Latin American, Mexican, Middle Eastern, Modern European, Russian, Spanish, and Turkish Restaurants.
    # According to Foursquare web documentation # (https://developer.foursquare.com/docs/resources/categories), their IDs are:
    non_similar_restaurant_categories = ['4bf58dd8d48988d14e941735','4bf58dd8d48988d144941735','4bf58dd8d48988d109941735',
                                         '52e81612bcbc57f1066b7a05','4bf58dd8d48988d10c941735','4bf58dd8d48988d10d941735',
                                         '4bf58dd8d48988d10e941735','4bf58dd8d48988d110941735','52e81612bcbc57f1066b79fd',
                                         '4bf58dd8d48988d1be941735','4bf58dd8d48988d1c1941735','4bf58dd8d48988d115941735',
                                         '52e81612bcbc57f1066b79f9','5293a7563cf9994f4e043a44','4bf58dd8d48988d150941735',
                                         '4f04af1f2fb6e1c99f3db0bb']
    for id in non_similar_restaurant_categories:
        non_similar_restaurants.update(get_venues(latitudes,longitudes,id))

    # save this in local file system
    with open('non_similar_restaurants.pkl', 'wb') as f:
        pickle.dump(non_similar_restaurants, f)

print('There are {} non-similar restaurants in the dataset.'.format(len(non_similar_restaurants)))

Non-similar restaurant data loaded from local file.
There are 4065 non-similar restaurants in the dataset.


In [37]:
###
# Use Overpass API (Open Street Map) to obtain ATM locations
radius=6150
overpass_url = "http://overpass-api.de/api/interpreter"
overpass_atm_query = """
[out:json][timeout:180];
node[amenity=atm](around:{},{},{});
out;
""".format(radius,london_center[0],london_center[1])
response = requests.get(overpass_url, params={'data': overpass_atm_query})
atm_data = response.json()
atm_latlons = [[item['lat'],item['lon']] for item in atm_data['elements']]

In [48]:
###
# Chinese restaurant heatmap overlaid with ATM
chinese_restaurant_latlons = [(res[2],res[3]) for res in chinese_restaurants.values()]
map_london = folium.Map(london_center, zoom_start=12)
# folium.Marker(london_center, popup='Chinatown').add_to(map_london)
folium.CircleMarker(london_center, radius=2, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_london) 
HeatMap(chinese_restaurant_latlons,radius=10).add_to(map_london)
for latlon in atm_latlons:
   folium.CircleMarker(latlon, radius=1, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_london) 
    
map_london

### Examples of the data
Following are the some **examples of the data**:

In [41]:
print('Examples of Chinese restaurants')
print('-------------------------------')
for r in list(chinese_restaurants_with_likes.values())[:10]:
    print(r)
print('...')
print('Total:', len(chinese_restaurants_with_likes))

Examples of Chinese restaurants
-------------------------------
('51be1a81498e7e9475a2847c', 'Mama Lan', 51.46158222955795, -0.1386893677680702, 698759.924395796, 5705040.036369441, 47)
('57433804498e6896a9ac6c3d', 'On Cafe', 51.461159936690485, -0.13606309890747068, 698944.1605157366, 5705000.219283052, 8)
('4fa44009e4b0cfe54c0e2cfb', 'Courtesan Dim Sum', 51.46115994755113, -0.11122012921135563, 700669.5588322084, 5705068.029273553, 45)
('4c0e9eb87189c92802ccd8b6', 'Big Fat Panda', 51.46406135523888, -0.16523420365737124, 696905.6586762026, 5705243.931458072, 5)
('4c93c6e458d4b60c7f6f2229', 'Ku Do', 51.465477250000006, -0.127758, 699502.1307681849, 5705502.82779718, 0)
('4bf4369dff90c9b687fa5428', 'Sun', 51.463309237975686, -0.13375971722526808, 699094.777378501, 5705245.444308283, 1)
('5543caad498ed0aa60d91ec9', 'Fu Manchu', 51.46450966067314, -0.12982931791815347, 699362.5056508451, 5705389.606454721, 29)
('4e7dccf5d3e3294a67ba2f07', 'Mama Lan', 51.462523456612985, -0.11227912704606

In [43]:
print('Examples of similar restaurants')
print('-------------------------------')
for r in list(similar_restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(similar_restaurants))

Examples of similar restaurants
-------------------------------
('4b4cd41bf964a5208dc026e3', "Nancy Lam's Enak Enak", 51.465976, -0.153701, 697698.3406066891, 5705487.89209157)
('5a8058b99de23b09d875b87e', 'Warung Rumpi London', 51.488782, -0.092265, 701863.9928464844, 5708191.203633217)
('5ad9e177a92d981f70847a1c', 'Sambel', 51.501529, -0.111248, 700490.4209270055, 5709556.237222514)
('55eb0642498e36ccf4d34004', 'Japindo', 51.49874093205115, -0.06749286354492272, 703539.0299959745, 5709367.073062456)
('548cc33d498e7cd8fda0e079', 'Soda Coda Cafe', 51.498748779296875, -0.06757054477930069, 703533.6043903355, 5709367.729417515)
('4e4a70b4d4c0dae7bfcb7fc0', 'Banana Tree', 51.51312243247239, -0.13374866254940396, 698878.5714444991, 5710783.800307268)
('4ac518dcf964a52023a920e3', 'Bali Bali', 51.51385268807065, -0.12840184944050295, 699246.3092473993, 5710879.5419694455)
('561007a5498e8b545895df6a', 'Nusa Dua', 51.51259012437537, -0.13079917077626516, 699085.5090451674, 5710732.640214789)
(

In [45]:
print('Examples of non-similar restaurants')
print('-----------------------------------')
for r in list(non_similar_restaurants.values())[10:19]:
    print(r)
print('...')
print('Total:', len(non_similar_restaurants))

Examples of non-similar restaurants
-----------------------------------
('4d02a1c7e350b60c291b7842', 'Venn Street Records', 51.4623889239294, -0.1378505864819057, 698814.6717795614, 5705132.003742321)
('57fcc1d7498e78978286c511', 'Phase Four', 51.46534239225995, -0.12877569412468998, 699432.0445504743, 5705485.0608908115)
('53444abe498eba6f0db62b87', 'Red Dog Saloon', 51.46445481434657, -0.12926999840357697, 699401.5879960831, 5705385.0320397625)
('4ec2bcbb8231a83de8c751f7', 'Chicken & ribs', 51.46594512202947, -0.11445495837103714, 700423.9171643042, 5705591.187734496)
('532c9031498e13befbe4469e', 'Brixton Diner', 51.46288842728365, -0.10830529972110091, 700864.4065985312, 5705268.197783859)
('56deb685cd10bb2f76fdc1f3', 'Hip Hop Chip Shop', 51.461915912148626, -0.11195728013338438, 700615.0456041377, 5705150.057719219)
('4c7384e38efc3704b9ca157d', 'Love Walk Cafe', 51.47086901877433, -0.09286352926981543, 701901.5784266564, 5706197.9714792315)
('4d3b0ba6039eb60c0157f19c', 'The Old Dis

In [46]:
print('Examples of ATM data')
print('--------------------')
for r in atm_latlons[0:9]:
    print(r)
print('...')
print('Total:', len(atm_latlons))

Examples of ATM data
--------------------
[51.5168043, -0.1041936]
[51.5337252, -0.2044338]
[51.5441105, -0.0899055]
[51.5434742, -0.0908125]
[51.5298413, -0.0803599]
[51.528965, -0.0776711]
[51.530544, -0.0789771]
[51.5177608, -0.1092027]
[51.5144211, -0.0567841]
...
Total: 537


Note: The London Underground passenger counts dataset has also been downloaded as an Excel file to be used in the next phase of this project.