# Gathering Bangalore Venues Data

In this notebook, Points of Interest data will be obtained for each cell in the hex grid created earlier. The data will be obtained using the FourSquare API.

In [1]:
import json
import os

import folium
import geopandas as gpd
import numpy as np
import pandas as pd
import plotly.express as px
import requests
import requests_cache
from diskcache import Cache
from nested_lookup import nested_lookup
from ratelimiter import RateLimiter
from tqdm.notebook import tqdm

In [2]:
# First import the list of hexes
df_hex = gpd.read_feather('../data/bangalore_hex_addresses.feather')
df_hex.head()

Unnamed: 0,id,hex_id,ward_no,centre_lat,centre_lon,resolution,pop_total,ward_name,geometry,address
0,8860169669fffff,8860169669fffff,1,13.09817,77.600839,8,1186.041718,Kempegowda Ward,"POLYGON ((77.60501 13.09567, 77.60510 13.10065...","Yelahanka Sante, Shivanahalli Main Road, Gandh..."
1,8860169661fffff,8860169661fffff,1,13.105662,77.59675,8,1845.480809,Kempegowda Ward,"POLYGON ((77.60092 13.10316, 77.60101 13.10814...","Yelahanka, Kempegowda, Yelahanka Zone, Bengalu..."
2,8860169753fffff,8860169753fffff,1,13.128114,77.592914,8,1786.194897,Kempegowda Ward,"POLYGON ((77.59709 13.12561, 77.59717 13.13060...","Century Artizan, Kempegowda, Yelahanka Zone, B..."
3,8861892db3fffff,8861892db3fffff,1,13.113109,77.60952,8,1413.21988,Kempegowda Ward,"POLYGON ((77.61369 13.11060, 77.61378 13.11559...","Yelahanka, Kempegowda, Yelahanka Zone, Bengalu..."
4,8860169645fffff,8860169645fffff,1,13.090701,77.596498,8,2635.116823,Kempegowda Ward,"POLYGON ((77.60067 13.08820, 77.60076 13.09318...","Bellary Road, Amruthnagar, Byatarayanapura, Ye..."


## Foursquare API calls

We will use the search endpoint of the Foursquare Places API to get locations in each hexagonal cell area.

In [3]:
# Get Foursquare API Keys from environment variables
FOURSQUARE_CLIENT_ID = os.environ.get('FOURSQUARE_CLIENT_ID')
FOURSQUARE_CLIENT_SECRET = os.environ.get('FOURSQUARE_CLIENT_SECRET')
if not FOURSQUARE_CLIENT_ID and not FOURSQUARE_CLIENT_SECRET:
    raise Exception('Could not find environment variables FOURSQUARE_CLIENT_ID or FOURSQUARE_CLIENT_SECRET.')

In [4]:
# Enable caching of requests
requests_cache.install_cache(
    '../data/cache/requests',
    backend='sqlite',
    ignored_parameters = ['FOURSQUARE_CLIENT_ID', 'FOURSQUARE_CLIENT_SECRET'] #! Don't include API Keys in cache
)

Define a function to query the API for points in a list of coordinates.

In [5]:
def foursquareSearch(
    lats,
    lons,
    client_id,
    client_secret,
    version = None,
    radius = None,
    category_id = None,
    limit = None,
):
    """Use Foursquare's Places API to obtain a list of venues near each point in a list

    Args:
        lats (list): List of latitudes
        lons (list): List of longitudes
        client_id (str): FourSquare Client ID
        client_secret (str): Foursqure Client Secret
        version (str, YYYYMMDD format): API version to use. Defaults to None (current)
        radius (int, optional): Radius (metres) around point to search. Defaults to auto-suggested radius.
        category_id (list), optional): List of FourSquare Category IDs. Defaults to all categories.
        limit (int, optional): Number of results to return per point. Defaults to None
    
    Returns:
        [list] List of Response objects
    """
    
    # Check if there are equal numbers of latitudes and longitudes
    if len(lats) != len(lons):
        raise Exception('Lengths of lattitudes and longitude lists must be equal')
    else:
        length = len(lats)
    # Prepare parameter strings
    version = str(version)
    radius = str(radius)
    categories = ','.join(category_id)
    limit = str(limit) if limit else None
    
    url = 'https://api.foursquare.com/v2/venues/search' # URL for the endpoint
    
    # Constant parameters
    parameters = {
        'client_id': client_id,
        'client_secret': client_secret,
        'v': version,
        'radius': radius,
        'categoryId': categories,
        'limit': limit
    }
    with requests.Session() as s:
        new_requests = 0 # Counter to keep track of how many new API requests are being made
        responses = [] # Empty list to store responses
        for lat, lon in tqdm(zip(lats, lons), total = length):
            query = parameters # First include all constant parameters
            query['ll'] = ','.join((str(lat), str(lon))) # Convert the points into a comma separated list and add to params
            response = s.get(url=url, params = query)
            responses.append(response)
            new_requests += 0 if response.from_cache else 1
            
        print('{} new API requests made.'.format(new_requests))
    return responses

We now use the function above on the centre points of all hexes in our list. The search radius is set to the circumradius of the hexagon. This means we can capture data from the entire hexagon area, but it also means there could be a few duplicates which we'll have to remove later. The circumradius for a resolution 8 hexagon is around 462 meters.

We will also use the following category IDs in our search. See full list of categories [here](https://developer.foursquare.com/docs/build-with-foursquare/categories/).

| Category Name               | Category ID              |
|-----------------------------|--------------------------|
| Arts & Entertainment        | 4d4b7104d754a06370d81259 |
| College & University        | 4d4b7105d754a06372d81259 |
| Event	                      | 4d4b7105d754a06373d81259 |
| Food                        | 4d4b7105d754a06374d81259 |
| Nightlife Spot              | 4d4b7105d754a06376d81259 |
| Outdoors & Recreation       | 4d4b7105d754a06377d81259 |
| Professional & Other Places | 4d4b7105d754a06375d81259 |
| Residence                   | 4e67e38e036454776db1fb3a |
| Shop & Service              | 4d4b7105d754a06378d81259 |

In [6]:
# Obtain a nested JSON of categories from the Foursquare API
p = {
    'client_id': FOURSQUARE_CLIENT_ID,
    'client_secret': FOURSQUARE_CLIENT_SECRET,
    'v': '20210315',
}
foursquare_categories = requests.get('https://api.foursquare.com/v2/venues/categories', params=p)
top_level_cats = foursquare_categories.json()['response']['categories']
with open('../data/foursquare_categories.json', 'w') as out_file:
    json.dump(top_level_cats, out_file)

In [7]:
# Select categories other than Travel & Transport
category_list = []
for cat in top_level_cats:
    if cat['name'] not in ['Travel & Transport']:
        category_list.append(cat['id'])

In [8]:
coords = df_hex[['centre_lat','centre_lon']].to_records(index=False)
lats, lons = zip(*coords)

# Make the requests
responses = foursquareSearch(
    lats= lats,
    lons= lons,
    client_id= FOURSQUARE_CLIENT_ID,
    client_secret= FOURSQUARE_CLIENT_SECRET,
    version= '20210315',
    radius= 462,
    category_id= category_list,
    limit = 100,
)

statuses = [r.status_code for r in responses]
print('{} of {} responses OK.'.format(statuses.count(200),len(statuses)))

  0%|          | 0/942 [00:00<?, ?it/s]

0 new API requests made.
942 of 942 responses OK.


We now need to process the information and create a dataframe containing all the venue information.

In [9]:
def processFoursquareResponse(response):
    """Extracts list of venues from foursquare API response.

    Args:
        response (Response): requests Response object

    Returns:
        [list]: List containing sub-lists with each individual venue details
    """
    venues = response.json()['response']['venues']
    venue_details = [] # List to store processed data
    for v in venues:
        id = v['id']
        name = v['name']
        lat = v['location']['lat']
        lon = v['location']['lng']
        address = ','.join(v['location']['formattedAddress'])
        # Extract categories
        num_categories = len(v['categories'])
        # If multiple categories, extract only primary category
        if num_categories > 1:
            for c in v['categories']:
                if c['primary']:
                    category = c['name']
        else:
            category = v['categories'][0]['name']
        # Create venue row
        venue_details.append([
            id,
            name,
            lat,
            lon,
            address,
            category,
        ])
    # Return list of venues
    return venue_details

In [10]:
# Extract all venues
venues_list = []
for hex_id, response in zip(df_hex['id'], responses):
    venues = processFoursquareResponse(response)
    # Add hexID to each venue in list
    for v in venues:
        v.append(hex_id)
    venues_list.extend(venues)
    
df_venues = pd.DataFrame(venues_list, columns = [
    'venue_id',
    'name',
    'lat',
    'lon',
    'address',
    'category',
    'hex_id'
]).drop_duplicates(subset='venue_id')\
    .replace('',np.nan)\
    .dropna()\
    .reset_index(drop=True)

print(df_venues.shape)
print(df_venues.dtypes)
df_venues.head()

(19985, 7)
venue_id     object
name         object
lat         float64
lon         float64
address      object
category     object
hex_id       object
dtype: object


Unnamed: 0,venue_id,name,lat,lon,address,category,hex_id
0,507a3692e4b02a5a82ef0742,Yelahanka Vegetable Market,13.097353,77.59971,"Yelahanka,Bangalore,Karnātaka,India",Farmers Market,8860169669fffff
1,511487e4e4b0da5e8b41ffac,HDFC Bank,13.099437,77.597802,"MVM Cplx, Gr Flr, Santhe Circle, NH 7, BB Rd (...",Bank,8860169669fffff
2,508bedd3e4b0702005e81b24,Sumathi Jewellers,13.096672,77.598305,"Yelahanka,India",Jewelry Store,8860169669fffff
3,51dd021c498e7cb7a88516df,Pizza Corner,13.099742,77.597595,India,Pizza Place,8860169669fffff
4,5bfaf31a1c0b34002c8639ce,Urban's Restaurant,13.099497,77.59963,"Bangalore 560064,Karnātaka,India",Restaurant,8860169669fffff


In [11]:
df_venues.describe(include = 'all')

Unnamed: 0,venue_id,name,lat,lon,address,category,hex_id
count,19985,19985,19985.0,19985.0,19985,19985,19985
unique,19985,17473,,,8589,530,875
top,505d82a4e4b06d1d2b0ea3fa,HDFC Bank ATM,,,India,Residential Building (Apartment / Condo),8861892421fffff
freq,1,97,,,8000,1389,50
mean,,,12.972925,77.610759,,,
std,,,0.057107,0.060998,,,
min,,,12.834263,77.461716,,,
25%,,,12.926858,77.567362,,,
50%,,,12.971981,77.602513,,,
75%,,,13.011358,77.651119,,,


## Cleaning Foursquare data

We should check to see how many locations we have per hexagon and per category. Also, using the category tree from Foursquare, we can map the lower level categories to higher level ones where appropriate, and reduce the total number of categories to a more manageable amount.

In [12]:
venues_per_hex = df_venues.groupby('hex_id')['venue_id'].count()

# Plot histogram
fig = px.histogram(
    x = venues_per_hex,
    #// nbins = 10,
    histnorm = 'percent',
    #// cumulative = True,
    template = 'plotly',
)

fig.update_layout(
    title = "Venues per hexagon",
    title_x = 0.5,
    xaxis_title = 'Number of unique venues',
    yaxis_title = 'Percentage of hexes',
    bargap = 0.01,
)

Around 20% of our hexes have less than 5 venues, which is not ideal but overall the distribution appears appropriate. We now check the same for categories.

In [13]:
print('Total number of categories is {}'.format(len(df_venues['category'].unique())))

Total number of categories is 530


In [14]:
venues_per_cat = df_venues.groupby('category')['venue_id'].count()\
    .sort_values(ascending=False)
venues_per_cat[0:5]

category
Residential Building (Apartment / Condo)    1389
Indian Restaurant                           1344
Office                                       957
Bank                                         571
Bakery                                       403
Name: venue_id, dtype: int64

There are far too many categories in the current data - several of the categories have only a few representative datapoints. We need to map these to a category list that is more suited to our use case.

Now, for each top-level category list all the first sub-level categories with a reasonable number of occurences in our set. We can then decide which of these sub-categories to retain, rest can be mapped to their parent category.

In [15]:
sub_categories = []
top_level = []
for top in top_level_cats:
    top_level.append(top['name'])
    cat_list = []
    for sub_level in top['categories']:
        try:
            venue_count = venues_per_cat[sub_level['name']]
        except KeyError:
            venue_count = 0
        child_cats = nested_lookup('name', sub_level['categories'])
        for child in child_cats:
            try:
                venue_count += venues_per_cat[child]
            except KeyError:
                continue # If key not found in dataset, skip iteration
        # Display in table only if more than threshold number of venues found
        if venue_count > 50:
            cat_list.append(sub_level['name'])
    sub_categories.append(','.join(cat_list))

pd.options.display.max_colwidth = 500

pd.DataFrame({
    'section': top_level,
    'sub-categories': sub_categories,
})

Unnamed: 0,section,sub-categories
0,Arts & Entertainment,"General Entertainment,Movie Theater,Performing Arts Venue"
1,College & University,"College Academic Building,College Classroom,General College & University,Student Center"
2,Event,
3,Food,"Asian Restaurant,Bakery,Breakfast Spot,Burger Joint,Cafeteria,Café,Coffee Shop,Dessert Shop,Diner,Fast Food Restaurant,Food Court,Food Truck,Fried Chicken Joint,Indian Restaurant,Italian Restaurant,Juice Bar,Pizza Place,Restaurant,Snack Place,Tea Room,Vegetarian / Vegan Restaurant"
4,Nightlife Spot,"Bar,Lounge"
5,Outdoors & Recreation,"Athletics & Sports,Lake,Other Great Outdoors,Park,Playground,Pool"
6,Professional & Other Places,"Building,Convention Center,Event Space,Factory,Government Building,Medical Center,Office,Post Office,School,Spiritual Center"
7,Residence,"Assisted Living,Housing Development,Residential Building (Apartment / Condo)"
8,Shop & Service,"ATM,Auto Dealership,Automotive Shop,Bank,Bike Shop,Car Wash,Clothing Store,Convenience Store,Department Store,Electronics Store,Food & Drink Shop,Furniture / Home Store,Gas Station,Hardware Store,Jewelry Store,Market,Miscellaneous Shop,Mobile Phone Shop,Motorcycle Shop,Optical Shop,Pharmacy,Salon / Barbershop,Shopping Mall,Smoke Shop,Spa"
9,Travel & Transport,


In [16]:
# List sub-categories to retain, grouping sub-categories where possible
# Dictionary where each value (list of labels) will be mapped to the corresponding key
sub_category_list = {
    'Movie Theater': ['Movie Theater'],
    # Food
    'Asian Restaurant': ['Asian Restaurant'],
    'Indian Restaurant': ['Indian Restaurant'],
    'Vegetarian / Vegan Restaurant': ['Vegetarian / Vegan Restaurant'],
    'Bakery & Dessert': ['Bakery', 'Dessert Shop'],
    'Cafeteria': ['Cafeteria', 'Food Court'],
    'Fast Food': ['Fast Food Restaurant', 'Fried Chicken Joint', 'Burger Joint'],
    'Quick Bites': ['Bakery', 'Food Truck', 'Snack Place'],
    'Coffee & Tea': ['Café', 'Coffee Shop', 'Tea Room', 'Juice Bar'],
    # Recreation
    'Athletics & Sports': ['Athletics & Sports', 'Pool'],
    # Professional
    'Medical Center': ['Medical Center'],
    'School': ['School'],
    'Spiritual Center': ['Spiritual Center'],
    'Factory': ['Factory'],
    'Office': ['Office'],
    # Shops
    'ATM': ['ATM'],
    'Automotive Shop': ['Auto Dealership', 'Automotive Shop', 'Bike Shop', 'Car Wash', 'Motorcycle Shop'],
    'Groceries' : ['Convenience Store', 'Department Store', 'Smoke Shop', 'Food & Drink Shop', 'Market'],
    'Bank': ['Bank'],
    'Electronics Store': ['Electronics Store', 'Mobile Phone Shop'],
    'Medical Store': ['Pharmacy', 'Optical Shop'],
    'Salon': ['Salon / Barbershop', 'Spa'],
    'Shopping Mall': ['Shopping Mall'],
    'Clothing & Jewelry': ['Clothing Store', 'Jewelry Store'],
    'Gas Station': ['Gas Station'],
}

print('With this structure, we should end up with {} unique categories'.format(
    len(sub_category_list) + len(category_list)
))

With this structure, we should end up with 34 unique categories


In [17]:
custom_category_map = {}

# Adding sub-category sets
for key, val in sub_category_list.items():
    sub_cats = []
    for top in top_level_cats:
        for sub in top['categories']:
            if sub['name'] in val:
                sub_cats.extend(nested_lookup('name', sub))
    custom_category_map[key] = sub_cats

# Map all other to top level cats
for top in top_level_cats:
    key = top['name']
    if key == 'Food':
        key = 'Restaurant' # Rename Food to Restaurant
    children = set(nested_lookup('name', top))
    # Remove all children categories that have already been assigned
    for v in custom_category_map.values():
        children = children - set(v) # Convert to set to find difference
    custom_category_map[key] = list(children) # Convert to list for saving to JSON

# Save to file
with open('../data/fourquare_categories_custom_map.json', 'w') as out_file:
    json.dump(custom_category_map, out_file)

We will now map the venues in our table to this defined set of categories.

In [18]:
new_categories = []
for old_category in df_venues['category']:
    for key, val in custom_category_map.items():
        if old_category in val:
            new_categories.append(key)
            break

# Assign new categories to dataframe and display
df_venues['category'] = new_categories
df_venues.head()

Unnamed: 0,venue_id,name,lat,lon,address,category,hex_id
0,507a3692e4b02a5a82ef0742,Yelahanka Vegetable Market,13.097353,77.59971,"Yelahanka,Bangalore,Karnātaka,India",Groceries,8860169669fffff
1,511487e4e4b0da5e8b41ffac,HDFC Bank,13.099437,77.597802,"MVM Cplx, Gr Flr, Santhe Circle, NH 7, BB Rd (Yelahanka),Bangalore 560064,Karnātaka,India",Bank,8860169669fffff
2,508bedd3e4b0702005e81b24,Sumathi Jewellers,13.096672,77.598305,"Yelahanka,India",Clothing & Jewelry,8860169669fffff
3,51dd021c498e7cb7a88516df,Pizza Corner,13.099742,77.597595,India,Restaurant,8860169669fffff
4,5bfaf31a1c0b34002c8639ce,Urban's Restaurant,13.099497,77.59963,"Bangalore 560064,Karnātaka,India",Restaurant,8860169669fffff


In [19]:
venues_per_cat = df_venues.groupby('category')['venue_id'].count()\
    .sort_values(ascending=False)
venues_per_cat [0:5]

category
Indian Restaurant    1870
Residence            1643
Office               1451
Shop & Service       1390
Restaurant           1142
Name: venue_id, dtype: int64

In [20]:
# Plot histogram
fig = px.histogram(
    x = venues_per_cat,
    nbins  = 10,
    #// histnorm = 'percent',
    #// cumulative = True,
    template = 'plotly',
)

fig.update_layout(
    title = "Venues per category",
    title_x = 0.5,
    xaxis_title = 'Number of unique venues',
    yaxis_title = 'Number of categories',
    bargap = 0.01,
)

## Plotting locations

Lets plot a choropleth map showing the density of venues across Bangalore from our collected data.

In [21]:
map_centre = (12.9792,77.5916)

m = folium.Map(location = map_centre, zoom_start = 11)

folium.Choropleth(
    geo_data = '../data/BBMP_hex.geojson',
    data = venues_per_hex,
    key_on = 'feature.properties.id',
    fill_color = 'YlOrRd',
    fill_opacity = 0.6,
    nan_fill_opacity = 0.3,
    line_opacity = 0.9,
    legend_name = 'Number of venues'
).add_to(m)

m # Display map

In [22]:
df_venues.to_feather('../data/bangalore_foursquare_data.feather') # Save to file