In [1]:
import requests
import pandas as pd
import h3
import json


- In this notebook, we will first collect points of interest (poi) data.
- In the second step we will calculate the number of points of interest in different categories for certain areas and each trip.


# Points of interest - data collection

We will use [OpenStreetMap (OSM)](https://en.wikipedia.org/wiki/OpenStreetMap) to collect data about different points of interest. OpenStreetMap is a collaborative free of charge tool that provides geographical data around the world. In general, data provided by OpenStreetMap is considered to be fairly good and is often compared to that of Google Maps.

OpenStreetMap divided the area considered in our research into three cities Los Angeles, Santa Monica and Burbank. Hence, in the following, we will collect data for all three of them using Overpass API, which was created to collect custom data for selected parts of the world. The queries that we wrote for Overpass API can be found under /utils/overpass_meta.py. These are based on the various map features documented in the OSM [wiki](https://wiki.openstreetmap.org/wiki/Map_features#Public_Transport). Also, queries can be tested using a tool called [overpass turbo](https://overpass-turbo.eu/), which directly visualises the results.

In [2]:
# following code is required to import the get_meta_data function from /utils/overpass_meta.py
import sys, os

sys.path.append(os.path.abspath(os.path.join("..", "utils")))
from overpass_meta import get_meta_data

overpass_meta = get_meta_data()


In [3]:
# following functions can be used to check for existence, read and save poi json files
# as well as fetch poi data from Overpass API
def poi_file_exists(category):
    return os.path.isfile(overpass_meta[category]["filepath"])


def read_poi_file(category):
    with open(overpass_meta[category]["filepath"]) as f:
        data = json.load(f)
    return data


def save_poi_file(category, data):
    with open(overpass_meta[category]["filepath"], "w") as f:
        json.dump(data, f)


def fetch_poi_data(category):
    response = requests.get(
        "http://overpass-api.de/api/interpreter",
        params={"data": overpass_meta[category]["query"]},
    )
    data = response.json()
    save_poi_file(category, data)
    return data


def get_poi_data(category):
    if poi_file_exists(category):
        return read_poi_file(category)
    return fetch_poi_data(category)


Next, we will extract poi data for each category we identified. These are 
- sustenance
- public transport
- education
- arts and culture
- sports

There are many more possible categories and many possibilities to define each category by the inclusion of differently tagged elements. The type of elements and the different values of their tags can be found in OSM [wiki](https://wiki.openstreetmap.org/wiki/Map_features#Public_Transport). However, we decided to focus on these exemplary 5 to preserve the clarity of our findings and to avoid the curse of dimensionality.

In [4]:
sustenance_data = get_poi_data("sustenance")
sustenance_df = pd.DataFrame(sustenance_data["elements"])
sustenance_df["category"] = "sustenance"
sustenance_df["amenity"] = sustenance_df["tags"].apply(lambda tags: tags["amenity"])
sustenance_df.head(2)


Unnamed: 0,type,id,lat,lon,tags,category,amenity
0,node,72448982,34.076217,-118.21602,"{'amenity': 'fast_food', 'cuisine': 'japanese'...",sustenance,fast_food
1,node,72448995,34.076693,-118.216013,"{'amenity': 'fast_food', 'cuisine': 'burger', ...",sustenance,fast_food


In [5]:
public_transport_data = get_poi_data("public_transport")
public_transport_df = pd.DataFrame(public_transport_data["elements"])
public_transport_df["category"] = "public_transport"
public_transport_df.head(2)


Unnamed: 0,type,id,lat,lon,tags,category
0,node,16298122,34.140526,-118.361246,"{'name': 'Universal City', 'network': 'LACMTA'...",public_transport
1,node,18660357,34.090696,-118.291703,"{'name': 'Vermont/Santa Monica', 'network': 'L...",public_transport


In [7]:
education_data = get_poi_data("education")
education_df = pd.DataFrame(education_data["elements"])
education_df["category"] = "education"
education_df["amenity"] = education_df["tags"].apply(lambda tags: tags["amenity"])
education_df.head(2)


Unnamed: 0,type,id,lat,lon,tags,category,amenity
0,node,243805625,33.959419,-118.417117,"{'amenity': 'library', 'created_by': 'Potlatch...",education,library
1,node,344327189,34.258253,-118.301348,"{'addr:state': 'CA', 'amenity': 'library', 'el...",education,library


In [8]:
arts_and_culture_data = get_poi_data("arts_and_culture")
arts_and_culture_df = pd.DataFrame(arts_and_culture_data["elements"])
arts_and_culture_df["category"] = "arts_and_culture"
arts_and_culture_df["amenity"] = arts_and_culture_df["tags"].apply(
    lambda tags: tags["amenity"]
)
arts_and_culture_df.head(2)


Unnamed: 0,type,id,lat,lon,tags,category,amenity
0,node,368167434,34.084167,-118.482222,"{'addr:state': 'CA', 'amenity': 'arts_centre',...",arts_and_culture,arts_centre
1,node,368167436,34.129722,-118.209722,"{'addr:state': 'CA', 'amenity': 'arts_centre',...",arts_and_culture,arts_centre


In [9]:
sports_data = get_poi_data("sports")
sports_df = pd.DataFrame(sports_data["elements"])
sports_df["category"] = "sports"
sports_df.head(2)


Unnamed: 0,type,id,lat,lon,tags,category
0,node,358826475,34.047789,-118.334798,"{'ele': '46', 'gnis:county_id': '037', 'gnis:c...",sports
1,node,358826622,34.166808,-118.485205,"{'ele': '213', 'gnis:county_id': '037', 'gnis:...",sports


In [10]:
poi_df = pd.concat(
    [sustenance_df, public_transport_df, education_df, arts_and_culture_df, sports_df]
)
poi_df = poi_df.drop(columns={"type", "id", "tags"})
poi_df.head(2)


Unnamed: 0,lat,lon,category,amenity
0,34.076217,-118.21602,sustenance,fast_food
1,34.076693,-118.216013,sustenance,fast_food


# Points of interest - data processing
In this section, we will use a special geographical indexing system called [H3](https://h3geo.org/). It was developed by Uber and is now an open-source project. Its special characteristic is that it uses hexagons instead of squares compared to other geographical indexing systems. Using hexagons brings the advantage that the distance from the centre of one hexagon to all its neighbors is the same. This is of great importance in our case.

We want to determine what was the possible destination of a trip or the original destination of a return trip. Optimally we would count all points of interest in a given radius. However, this is computationally intensive. Therefore, we divide the map into certain areas and calculate poi statistics for them. Then we can simply identify in which area the trip started and ended. In this case, the hexagons are a better approximation to a circle than squares.

H3 is a hierarchical system, which means each hexagon has a bigger parent and smaller children. Hexagon sizes are determined by the resolution in the range from 0 to 15, where higher resolution means smaller hexagon area. We decided to use resolution 9 in our research. It corresponds to the area of 0.1053325 km^2 and the edge length of 0.174375668 km. Information about other resolutions can be found [here](https://h3geo.org/docs/core-library/restable/).

In [11]:
resolution = 9

# this function will return hex id for given latitude and longitude
def convert_to_hex(latitude, longitude):
    return h3.geo_to_h3(lat=latitude, lng=longitude, resolution=resolution)


In [12]:
# compute the hexagon id for each point of interest
poi_df["hex"] = poi_df.apply(lambda poi: convert_to_hex(poi["lat"], poi["lon"]), axis=1)
poi_df.head(2)


Unnamed: 0,lat,lon,category,amenity,hex
0,34.076217,-118.21602,sustenance,fast_food,8929a1d73c7ffff
1,34.076693,-118.216013,sustenance,fast_food,8929a1d731bffff


In [13]:
# create a dataframe with the number of points of interest in each hexagon
# and each category that occur in thepoi dataframe
all_hexagons_with_poi = poi_df.groupby(["hex", "category"]).size().to_frame()
all_hexagons_with_poi = all_hexagons_with_poi.reset_index()
all_hexagons_with_poi = all_hexagons_with_poi.rename(columns={0: "number of poi"})
all_hexagons_with_poi.head(2)


Unnamed: 0,hex,category,number of poi
0,891f9c344dbffff,sustenance,1
1,892664501b7ffff,sustenance,1


In [14]:
# import trips data and compute start and end hexagons for each trip
trips_df = pd.read_pickle("../00_data/trips.pkl")

trips_df["start_hex"] = trips_df.apply(
    lambda trip: convert_to_hex(trip["start_latitude"], trip["start_longitude"]), axis=1
)

trips_df["end_hex"] = trips_df.apply(
    lambda trip: convert_to_hex(trip["end_latitude"], trip["end_longitude"]), axis=1
)


In [15]:
number_of_unique_hexagons = (
    pd.concat([trips_df["start_hex"], trips_df["end_hex"]]).unique().size
)
print(
    f"We have identified {number_of_unique_hexagons} hexagons in total with "
    f"resolution {resolution} where at least one trip started or ended."
)


We have identified 130 hexagons in total with resolution 9 where at least one trip started or ended.


In [16]:
# create a dataframe with all hexagons where at least one trip started or ended
hexagons_df = pd.DataFrame()
hexagons_df["hex"] = pd.concat([trips_df["start_hex"], trips_df["end_hex"]]).unique()
hexagons_df.head(2)


Unnamed: 0,hex
0,8929a1d7577ffff
1,8929a1d7543ffff


In [17]:
# create a column 'hex_and_neighbors' which contains a set of hexagons
# this set consists of the hexagon from column 'hex' and its 6 neighbours
hexagons_df["hex_and_neighbors"] = hexagons_df.apply(
    lambda row: list(h3.k_ring(row["hex"], 1)), axis=1
)
hexagons_df.head(2)


Unnamed: 0,hex,hex_and_neighbors
0,8929a1d7577ffff,"[8929a1d7567ffff, 8929a1d7577ffff, 8929a1d7573..."
1,8929a1d7543ffff,"[8929a1d7553ffff, 8929a1d7557ffff, 8929a1d754b..."


In [18]:
# this is an example of how one entry in 'hex_and_neighbors' column looks like
hexagons_df["hex_and_neighbors"][0]


['8929a1d7567ffff',
 '8929a1d7577ffff',
 '8929a1d7573ffff',
 '8929a1d7563ffff',
 '8929a1d750fffff',
 '8929a1d752bffff',
 '8929a1d753bffff']

In [19]:
# this function will return the sum of points of interest in given category for a given set of hexagons
def calculate_poi(hex_and_neighbors, category):
    return all_hexagons_with_poi[
        (
            (all_hexagons_with_poi["hex"].isin(hex_and_neighbors))
            & (all_hexagons_with_poi["category"] == category)
        )
    ]["number of poi"].sum()


In [20]:
# compute the number of poi in each category for each hexagon and its neighbours
categories = [
    'sustenance',
    'public_transport',
    'education',
    'arts_and_culture',
    'sports'
]

for category in categories:
    hexagons_df[f"{category}_poi"] = hexagons_df["hex_and_neighbors"].apply(
    lambda row: calculate_poi(row, category)
)

hexagons_df.head(2)


Unnamed: 0,hex,hex_and_neighbors,sustenance_poi,public_transport_poi,education_poi,arts_and_culture_poi,sports_poi
0,8929a1d7577ffff,"[8929a1d7567ffff, 8929a1d7577ffff, 8929a1d7573...",80,28,2,4,0
1,8929a1d7543ffff,"[8929a1d7553ffff, 8929a1d7557ffff, 8929a1d754b...",40,1,2,0,1


We will now add the newly acquired data to the trips dataset. Performing cluster analysis with these new features might yield intersting results.

In [21]:
trips_df = pd.merge(trips_df, hexagons_df, left_on="start_hex", right_on="hex")
trips_df = trips_df.drop(columns={"hex", "hex_and_neighbors"})

# add '_start' suffix to poi columns
trips_df = trips_df.rename(
    columns={
        "sustenance_poi": "sustenance_poi_start",
        "public_transport_poi": "public_transport_poi_start",
        "education_poi": "education_poi_start",
        "arts_and_culture_poi": "arts_and_culture_poi_start",
        "sports_poi": "sports_poi_start",
    }
)
trips_df.head(2)


Unnamed: 0,start_time,end_time,start_station_id,end_station_id,bike_id,user_type,start_station_name,end_station_name,duration,start_latitude,...,end_longitude,distance,speed,start_hex,end_hex,sustenance_poi_start,public_transport_poi_start,education_poi_start,arts_and_culture_poi_start,sports_poi_start
0,2019-01-01 00:18:00,2019-01-01 00:50:00,3030,3075,5992,Walk-up,Main & 1st,Broadway & 9th,0 days 00:32:00,34.05194,...,-118.25619,1.498844,2.810332,8929a1d7577ffff,8929a1d75a3ffff,80,28,2,4,0
1,2019-01-01 00:20:00,2019-01-01 00:50:00,3030,3075,5860,Walk-up,Main & 1st,Broadway & 9th,0 days 00:30:00,34.05194,...,-118.25619,1.498844,2.997688,8929a1d7577ffff,8929a1d75a3ffff,80,28,2,4,0


In [22]:
trips_df = pd.merge(trips_df, hexagons_df, left_on="end_hex", right_on="hex")
trips_df = trips_df.drop(columns={"hex", "hex_and_neighbors"})

# add '_end' suffix to poi columns
trips_df = trips_df.rename(
    columns={
        "sustenance_poi": "sustenance_poi_end",
        "public_transport_poi": "public_transport_poi_end",
        "education_poi": "education_poi_end",
        "arts_and_culture_poi": "arts_and_culture_poi_end",
        "sports_poi": "sports_poi_end",
    }
)
trips_df.head(2)


Unnamed: 0,start_time,end_time,start_station_id,end_station_id,bike_id,user_type,start_station_name,end_station_name,duration,start_latitude,...,sustenance_poi_start,public_transport_poi_start,education_poi_start,arts_and_culture_poi_start,sports_poi_start,sustenance_poi_end,public_transport_poi_end,education_poi_end,arts_and_culture_poi_end,sports_poi_end
0,2019-01-01 00:18:00,2019-01-01 00:50:00,3030,3075,5992,Walk-up,Main & 1st,Broadway & 9th,0 days 00:32:00,34.05194,...,80,28,2,4,0,67,51,0,4,4
1,2019-01-01 00:20:00,2019-01-01 00:50:00,3030,3075,5860,Walk-up,Main & 1st,Broadway & 9th,0 days 00:30:00,34.05194,...,80,28,2,4,0,67,51,0,4,4


In [23]:
poi_df.to_pickle("../00_data/poi.pkl")
hexagons_df.to_pickle("../00_data/hexagons.pkl")
trips_df.to_pickle("../00_data/trips.pkl")