<a href="https://colab.research.google.com/github/predicthq/phq-data-science-docs/blob/master/venues/venues-example.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Venues-Example DATA SCIENCE GUIDES

# Venues Example

**A How To Guide to extracting venue information from PredictHQ's Attended Events data (conferences, expos, concerts, festivals, performing-arts, sports, community).**

    
The aim of this notebook is showcase how the PHQ Events API could be used to extract venue information for a location and time of your choice and basic map visuals.

- [Setup](#setup)
- [Access Token](#access_token)
- [SDK Parameters](#setting_params) 
- [Calling the PredictHQ API and Fetching Events](#calling_api)
- [Exploring the Result DataFrame and extract venue data](#exploring_df)
- [Displaying all venues on a map](#display_venue_loc)
- [What type of events occurs most often at venues?](#venue_most_freq_type)
- [What's the estimated capacity of venues?](#venue_est_capacity)
- [Appendix 1 - Finding place_id](#appendix-1)

<a id='setup'></a>
# Setup


- If you're using Google Colab, uncomment and run the following code block.

In [1]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/venues
# !pip install timezonefinder==5.2.0 requests==2.26.0 predicthq==2.0.2 plotly==5.2.1 pandas==1.3.5


- Alternatively if you're running this notebook on a local machine, set up a Python environment using [requirements.txt](https://github.com/predicthq/phq-data-science-docs/blob/master/venues-example/requirements.txt) file which is shared alongside the notebook.
These requirements can be installed by runing the command `pip install -r requirements.txt`.


In [2]:
import pandas as pd
import plotly.express as px
import requests
from predicthq import Client
from timezonefinder import TimezoneFinder

# To display more columns and with a larger width in the DataFrame
pd.set_option("display.max_columns", 50)
pd.options.display.max_colwidth = 100


<a id='access_token'></a>
# Access Token
An Access Token is required to query the API. You can checked out our [API Quickstart](../../guides/quickstart/) page if you want to create an account or an access token.


In [3]:
ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
phq = Client(access_token=ACCESS_TOKEN)

<a id='setting_params'></a>
# SDK Parameters
To search for Attended Events, start by building a parameter dictionary and adding the required filters.

In [4]:
parameters = dict()

#### Location
Specifying the location ensures you will see the relevant events for the location.

The notebook provides a default location in Chicago.

This can be adjusted to suit a location that is of interest to you.

We can do this in two ways:  

  1) Using the `within` parameter, which contains `latitude`, `longitude` of the interested location with a `radius` and a `unit` for the radius.
  
  2) Using a list of `place_id` values.
   
Note: This notebook uses the state of ```Chicago``` as an example location. You can modify this to suit your location(s) of interest.

In [5]:
# Using latitude, longitude and a radius
latitude, longitude = (41.881832, -87.623177) # Latitude, longitude for Chicago
radius = 5
radius_unit = "km"

within = f"{radius}{radius_unit}@{latitude},{longitude}"

Alternatively, we could have used a list of ```place_id``` for our search (See our Appendix on Place IDs for detailed explanation).

In [6]:
place_ids = [4887398] # This is the place_id for Chicago. See the Appendix for how to find place_ids.

You can use either ```within``` or place_id as a filter but you can not use both.

In [7]:
parameters.update(within=within)  # Comment if you want to use place_ids
# parameters.update(place__scope=place_ids)  # Comment if you want to use lat and long

#### Date "YYYY-MM-DD"

To define the period of time for which Attended Events will be returned, set the greater than or equal (active.gte) and less than or equal (active.lte) parameters. This will select all Attended Events that are active within this period.

You could also use either of these parameters depending on your time period of interest:

```gte - Greater than or equal.``` <br>
```gt - Greater than.```<br>
```lte - Less than or equal.```<br>
```lt - Less than.```<br>


The default example in this notebook is to search for the whole of 2021.

In [8]:
start_time = "2021-01-01"
end_time = "2021-12-31"
parameters.update(active__gte=start_time)
parameters.update(active__lte=end_time)

#### Timezone 
By setting the timezone for the location of interest, the appropriate events will be returned.(<a href="https://en.wikipedia.org/wiki/List_of_tz_database_time_zones">tz database</a>)

For our Chicago example, the timezone would be ```America/Chicago```. 
Use the `TimezoneFinder()` to find it for our location of interest.

In [9]:
timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)
print(timezone)

America/Chicago


In [10]:
parameters.update(active__tz=timezone)

#### Rank range

Similar to the date period, the rank range can be set to filter events. The rank_type can be set to either `rank`, `local_rank` or `aviation_rank`. The `rank` reflects the [estimated impact of an event](https://docs.predicthq.com/start/ranks/). The `local_rank` reflects the [estimated impact of an event in their local area](https://docs.predicthq.com/start/ranks/) by considering population density. Local Rank is useful when comparing events from multiple locations. The `aviation_rank` reflects the number of passengers who attend the event by flight. As a rule of thumb, here is the estimation of the number of attendance/passengers for typical rank_type and rank_threshold settings:

rank_type |rank_threshold |Number of attendance/passengers
:-----:|:-----:|:-----:
rank|$20$|$\sim30$
rank|$30$|$\sim100$
rank|$40$|$\sim300$
rank|$50$|$\sim1000$
rank|$60$|$\sim3000$
aviation_rank|$30$|$\sim20$
aviation_rank|$40$|$\sim40$
aviation_rank|$50$|$\sim100$
aviation_rank|$60$|$\sim200$

In [11]:
# Select events according to rank_type, rank threshold.
rank_type = "rank" # Set to be either "rank", "local_rank" or "aviation_rank".
rank_threshold = 40 
filter_parameter = "gte"
parameters.update({f"{rank_type}__{filter_parameter}": rank_threshold})

#### Categories
Set the category to `attendance category`.


In [12]:
categories = [
    "community",
    "concerts",
    "conferences",
    "expos",
    "festivals",
    "performing-arts",
    "sports",
]
parameters.update(category=categories)

#### Limits 
When pulling historical data for a large time period, many results are returned. To speed up the execution, set limit to the highest available setting (500). By doing this, each call to the API returns 500 results and this will generally speed up the time to retrieve large datasets.

In [13]:
parameters.update(limit=500)

#### Checking the parameters
Finally, let's take a look at the parameters we have set for our search.

In [14]:
parameters

{'within': '5km@41.881832,-87.623177',
 'active__gte': '2021-01-01',
 'active__lte': '2021-12-31',
 'active__tz': 'America/Chicago',
 'rank__gte': 40,
 'category': ['community',
  'concerts',
  'conferences',
  'expos',
  'festivals',
  'performing-arts',
  'sports'],
 'limit': 500}

You can check out the full list of available parameters that you could use in querying attended Events at our [Events Resource page](../../resources/events/).

<a id='calling_api'></a>
# Calling the PredictHQ API and Fetching Events

In this step, we use PHQ's Python SDK Client to query and fetch events using the parameters we defined above.

In [None]:
results = []

# Iterating through all the events that match our criteria and adding them to our results
for event in phq.events.search(parameters).iter_all():
    results.append(event.to_dict())

# Converting the results to a DataFrame
event_df = pd.DataFrame(results)

<a id='exploring_df'></a>
# Exploring the Result DataFrame and extract venue data
We take a look at the result data and extract venue information for our use case.

In [None]:
# Extract venue locations
event_df['lat'] = event_df.location.apply(lambda x: x[1])
event_df['lon'] = event_df.location.apply(lambda x: x[0])

In [None]:
# Extract entity information: id, name and type
def _filter_venues(entities):
    venues = [e for e in entities if e["type"] == "venue"]
    if not venues:
        return None
    return venues[0]

venue_field = lambda entity, fieldname: entity[fieldname] if entity is not None else None


event_df["venues"] = event_df.entities.apply(_filter_venues)

event_df["entity_name"] = event_df["venues"].apply(venue_field, args=("name",))
event_df["entity_id"] = event_df["venues"].apply(venue_field, args=("entity_id",))
event_df["entity_type"] = event_df["venues"].apply(venue_field, args=("type",))

In [None]:
event_df.head()

<a id='display_venue_loc'></a>
# Displaying all venues on a map

In [None]:
# Specify the entity as venue and remove duplicate records
entity_loc = event_df[['lat','lon','entity_id','entity_name','entity_type']][event_df.entity_type=='venue'].drop_duplicates()
entity_loc

In [None]:
fig = px.scatter_mapbox(entity_loc, lat="lat", lon="lon",
                        hover_name="entity_name",  color_discrete_sequence=["fuchsia"],    
                        zoom=12)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(height=500, margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

<a id='venue_most_freq_type'></a>
# What type of events occurs most often at venues?

The map shows the type of events that happened most frequently for the venue in the past year which is distinguished by colours. The bigger size of circles indicates there were more events that happened.

In [None]:
agg_df = event_df[event_df.entity_type=='venue'].groupby(['lat','lon','entity_name','category']).agg({'entity_id':'count','phq_attendance':'max'}).reset_index()
venue_type = agg_df.loc[agg_df.reset_index().groupby(['entity_name'])['entity_id'].idxmax()]
venue_type.head()

In [None]:
fig = px.scatter_mapbox(venue_type, lat="lat", lon="lon", color="category", size='entity_id',
                        hover_name="entity_name", hover_data=['phq_attendance'],      
                        zoom=12)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(height=500, margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

<a id='venue_est_capacity'></a>
# What's the estimated capacity of venues?

The capacity is estimated based on the maximum attendance in the specified time period.

In [None]:
estimated_venue_capacity = agg_df.loc[agg_df.reset_index().groupby(['phq_attendance'])['entity_id'].idxmax()]
estimated_venue_capacity

In [None]:
fig = px.scatter_mapbox(estimated_venue_capacity, lat="lat", lon="lon", color="category", size="phq_attendance",
                        hover_name="entity_name",
                        zoom=12)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(height=500, margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

<a id='appendix'></a>
## Appendix: Finding `place_id` 

Here is a guide on how to link between store locations and `place_id`. We present 3 options:

 1. Query `place_id` based on location name. Here the location name could be a city, a state, a country or a continent.
 2. Query `place_hierarchies` based on latitude and longitude.
 3. Query location name based on `place_id`.

The full list of parameters that you could use in your query is documented at our [Places API page](https://docs.predicthq.com/resources/places/).<br>PredictHQ uses the [geonames places](https://www.geonames.org/) convention. 

#### 1) Query `place_id` based on location name

By using PredictHQ's Places SDK, you can find the `place_id` for a specific location name. By calling the SDK and setting `q` to location name, the API will return the most relevant `place_id`. Taking the top `place_id` will provide the most relevant `place_id` the location name is in. You can also limit results by specifying parameters like`country`, `type` to get the closest match. All the parameter fields can be found at [Places](https://docs.predicthq.com/resources/places/#fields).

In [None]:
# Example: retrieve place_id
results = phq.places.search(q='Chicago', country='US',type='locality')

for place in results:
    print(place.id, place.name, place.type, place.location)

`phq.places.search` will retrieve all relevant matches with parameter `q`, taking the top `place_id` as the most relevant match

In [None]:
# Place id for Chicago, US
results['results'][0]['id']

#### 2) Query `place_hierarchies` based on  latitude and longitude

By using PredictHQ's python SDK, you can find the `place_hierarchies` for a specific latitude and longitude. By calling the SDK and setting `location` to `@{latitude},{longitude}`, the API will return the all relevant `place_hierarchies`.

In [None]:
# Example: retrieve place_hierarchies
results = phq.places.search(location="@41.85003,-87.65005")

for place in results:
    print(place.id, place.name, place.type, place.location, place.county, place.region, place.country)

For `latitude,longitude`, the response might include more than one hierarchy. The reason for this is we try to match the closest place's hierarchy but we also include the closest major city's hierarchy within a radius of 50km. This only applies if the level is below region, you can limit the results to different levels through `type`.

#### 3) Query location name based on `place_id`

By using PredictHQ's SDK, you can find the location name for a specific `place_id`. By calling the SDK and setting `id` to `place_id`, the SDK will return the most relevant location name. Taking the top location name will provide the most relevant location name the `place_id` is in.

In [None]:
# Example: retrieve place name by place_id
ny_state = phq.places.search(id="5128638").results[0]
print(ny_state.id, ny_state.name, ny_state.type, ny_state.location)