<a href="https://colab.research.google.com/github/predicthq/phq-data-science-docs/blob/master/venues/venues-example.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Venues Example

**A How To Guide to extracting venue information from PredictHQ's Attended Events data (conferences, expos, concerts, festivals, performing-arts, sports, community).**


The aim of this notebook is to showcase how the PredictHQ Events API could be used to extract venue information for a location and time of your choice and basic map visuals.

- [Setup](#setup)
- [Access Token](#access_token)
- [SDK Parameters](#setting_params) 
- [Calling the PredictHQ API and Fetching Events](#calling_api)
- [Exploring the Result DataFrame and extract venue data](#exploring_df)
- [Displaying all venues on a map](#display_venue_loc)
- [What type of events occurs most often at venues?](#venue_most_freq_type)
- [What's the estimated capacity of venues?](#venue_est_capacity)
- [Appendix 1 - Finding place_id](#appendix-1)

<a id='setup'></a>
# Setup


- If you're using Google Colab, uncomment and run the following code block.

In [1]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/venues
# !pip install timezonefinder==5.2.0 predicthq==2.0.2 plotly==5.2.1 pandas==1.3.5


- Alternatively if you're running this notebook on a local machine, set up a Python environment using the [requirements.txt](https://github.com/predicthq/phq-data-science-docs/blob/master/venues/requirements.txt) file which is shared alongside the notebook.
These requirements can be installed by running the command `pip install -r requirements.txt`.


In [2]:
import pandas as pd
import plotly.express as px
from predicthq import Client
from timezonefinder import TimezoneFinder

# To display more columns and with a larger width in the DataFrame
pd.set_option("display.max_columns", 50)
pd.options.display.max_colwidth = 100


<a id='access_token'></a>
# Access Token
An Access Token is required to query the API. Check out our [API Quickstart](https://docs.predicthq.com/guides/api-quickstart) page if you want to create an account or an access token.


In [3]:
ACCESS_TOKEN = "REPLACE_WITH_ACCESS_TOKEN"
phq = Client(access_token=ACCESS_TOKEN)


<a id='setting_params'></a>
# SDK Parameters
We will use PredictHQ's python SDK to search for Attended Events. Start by building a parameter dictionary and adding the required filters.

In [4]:
parameters = dict()

#### Location
Specifying the location ensures you will see the relevant events for the location.

This notebook uses the city of `Chicago`, Illinois, USA as an example location. You can modify this to suit your location(s) of interest.

This can be adjusted to suit a location that is of interest to you.

We can do this in two ways:  

  1) Using the `within` parameter, which contains `latitude`, `longitude` of the interested location with a `radius` and a `unit` for the radius. The format for this parameter is detailed in our [Events API technical documentation](https://docs.predicthq.com/resources/events/#search-events).
  
  2) Using a list of `place_id` values.

In [5]:
# Using latitude, longitude and a radius
latitude, longitude = (41.881832, -87.623177) # latitude, longitude for Chicago
radius = 5
radius_unit = "km"

within = f"{radius}{radius_unit}@{latitude},{longitude}"

Alternatively, we could use a list of `place_id`s for our search (See our Appendix on Place IDs for a detailed explanation).

In [6]:
place_ids = [4887398] # This is the place_id for Chicago. See the Appendix for how to find place_ids.

You can use either ```within``` or place_id as a filter but you cannot use both.

In [7]:
parameters.update(within=within)  # Comment if you want to use place_ids
# parameters.update(place__scope=place_ids)  # Comment if you want to use lat and long

#### Date "YYYY-MM-DD"

To define the period of time for which Attended Events will be returned, set the greater than or equal (active.gte) and less than or equal (active.lte) parameters. This will select all Attended Events that are active within this period.

You could also use these parameters depending on your time period of interest:

`gte - Greater than or equal.` <br/>
`gt - Greater than.`<br/>
`lte - Less than or equal.`<br/>
`lt - Less than.`<br/>

The default example in this notebook is to search for the whole of 2021.

In [8]:
start_date = "2021-01-01"
end_date = "2021-12-31"
parameters.update(active__gte=start_date)
parameters.update(active__lte=end_date)

#### Timezone 
The default timezone for date and time ranges provided as parameters is UTC. But if we want events that occur between 2021-01-01 and 2021-12-31 in Chicago's local time then we can specify Chicago's timezone as a parameter. Timezones are specified as [TZ database names](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones).

For our Chicago location, the timezone is `America/Chicago`. 

You can use the `TimezoneFinder()` to find timezones for your locations of interest.

In [9]:
timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)
print(timezone)

America/Chicago


In [10]:
parameters.update(active__tz=timezone)

#### Rank range

Similar to the date period, the rank range can be set to filter events. The rank_type can be set to either `rank`, `local_rank` or `aviation_rank`. The `rank` reflects the [estimated impact of an event](https://docs.predicthq.com/start/ranks/). The `local_rank` reflects the [estimated impact of an event in their local area](https://docs.predicthq.com/start/ranks/) by considering population density. Local Rank is useful when comparing events from multiple locations. The `aviation_rank` reflects the number of passengers who attend the event by flight. As a rule of thumb, here is the estimation of the number of attendance/passengers for typical rank_type and rank_threshold settings:

rank_type |rank_threshold |Number of attendance/passengers
:-----:|:-----:|:-----:
rank|$20$|$\sim30$
rank|$30$|$\sim100$
rank|$40$|$\sim300$
rank|$50$|$\sim1000$
rank|$60$|$\sim3000$
aviation_rank|$30$|$\sim20$
aviation_rank|$40$|$\sim40$
aviation_rank|$50$|$\sim100$
aviation_rank|$60$|$\sim200$

In [11]:
# Select events according to rank_type, rank threshold.
rank_type = "rank"  # Set to be either "rank", "local_rank" or "aviation_rank".
rank_threshold = 40
filter_parameter = "gte"
parameters.update({f"{rank_type}__{filter_parameter}": rank_threshold})


#### Categories
[Attended events](https://docs.predicthq.com/categoryinfo/attended-events) are those from specific categories. We specify them here.


In [12]:
categories = [
    "community",
    "concerts",
    "conferences",
    "expos",
    "festivals",
    "performing-arts",
    "sports",
]
parameters.update(category=categories)

#### Limit-per-call 
When pulling historical data for a large time period, many results are returned. To speed up the execution, set limit to the highest available setting (500). By doing this, each call to the API returns 500 results and this will generally speed up the time to retrieve large datasets.

In [13]:
parameters.update(limit=500)

#### Checking the parameters
Finally, let's take a look at the parameters we have set for our search.

In [14]:
parameters

{'within': '5km@41.881832,-87.623177',
 'active__gte': '2021-01-01',
 'active__lte': '2021-12-31',
 'active__tz': 'America/Chicago',
 'rank__gte': 40,
 'category': ['community',
  'concerts',
  'conferences',
  'expos',
  'festivals',
  'performing-arts',
  'sports'],
 'limit': 500}

You can check out the full list of available parameters that you could use in querying attended events on our [Events API documentation](https://docs.predicthq.com/resources/events).

<a id='calling_api'></a>
# Calling the PredictHQ API and Fetching Events

In this step, we use PredictHQ's Python SDK Client to query and fetch events using the parameters we defined above.

In [15]:
results = []

# Iterating through all the events that match our criteria and adding them to our results
for event in phq.events.search(parameters).iter_all():
    results.append(event.to_dict())

# Converting the results to a DataFrame
event_df = pd.DataFrame(results)


<a id='exploring_df'></a>
# Exploring the Result DataFrame and extract venue data
We examine the retrieved events and extract venue information.

In [16]:
# Extract venue locations
event_df["lat"] = event_df.location.apply(lambda x: x[1])
event_df["lon"] = event_df.location.apply(lambda x: x[0])


In [17]:
# Extract entity information: id, name and type
def _filter_venues(entities):
    venues = [e for e in entities if e["type"] == "venue"]
    if not venues:
        return None
    return venues[0]


venue_field = lambda entity, fieldname: entity[fieldname] if entity is not None else None


event_df["venues"] = event_df.entities.apply(_filter_venues)

event_df["entity_name"] = event_df["venues"].apply(venue_field, args=("name",))
event_df["entity_id"] = event_df["venues"].apply(venue_field, args=("entity_id",))
event_df["entity_type"] = event_df["venues"].apply(venue_field, args=("type",))


In [18]:
event_df.head()

Unnamed: 0,cancelled,category,country,deleted_reason,description,duplicate_of_id,duration,end,first_seen,geo,id,impact_patterns,labels,location,parent_event,place_hierarchies,postponed,relevance,scope,start,state,timezone,title,updated,aviation_rank,brand_safe,entities,local_rank,phq_attendance,predicted_end,private,rank,lat,lon,venues,entity_name,entity_id,entity_type
0,,concerts,US,,,,0,2022-01-01 05:00:00+00:00,2021-11-04 18:14:17+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-87.6570519, 41.884637]}}",APLYr2znpcekmv6xbG,,"[concert, music]","[-87.6570519, 41.884637]",,"[[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]]",,0.0,locality,2022-01-01 05:00:00+00:00,active,America/Chicago,Los Lobos,2021-11-04 22:22:35+00:00,0.0,,"[{'entity_id': 'xLVCdVzxtmahfdDMPEDXZB', 'name': 'City Winery', 'type': 'venue', 'formatted_addr...",51,300,NaT,False,40,41.884637,-87.657052,"{'entity_id': 'xLVCdVzxtmahfdDMPEDXZB', 'name': 'City Winery', 'type': 'venue', 'formatted_addre...",City Winery,xLVCdVzxtmahfdDMPEDXZB,venue
1,,concerts,US,,,,0,2022-01-01 04:00:00+00:00,2021-11-25 18:07:55+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-87.6275718, 41.8854802]}}",78RYvKCwmbQXLDw9SC,,"[concert, music]","[-87.6275718, 41.8854802]",,"[[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]]",,0.0,locality,2022-01-01 04:00:00+00:00,active,America/Chicago,Trey Songz,2022-04-04 22:47:30+00:00,0.0,,"[{'entity_id': 'UniKDgC373FLJBdMmghKZc', 'name': 'The Chicago Theatre', 'type': 'venue', 'format...",69,3314,NaT,False,60,41.88548,-87.627572,"{'entity_id': 'UniKDgC373FLJBdMmghKZc', 'name': 'The Chicago Theatre', 'type': 'venue', 'formatt...",The Chicago Theatre,UniKDgC373FLJBdMmghKZc,venue
2,,concerts,US,,,,0,2022-01-01 03:00:00+00:00,2021-10-06 01:11:56+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-87.6267529, 41.8538782]}}",AMg4KnuD7wjBRsJXvf,,"[concert, music]","[-87.6267529, 41.8538782]",,"[[6295630, 6255149, 6252001, 4896861, 4888671, 4906683], [6295630, 6255149, 6252001, 4896861, 48...",,0.0,locality,2022-01-01 03:00:00+00:00,active,America/Chicago,Reggies Rooftop New Years Eve Package,2021-10-06 01:35:25+00:00,0.0,,"[{'entity_id': '3AqCHbCHCKhLNSJsVeRijxk', 'name': 'Reggie's Rock Club', 'type': 'venue', 'format...",50,300,NaT,False,40,41.853878,-87.626753,"{'entity_id': '3AqCHbCHCKhLNSJsVeRijxk', 'name': 'Reggie's Rock Club', 'type': 'venue', 'formatt...",Reggie's Rock Club,3AqCHbCHCKhLNSJsVeRijxk,venue
3,,community,US,,21+ / www.joesbar.comGrab your tickets now for a rockin New Years Eve night with one of our FAVO...,,0,2022-01-01 03:00:00+00:00,2021-12-04 00:43:56+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-87.652182, 41.910098]}}",EM74tQatxzf4QoQFzu,,"[concert, music]","[-87.652182, 41.910098]",,"[[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]]",,0.0,locality,2022-01-01 03:00:00+00:00,active,America/Chicago,New Years Eve Party with Jukebox Luke (SUGGESTED RETAIL $95),2021-12-10 01:40:12+00:00,,,"[{'entity_id': 'YhUBPfQsr43bjc6ADDqjwF', 'name': 'Joe's Bar', 'type': 'venue', 'formatted_addres...",63,1500,NaT,False,54,41.910098,-87.652182,"{'entity_id': 'YhUBPfQsr43bjc6ADDqjwF', 'name': 'Joe's Bar', 'type': 'venue', 'formatted_address...",Joe's Bar,YhUBPfQsr43bjc6ADDqjwF,venue
4,,concerts,US,,,,0,2022-01-01 02:30:00+00:00,2021-09-28 06:20:06+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-87.62913, 41.888233]}}",4VA7eXFetqy3zHgUzN,,"[concert, music]","[-87.62913, 41.888233]",,"[[6295630, 6255149, 6252001, 4896861, 4888671, 4887398]]",,0.0,locality,2022-01-01 02:30:00+00:00,active,America/Chicago,Motion City Soundtrack with All Get Out,2022-01-01 03:39:45+00:00,0.0,,"[{'entity_id': 'AfJwpDskcMhXWJWRjtALNV', 'name': 'House of Blues Chicago', 'type': 'venue', 'for...",60,1300,NaT,False,52,41.888233,-87.62913,"{'entity_id': 'AfJwpDskcMhXWJWRjtALNV', 'name': 'House of Blues Chicago', 'type': 'venue', 'form...",House of Blues Chicago,AfJwpDskcMhXWJWRjtALNV,venue


<a id='display_venue_loc'></a>
# Displaying all venues on a map

In [19]:
# Extract unique venues
entity_loc = event_df[["lat", "lon", "entity_id", "entity_name", "entity_type"]][
    event_df.entity_type == "venue"
].drop_duplicates()

entity_loc


Unnamed: 0,lat,lon,entity_id,entity_name,entity_type
0,41.884637,-87.657052,xLVCdVzxtmahfdDMPEDXZB,City Winery,venue
1,41.885480,-87.627572,UniKDgC373FLJBdMmghKZc,The Chicago Theatre,venue
2,41.853878,-87.626753,3AqCHbCHCKhLNSJsVeRijxk,Reggie's Rock Club,venue
3,41.910098,-87.652182,YhUBPfQsr43bjc6ADDqjwF,Joe's Bar,venue
4,41.888233,-87.629130,AfJwpDskcMhXWJWRjtALNV,House of Blues Chicago,venue
...,...,...,...,...,...
1077,41.876452,-87.624869,33AY8zr5TjqT8yeHVmJucuR,Fine Arts Building,venue
1122,41.875210,-87.658001,QJjfX3mJEReLU4USjyzyKc,University Of Illinois At Chicago,venue
1191,41.849998,-87.650002,y8UT486CTjp4B6UhbGqxaL,Windy City Ribfest,venue
1205,41.914127,-87.628799,dCcKAP7DqkXXJuKDRvvwMu,Lincoln Park South Fields,venue


In [20]:
fig = px.scatter_mapbox(
    entity_loc, lat="lat", lon="lon", hover_name="entity_name", color_discrete_sequence=["fuchsia"], zoom=12
)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(height=500, margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.show()


<a id='venue_most_freq_type'></a>
# What type of events occurs most often at venues?

The map shows the type of events that happened most frequently for the venue, distinguished by colours. A bigger circle indicates there were more events that happened.

In [21]:
agg_df = (
    event_df[event_df.entity_type == "venue"]
    .groupby(["lat", "lon", "entity_name", "category"])
    .agg({"entity_id": "count", "phq_attendance": "max"})
    .reset_index()
)
venue_type = agg_df.loc[agg_df.reset_index().groupby(["entity_name"])["entity_id"].idxmax()]
venue_type.head()


Unnamed: 0,lat,lon,entity_name,category,entity_id,phq_attendance
21,41.863956,-87.663828,Addams/Medill Park,festivals,1,64000
6,41.852094,-87.61185,Arie Crown Theater,performing-arts,5,2561
36,41.875824,-87.625113,Auditorium Theatre,concerts,10,3875
72,41.88524,-87.661725,Bottom Lounge,concerts,22,672
117,41.89819,-87.622225,Broadway Playhouse,performing-arts,4,331


In [22]:
fig = px.scatter_mapbox(
    venue_type,
    lat="lat",
    lon="lon",
    color="category",
    size="entity_id",
    hover_name="entity_name",
    hover_data=["phq_attendance"],
    zoom=12,
)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(height=500, margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.show()


<a id='venue_est_capacity'></a>
# What's the estimated capacity of venues?

The capacity is estimated based on the maximum attendance of events in the venue found in the specified time period.

In [23]:
estimated_venue_capacity = agg_df.loc[agg_df.reset_index().groupby(["phq_attendance"])["entity_id"].idxmax()]
estimated_venue_capacity


Unnamed: 0,lat,lon,entity_name,category,entity_id,phq_attendance
13,41.853878,-87.626753,Reggie's Rock Club,community,1,300
112,41.896840,-87.636805,Parliament Chicago,concerts,1,320
117,41.898190,-87.622225,Broadway Playhouse,performing-arts,4,331
40,41.878045,-87.627434,DePaul University Loop Campus DePaul Center,conferences,1,343
96,41.889042,-87.631627,Untitled,concerts,5,360
...,...,...,...,...,...,...
3,41.851219,-87.617028,McCormick Place,expos,15,90000
61,41.883182,-87.621860,Jay Pritzker Pavilion,festivals,5,240000
24,41.872172,-87.618750,Grant Park,festivals,2,385000
25,41.872172,-87.618750,Grant Park,sports,1,960000


In [24]:
fig = px.scatter_mapbox(
    estimated_venue_capacity,
    lat="lat",
    lon="lon",
    color="category",
    size="phq_attendance",
    hover_name="entity_name",
    zoom=12,
)

fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(height=500, margin={"r": 0, "t": 0, "l": 0, "b": 0})
fig.show()


<a id='appendix'></a>
## Appendix: Finding `place_id` 

Here is a guide on how to link between locations and `place_id`. We present 3 options:

 1. Query `place_id` based on location name. Here the location name could be a city, a state, a country or a continent.
 2. Query `place_hierarchies` based on latitude and longitude.
 3. Query location name based on `place_id`.

The full list of parameters that you could use in your query is documented at our [Places API page](https://docs.predicthq.com/resources/places/).<br>PredictHQ uses the [geonames places](https://www.geonames.org/) convention. 

#### 1) Query `place_id` based on location name

By using PredictHQ's Places SDK, you can find the `place_id` for a specific location name. By calling the SDK and setting `q` to location name, the API will return the most relevant `place_id`. Taking the top `place_id` will provide the most relevant `place_id` the location name is in. You can also limit results by specifying parameters like `country`, `type` to get the closest match. All the parameter fields can be found at [Places](https://docs.predicthq.com/resources/places/#fields).

In [25]:
# Example: retrieve place_id
results = phq.places.search(q="Chicago", country="US", type="locality")

for place in results:
    print(place.id, place.name, place.type, place.county, place.region, place.country, place.location)


4887398 Chicago locality Cook County Illinois United States [-87.65005, 41.85003]
8922857 Chicago locality Mezquitic Jalisco Mexico [-104.15972, 22.33944]
4903862 North Chicago locality Lake County Illinois United States [-87.84118, 42.32558]
4887442 Chicago Heights locality Cook County Illinois United States [-87.6356, 41.50615]
4919857 East Chicago locality Lake County Indiana United States [-87.45476, 41.6392]
4915963 West Chicago locality DuPage County Illinois United States [-88.20396, 41.88475]
4887492 Chicago Ridge locality Cook County Illinois United States [-87.77922, 41.70142]
4924095 New Chicago locality Lake County Indiana United States [-87.27448, 41.55837]
4911868 South Chicago Heights locality Cook County Illinois United States [-87.63782, 41.48087]
8966977 Case Chicago locality Provincia di Ravenna Emilia-Romagna Italy [11.80487, 44.51449]


`phq.places.search` will retrieve all relevant matches with parameter `q`, taking the top `place_id` as the most relevant match

In [26]:
# Place id for Chicago, US
results["results"][0]["id"]


'4887398'

#### 2) Query `place_hierarchies` based on  latitude and longitude

By using PredictHQ's python SDK, you can find the `place_hierarchies` for a specific latitude and longitude. By calling the SDK and setting `location` to `@{latitude},{longitude}`, the API will return the all relevant `place_hierarchies`.

In [27]:
# Example: retrieve place_hierarchies
results = phq.places.search(location="@41.85003,-87.65005")

for place in results:
    print(place.id, place.name, place.type, place.location, place.county, place.region, place.country)


4887398 Chicago locality [-87.65005, 41.85003] Cook County Illinois United States
4911838 South Branch Addition neighbourhood [-87.64116, 41.85031] Cook County Illinois United States
4905971 Pilsen neighbourhood [-87.65755, 41.85753] Cook County Illinois United States
4885565 Bridgeport neighbourhood [-87.65116, 41.83809] Cook County Illinois United States
4900306 Locks neighbourhood [-87.66339, 41.84281] Cook County Illinois United States
4900611 Lower West Side neighbourhood [-87.66561, 41.8542] Cook County Illinois United States
4907805 Robert Brooks Homes neighbourhood [-87.6595, 41.86531] Cook County Illinois United States
4883580 Armour Square neighbourhood [-87.63311, 41.84003] Cook County Illinois United States
4888642 Conleys Patch neighbourhood [-87.63616, 41.8642] Cook County Illinois United States
4894177 Grace Abbott Homes neighbourhood [-87.66478, 41.86337] Cook County Illinois United States


For `latitude,longitude`, the response might include more than one hierarchy. The reason for this is we try to match the closest place's hierarchy but we also include the closest major city's hierarchy within a radius of 50km. This only applies if the level is below region, you can limit the results to different levels through `type`.

#### 3) Query location name based on `place_id`

By using PredictHQ's SDK, you can find the location name for a specific `place_id`. By calling the SDK and setting `id` to `place_id`, the SDK will return the most relevant location name. Taking the top location name will provide the most relevant location name the `place_id` is in.

In [28]:
# Example: retrieve place name by place_id
ny_state = phq.places.search(id="5128638").results[0]
print(ny_state.id, ny_state.name, ny_state.type, ny_state.location)


5128638 State of New York region [-75.4999, 43.00035]
