<a href="https://colab.research.google.com/github/predicthq/phq-data-science-docs/blob/master/unattended-events/part_1_data_engineering.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### NON-ATTENDANCE-BASED EVENTS DATA SCIENCE GUIDES

Non-Attendance-Based Events are events with a start and end date, but are more fluid in impact, such as observances, public holidays and school holidays. This How to Series allows you to quickly extract the data (Part 1), explore the data (Part 2) and experiment with different aggregations (Part 3).

# Part 1 Data Engineering

<b>A How To Guide to extracting data from PredictHQ's Non-Attendance-Based Events data (public-holidays, observances and school-holidays). 

This notebook will guide you through how to extract Non-Attendance-Based Events for a location and time of your choice.

- [Setup](#setup)
- [Access Token](#access_token)
- [SDK Parameters](#setting_params) 
- [Query Unattended Events](#query_unattended_events)
- [Output DataFrame](#output_dataframe)
- [Appendix - Finding place_id](#appendix)


<a id='setup'></a>
# Setup


If using Google Colab uncomment the following code block.

In [None]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/unattended-events
# !pip install predicthq timezonefinder pandas==1.0.5


If running locally, set up a Python environment using `requirements.txt` shared alongside the notebook to install the required dependencies. 



In [8]:
import pandas as pd
from predicthq import Client
from timezonefinder import TimezoneFinder
import requests
import passwords
from shapely.geometry import shape
import folium
import geopandas as gpd
# To display more columns and with a larger width in the DataFrame
pd.set_option("display.max_columns", 50)
pd.options.display.max_colwidth = 100

<a id='access_token'></a>
# Access Token
An Access Token is required to query the API.

The following link will guide you through creating an account and an access token. 

 - https://docs.predicthq.com/guides/quickstart/

In [9]:
# Replace Access Token with own access token.
ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
phq = Client(access_token=ACCESS_TOKEN)

<a id='setting_params'></a>
# SDK Parameters
To search for Non-Attendance-Based Events, start by building a parameter dictionary and adding the required filters.

In [10]:
parameters = dict()

#### Location
Observances, public holidays and school holidays change by location. Specifying the location ensures you will see the relevant events for the location.

The notebook provides a default location in ```Los Angeles, California```.

This can be adjusted to suit a location that is of interest to you.

We can do this in two ways:  

  1) Using ```within``` parameter, which contains ```latitude```, ```longitude``` of the interested location with a ```radius``` and a ```unit``` for the radius.
  
  2) Using a list of place_ids.
    
The result is relatively insensitive to the setting of radius as Non-Attendance-Based Events have large location scopes. We recommend a default radius of 10.

Note: Radius is a key parameter for retrieving events from other categories such as Attendance-Based Events such as concerts and sports games.


In [11]:
# Using latitude, longitude and a radius
latitude, longitude = (34.07, -118.25)
radius = 10
radius_unit = "km"

within = f"{radius}{radius_unit}@{latitude},{longitude}"

Alternatively, we could have used a list of ```place_id``` for our search (See our Appendix on Place IDs for detailed explanation).

In [12]:
# Using a list of place_id
# place_ids = [5368361]

You can use either ```within``` or place_id as a filter but you can not use both.

In [13]:
parameters.update(within=within)  # Comment if you want to use place_ids
#parameters.update(place__scope=place_ids)  # Comment if you want to use lat and long

#### Date "YYYY-MM-DD"

To define the period of time for which Non-Attendance-Based Events will be returned, set the greater than or equal (active__gte) and less than or equal (active__lte) parameters. This will select all Non-Attendance-Based Events that are active within this period.

You could also use either of these parameters depending on your time period of interest:

```gte - Greater than or equal.``` <br>
```gt - Greater than.```<br>
```lte - Less than or equal.```<br>
```lt - Less than.```<br>


The default example in this notebook is to search for the whole of 2020.

In [14]:
start_time = "2020-01-01"
end_time = "2020-12-31"
parameters.update(active__gte=start_time)
parameters.update(active__lte=end_time)

#### Timezone 
By setting the timezone for the location of interest, the appropriate events will be returned.(<a href="https://en.wikipedia.org/wiki/List_of_tz_database_time_zones">tz database</a>)

For our Los Angeles example it is ```America/Los_Angeles```. 
Use the `TimezoneFinder()` to find it for our location of interest.

See the appendix on how to find the timezone using ```place_id```.

In [15]:
timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)
print(timezone)

America/Los_Angeles


In [16]:
parameters.update(active__tz=timezone)

#### Categories
Specify a list of Non-Attendance-Based Events categories to return.
```['school-holidays', 'public-holidays', 'observances']```

In [17]:
categories = ["school-holidays", "public-holidays", "observances"]
parameters.update(category=categories)

#### Checking the parameters
Finally, let's take a look at the parameters we have set for our search.

In [18]:
parameters

{'within': '10km@34.07,-118.25',
 'active__gte': '2020-01-01',
 'active__lte': '2020-12-31',
 'active__tz': 'America/Los_Angeles',
 'category': ['school-holidays', 'public-holidays', 'observances']}

You can check out the full list of available parameters that you could use in querying Non-Attendance-Based Events at our [Events Resource page](https://docs.predicthq.com/resources/events/).

<a id='calling_api'></a>
# Calling the PredictHQ API and Fetching Events

In this step, we use PHQ Python SDK Client to query and fetch events based on the parameters we defined above.

In [19]:
results = []

# Iterating through all the events that match our criteria and adding them to our results
for event in phq.events.search(**parameters).iter_all():
    results.append(event.model_dump())

# Converting the results to a DataFrame
event_df = pd.DataFrame(results)

<a id='exploring_df'></a>
# Exploring the Result DataFrame and Storing it
We take a look at the result data and select the most important fields for our use case.

In [21]:
event_df.head(n=10)

Unnamed: 0,cancelled,category,country,deleted_reason,description,duplicate_of_id,duration,end,first_seen,geo,id,labels,location,parent_event,place_hierarchies,postponed,relevance,scope,start,state,timezone,title,updated,aviation_rank,brand_safe,entities,local_rank,phq_attendance,predicted_end,private,rank
0,,observances,US,,New Year's Eve is the last day of the year in the Gregorian calendar. Many parties to welcome th...,,86399,2020-12-31 23:59:59+00:00,2017-01-04 23:07:03+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",BJmqLY9kQNqw,"[holiday, observance]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-31 00:00:00+00:00,active,,New Year's Eve,2021-11-17 02:36:07+00:00,0,,"[{'entity_id': 'huZ4EThvNDrFWjjAbDcLy4', 'name': 'New Year's Eve observed', 'type': 'event-group...",,,,False,90
1,,observances,US,,,,86399,2020-12-27 23:59:59+00:00,2021-07-28 00:17:16+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",D82cvkxrTUk8BSLGbr,"[holiday, observance, observance-united-nations]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-27 00:00:00+00:00,active,,International Day of Epidemic Preparedness,2021-11-13 03:20:18+00:00,0,,"[{'entity_id': 'dHUNwAjmgnZ9AvqFtn44wG', 'name': 'International Day of Epidemic Preparedness', '...",,,,False,50
2,,observances,US,,Kwanzaa is a week-long holiday honoring the culture and traditions of African people and their d...,,86399,2020-12-26 23:59:59+00:00,2017-01-04 23:06:54+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",QDPqxb7ll7mM,"[holiday, observance]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-26 00:00:00+00:00,active,,Kwanzaa (first day),2021-11-13 17:07:38+00:00,0,,"[{'entity_id': 'hwGSnBtjr2YsYVWTAjVUy4', 'name': 'Kwanzaa (first day)', 'type': 'event-group', '...",,,,False,50
3,,public-holidays,US,,Christmas Day celebrates Jesus Christ's birth.,,86399,2020-12-25 23:59:59+00:00,2021-10-17 01:08:09+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-119.4179324, 36.778261]}}",AEUSEm4cunYtwfPwPe,"[holiday, holiday-christian, holiday-local, holiday-religious]","[-119.4179324, 36.778261]",,"[[6295630, 6255149, 6252001, 5332921]]",,0.0,region,2020-12-25 00:00:00+00:00,active,,Christmas Day,2021-11-17 01:50:33+00:00,0,,"[{'entity_id': 'huZZt9gWmDyBzwcHEmNiCc', 'name': 'Christmas Day', 'type': 'event-group', 'format...",,,,False,70
4,,public-holidays,US,,Christmas Day celebrates Jesus Christ's birth.,,86399,2020-12-25 23:59:59+00:00,2017-01-04 23:06:52+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",b3xEqLza0Nz0,"[holiday, holiday-christian, holiday-national, holiday-religious]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-25 00:00:00+00:00,active,,Christmas Day,2021-11-16 05:57:13+00:00,100,,"[{'entity_id': 'huZZt9gWmDyBzwcHEmNiCc', 'name': 'Christmas Day', 'type': 'event-group', 'format...",,,,False,90
5,,public-holidays,US,,Christmas Eve in the United States is on December 24 each year.,,86399,2020-12-24 23:59:59+00:00,2020-12-23 00:01:21+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",2GzBLK5LKTFQqdMj7n,"[holiday, holiday-national]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-24 00:00:00+00:00,active,,Christmas Eve,2021-11-17 02:08:57+00:00,100,,"[{'entity_id': 'hnmD6GjKBwXLdLKVBxtmcc', 'name': 'Christmas Eve observed', 'type': 'event-group'...",,,,False,90
6,,observances,US,,Christmas Eve in the United States is on December 24 each year.,,86399,2020-12-24 23:59:59+00:00,2017-01-04 23:06:51+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",lljqM1zoRGVk,"[holiday, holiday-christian, holiday-religious, observance]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-24 00:00:00+00:00,active,,Christmas Eve,2021-11-17 02:11:04+00:00,0,,"[{'entity_id': 'hnmD6GjKBwXLdLKVBxtmcc', 'name': 'Christmas Eve observed', 'type': 'event-group'...",,,,False,50
7,,observances,US,,,,86399,2020-12-21 23:59:59+00:00,2017-01-04 23:06:45+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",YeVAbxBEe8ZQ,"[holiday, observance, observance-season]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-21 00:00:00+00:00,active,,December Solstice,2021-11-13 12:51:42+00:00,0,,"[{'entity_id': 'Ht8gVTuSkVD5YynPa7etkx', 'name': 'December Solstice', 'type': 'event-group', 'fo...",,,,False,50
8,,observances,US,,The United Nations' (UN) International Human Solidarity Day is celebrated on December 20 each ye...,,86399,2020-12-20 23:59:59+00:00,2017-01-04 23:06:43+00:00,"{'geometry': {'type': 'Point', 'coordinates': [-95.712891, 37.09024]}}",NwJ2xyYGDZAA,"[holiday, observance, observance-united-nations]","[-95.712891, 37.09024]",,"[[6295630, 6255149, 6252001]]",,0.0,country,2020-12-20 00:00:00+00:00,active,,International Human Solidarity Day,2021-11-13 18:42:37+00:00,0,,"[{'entity_id': 'huYxbqRcEVtqeSu3bBtKVk', 'name': 'International Human Solidarity Day', 'type': '...",,,,False,50
9,,school-holidays,US,,,,1468799,2021-01-04 23:59:59+00:00,2021-08-13 05:51:01+00:00,"{'geometry': {'type': 'Polygon', 'coordinates': [[[-118.14199500000001, 34.11953000000002], [-11...",4kKqorpAcBk9tHzssU,"[holiday, school]","[-118.1572935851, 34.1102483727]",,"[[6295630, 6255149, 6252001, 5332921, 5368381]]",,0.0,county,2020-12-19 00:00:00+00:00,active,,South Pasadena Unified School District - Christmas Break,2021-10-27 03:10:23+00:00,0,,[],59.0,4160.0,,False,62


It is important to understand the output data. The most useful fields are the following:
- ```id``` The unique id of each event.
- ```title``` The title of each event.
- ```description``` The description of each event.
- ```start``` The start time of each event.
- ```end``` The end time of each event.
- ```duration``` Duration of event in seconds.
- ```category``` Category of events. e.g. school-holidays, public-holidays, observances.
- ```labels``` Labels of each event.
- ```country``` Country of each event.
- ```rank``` PHQ rank of each event.
- ```aviation_rank``` Aviation rank of each event.
- ```local_rank``` For school holidays in the US and the UK - it represents the proportion of students in a school district/LEA.
- ```location``` Latitude and longitude of each event.
- ```place_hierarchies``` The hierarchies place ids.
- ```scope``` The scope of each event.
- ```first_seen``` The time when we received this event.
- ```phq_attendance``` For school holidays in the US and the UK - it represents the number of students/LEA. 
- ```geo``` This details whether an event is at a point or covers a polygon area and includes the coordinates

In [23]:
# Selecting the target fields
event_df = event_df[
    [
        "id",
        "title",
        "description",
        "start",
        "end",
        "duration",
        "category",
        "labels",
        "country",
        "rank",
        "local_rank",
        "aviation_rank",
        "phq_attendance",
        "location",
        "place_hierarchies",
        "scope",
        "first_seen",
        "geo"
    ]
]

In [34]:
# Creating a filename for our DataFrame and saving our final DataFrame as a CSV file
if "within" in parameters:
    file_name = (
        f"radius{radius}{radius_unit}_{latitude}_{longitude}_{start_time}_{end_time}"
    )
else:
    file_name = f"place_ids_{'_'.join(place_ids)}_{start_time}_{end_time}"
event_df.to_csv(f"data/event_data/{file_name}.csv", index=False)
print(f"DataFrame saved to data/event_data/f{file_name}.csv")

radius10km_34.07_-118.25_2020-01-01_2020-12-31
DataFrame saved to data/geo_data/fradius10km_34.07_-118.25_2020-01-01_2020-12-31.csv


<a id='appendix'></a>
## Appendix: Finding ```place_id``` 

Here is a guide on how to link between store locations and ```place_id```. Here the ```location``` could be a city, a state, a country or a continent. 

 - Query ```place_id``` based on ```location```
 - Query ```place_hierarchies``` based on ```latitude, longitude```
 - Query ```location``` based on ```place_id```

The full list of parameters that you could use in your query is documents at our [Places API page](https://docs.predicthq.com/resources/places/).<br>PredictHQ uses the geonames places convention https://www.geonames.org/ 

#### 1) Query ```place_id``` based on ```location```

By using PredictHQ Places API, you can find the ```place_id``` for a specific ```location```. By calling the API and setting ```q``` to ```location```, the API will return the most relevant ```place_id```. Taking the top ```place_id``` will provide the most relevant ```place_id``` the ```location``` is in.

In [37]:
# Example locations.
locations = ["California"]

place_id_lookup = pd.DataFrame()

for location in locations:
    response = requests.get(
        url="https://api.predicthq.com/v1/places/",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        params={"q": location},
    )

    data = response.json()
    df = pd.json_normalize(data["results"])
    place_id_lookup = place_id_lookup.append(df.iloc[0], ignore_index=True)

In [38]:
place_id_lookup[["id", "name", "type"]]

Unnamed: 0,id,name,type
0,5332921,California,region


#### 2) Query ```place_hierarchies``` based on ```latitude, longitude```

By using PredictHQ Places Hierarchies API, you can find the  ```place_hierarchies``` for a specific ```latitude, longitude```. By calling the API and setting ```location.origin``` to ```latitude, longitude```, the API will return the most relevant ```place_hierarchies```.

In [27]:
# Example locations.
latitude_longitudes = [[34.07, -118.25]]

place_hierarchies_lookup = pd.DataFrame()

for latitude_longitude in latitude_longitudes:
    latitude, longitude = latitude_longitude
    response = requests.get(
        url="https://api.predicthq.com/v1/places/hierarchies",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        params={"location.origin": f"{latitude},{longitude}"},
    )

    data = response.json()
    df = pd.DataFrame(data)
    df["latitude"] = latitude
    df["longitude"] = longitude
    place_hierarchies_lookup = place_hierarchies_lookup.append(df, ignore_index=True)

In [28]:
place_hierarchies_lookup

Unnamed: 0,place_hierarchies,latitude,longitude
0,"[{'place_id': '6295630', 'type': 'planet'}, {'place_id': '6255149', 'type': 'continent'}, {'plac...",34.07,-118.25
1,"[{'place_id': '6295630', 'type': 'planet'}, {'place_id': '6255149', 'type': 'continent'}, {'plac...",34.07,-118.25


For each ```latitude, longitude```, the response might include more than one hierarchy. The reason for this is we try to match the closest place's hierarchy but we also include the closest major city's hierarchy within a radius of 50km. This only applies if the level is below region and, if it exists, the major city's hierarchy will always be the second row of the DataFrame.

#### 3) Query ```location``` based on ```place_id```

By using PredictHQ Places API, you can find the ```location``` for a specific ```place_id```. By calling the API and setting ```id``` to ```place_id```, the API will return the most relevant ```location```. Taking the top ```location``` will provide the most relevant ```location``` the ```place_id``` is in.

In [29]:
# Example locations.
place_ids = ["6295630", "6255148", "2510769", "2513413"]

location_lookup = pd.DataFrame()

for place_id in place_ids:
    response = requests.get(
        url="https://api.predicthq.com/v1/places/",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        # The id could be a comma-separated list of place_ids. In this example, the
        # events are queried based on each place_id.
        params={"id": place_id},
    )

    data = response.json()
    df = pd.json_normalize(data["results"])
    location_lookup = location_lookup.append(df.iloc[0], ignore_index=True)

In [30]:
location_lookup

Unnamed: 0,id,type,name,county,region,country,country_alpha2,country_alpha3,location
0,6295630,planet,Earth,,,,,,"[0, 0]"
1,6255148,continent,Europe,,,,,,"[9.14062, 48.69096]"
2,2510769,country,Spain,,,Spain,ES,ESP,"[-4, 40]"
3,2513413,region,Murcia,,Murcia,Spain,ES,ESP,"[-1.5, 38]"


#### 4) Query ```timezone``` based on ```place_id```

Here is an example of how to find the ```timezone``` using ```place_id```.

In [32]:
timezone_lookup = pd.DataFrame()
for place_id in [5368361,4887398]:
    response = requests.get(
        url="https://api.predicthq.com/v1/places/",
        headers={
            "Authorization": "Bearer {}".format(ACCESS_TOKEN),
            "Accept": "application/json",
        },
        # The id could be a comma-separated list of place_ids. In this example, the
        # events are queried based on each place_id.
        params={"id": place_id},
    )

    data = response.json()
    df = pd.json_normalize(data["results"])
    #print(df)
    timezone_lookup = timezone_lookup.append(df.iloc[0], ignore_index=True)
    
timezone_lookup[['lon','lat']] = pd.DataFrame(timezone_lookup.location.tolist(), index= timezone_lookup.index)  
func = TimezoneFinder().timezone_at

timezone_lookup['timezone'] = timezone_lookup.apply(lambda x: func(lng=x['lon'], lat=x['lat']),axis=1)
print(timezone_lookup[['name','id','timezone']])



          name       id             timezone
0  Los Angeles  5368361  America/Los_Angeles
1      Chicago  4887398      America/Chicago
