<a href="https://colab.research.google.com/github/predicthq/phq-data-science-docs/blob/master/unattended-events/part_3_feature_engineering.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### NON-ATTENDANCE-BASED EVENTS DATA SCIENCE GUIDES

# Part 3 Aggregation and Feature Engineering

<b>A How To Guide to aggregating data and creating features for forecasting from PredictHQ's Non-Attendance-Based Events data (public-holidays, observances and school-holidays).</b>

Designing features for forecasting will be affected by what you are forecasting and what you are trying to optimise.

The findings of Part 2 can be used to influence decisions dependant on your business domain. 

This notebook addresses some of the key considerations.
    
- Impact Measurement
    - PHQ Rank
    - Aviation Rank
    - PHQ attendance
    - Event count
    - Event flag
- Business type 
   - Using certain types of events

- Impact pattern
   - For events with more than one day duration. The impact from each of the days might be different. 

The rest of this notebook takes you through three different fictional use case ideas for how you could aggregate the data dependant on what's important to your company. It should provide a framework to create your own features or aggregation. 


## [Case 1](#case_1): 
For this case study, consider the example of a coffee shop. Most Non-Attendance-Based Events nearby are likely to impact demand. PHQ Rank is used to indicate the relative impact of different events. The events are aggregated based on PHQ Rank. 

When there are multiple events from the same category, the event with the largest rank is assumed to have the leading impact. Thus, the maximum PHQ rank from each category is used as a feature.
For events with multiple days duration, each of the days has an equal impact on the demand.

For the school holiday category we will also look at using the sum of phq_attendance on each day. This is equivalent to the number of students of holiday. The phq attendance field for school holidays is only available for the USA and the UK. 


Summary for the example case study: 
- Use the maximum rank in each category as impact. 
- Include all Non-Attendance-Based Events.
- Each day has the same impact for multiple day events.
 

## [Case 2](#case_2): 
For this case study, consider a DIY shop. They might notice that certain kinds of Non-Attendance-Based Events drive positive impact on footfall and sales.

In this case, the impact of events is measured by the number of events happening on each day. Based on previous experience, the shop owner might notice that sales increase on religious holidays, nationwide holidays and school holidays. Thus, only events from the related labels are included. 

The length of event duration does not impact the sales. 

Summary for the example case study:
- Event count is relevant to the business.
- Interested in events with certain labels. 
- Each day of a multi-day event has an equal impact. 

## [Case 3](#case_3): 
For this case study, consider a transportation business such as airline. Thus, Aviation Rank is relevant for evaluating the event impact. 

This kind of businss is highly likely to notice events with 'holiday-national' and 'school' labels impact their demand.

For events with multiple day duration, especially the school holidays, the beginning and the end of the holiday is busier than usual. 

Summary for the example case study:
- Aviation Rank is relevant to customers' business. 
- Interested in events with certain labels. 
- The first and last two days have an impact for multiple day events.
 
There are numerous other approaches that may be appropriate for your data when you start to examine the relationship between None-Attendance-Based Event data and your demand. For example, events could be filtered out by rank thresholds, and time information, such as whether it is from a particular season and weak day, could be used to adjust event impact. 


## [Case 4](#case_4):
For this case study, we will again consider a coffee shop. The coffee shop wonders whether the number of students on holiday will affect their business.

In this case we will look at the school holiday category. We will use the sum of phq_attendance on each day for the school districts that fall within our 10km radius. This is equivalent to the number of students of holiday. The phq attendance field for school holidays is only available for the USA and the UK.

Summary for the example case study:
- phq_attendance for school holidays refers to the number of students on holiday 
- Looking at the sum of phq_attendance per day 
- Each day of a multi-day event has an equal impact.
## [Compare Features](#compare) 

Comparison of the different aggregation methods applied in the above case studies. 


If using Google Colab uncomment the following code block.

In [49]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/unattended-events
# !pip install predicthq timezonefinder pandas==1.0.5


If running locally, set up a Python environment using `requirements.txt` shared alongside the notebook to install the required dependencies. 



In [50]:
from predicthq import Client
from timezonefinder import TimezoneFinder
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import pandas as pd
from datetime import timedelta
import passwords

# To display more columns in the dataframe
pd.set_option("display.max_columns", 50)

<a id='csv_sdk'></a>
# SDK or CSV Data Access
This notebook can be run using both the csv example data provided, or if you have access to the PredictHQ Client Endpoint, you can use the code provided to call the SDK for the locations of interest to you. 

As using the SDK is not the focus of Part 3, a function is created to call the SDK. For guidance on how to use the SDK refer to Part 1. If you do not have access to the SDK the notebook also works with a number of CSV files that are provided alongside the notebook. 

In [51]:
# Set whether to run with SDK or using provided CSV files
# Set to either 'CSV' or 'SDK'
RUN_SETTING = "SDK"

if RUN_SETTING == "SDK":
    # Replace Access Token with own access token.
    ACCESS_TOKEN = 'REPLACE_WITH_ACCESS_TOKEN'
    phq = Client(access_token=ACCESS_TOKEN)


def query_unattended_events(
    start_time,
    end_time,
    #place_ids,
    radius,
    radius_unit,
    latitude,
    longitude,
    categories,
):
    """
    Query Non-Attendance-Based Events based on time, location and category info.
    Args:
        start_time: start of the period for querying Non-Attendance-Based Events.
                    Format "YYYY-MM-DD"
        end_time: end of the period for querying Non-Attendance-Based Events.
                    Format of "YYYY-MM-DD"
        radius: radius for querying Non-Attendance-Based Events.
        radius_unit: unit of the radius.
        latitude: latitude of the interested location.
        longitude: longitude of the interested location.
        categories: list of categories, such as
                                    ['school-holidays',
                                    'public-holidays',
                                    'observances']
    return:
        event_df: pandas DataFrames of non-attendance-based events
    """
    within = f"{radius}{radius_unit}@{latitude},{longitude}"
    #timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)
    timezone = 'America/Los_Angeles'
    params = {
        "active__gte": start_time,
        "active__lte": end_time,
        "active__tz": timezone,
        "within": within,
        "category": categories,
        "place__scope": place_ids
    }

    result_list = []

    # Iterating through all the events that match our criteria and
    # adding them to our result_list
    for event in phq.events.search(params).iter_all():
        result_list.append(event.to_dict())

    event_df = pd.DataFrame(result_list)
    # Selecting the target fields
    event_df = event_df[
        [
            "id",
            "title",
            "description",
            "start",
            "end",
            "duration",
            "category",
            "labels",
            "country",
            "rank",
            "local_rank",
            "aviation_rank",
            "location",
            "place_hierarchies",
            "phq_attendance",
            "scope",
            "first_seen",
        ]
    ]
    return event_df

<a id='query_unattended_events'></a>
# Query Non-Attendance-Based Events

In [52]:
start_time = "2020-01-01"
end_time = "2020-12-31"
radius = 10
radius_unit = "km"
latitude, longitude = (34.07, -118.25)
#place_ids = [5332921]
categories = ["school-holidays", "public-holidays", "observances"]
file_name = f"data/event_data/radius{radius}{radius_unit}_{latitude}_{longitude}_{start_time}_{end_time}.csv"
#file_name = f"data/event_data/radius{place_ids}_{end_time}.csv"
if RUN_SETTING == "SDK":
    event_df = query_unattended_events(
        start_time,
        end_time,
        radius,
        radius_unit,
        latitude,
        longitude,
        #place_ids,
        categories
    )
    
    event_df.to_csv(
        file_name,
        index=False,
    )
elif RUN_SETTING == "CSV":
    event_df = pd.read_csv(file_name)
    # convert string with python expression
    event_df["labels"] = event_df["labels"].apply(lambda x: eval(x))
    event_df["start"] = pd.to_datetime(event_df["start"])
    event_df["end"] = pd.to_datetime(event_df["end"])
else:
    print("Must set RUN_SETTING to either 'SDK' or 'CSV'")

event_df = event_df.sort_values("start")
event_df.head()

Unnamed: 0,id,title,description,start,end,duration,category,labels,country,rank,local_rank,aviation_rank,location,place_hierarchies,phq_attendance,scope,first_seen
415,AFwZU36DodT2PW4k9J,Pasadena Unified School District - Christmas B...,,2019-12-20 00:00:00+00:00,2020-01-05 23:59:59+00:00,1468799,school-holidays,"[holiday, school]",US,78,42.0,0,"[-118.0958544725, 34.1957128825]","[[6295630, 6255149, 6252001, 5332921, 5368381]]",25424.0,county,2021-08-13 05:21:29+00:00
410,7eM6jzhyKCjoBmZMwH,South Pasadena Unified School District - Chris...,,2019-12-21 00:00:00+00:00,2020-01-05 23:59:59+00:00,1382399,school-holidays,"[holiday, school]",US,62,59.0,0,"[-118.1572935851, 34.1102483727]","[[6295630, 6255149, 6252001, 5332921, 5368381]]",4160.0,county,2021-08-13 05:19:29+00:00
411,BDYLkauWPG3jfND8yf,Los Angeles Unified School District - Christma...,,2019-12-21 00:00:00+00:00,2020-01-12 23:59:59+00:00,1987199,school-holidays,"[holiday, school]",US,100,54.0,85,"[-118.3984876888, 34.0249894925]","[[6295630, 6255149, 6252001, 5332921, 5368381]...",688725.0,county,2021-08-13 06:51:38+00:00
414,HzEtZx4vGYmikJzjKy,Montebello Unified School District - Christmas...,,2019-12-21 00:00:00+00:00,2020-01-13 23:59:59+00:00,2073599,school-holidays,"[holiday, school]",US,80,70.0,0,"[-118.1290708535, 34.0073885633]","[[6295630, 6255149, 6252001, 5332921, 5368381]]",30377.0,county,2021-08-13 06:53:40+00:00
413,HsXVHKy2gzaDaGHfWw,Glendale Unified School District - Christmas B...,,2019-12-21 00:00:00+00:00,2020-01-06 23:59:59+00:00,1468799,school-holidays,"[holiday, school]",US,79,43.0,0,"[-118.2431304629, 34.1930749973]","[[6295630, 6255149, 6252001, 5332921, 5368381]]",27450.0,county,2021-08-13 05:18:08+00:00


# Non-Attendance-Based Events Aggregations

In [53]:
# Set the measurement for aggregation and feature engineering
# Set either to "phq_rank", "aviation_rank", "event_count", "event_flag","phq_attendance"
impact_measurement = "phq_rank"

# Filter event based on labels of interest.
# The  labels are summarized in a label set such as {"holiday-national","observance-worldwide"},
# Set to None, when all of the labels are required.
label_set = None

# For events with multiple day duration, the days with impact are specified as an index set.
# Index represents either the start or the end of the event.
# For the start, it begins with 0, 1, 2, ...,
# the end, it begins with -1, -2, ...
# Set to None, when all of the days have the same impact.
# The impact_pattern only valid for events with multiple day duration.
impact_pattern = None


def unattended_events_aggregation_and_features(
    event_df,
    impact_measurement="phq_rank",
    label_set=None,
    impact_pattern=None,
):
    event_split = []

    for _, row in event_df.iterrows():
        start_date = row["start"].date()
        end_date = row["end"].date()
        event_duration = (end_date - start_date).days + 1

        # Ignore event if labels don't match what we've specified
        if label_set is not None and not label_set.intersection(set(row["labels"])):
            continue

        # For multi-day events we may want to consider only specific days
        for day in range(event_duration):
            if (
                (event_duration == 1)
                or (impact_pattern is None)
                or (day in impact_pattern)
                or ((day - event_duration) in impact_pattern)
            ):
                event_split.append(
                    {
                        "date": start_date + timedelta(days=day),
                        "category": row["category"],
                        "rank": row["rank"],
                        "aviation_rank": row["aviation_rank"],
                        "phq_attendance": row["phq_attendance"],
                        "count": 1,
                    }
                )

    event_split = pd.DataFrame(event_split)
    
    event_split_count = (
        event_split.groupby(["date", "category"])
        .agg(
            phq_rank=("rank", "max"),
            aviation_rank=("aviation_rank", "max"),
            event_count=("count", "sum"),
            event_flag=("count", "max"),
            phq_attendance=("phq_attendance","sum")
            # if day = start day then 1, 0
            # if day = end day then 1, 0 
        )
        .reset_index()
    )
    
    event_split_count = event_split_count.pivot(
        index="date", columns="category", values=impact_measurement
    ).fillna(0)
    
    event_count_per_day = pd.DataFrame(
        index=pd.date_range(start=start_time, end=end_time)
    )
    
    event_count_per_day = pd.merge(
        event_count_per_day,
        event_split_count,
        how="left",
        left_index=True,
        right_index=True,
    ).fillna(0)

    return event_count_per_day

# Non-Attendance-Based Event Aggregations

<a id='case_1'></a>
## Case 1: 
For this case study, consider the example of a coffee shop. Most Non-Attendance-Based Events nearby are likely to impact demand. We try using PHQ Rank and PHQ attendance to indicate the relative impact of different events. First we will aggregate the events based on PHQ Rank. 

When there are multiple events from the same category, the event with the largest rank is assumed to have the leading impact. Thus, the maximum PHQ rank from each category is used as a feature.

For events with multiple days duration, each of the days has an equal impact on the demand.
    
Given above info, the parameters are set to be:
- ```impact_measurement = "phq_rank"```\
   Use the maximum rank in each category as impact.
- ```label_set = None```\
   Include all Non-Attendance-Based Events.
- ```impact_pattern = None```\
   Each day has the same impact for multiple day events.
   
We will also aggregate the events based on phq_attendance by taking the sum - this represents the number of students on holiday per day. Only US and UK school holidays have phq_attendance. From the resulting dataframe we can see that some school districts end their holidays earlier than others as the phq_attendance starts to drop off. 
- ```impact_measurement = "phq_attendance"```\
   Use the sum of phq_attendance as the impact (number of students on holiday).
- ```label_set = {school}```\
   Include School Holidays.


In [54]:
df_case_1 = unattended_events_aggregation_and_features(
    event_df, impact_measurement="phq_rank", label_set=None, impact_pattern=None
)

df_case_1

Unnamed: 0,observances,public-holidays,school-holidays
2020-01-01,0.0,90.0,100.0
2020-01-02,0.0,0.0,100.0
2020-01-03,0.0,0.0,100.0
2020-01-04,50.0,0.0,100.0
2020-01-05,0.0,0.0,100.0
...,...,...,...
2020-12-27,50.0,0.0,100.0
2020-12-28,0.0,0.0,100.0
2020-12-29,0.0,0.0,100.0
2020-12-30,0.0,0.0,100.0


<a id='case_2'></a>
## Case 2: 
For this case study, consider a DIY shop. They might notice that certain kinds of Non-Attendance-Based Events drive positive impact on footfall and sales.

In this case, the impact of events is measured by the number of events happening on each day. Based on previous experience, the shop owner might notice that sales increase on religious holidays, nationwide holidays and school holidays. Thus, only events from the related labels are included. 

The length of event duration does not impact the sales. 
    
Given above info, the parameters are set to be:
- ```impact_measurement = "event_count"``` \
    Event count is relevant to the business. 
- ```label_set = {'holiday-national', 'observance-united-nations', 'holiday-religious', 'school'}``` \
    Interested in events with certain labels. 
- ```impact_pattern = None```  \
    Each day of a multi-day event has an equal impact. 

In [55]:
df_case_2 = unattended_events_aggregation_and_features(
    event_df,
    impact_measurement="event_count",
    label_set={
        "holiday-national",
        "observance-united-nations",
        "holiday-religious",
        "school",
    },
    impact_pattern=None,
)

df_case_2

Unnamed: 0,observances,public-holidays,school-holidays
2020-01-01,0.0,1.0,6.0
2020-01-02,0.0,0.0,6.0
2020-01-03,0.0,0.0,6.0
2020-01-04,1.0,0.0,6.0
2020-01-05,0.0,0.0,6.0
...,...,...,...
2020-12-27,1.0,0.0,6.0
2020-12-28,0.0,0.0,6.0
2020-12-29,0.0,0.0,6.0
2020-12-30,0.0,0.0,6.0


<a id='case_3'></a>
## Case 3: 
For this case study, consider a transportation business such as airline. Thus, Aviation Rank is relevant for evaluating the event impact. 

This kind of businss is highly likely to notice events with 'holiday-national' and 'school' labels impact their demand. 

For events with multiple day duration, especially the school holidays, the beginning and the end of the holiday is busier than usual.
    
Given above info, the parameters are set to be:  
- ```impact_measurement = "aviation_rank"``` \
  Aviation Rank is relevant to customers' business. 
- ```label_set = {'holiday-national', 'school'}```\
  Interested in events with certain labels. 
- ```impact_pattern = {0,1,-2,-1}```\
  The first and last two days have an impact for multiple day events.

In [56]:
df_case_3 = unattended_events_aggregation_and_features(
    event_df,
    impact_measurement="aviation_rank",
    label_set={"holiday-national", "school"},
    impact_pattern={0, 1, -2, -1},
)

df_case_3

Unnamed: 0,public-holidays,school-holidays
2020-01-01,100.0,0.0
2020-01-02,0.0,0.0
2020-01-03,0.0,0.0
2020-01-04,0.0,0.0
2020-01-05,0.0,0.0
...,...,...
2020-12-27,0.0,0.0
2020-12-28,0.0,0.0
2020-12-29,0.0,0.0
2020-12-30,0.0,0.0


<a id='case_4'></a>
## Case 4: 

For this case study, consider the example of a coffee shop again. Students being on holiday in nearby school districts may affect the demand in store. This effect may be different depending on the total number of nearby students on holiday. We will aggregate the events by adding the PHQ attendance of school holiday events per day. 

For events with multiple days duration, each of the days has an equal impact on the demand.
    
Given above info, the parameters are set to be:
- ```impact_measurement = "phq_attendance"```\
   Use the maximum rank in each category as impact.
- ```label_set = {school}```\
   Include only school holidays.
- ```impact_pattern = None```\
   Each day has the same impact for multiple day events.

In [57]:
df_case_4 = unattended_events_aggregation_and_features(
    event_df, impact_measurement="phq_attendance", label_set={'school'}, impact_pattern=None
)

df_case_4.head(n=10)

Unnamed: 0,school-holidays
2020-01-01,792734.0
2020-01-02,792734.0
2020-01-03,792734.0
2020-01-04,792734.0
2020-01-05,792734.0
2020-01-06,746552.0
2020-01-07,719102.0
2020-01-08,719102.0
2020-01-09,719102.0
2020-01-10,719102.0


<a id='compare'></a>
# Compare Features

By comparing the three different approaches taken in this notebook, you can see how different the features are. These three different cases all use the same underlying data but are aggregated with a different use case in mind.

In [64]:
def plot_difference_per_category(feature_column):
    # TODO DOC string
    if feature_column in df_case_4.columns:
        sub_plot_count = 4
        fig = make_subplots(rows=sub_plot_count, cols=1, shared_xaxes=True)
    elif feature_column in df_case_3.columns:
        sub_plot_count = 3
        fig = make_subplots(rows=sub_plot_count, cols=1, shared_xaxes=True)
    else:
        sub_plot_count = 2
        fig = make_subplots(rows=sub_plot_count, cols=1, shared_xaxes=True)

    fig.add_trace(
        go.Scatter(
            x=df_case_1.index,
            y=df_case_1[feature_column],
            name=f"Case1_{feature_column}",
            mode="lines+markers",
        ),
        row=1,
        col=1,
    )
    
    

    fig.add_trace(
        go.Scatter(
            x=df_case_2.index,
            y=df_case_2[feature_column],
            name=f"Case2_{feature_column}",
            mode="lines+markers",
        ),
        row=2,
        col=1,
    )

    if feature_column in df_case_3.columns:
        fig.add_trace(
            go.Scatter(
                x=df_case_3.index,
                y=df_case_3[feature_column],
                name=f"Case3_{feature_column}",
                mode="lines+markers",
            ),
            row=3,
            col=1,
        )
    if feature_column in df_case_4.columns:    
        fig.add_trace(
        go.Scatter(
            x=df_case_1b.index,
            y=df_case_1b[feature_column],
            name=f"Case4_{feature_column}",
            mode="lines+markers",
        ),
        row=4,
        col=1,
    )

    fig.update_layout(
        title="Non-Attendance-Based Events Aggregation Comparison - {}".format(
            feature_column
        ),
        title_x=0.45,
        yaxis=dict(
            title="Feature Value",
        ),
        yaxis2=dict(
            title="Feature Value",
        ),
        yaxis3=dict(
            title="Feature Value",
        ),
        yaxis4=dict(
            title="Feature Value",
        ),
        xaxis3=dict(
            title="Date",
        ),
    )
    
    fig.show()

In [61]:
plot_difference_per_category("public-holidays")

In [62]:
plot_difference_per_category("observances")

In [65]:
plot_difference_per_category("school-holidays")