<a href="https://colab.research.google.com/github/predicthq/phq-data-science-docs/blob/master/unattended-events/part_3_feature_engineering.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### NON-ATTENDANCE-BASED EVENTS DATA SCIENCE GUIDES

# Part 3 Aggregation and Feature Engineering

<b>A How To Guide to aggregating data and creating features for forecasting from PredictHQ's Non-Attendance-Based Events data (public-holidays, observances and school-holidays).</b>

Designing features for forecasting will be affected by what you are forecasting and what you are trying to optimise.

The findings of Part 2 can be used to influence decisions dependant on your business domain. 

This notebook addresses some of the key considerations.
    
- Impact Measurement
    - PHQ Rank
    - Aviation Rank
    - Event count
    - Event flag
- Business type 
   - Using certain types of events

- Impact pattern
   - For events with more than one day duration. The impact from each of the days might be different. 

The rest of this notebook takes you through three different fictional use case ideas for how you could aggregate the data dependant on what's important to your company. It should provide a framework to create your own features or aggregation. 


## [Case 1](#case_1): 
For this case study, consider the example of a coffee shop. Most Non-Attendance-Based Events nearby are likely to impact demand. PHQ Rank is used to indicate the relative impact of different events. The events are aggregated based on PHQ Rank. 

When there are multiple events from the same category, the event with the largest rank is assumed to have the leading impact. Thus, the maximum PHQ rank from each category is used as a feature.

For events with multiple days duration, each of the days has an equal impact on the demand.

Summary for the example case study: 
- Use the maximum rank in each category as impact. 
- Include all Non-Attendance-Based Events.
- Each day has the same impact for multiple day events.

Features API is an alternative end point provided by PredictHQ that aggregates event data together to simplify the preparation of features to use in machine learning models. More documentation can be found at https://docs.predicthq.com/start/features-api/. As an example of using Features API, the data preparation for case 1 is repeated using the Features API.

## [Case 2](#case_2): 
For this case study, consider a DIY shop. They might notice that certain kinds of Non-Attendance-Based Events drive positive impact on footfall and sales.

In this case, the impact of events is measured by the number of events happening on each day. Based on previous experience, the shop owner might notice that sales increase on religious holidays, nationwide holidays and school holidays. Thus, only events from the related labels are included. 

The length of event duration does not impact the sales. 

Summary for the example case study:
- Event count is relevant to the business.
- Interested in events with certain labels. 
- Each day of a multi-day event has an equal impact. 

As similar with the case 1, the data preparation for case 2 is repeated using the Features API.

## [Case 3](#case_3): 
For this case study, consider a transportation business such as airline. Thus, Aviation Rank is relevant for evaluating the event impact. 

This kind of businss is highly likely to notice events with 'holiday-national' and 'school' labels impact their demand.

For events with multiple day duration, especially the school holidays, the beginning and the end of the holiday is busier than usual. 

Summary for the example case study:
- Aviation Rank is relevant to customers' business. 
- Interested in events with certain labels. 
- The first and last two days have an impact for multiple day events.
 
There are numerous other approaches that may be appropriate for your data when you start to examine the relationship between None-Attendance-Based Event data and your demand. For example, events could be filtered out by rank thresholds, and time information, such as whether it is from a particular season and weak day, could be used to adjust event impact. 
   

## [Compare Features](#compare) 

Comparison of the different aggregation methods applied in the above case studies. 


If using Google Colab uncomment the following code block.

In [None]:
# %%capture
# !git clone https://github.com/predicthq/phq-data-science-docs.git
# %cd phq-data-science-docs/unattended-events
# !pip install predicthq timezonefinder


If running locally, set up a Python environment using `requirements.txt` shared alongside the notebook to install the required dependencies. 



In [238]:
from predicthq import Client
from timezonefinder import TimezoneFinder
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import pandas as pd
from datetime import timedelta

import requests
import json

import asyncio
import aiohttp

import uvloop
import iso8601  
import backoff

# To display more columns in the dataframe
pd.set_option("display.max_columns", 50)

<a id='csv_sdk'></a>
# SDK or CSV Data Access
This notebook can be run using both the csv example data provided, or if you have access to the PredictHQ Client Endpoint, you can use the code provided to call the SDK for the locations of interest to you. 

As using the SDK is not the focus of Part 3, a function is created to call the SDK. For guidance on how to use the SDK refer to Part 1. If you do not have access to the SDK the notebook also works with a number of CSV files that are provided alongside the notebook. 

In [188]:
# Set whether to run with SDK or using provided CSV files
# Set to either 'CSV' or 'SDK'
RUN_SETTING = "SDK"

if RUN_SETTING == "SDK":
    # Replace Access Token with own access token.
    ACCESS_TOKEN = ''
    phq = Client(access_token=ACCESS_TOKEN)


def query_unattended_events(
    start_time,
    end_time,
    radius,
    radius_unit,
    latitude,
    longitude,
    categories,
):
    """
    Query Non-Attendance-Based Events based on time, location and category info.
    Args:
        start_time: start of the period for querying Non-Attendance-Based Events.
                    Format "YYYY-MM-DD"
        end_time: end of the period for querying Non-Attendance-Based Events.
                    Format of "YYYY-MM-DD"
        radius: radius for querying Non-Attendance-Based Events.
        radius_unit: unit of the radius.
        latitude: latitude of the interested location.
        longitude: longitude of the interested location.
        categories: list of categories, such as
                                    ['school-holidays',
                                    'public-holidays',
                                    'observances']
    return:
        event_df: pandas DataFrames of non-attendance-based events
    """
    within = f"{radius}{radius_unit}@{latitude},{longitude}"
    timezone = TimezoneFinder().timezone_at(lat=latitude, lng=longitude)

    params = {
        "active__gte": start_time,
        "active__lte": end_time,
        "active__tz": timezone,
        "within": within,
        "category": categories,
    }

    result_list = []

    # Iterating through all the events that match our criteria and
    # adding them to our result_list
    for event in phq.events.search(params).iter_all():
        result_list.append(event.to_dict())

    event_df = pd.DataFrame(result_list)
    # Selecting the target fields
    event_df = event_df[
        [
            "id",
            "title",
            "description",
            "start",
            "end",
            "duration",
            "category",
            "labels",
            "country",
            "rank",
            "aviation_rank",
            "location",
            "place_hierarchies",
            "scope",
            "first_seen",
        ]
    ]
    return event_df

<a id='query_unattended_events'></a>
# Query Non-Attendance-Based Events

In [223]:
start_time = "2020-01-01"
end_time = "2020-12-31"
radius = 10
radius_unit = "km"
latitude, longitude = (34.07, -118.25)
categories = ["school-holidays", "public-holidays", "observances"]
file_name = f"data/event_data/radius{radius}{radius_unit}_{latitude}_{longitude}_{start_time}_{end_time}.csv"

if RUN_SETTING == "SDK":
    event_df = query_unattended_events(
        start_time,
        end_time,
        radius,
        radius_unit,
        latitude,
        longitude,
        categories,
    )
    
    event_df.to_csv(
        file_name,
        index=False,
    )
elif RUN_SETTING == "CSV":
    event_df = pd.read_csv(file_name)
    # convert string with python expression
    event_df["labels"] = event_df["labels"].apply(lambda x: eval(x))
    event_df["start"] = pd.to_datetime(event_df["start"])
    event_df["end"] = pd.to_datetime(event_df["end"])
else:
    print("Must set RUN_SETTING to either 'SDK' or 'CSV'")

event_df = event_df.sort_values("start")


# Non-Attendance-Based Events Aggregations

In [191]:
# Set the measurement for aggregation and feature engineering
# Set either to "phq_rank", "aviation_rank", "event_count", "event_flag"
impact_measurement = "phq_rank"

# Filter event based on labels of interest.
# The  labels are summarized in a label set such as {"holiday-national","observance-worldwide"},
# Set to None, when all of the labels are required.
label_set = None

# For events with multiple day duration, the days with impact are specified as an index set.
# Index represents either the start or the end of the event.
# For the start, it begins with 0, 1, 2, ...,
# the end, it begins with -1, -2, ...
# Set to None, when all of the days have the same impact.
# The impact_pattern only valid for events with multiple day duration.
impact_pattern = None


def unattended_events_aggregation_and_features(
    event_df,
    impact_measurement="phq_rank",
    label_set=None,
    impact_pattern=None,
):
    event_split = []

    for _, row in event_df.iterrows():
        start_date = row["start"].date()
        end_date = row["end"].date()
        event_duration = (end_date - start_date).days + 1

        # Ignore event if labels don't match what we've specified
        if label_set is not None and not label_set.intersection(set(row["labels"])):
            continue

        # For multi-day events we may want to consider only specific days
        for day in range(event_duration):
            if (
                (event_duration == 1)
                or (impact_pattern is None)
                or (day in impact_pattern)
                or ((day - event_duration) in impact_pattern)
            ):
                event_split.append(
                    {
                        "date": start_date + timedelta(days=day),
                        "category": row["category"],
                        "rank": row["rank"],
                        "aviation_rank": row["aviation_rank"],
                        "count": 1,
                    }
                )

    event_split = pd.DataFrame(event_split)
    
    event_split_count = (
        event_split.groupby(["date", "category"])
        .agg(
            phq_rank=("rank", "max"),
            aviation_rank=("aviation_rank", "max"),
            event_count=("count", "sum"),
            event_flag=("count", "max"),
        )
        .reset_index()
    )
    
    event_split_count = event_split_count.pivot(
        index="date", columns="category", values=impact_measurement
    ).fillna(0)
    
    event_count_per_day = pd.DataFrame(
        index=pd.date_range(start=start_time, end=end_time)
    )
    
    event_count_per_day = pd.merge(
        event_count_per_day,
        event_split_count,
        how="left",
        left_index=True,
        right_index=True,
    ).fillna(0)

    return event_count_per_day

# Non-Attendance-Based Event Aggregations

<a id='case_1'></a>
## Case 1: 
For this case study, consider the example of a coffee shop. Most Non-Attendance-Based Events nearby are likely to impact demand. PHQ Rank is used to indicate the relative impact of different events. The events are aggregated based on PHQ Rank. 

When there are multiple events from the same category, the event with the largest rank is assumed to have the leading impact. Thus, the maximum PHQ rank from each category is used as a feature.

For events with multiple days duration, each of the days has an equal impact on the demand.
    
Given above info, the parameters are set to be:
- ```impact_measurement = "phq_rank"```\
   Use the maximum rank in each category as impact.
- ```label_set = None```\
   Include all Non-Attendance-Based Events.
- ```impact_pattern = None```\
   Each day has the same impact for multiple day events.


In [192]:
df_case_1 = unattended_events_aggregation_and_features(
    event_df, impact_measurement="phq_rank", label_set=None, impact_pattern=None
)

df_case_1

Unnamed: 0,observances,public-holidays,school-holidays
2020-01-01,0.0,90.0,90.0
2020-01-02,0.0,0.0,90.0
2020-01-03,0.0,0.0,90.0
2020-01-04,50.0,0.0,90.0
2020-01-05,0.0,0.0,90.0
...,...,...,...
2020-12-27,0.0,0.0,90.0
2020-12-28,0.0,0.0,90.0
2020-12-29,0.0,0.0,90.0
2020-12-30,0.0,0.0,90.0


### Feature API Example for Case 1

Features API is an alternative end point provided by PredictHQ that aggregates event data together to simplify the preparation of features to use in machine learning models. More documentation can be found at https://docs.predicthq.com/start/features-api/. As an example of using Features API, the data preparation for case 1 is repeated using the Features API.

The Features API uses predefined aggregations. For public-holidays, shcool-holidays and observances, the PHQ Rank Level of events is used rather than the PHQ Rank of event. This means the features API will return different features to the method above. 

For the Feature API, we can query a date range up to 90 days. To get a wider date range, multiple requests are sent. Here is an example using asyncio for this. 

In [278]:
headers = {
    'Content-Type': 'application/json',
    'Accept': 'application/json',
    'Authorization': f'Bearer {ACCESS_TOKEN}'
}

url = 'https://api.predicthq.com/v1/features'

asyncio.set_event_loop(uvloop.new_event_loop())

For this case, we consider the public_holidays, school_holidays and observances. All of them are based on the PHQ Rank. In the feature API, their features are:
- phq_rank_public_holidays
- phq_rank_school_holidays
- phq_rank_observances

We can get a query by lat, lon along with radius. For the `stats`, we use the default setting, which we get the count of eack `phq_rank` bucket.

In [326]:
# We can use the place_id for query.
query = {
    'location': {
        'place_id': [5368361] 
        },
    'phq_rank_observances': True,
    'phq_rank_public_holidays': True,
    'phq_rank_school_holidays': True
}

In [292]:
format_date = lambda x: x.strftime('%Y-%m-%d')
parse_date = lambda x: iso8601.parse_date(x)

# The API has rate limits so failed requests should be retried automatically
@backoff.on_exception(backoff.expo, (aiohttp.ClientError), max_time=60)
async def get(
    session: aiohttp.ClientSession,
    query: dict,
    start: str,
    end: str,
    **kwargs ) -> dict:
    
    payload = {
        'active': {
            'gte': start,
            'lte': end
        },
        **query
    }

    resp = await session.request('POST', url=url, headers=headers, raise_for_status=True, json=payload, **kwargs)
    data = await resp.json()
    return data

In [293]:
async def gather_with_concurrency(n, *tasks):
    semaphore = asyncio.Semaphore(n)

    async def sem_task(task):
        async with semaphore:
            return await task
    
    return await asyncio.gather(*(sem_task(task) for task in tasks))


async def gather_stats(query: dict, start_date: str, end_date: str, **kwargs):
    date_ranges = []
    start_date = parse_date(start_date)
    end_date = parse_date(end_date)
    start_ref = start_date
    
    while start_ref + timedelta(days=90) < end_date:
        date_ranges.append({'start': format_date(start_ref), 
                            'end': format_date(start_ref + timedelta(days=90))})
        start_ref = start_ref + timedelta(days=91)
    
    date_ranges.append({'start': format_date(start_ref), 
                        'end': format_date(end_date)})
    
    async with aiohttp.ClientSession() as session:
        tasks = []
        for date_range in date_ranges:
            tasks.append(
                get(
                    session=session,
                    query=query,
                    start=date_range['start'],
                    end=date_range['end'], **kwargs))
        
        responses = await gather_with_concurrency(5, *tasks)
        results = []
        
        for response in responses:
            results.extend(response['results'])
        
        return results

In [294]:
responses = await gather_stats(
    query=query,
    start_date=start_time,
    end_date=end_time)

In [295]:
def get_phq_rank_max(record, feature_key):
    """get the impact, phq rank from the record and the unattandence category
    
    Args:
        record: records fetched from the feature API
        feature_key: non-attendance category
    return:
        The count of the maximum phq rank event"""
    rank_count_list = list(record.get(feature_key).get('rank_levels').values())

    rank_count_max = max(rank_count_list)
    if rank_count_max == 0:
        rank_max = 0
    else:
        rank_max = rank_count_list.index(rank_count_max) + 1
    
    return rank_max
    

Here, the maximum PHQ Rank will be used as features.

In [304]:
event_list = []
for record in responses:
    event_dict = {'date': record.get('date'),
                  'observances': get_phq_rank_max(record, 'phq_rank_observances'), 
                  'public-holidays': get_phq_rank_max(record, 'phq_rank_public_holidays'), 
                  'school-holidays': get_phq_rank_max(record, 'phq_rank_school_holidays')}
    event_list.append(event_dict)
    
df_case_1_featureapi = pd.DataFrame(event_list)
df_case_1_featureapi

Unnamed: 0,date,observances,public-holidays,school-holidays
0,2020-01-01,0,5,5
1,2020-01-02,0,0,5
2,2020-01-03,0,0,5
3,2020-01-04,3,0,5
4,2020-01-05,0,0,5
...,...,...,...,...
361,2020-12-27,0,0,5
362,2020-12-28,0,0,5
363,2020-12-29,0,0,5
364,2020-12-30,0,0,5


<a id='case_2'></a>
## Case 2: 
For this case study, consider a DIY shop. They might notice that certain kinds of Non-Attendance-Based Events drive positive impact on footfall and sales.

In this case, the impact of events is measured by the number of events happening on each day. Based on previous experience, the shop owner might notice that sales increase on religious holidays, nationwide holidays and school holidays. Thus, only events from the related labels are included. 

The length of event duration does not impact the sales. 
    
Given above info, the parameters are set to be:
- ```impact_measurement = "event_count"``` \
    Event count is relevant to the business. 
- ```label_set = {'holiday-national', 'observance-united-nations', 'holiday-religious', 'school'}``` \
    Interested in events with certain labels. 
- ```impact_pattern = None```  \
    Each day of a multi-day event has an equal impact. 

In [251]:
df_case_2 = unattended_events_aggregation_and_features(
    event_df,
    impact_measurement="event_count",
    label_set={
        "holiday-national",
        "observance-united-nations",
        "holiday-religious",
        "school",
    },
    impact_pattern=None,
)

df_case_2[df_case_2['observances']!=0]

Unnamed: 0,observances,public-holidays,school-holidays
2020-01-06,1.0,0.0,1.0
2020-01-07,1.0,0.0,1.0
2020-01-14,1.0,0.0,0.0
2020-01-27,1.0,0.0,0.0
2020-02-04,1.0,0.0,0.0
...,...,...,...
2020-12-11,2.0,0.0,0.0
2020-12-12,1.0,0.0,0.0
2020-12-18,3.0,0.0,0.0
2020-12-20,1.0,0.0,1.0


<a id='case_3'></a>
## Case 3: 
For this case study, consider a transportation business such as airline. Thus, Aviation Rank is relevant for evaluating the event impact. 

This kind of businss is highly likely to notice events with 'holiday-national' and 'school' labels impact their demand. 

For events with multiple day duration, especially the school holidays, the beginning and the end of the holiday is busier than usual.
    
Given above info, the parameters are set to be:  
- ```impact_measurement = "aviation_rank"``` \
  Aviation Rank is relevant to customers' business. 
- ```label_set = {'holiday-national', 'school'}```\
  Interested in events with certain labels. 
- ```impact_pattern = {0,1,-2,-1}```\
  The first and last two days have an impact for multiple day events.

In [40]:
df_case_3 = unattended_events_aggregation_and_features(
    event_df,
    impact_measurement="aviation_rank",
    label_set={"holiday-national", "school"},
    impact_pattern={0, 1, -2, -1},
)

df_case_3

Unnamed: 0,public-holidays,school-holidays
2020-01-01,100.0,0.0
2020-01-02,0.0,0.0
2020-01-03,0.0,0.0
2020-01-04,0.0,0.0
2020-01-05,0.0,0.0
...,...,...
2020-12-27,0.0,0.0
2020-12-28,0.0,0.0
2020-12-29,0.0,0.0
2020-12-30,0.0,0.0


<a id='compare'></a>
# Compare Features

By comparing the three different approaches taken in this notebook, you can see how different the features are. These three different cases all use the same underlying data but are aggregated with a different use case in mind.

In [322]:
def plot_difference_per_category(feature_column):
    # TODO DOC string
    if feature_column in df_case_3.columns:
        sub_plot_count = 4
        fig = make_subplots(rows=sub_plot_count, cols=1, shared_xaxes=True)
    else:
        sub_plot_count = 3
        fig = make_subplots(rows=sub_plot_count, cols=1, shared_xaxes=True)

    fig.add_trace(
        go.Scatter(
            x=df_case_1.index,
            y=df_case_1[feature_column],
            name=f"Case1_{feature_column}",
            mode="lines+markers",
        ),
        row=1,
        col=1,
    )
    
    fig.add_trace(
        go.Scatter(
            x=df_case_1_featureapi.date,
            y=df_case_1_featureapi[feature_column],
            name=f"Case1_featureapi_{feature_column}",
            mode="lines+markers",
        ),
        row=2,
        col=1,
    )

    fig.add_trace(
        go.Scatter(
            x=df_case_2.index,
            y=df_case_2[feature_column],
            name=f"Case2_{feature_column}",
            mode="lines+markers",
        ),
        row=3,
        col=1,
    )

    if feature_column in df_case_3.columns:
        fig.add_trace(
            go.Scatter(
                x=df_case_3.index,
                y=df_case_3[feature_column],
                name=f"Case3_{feature_column}",
                mode="lines+markers",
            ),
            row=4,
            col=1,
        )
    if feature_column in df_case_3.columns:
        fig.update_layout(
            title="Non-Attendance-Based Events Aggregation Comparison - {}".format(
                feature_column
            ),
            title_x=0.45,
            yaxis2=dict(
                title="Feature Value",
            ),
            xaxis4=dict(
                title="Date",
            ),
        )
    else:
        fig.update_layout(
            title="Non-Attendance-Based Events Aggregation Comparison - {}".format(
                feature_column
            ),
            title_x=0.45,
            yaxis2=dict(
                title="Feature Value",
            ),
            xaxis3=dict(
                title="Date",
            ),
        )
    
    fig.show()

In [323]:
plot_difference_per_category("public-holidays")

In [324]:
plot_difference_per_category("observances")

In [325]:
plot_difference_per_category("school-holidays")