# Impact Analyzer

This script reproduces the Impact Analyzer numbers from the data in VBD Actuals / Scenario Planner Actuals.

The direct data export from Impact Analyzer gives the exact same numbers as are driving the plots. The VBD data is larger and allows to specify different time ranges or different selections than the Pega UI.

Caveats:

1. Impact Analyzer only looks at *active* actions. This notion of active / not active is not in the VBD data or at least not currently used by this script.

This script is work-in-progress. It currently only reproduces the impression counts. Value should be included as well as lift, uncertainties etc.

In [None]:
import polars as pl
import pathlib
import pdstools
import polars.selectors as cs
from pdstools import ImpactAnalyzer
from pathlib import Path

In [None]:
# Replace by your own file. This example is part of the PDS tools repository. TODO: just pull from the GH site directly.

sample_pdc_data = Path("~/dev/pega-datascientist-tools/data/ia/CDH_Metrics_ImpactAnalyzer.json").expanduser()

The raw input data shows like this. The PDC format is very verbose and includes even more than just these.

In [None]:
ImpactAnalyzer.from_pdc(
    sample_pdc_data, return_df=True
).head(10).collect()

When reading from PDC, our ImpactAnalyzer class only keeps the counts of impressions, accepts and the action value per impression and re-calculates all the derived values on demand. It drops in-active experiments and adds rows for the "NBA" group. The "All channels" is dropped. ValueLift and ValueLiftInterval are copied from the PDC data as this can currently not be re-calculated from the available raw numbers (ValuePerImpression is empty).

In [None]:
ia = ImpactAnalyzer.from_pdc(
    sample_pdc_data,
)
ia.ia_data.head(10).collect().to_pandas().style

All channels

In [None]:
ia.summarize_control_groups().collect()

In [None]:
ia.summarize_experiments().collect()

There are convenient summarization functions that pivot the lift metrics overall or per channel.

In [None]:
ia.overall_summary().collect()

In [None]:
ia.summary_by_channel().collect()

There is also some (basic) support for plotting

In [None]:
ia.plot.overview()

In [None]:
ia.plot.trend()

# TODO

Above seems ok. Maybe control/test confused still. 

- Double check the numbers against the PDC data also in tests
- check All Channel values
- Check why there is a null channel
- resurrect the calculations from VBD and/or data extracts from the IA UI
- put in the client intelligence code for the summaries, generate cache over all clients, not sure about summaries, could be done on-the-fly
- make sample notebook more useful with some other sample data




# Data

Export the VBD Actuals dataset (in production) or VBD Scenario Planner Actuals (in BOE) from Pega Dev or App Studio.

In [None]:
from pdstools import read_ds_export
from pdstools.utils import cdh_utils
# vbd_export_path = pathlib.Path(
#     "~/Downloads/Data-pxStrategyResult_ActualsExport_20240221T204009_GMT.zip"
# ).expanduser()

# GRR NONE of these have the required fields - we could assert this...

vbd_export_path = pathlib.Path(
    "~/Library/CloudStorage/OneDrive-PegasystemsInc/AI Chapter/projects/Impact Analyzer/VBD_Exports/Data-pxStrategyResult_ScenarioPlannerActuals_20220720T143616_GMT/data.json"
).expanduser()

vbd_export = read_ds_export(vbd_export_path).with_columns(
    cdh_utils.parse_pega_date_time_formats("pxOutcomeTime").dt.date(),
)

In [None]:
cols = vbd_export.columns
cols.sort()
cols

Fix up some data. **FinalPropensity** is not always in IH, only by customization or starting from v xxx onwards.

In [None]:
if "FinalPropensity" not in vbd_export.columns:
    vbd_export = vbd_export.with_columns(pl.lit(None).alias("FinalPropensity"))

# Control Groups in Impact Analyzer

IA uses **pyReason**, **MktType**, **MktValue** and **ModelControlGroup** to define the various experiments. For the standard NBA decisions (no experiment), values are left empty (null). 

Prior to Impact Analyzer, or when turned off, Predictions from Prediction Studio manage two groups through the **ModelControlGroup** property. A value of **Test** is used for model driven arbitration, **Control** for the random control group (defaults to 2%).

When IA is on, the distinct values from just **MktValue** are sufficient to identify the different experiments. In the future, more and custom experiments may be supported.

For the full NBA interactions the value of the marker fields is left empty.

TODO: NBAHealth_ModelControl_2 is conceptually the same as NBAHealth_PropensityPriority and will be phased out in Pega 24.1/24.2. 


# No-Action 

The usage of "Default" issues and groups indicates that there is no action. These need to be filtered out for proper reporting.

TODO: should we exclude these from analysis?

TODO: what about things with inactive status? And how can we know?

In [None]:
vbd_export.group_by(["pyChannel", "pyDirection", "pyIssue", "pyGroup"]).agg(
    pl.count().alias("VBD Records"),
    pl.col("AggregateCount").cast(pl.Int64).sum().alias("Actions"),
).with_columns(
    (pl.col("VBD Records") / pl.sum("VBD Records"))
    .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
    .alias("VBD Records Percentage (per channel)"),
    (pl.col("Actions") / pl.sum("Actions"))
    .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
    .alias("Actions Percentage (per channel)"),
).filter(
    (pl.col("pyIssue") == "Default") | (pl.col("pyGroup") == "Default")
).collect()

# Lookback Period

Impact Analyzer goes back from today's date, also when the data is from an earlier date.

In [None]:
from datetime import datetime


lookback = "-51d"  # "-1mo", "-2y", "-2w" etc https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.dt.offset_by.html

AvailableDates = (
    vbd_export.select(
        From=pl.col("pxOutcomeTime").min(),
        To=pl.col("pxOutcomeTime").max(),
        LookbackFromLastDateInData=pl.col("pxOutcomeTime").max().dt.offset_by(lookback),
        LookbackFromNow=pl.lit(datetime.now()).dt.offset_by(lookback).dt.date(),
    ).collect()
    # .item()
)

lookback_time = AvailableDates.select("LookbackFromLastDateInData").item()

AvailableDates

# Impact Analyzer counts by Channel

In [None]:
optional_mcg = (
    ["ModelControlGroup"] if "ModelControlGroup" in vbd_export.columns else []
)

ia_summary_by_channel = (
    vbd_export.filter(pl.col("pxOutcomeTime") >= lookback_time)
    .group_by(
        [
            "pyChannel",
            "pyDirection",
            "MktType",
            "MktValue",
            "pyReason",
            "pyOutcome",
        ]
        + optional_mcg
    )
    .agg(
        pl.col("pxOutcomeTime").max().alias("Most Recent Update"),
        pl.count().alias("VBD Records"),
        pl.col("AggregateCount").cast(pl.Int64).sum().alias("Actions"),
        pl.sum("FinalPropensity"),
        pl.sum("pyPropensity"),
        pl.sum("pxPriority"),
    )
    .with_columns(
        (pl.col("VBD Records") / pl.sum("VBD Records"))
        .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
        .alias("VBD Records Percentage (per channel)"),
        (pl.col("Actions") / pl.sum("Actions"))
        .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
        .alias("Actions Percentage (per channel)"),
        (pl.col("FinalPropensity") / pl.col("Actions")).alias("Avg FinalPropensity"),
        (pl.col("pyPropensity") / pl.col("Actions")).alias("Avg pyPropensity"),
        (pl.col("pxPriority") / pl.col("Actions")).alias("Avg pxPriority"),
    )
    .drop(["FinalPropensity", "pyPropensity", "pxPriority"])
    .join(
        default_ia_experiments.lazy(),
        how="left",
        on=["MktValue", "MktType", "pyReason"],
        nulls_equal=True,
    )
    .sort(
        [
            "pyChannel",
            "pyDirection",
            "Experiment",
        ]
        + optional_mcg,
        nulls_last=True,
    )
)


def highlight(s):
    if s.Experiment is None:
        return ["background-color: orange"] * len(s)
    else:
        return ["background-color: white"] * len(s)


ia_summary_by_channel_formatted = (
    ia_summary_by_channel.filter(pl.col("pyChannel") == "Web")
    .collect()
    .to_pandas()
    .style.format(
        {
            "Avg FinalPropensity": "{:.2%}",
            "Avg pyPropensity": "{:.2%}",
            "Avg pxPriority": "{:.3f}",
            "VBD Records Percentage (per channel)": "{:.2%}",
            "Actions Percentage (per channel)": "{:.2%}",
            "Most Recent Update": "{:%d %b '%y}",
        }
    )
    .hide(axis="index")
    .hide(["Description", "VBD Records Percentage (per channel)"], axis="columns")
    .set_caption("Experiment Summary:")
    .apply(highlight, axis=1)
)

ia_summary_by_channel_formatted

# KPIs per Channel

In [None]:
# ia_summary_by_channel.collect().pivot("pyOutcome")
# xxx = set(ia_summary_by_channel.columns)
# xxx.remove('pyOutcome')
import pandas as pd

group_by_cols = [
    "pyChannel",
    "pyDirection",
    # "MktType",
    "Experiment",
    "MktValue",
    # "pyReason",
] + optional_mcg

engagement_overview = (
    ia_summary_by_channel.collect()
    .pivot(
        index=group_by_cols,
        columns="pyOutcome",
        values="Actions",
        aggregate_function="sum",
        sort_columns=True,
    )
    .with_columns(cs.numeric().fill_null(0))
)

positive_labels = [
    label
    for label in ["Clicked", "Accept", "Accepted"]
    if label in engagement_overview.columns
]
negative_labels = [
    label for label in ["Impression"] if label in engagement_overview.columns
]

pos_expr = pl.lit(0.0)
for label in positive_labels:
    pos_expr = pos_expr + pl.col(label)
neg_expr = pl.lit(0.0)
for label in negative_labels:
    neg_expr = neg_expr + pl.col(label)


engagement_overview = (
    engagement_overview.with_columns(
        Positives=pos_expr.cast(pl.Int64),
        Negatives=neg_expr.cast(pl.Int64),
        CTR=(pos_expr / (pos_expr + neg_expr)),
    ).with_columns(
        CTR_Lift_vs_NBA=(
            pl.col("CTR")
            / pl.repeat(
                (pl.col("CTR").filter(pl.col("MktValue").is_null())), pl.count()
            )
            - 1.0
        ).over(["pyChannel", "pyDirection"])
    )
    # .with_columns(
    #     CTR_Lift_vs_NBA=(pl.col("CTR")
    #     / (pl.col("CTR").filter(pl.col("MktValue").is_null()))
    #     - 1.0).over(["pyChannel", "pyDirection"])
    # )
    .sort(["pyChannel", "pyDirection", "Experiment"] + optional_mcg)
)


def set_background_col(s, color):
    return "background-color: %s" % color


def set_font_weight(val, weight="bold"):
    return "font-weight: %s" % weight


engagement_overview.to_pandas().style.hide().applymap(
    set_background_col,
    subset=pd.IndexSlice[:, positive_labels + ["Positives"]],
    color="mediumseagreen",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, negative_labels + ["Negatives"]],
    color="tomato",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, ["CTR", "CTR_Lift_vs_NBA"]],
    color="orange",
).applymap(
    set_font_weight, subset=pd.IndexSlice[:, ["pyChannel", "pyDirection", "Experiment"]]
).format(
    {"CTR": "{:,.3%}".format}
)

# Aggregated over all channels

In [None]:
import pandas as pd


top_level_dashboard = (
    engagement_overview.group_by(["Experiment", "MktValue"] + optional_mcg)
    .agg(cs.numeric().sum())
    .with_columns(
        Positives=pos_expr.cast(pl.Int64),
        Negatives=neg_expr.cast(pl.Int64),
        CTR=(pos_expr / (pos_expr + neg_expr)),
    )
    .with_columns(
        CTR_Lift_vs_NBA=pl.col("CTR")
        / (pl.col("CTR").filter(pl.col("MktValue").is_null()))
        - 1.0
    )
    .sort(["Experiment"] + optional_mcg)
)

top_level_dashboard.to_pandas().style.hide().applymap(
    set_background_col,
    subset=pd.IndexSlice[:, positive_labels + ["Positives"]],
    color="mediumseagreen",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, negative_labels + ["Negatives"]],
    color="tomato",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, ["CTR", "CTR_Lift_vs_NBA"]],
    color="orange",
).applymap(
    set_font_weight, subset=pd.IndexSlice[:, ["Experiment"]]
).format(
    {"CTR": "{:,.3%}".format}
)

# Lift

Engagement Lift is calculated as (SuccessRate(test) - SuccessRate(control))/SuccessRate(control)

Value Lift is calculated as (ValueCapture(test) - ValueCapture(control))/ValueCapture(control)

TODO replicate IA tiles

For value, aggregting the Value property
