# Impact Analyzer

This script reproduces the Impact Analyzer numbers from the data in VBD Actuals / Scenario Planner Actuals.

The direct data export from Impact Analyzer gives the exact same numbers as are driving the plots. The VBD data is larger and allows to specify different time ranges or different selections than the Pega UI.

Caveats:

1. Impact Analyzer only looks at *active* actions. This notion of active / not active is not in the VBD data or at least not currently used by this script.

This script is work-in-progress. It currently only reproduces the impression counts. Value should be included as well as lift, uncertainties etc.

In [1]:
import polars as pl
import pathlib
import pdstools
import polars.selectors as cs
from pdstools import ImpactAnalyzer
from pathlib import Path

In [2]:
# Replace by your own file. This example is part of the PDS tools repository. TODO: just pull from the GH site directly.

sample_pdc_data = Path("~/dev/pega-datascientist-tools/data/ia/CDH_Metrics_ImpactAnalyzer.json").expanduser()

The raw input data shows like this. The PDC format is very verbose and includes even more than just these.

In [3]:
input_as_tbl = ImpactAnalyzer.from_pdc(
    sample_pdc_data, return_input_df=True
).collect()
# TMP!!
input_as_tbl.write_excel(Path("~/Downloads/ia_testing.xlsx").expanduser(), autofit=True)
input_as_tbl.to_pandas()

Unnamed: 0,IsActive,Accepts_Control,EngagementLiftInterval,ConfidenceLevel,SnapshotTime,ExperimentName,AcceptRate_Control,AcceptRate_NBA,LastDataReceived,IsSignificant,...,AggregationFrequency,ActionValuePerImp_NBA,Accepts_NBA,Impressions_NBA,Impressions_Control,ChannelName,ValueLift,ActionValuePerImp_Control,EngagementLift,ValueLiftInterval
0,True,36108,0.01,,1970-01-01T00:00:00.000Z,NBAHealth_NBAPrioritization,0.0,0.0,Yesterday,True,...,Daily,0.0,3991815,39942446,362100,All channels,0.911865,0.0,0.002215,0.74
1,True,41455,0.01,,1970-01-01T00:00:00.000Z,NBAHealth_PropensityPriority,0.0,0.0,Yesterday,True,...,Daily,0.0,3991815,39942446,416372,All channels,0.018921,0.0,0.003784,0.19
2,True,39189,0.01,,1970-01-01T00:00:00.000Z,NBAHealth_LeverPriority,0.0,0.0,Yesterday,True,...,Daily,0.0,3991815,39942446,393169,All channels,0.002653,0.0,0.002653,0.07
3,False,0,0.0,,1970-01-01T00:00:00.000Z,NBAHealth_EngagementPolicy,0.0,0.0,Yesterday,False,...,Daily,0.0,3991815,39942446,0,All channels,0.0,0.0,0.0,0.0
4,True,16676,0.02,,1970-01-01T00:00:00.000Z,NBAHealth_ModelControl,0.0,0.0,Yesterday,True,...,Daily,0.0,22544,224211,167437,All channels,1.032273,0.0,0.009563,1.42
5,True,9068,0.02,,1970-01-01T00:00:00.000Z,NBAHealth_NBAPrioritization,0.0,0.0,Yesterday,True,...,Daily,0.0,996905,9985661,90525,DirectMail,0.881899,0.0,-0.00337,0.79
6,True,10433,0.02,,1970-01-01T00:00:00.000Z,NBAHealth_PropensityPriority,0.0,0.0,Yesterday,True,...,Daily,0.0,996905,9985661,104093,DirectMail,0.006902,0.0,-0.003931,0.19
7,True,9778,0.02,,1970-01-01T00:00:00.000Z,NBAHealth_LeverPriority,0.0,0.0,Yesterday,True,...,Daily,0.0,996905,9985661,98292,DirectMail,0.003564,0.0,0.003564,0.08
8,False,0,0.0,,1970-01-01T00:00:00.000Z,NBAHealth_EngagementPolicy,0.0,0.0,Yesterday,False,...,Daily,0.0,996905,9985661,0,DirectMail,0.0,0.0,0.0,0.0
9,True,4228,0.04,,1970-01-01T00:00:00.000Z,NBAHealth_ModelControl,0.0,0.0,Yesterday,True,...,Daily,0.0,5581,56051,41860,DirectMail,0.915669,0.0,-0.014191,1.34


When reading from PDC, our ImpactAnalyzer class only keeps the counts of impressions, accepts and the action value per impression and re-calculates all the derived values on demand. It drops in-active experiments and adds rows for the "NBA" group. The "All channels" is dropped. ValueLift and ValueLiftInterval are copied from the PDC data as this can currently not be re-calculated from the available raw numbers (ValuePerImpression is empty).

When reading multiple PDC files from S3 we can use

`ImpactAnalyzer.from_pdc(
    sample_pdc_data,
    return_df=True
)
`

and stack up the returned dataframes to pass them on collectively to the ImpactAnalyzer class.

In [4]:
ia = ImpactAnalyzer.from_pdc(
    sample_pdc_data,
)
ia.ia_data.head(10).collect().to_pandas().style

Unnamed: 0,SnapshotTime,Channel,ControlGroup,Impressions,Accepts,ValuePerImpression,Pega_ValueLift,Pega_ValueLiftInterval
0,2025-03-01 00:00:00,DirectMail,NBAHealth_LeverPriority,98292,9778,,0.003564,0.08
1,2025-03-01 00:00:00,DirectMail,NBAHealth_ModelControl_1,41860,4228,,0.915669,1.34
2,2025-03-01 00:00:00,DirectMail,NBAHealth_ModelControl_2,56051,5581,,1.0,0.0
3,2025-03-01 00:00:00,DirectMail,NBAHealth_NBA,9985661,996905,,1.0,0.0
4,2025-03-01 00:00:00,DirectMail,NBAHealth_NBAPrioritization,90525,9068,,0.881899,0.79
5,2025-03-01 00:00:00,DirectMail,NBAHealth_PropensityPriority,104093,10433,,0.006902,0.19
6,2025-03-01 00:00:00,Email,NBAHealth_LeverPriority,98293,9755,,0.007617,0.08
7,2025-03-01 00:00:00,Email,NBAHealth_ModelControl_1,41860,4198,,0.908051,1.34
8,2025-03-01 00:00:00,Email,NBAHealth_ModelControl_2,56055,5570,,1.0,0.0
9,2025-03-01 00:00:00,Email,NBAHealth_NBA,9985677,998568,,1.0,0.0


All the control groups with counts aggregated over all the channels

In [5]:
ia.summarize_control_groups().collect()

ControlGroup,Impressions,Accepts,CTR,ValuePerImpression
str,i64,i64,f64,f64
"""NBAHealth_LeverPriority""",393169,39189,0.099675,
"""NBAHealth_ModelControl_1""",167437,16676,0.099596,
"""NBAHealth_ModelControl_2""",224211,22544,0.100548,
"""NBAHealth_NBA""",39942446,3991815,0.099939,
"""NBAHealth_NBAPrioritization""",362100,36108,0.099718,
"""NBAHealth_PropensityPriority""",416372,41455,0.099562,


All the experiments, split by channel

In [6]:
ia.summarize_experiments("Channel").collect()

Experiment,Test,Control,Channel,Impressions_Test,Accepts_Test,CTR_Test,ValuePerImpression_Test,Impressions_Control,Accepts_Control,CTR_Control,ValuePerImpression_Control,Control_Fraction,CTR_Lift,Value_Lift
str,str,str,str,i64,i64,f64,f64,i64,i64,f64,f64,f64,f64,f64
"""Adaptive Models vs Random Prop…","""NBAHealth_ModelControl_2""","""NBAHealth_ModelControl_1""","""DirectMail""",56051,5581,0.09957,,41860,4228,0.101003,,0.427531,-0.014191,0.915669
"""Adaptive Models vs Random Prop…","""NBAHealth_ModelControl_2""","""NBAHealth_ModelControl_1""","""Email""",56055,5570,0.099367,,41860,4198,0.100287,,0.427514,-0.009173,0.908051
"""Adaptive Models vs Random Prop…","""NBAHealth_ModelControl_2""","""NBAHealth_ModelControl_1""","""Push""",56053,5792,0.103331,,41860,4143,0.098973,,0.427522,0.044032,1.043275
"""Adaptive Models vs Random Prop…","""NBAHealth_ModelControl_2""","""NBAHealth_ModelControl_1""","""SMS""",56052,5601,0.099925,,41857,4107,0.09812,,0.427509,0.018399,1.312399
"""NBA vs No Levers""","""NBAHealth_NBA""","""NBAHealth_LeverPriority""","""DirectMail""",9985661,996905,0.099834,,98292,9778,0.099479,,0.009747,0.003564,0.003564
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""NBA vs Propensity Only""","""NBAHealth_NBA""","""NBAHealth_PropensityPriority""","""SMS""",9985457,998417,0.099987,,104094,10328,0.099218,,0.010317,0.007752,0.032556
"""NBA vs Random""","""NBAHealth_NBA""","""NBAHealth_NBAPrioritization""","""DirectMail""",9985661,996905,0.099834,,90525,9068,0.100171,,0.008984,-0.00337,0.881899
"""NBA vs Random""","""NBAHealth_NBA""","""NBAHealth_NBAPrioritization""","""Email""",9985677,998568,0.1,,90525,8940,0.098757,,0.008984,0.012584,0.924631
"""NBA vs Random""","""NBAHealth_NBA""","""NBAHealth_NBAPrioritization""","""Push""",9985651,997925,0.099936,,90525,8987,0.099276,,0.008984,0.006643,0.881671


There are convenient summarization functions that pivot the lift metrics overall or per channel.

In [7]:
ia.overall_summary().collect()

CTR_Lift Adaptive Models vs Random Propensity,CTR_Lift NBA vs No Levers,CTR_Lift NBA vs Only Eligibility Rules,CTR_Lift NBA vs Propensity Only,CTR_Lift NBA vs Random,Value_Lift Adaptive Models vs Random Propensity,Value_Lift NBA vs No Levers,Value_Lift NBA vs Only Eligibility Rules,Value_Lift NBA vs Propensity Only,Value_Lift NBA vs Random
f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
0.009563,0.002653,,0.003784,0.002215,1.044844,0.00274,,0.018999,0.912425


In [8]:
ia.summary_by_channel().collect()

Channel,CTR_Lift Adaptive Models vs Random Propensity,CTR_Lift NBA vs No Levers,CTR_Lift NBA vs Only Eligibility Rules,CTR_Lift NBA vs Propensity Only,CTR_Lift NBA vs Random,Value_Lift Adaptive Models vs Random Propensity,Value_Lift NBA vs No Levers,Value_Lift NBA vs Only Eligibility Rules,Value_Lift NBA vs Propensity Only,Value_Lift NBA vs Random
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""DirectMail""",-0.014191,0.003564,,-0.003931,-0.00337,0.915669,0.003564,,0.006902,0.881899
"""Email""",-0.009173,0.007617,,0.007882,0.012584,0.908051,0.007617,,0.020145,0.924631
"""Push""",0.044032,-0.012555,,0.003514,0.006643,1.043275,-0.012555,,0.016393,0.881671
"""SMS""",0.018399,0.012333,,0.007752,-0.006767,1.312399,0.012333,,0.032556,0.961498


There is also some (basic) support for plotting

In [9]:
ia.plot.overview()

In [10]:
ia.plot.trend()

# Data

Export the VBD Actuals dataset (in production) or VBD Scenario Planner Actuals (in BOE) from Pega Dev or App Studio.

In [11]:
from pdstools import read_ds_export
from pdstools.utils import cdh_utils
# vbd_export_path = pathlib.Path(
#     "~/Downloads/Data-pxStrategyResult_ActualsExport_20240221T204009_GMT.zip"
# ).expanduser()

# GRR NONE of these have the required fields - we could assert this...

vbd_export_path = pathlib.Path(
    "~/Library/CloudStorage/OneDrive-PegasystemsInc/AI Chapter/projects/Impact Analyzer/VBD_Exports/Data-pxStrategyResult_ScenarioPlannerActuals_20220720T143616_GMT/data.json"
).expanduser()

vbd_export = read_ds_export(vbd_export_path).with_columns(
    cdh_utils.parse_pega_date_time_formats("pxOutcomeTime").dt.date(),
)

In [12]:
cols = vbd_export.columns
cols.sort()
cols





['ActionContext',
 'AggregateCount',
 'BundleHead',
 'BundleName',
 'Category',
 'Cost',
 'Value',
 'pxFactID',
 'pxOutcomeTime',
 'pxRank',
 'pyApplication',
 'pyApplicationVersion',
 'pyChannel',
 'pyDirection',
 'pyGroup',
 'pyIssue',
 'pyLabel',
 'pyName',
 'pyOutcome',
 'pyPropensity',
 'pyTreatment']

Fix up some data. **FinalPropensity** is not always in IH, only by customization or starting from v xxx onwards.

In [13]:
if "FinalPropensity" not in vbd_export.columns:
    vbd_export = vbd_export.with_columns(pl.lit(None).alias("FinalPropensity"))





# Control Groups in Impact Analyzer

IA uses **pyReason**, **MktType**, **MktValue** and **ModelControlGroup** to define the various experiments. For the standard NBA decisions (no experiment), values are left empty (null). 

Prior to Impact Analyzer, or when turned off, Predictions from Prediction Studio manage two groups through the **ModelControlGroup** property. A value of **Test** is used for model driven arbitration, **Control** for the random control group (defaults to 2%).

When IA is on, the distinct values from just **MktValue** are sufficient to identify the different experiments. In the future, more and custom experiments may be supported.

For the full NBA interactions the value of the marker fields is left empty.

TODO: NBAHealth_ModelControl_2 is conceptually the same as NBAHealth_PropensityPriority and will be phased out in Pega 24.1/24.2. 


# No-Action 

The usage of "Default" issues and groups indicates that there is no action. These need to be filtered out for proper reporting.

TODO: should we exclude these from analysis?

TODO: what about things with inactive status? And how can we know?

In [14]:
vbd_export.group_by(["pyChannel", "pyDirection", "pyIssue", "pyGroup"]).agg(
    pl.count().alias("VBD Records"),
    pl.col("AggregateCount").cast(pl.Int64).sum().alias("Actions"),
).with_columns(
    (pl.col("VBD Records") / pl.sum("VBD Records"))
    .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
    .alias("VBD Records Percentage (per channel)"),
    (pl.col("Actions") / pl.sum("Actions"))
    .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
    .alias("Actions Percentage (per channel)"),
).filter(
    (pl.col("pyIssue") == "Default") | (pl.col("pyGroup") == "Default")
).collect()


`pl.count()` is deprecated. Please use `pl.len()` instead.



pyChannel,pyDirection,pyIssue,pyGroup,VBD Records,Actions,VBD Records Percentage (per channel),Actions Percentage (per channel)
str,str,str,str,u32,i64,f64,f64


# Lookback Period

Impact Analyzer goes back from today's date, also when the data is from an earlier date.

In [15]:
from datetime import datetime


lookback = "-51d"  # "-1mo", "-2y", "-2w" etc https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.dt.offset_by.html

AvailableDates = (
    vbd_export.select(
        From=pl.col("pxOutcomeTime").min(),
        To=pl.col("pxOutcomeTime").max(),
        LookbackFromLastDateInData=pl.col("pxOutcomeTime").max().dt.offset_by(lookback),
        LookbackFromNow=pl.lit(datetime.now()).dt.offset_by(lookback).dt.date(),
    ).collect()
    # .item()
)

lookback_time = AvailableDates.select("LookbackFromLastDateInData").item()

AvailableDates

From,To,LookbackFromLastDateInData,LookbackFromNow
date,date,date,date
2022-07-20,2022-07-20,2022-05-30,2025-04-05


# Impact Analyzer counts by Channel

In [16]:
optional_mcg = (
    ["ModelControlGroup"] if "ModelControlGroup" in vbd_export.columns else []
)

ia_summary_by_channel = (
    vbd_export.filter(pl.col("pxOutcomeTime") >= lookback_time)
    .group_by(
        [
            "pyChannel",
            "pyDirection",
            "MktType",
            "MktValue",
            "pyReason",
            "pyOutcome",
        ]
        + optional_mcg
    )
    .agg(
        pl.col("pxOutcomeTime").max().alias("Most Recent Update"),
        pl.count().alias("VBD Records"),
        pl.col("AggregateCount").cast(pl.Int64).sum().alias("Actions"),
        pl.sum("FinalPropensity"),
        pl.sum("pyPropensity"),
        pl.sum("pxPriority"),
    )
    .with_columns(
        (pl.col("VBD Records") / pl.sum("VBD Records"))
        .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
        .alias("VBD Records Percentage (per channel)"),
        (pl.col("Actions") / pl.sum("Actions"))
        .over(["pyChannel", "pyDirection"])  # Percentages relative to channel
        .alias("Actions Percentage (per channel)"),
        (pl.col("FinalPropensity") / pl.col("Actions")).alias("Avg FinalPropensity"),
        (pl.col("pyPropensity") / pl.col("Actions")).alias("Avg pyPropensity"),
        (pl.col("pxPriority") / pl.col("Actions")).alias("Avg pxPriority"),
    )
    .drop(["FinalPropensity", "pyPropensity", "pxPriority"])
    .join(
        default_ia_experiments.lazy(),
        how="left",
        on=["MktValue", "MktType", "pyReason"],
        nulls_equal=True,
    )
    .sort(
        [
            "pyChannel",
            "pyDirection",
            "Experiment",
        ]
        + optional_mcg,
        nulls_last=True,
    )
)


def highlight(s):
    if s.Experiment is None:
        return ["background-color: orange"] * len(s)
    else:
        return ["background-color: white"] * len(s)


ia_summary_by_channel_formatted = (
    ia_summary_by_channel.filter(pl.col("pyChannel") == "Web")
    .collect()
    .to_pandas()
    .style.format(
        {
            "Avg FinalPropensity": "{:.2%}",
            "Avg pyPropensity": "{:.2%}",
            "Avg pxPriority": "{:.3f}",
            "VBD Records Percentage (per channel)": "{:.2%}",
            "Actions Percentage (per channel)": "{:.2%}",
            "Most Recent Update": "{:%d %b '%y}",
        }
    )
    .hide(axis="index")
    .hide(["Description", "VBD Records Percentage (per channel)"], axis="columns")
    .set_caption("Experiment Summary:")
    .apply(highlight, axis=1)
)

ia_summary_by_channel_formatted




`pl.count()` is deprecated. Please use `pl.len()` instead.



NameError: name 'default_ia_experiments' is not defined

# KPIs per Channel

In [None]:
# ia_summary_by_channel.collect().pivot("pyOutcome")
# xxx = set(ia_summary_by_channel.columns)
# xxx.remove('pyOutcome')
import pandas as pd

group_by_cols = [
    "pyChannel",
    "pyDirection",
    # "MktType",
    "Experiment",
    "MktValue",
    # "pyReason",
] + optional_mcg

engagement_overview = (
    ia_summary_by_channel.collect()
    .pivot(
        index=group_by_cols,
        columns="pyOutcome",
        values="Actions",
        aggregate_function="sum",
        sort_columns=True,
    )
    .with_columns(cs.numeric().fill_null(0))
)

positive_labels = [
    label
    for label in ["Clicked", "Accept", "Accepted"]
    if label in engagement_overview.columns
]
negative_labels = [
    label for label in ["Impression"] if label in engagement_overview.columns
]

pos_expr = pl.lit(0.0)
for label in positive_labels:
    pos_expr = pos_expr + pl.col(label)
neg_expr = pl.lit(0.0)
for label in negative_labels:
    neg_expr = neg_expr + pl.col(label)


engagement_overview = (
    engagement_overview.with_columns(
        Positives=pos_expr.cast(pl.Int64),
        Negatives=neg_expr.cast(pl.Int64),
        CTR=(pos_expr / (pos_expr + neg_expr)),
    ).with_columns(
        CTR_Lift_vs_NBA=(
            pl.col("CTR")
            / pl.repeat(
                (pl.col("CTR").filter(pl.col("MktValue").is_null())), pl.count()
            )
            - 1.0
        ).over(["pyChannel", "pyDirection"])
    )
    # .with_columns(
    #     CTR_Lift_vs_NBA=(pl.col("CTR")
    #     / (pl.col("CTR").filter(pl.col("MktValue").is_null()))
    #     - 1.0).over(["pyChannel", "pyDirection"])
    # )
    .sort(["pyChannel", "pyDirection", "Experiment"] + optional_mcg)
)


def set_background_col(s, color):
    return "background-color: %s" % color


def set_font_weight(val, weight="bold"):
    return "font-weight: %s" % weight


engagement_overview.to_pandas().style.hide().applymap(
    set_background_col,
    subset=pd.IndexSlice[:, positive_labels + ["Positives"]],
    color="mediumseagreen",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, negative_labels + ["Negatives"]],
    color="tomato",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, ["CTR", "CTR_Lift_vs_NBA"]],
    color="orange",
).applymap(
    set_font_weight, subset=pd.IndexSlice[:, ["pyChannel", "pyDirection", "Experiment"]]
).format(
    {"CTR": "{:,.3%}".format}
)

# Aggregated over all channels

In [None]:
import pandas as pd


top_level_dashboard = (
    engagement_overview.group_by(["Experiment", "MktValue"] + optional_mcg)
    .agg(cs.numeric().sum())
    .with_columns(
        Positives=pos_expr.cast(pl.Int64),
        Negatives=neg_expr.cast(pl.Int64),
        CTR=(pos_expr / (pos_expr + neg_expr)),
    )
    .with_columns(
        CTR_Lift_vs_NBA=pl.col("CTR")
        / (pl.col("CTR").filter(pl.col("MktValue").is_null()))
        - 1.0
    )
    .sort(["Experiment"] + optional_mcg)
)

top_level_dashboard.to_pandas().style.hide().applymap(
    set_background_col,
    subset=pd.IndexSlice[:, positive_labels + ["Positives"]],
    color="mediumseagreen",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, negative_labels + ["Negatives"]],
    color="tomato",
).applymap(
    set_background_col,
    subset=pd.IndexSlice[:, ["CTR", "CTR_Lift_vs_NBA"]],
    color="orange",
).applymap(
    set_font_weight, subset=pd.IndexSlice[:, ["Experiment"]]
).format(
    {"CTR": "{:,.3%}".format}
)

# Lift

Engagement Lift is calculated as (SuccessRate(test) - SuccessRate(control))/SuccessRate(control)

Value Lift is calculated as (ValueCapture(test) - ValueCapture(control))/ValueCapture(control)

TODO replicate IA tiles

For value, aggregting the Value property
