# Predictions Overview

__Pega__

__2024-12-04__

This is a small notebook to report and analyse Prediction Studio data on Predictions. The underlying data is from the Data-DM-Snapshot table that is used to populate the Prediction Studio screen with Prediction Performance, Lift, CTR etc.

As data this notebook accept data exported from PDC - which has a slightly altered format - as well as data directly exported from the pyGetSnapshot dataset in Pega.

For a description of the datamart tables see https://docs-previous.pega.com/decision-management/87/database-tables-monitoring-models.

Disclaimer: this is not a canned, robust and customer-facing notebook (yet). It's mostly used internally to validate Prediction data. Column names and file formats may need some more review to make it more robust.

## Raw data

First, we're going to show the raw data as . The raw data is in a "long" format with e.g. test and control groups in separate rows.

In [1]:
from pathlib import Path
import sys
import polars as pl
from pdstools import read_ds_export, Prediction

# path to dataset export here
# e.g. PR_DATA_DM_SNAPSHOTS.parquet
data_export = "<Your Export Here>"

prediction = None
predictions_raw_data = None
if data_export.endswith(".parquet"):
    predictions_raw_data = pl.scan_parquet(Path(data_export).expanduser())
    prediction = Prediction(predictions_raw_data)
elif data_export.endswith(".json"):
    print("Import of PDC JSON data not supported")
    sys.exit()
elif data_export.endswith(".zip"):
    predictions_raw_data = read_ds_export(data_export)
    prediction = Prediction(predictions_raw_data)
else:
    prediction = Prediction.from_mock_data(days=60)

if predictions_raw_data is not None:
    predictions_raw_data.head(5).collect()

## Prediction Data

The actual prediction data is in a "wide" format with separate fields for Test and Control groups. Also, it is only the "daily" snapshots and the numbers and date are formatted to be normal Polars types.

In [2]:
prediction.predictions.head().collect()

pyModelId,SnapshotTime,Positives,Negatives,ResponseCount,Performance,Positives_Test,Negatives_Test,ResponseCount_Test,Positives_Control,Negatives_Control,ResponseCount_Control,Positives_NBA,Negatives_NBA,ResponseCount_NBA,Class,ModelName,CTR,CTR_Test,CTR_Control,CTR_NBA,CTR_Lift,isValidPrediction
str,date,f64,i64,f64,f32,f64,i64,f64,f64,i64,f64,f64,i64,f64,str,str,f64,f64,f64,f64,f64,bool
"""DATA-DECISION-REQUEST-CUSTOMER…",2025-05-12,150.0,6000,6150.0,70.0,250.0,6000,6250.0,120.0,6000,6120.0,150.0,6000,6150.0,"""DATA-DECISION-REQUEST-CUSTOMER""","""PREDICTMOBILEPROPENSITY""",0.02439,0.04,0.019608,0.02439,1.04,True
"""DATA-DECISION-REQUEST-CUSTOMER…",2025-05-12,250.0,6000,6250.0,70.0,250.0,6000,6250.0,120.0,6000,6120.0,150.0,6000,6150.0,"""DATA-DECISION-REQUEST-CUSTOMER""","""PREDICTMOBILEPROPENSITY""",0.04,0.04,0.019608,0.02439,1.04,True
"""DATA-DECISION-REQUEST-CUSTOMER…",2025-05-12,120.0,6000,6120.0,70.0,250.0,6000,6250.0,120.0,6000,6120.0,150.0,6000,6150.0,"""DATA-DECISION-REQUEST-CUSTOMER""","""PREDICTMOBILEPROPENSITY""",0.019608,0.04,0.019608,0.02439,1.04,True
"""DATA-DECISION-REQUEST-CUSTOMER…",2025-05-13,120.0,6000,6120.0,70.05085,250.847458,6000,6250.847458,120.0,6000,6120.0,150.0,6000,6150.0,"""DATA-DECISION-REQUEST-CUSTOMER""","""PREDICTMOBILEPROPENSITY""",0.019608,0.04013,0.019608,0.02439,1.046638,True
"""DATA-DECISION-REQUEST-CUSTOMER…",2025-05-13,250.847458,6000,6250.847458,70.05085,250.847458,6000,6250.847458,120.0,6000,6120.0,150.0,6000,6150.0,"""DATA-DECISION-REQUEST-CUSTOMER""","""PREDICTMOBILEPROPENSITY""",0.04013,0.04013,0.019608,0.02439,1.046638,True


## Summary by Channel

Standard functionality exists to summarize the predictions per channel. Note that we do not have the prediction to channel mapping in the data (this is an outstanding product issue), so apply the implicit naming conventions of NBAD. For a specific customer, custom mappings can be passed into the summarization function.

In [3]:
prediction.summary_by_channel().collect()

Prediction,Channel,Direction,isStandardNBADPrediction,isMultiChannelPrediction,DateRange Min,DateRange Max,Duration,Performance,Positives,Negatives,Responses,Positives_Test,Positives_Control,Positives_NBA,Negatives_Test,Negatives_Control,Negatives_NBA,usesImpactAnalyzer,ControlPercentage,TestPercentage,CTR,CTR_Test,CTR_Control,CTR_NBA,ChannelDirectionGroup,isValid,Lift
str,str,str,bool,bool,date,date,i64,f64,f64,i64,f64,f64,f64,f64,i64,i64,i64,bool,f64,f64,f64,f64,f64,f64,str,bool,f64
"""PREDICTMOBILEPROPENSITY""","""Mobile""","""Inbound""",True,False,2025-05-12,2025-07-10,5097600,71.500697,32700.0,1080000,1112700.0,49500.0,21600.0,27000.0,1080000,1080000,1080000,True,33.000809,33.836614,0.029388,0.043825,0.019608,0.02439,"""Mobile/Inbound""",True,1.23506
"""PREDICTOUTBOUNDEMAILPROPENSITY""","""E-mail""","""Outbound""",True,False,2025-05-12,2025-07-10,5097600,62.500567,24000.0,1800000,1824000.0,32400.0,18000.0,21600.0,1800000,1800000,1800000,True,33.223684,33.486842,0.013158,0.017682,0.009901,0.011858,"""E-mail/Outbound""",True,0.785855
"""PREDICTWEBPROPENSITY""","""Web""","""Inbound""",True,False,2025-05-12,2025-07-10,5097600,67.001637,379200.0,7200000,7579200.0,612000.0,252000.0,273600.0,7200000,7200000,7200000,True,32.773908,34.357188,0.050032,0.078341,0.033816,0.036609,"""Web/Inbound""",True,1.316656


# Prediction Trends

Summarization by default is over all time. You can pass in an argument to summarize by day, week or any other period as supported by the (Polars time offset string language)[https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.offset_by.html].

This trend data can then easily be visualized.

In [4]:
prediction.plot.performance_trend("1w")

In [5]:
prediction.plot.lift_trend("1w")#, return_df=True).collect()

In [6]:
prediction.plot.ctr_trend("1w", facetting=False)

In [7]:
prediction.plot.responsecount_trend("1w", facetting=False)