# ADM Explained

__Pega__

__2023-03-15__

This notebook shows exactly how all the values in an ADM model report are calculated. It also shows how the propensity is calculated for a particular customer.

We use one of the shipped datamart exports for the example. This is a model very similar to one used in some of the ADM PowerPoint/Excel deep dive examples. To load your own data, see the vignette on ADM reporting for examples.

For the example we use one particular model: AutoNew36Months over SMS. You can use your own data and select a different model.

To explain the ADM model report, we use one of the IH predictors as an example. Swap for any other predictor when using different data.



In [None]:
# These lines are only for rendering in the docs, and are hidden through Jupyter tags
# Do not run if you're running the notebook seperately

import plotly.io as pio

pio.renderers.default = "notebook_connected"

import sys

sys.path.append("../../../")
import pandas as pd
pd.set_option('display.max_colwidth', 0)


In [None]:
import polars as pl
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from typing import List

from pdstools import datasets, cdh_utils

pl.Config.set_fmt_str_lengths(100);

In [None]:
dm = datasets.CDHSample(subset=False)

model = dm.combinedData.filter(
    (pl.col("Name") == "AutoNew36Months") & (pl.col("Channel") == "SMS")
)

modelpredictors = (
    dm.combinedData.join(
        model.select(pl.col("ModelID").unique()), on="ModelID", how="inner"
    )
    .filter(pl.col("EntryType") != "Inactive")
    .with_columns(Action=pl.concat_str(["Issue", "Group"], separator="/"),
                  PredictorName=pl.col("PredictorName").cast(pl.Utf8))
    .collect()
)

predictorbinning = modelpredictors.filter(
    pl.col("PredictorName") == "IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount"
).sort("BinIndex")

## Model Overview

In [None]:
modelpredictors.select(
    pl.col("Action").unique(),
    pl.col("Channel").unique(),
    pl.col("Name").unique(),
    pl.col("PredictorName").unique().sort().implode()
    .arr.join(", ").alias("Active Predictors"),
    (pl.col("Performance").unique() * 100).alias("Model Performance (AUC)"),
).to_pandas().T.set_axis(["Values"], axis=1)


## Predictor binning for IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount


The ADM model report will show predictor binning similar to this, with all displayed data coming from fields in the ADM data mart. In subsequent sections we’ll show how all the data is derived from the number of positives and negatives in each of the bins.



In [None]:
predictorbinning.groupby("PredictorName").agg(
    pl.first("ResponseCount").alias("# Responses"),
    pl.n_unique("BinIndex").alias("# Bins"),
    (pl.first("PerformanceBin") * 100).alias("Predictor Performance(AUC)"),
).rename({"PredictorName": "Predictor Name"}).transpose(include_header=True).rename(
    {"column": "", "column_0": "Value"}
)


In [None]:
BinPositives = pl.col("BinPositives")
BinNegatives = pl.col("BinNegatives")
sumPositives = pl.sum("BinPositives")
sumNegatives = pl.sum("BinNegatives")

predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    ((BinPositives + BinNegatives) / (sumPositives + sumNegatives))
    .round(2)
    .alias("Responses (%)"),
    BinPositives.alias("Positives"),
    (BinPositives / sumPositives).round(2).alias("Positives (%)"),
    BinNegatives.alias("Negatives"),
    (BinNegatives / sumNegatives).round(2).alias("Negatives (%)"),
    (BinPositives / (BinPositives + BinNegatives)).round(4).alias("Propensity (%)"),
    pl.col("ZRatio"),
    pl.col("Lift"),
)


## Bin Statistics

### Positive and Negative ratios

Internally, ADM only keeps track of the total counts of positive and negative responses in each bin. Everything else is derived from those numbers. The percentages and totals are trivially derived, and the propensity is just the number of positives divided by the total. The numbers calculated here match the numbers from the datamart table exactly.

In [None]:
binningDerived = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    BinPositives.alias("Positives"),
    BinNegatives.alias("Negatives"),
    (((BinPositives + BinNegatives) / (sumPositives + sumNegatives)) * 100)
    .round(2)
    .alias("Responses %"),
    ((BinPositives / sumPositives) * 100).round(2).alias("Positives %"),
    ((BinNegatives/sumNegatives) * 100).round(2).alias("Negatives %"),
    (BinPositives / (BinPositives + BinNegatives))
    .round(4)
    .alias("Propensity"),
)
binningDerived

### Lift

Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity:

In [None]:
Positives = pl.col('Positives')
Negatives = pl.col('Negatives')
sumPositives = pl.sum("Positives")
sumNegatives = pl.sum("Negatives")
binningDerived.select(
    "Range/Symbol",
    "Positives",
    "Negatives",
    (
        (Positives / (Positives + Negatives))
        / (sumPositives / (Positives + Negatives).sum())
    ).alias("Lift"),
)


### Z-Ratio

The Z-Ratio is also a measure of the how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the average, so centers around 0. The wider the spread, the better the predictor is.
$$\frac{posFraction-negFraction}{\sqrt(\frac{posFraction*(1-posFraction)}{\sum positives}+\frac{negFraction*(1-negFraction)}{\sum negatives})}$$ 

See the calculation here, which is also included in [`cdh_utils`' `zRatio()`](https://pegasystems.github.io/pega-datascientist-tools/Python/autoapi/pdstools/utils/cdh_utils/index.html#pdstools.utils.cdh_utils.zRatio) function.

In [None]:
def zRatio(
    posCol: pl.Expr = pl.col("BinPositives"), negCol: pl.Expr = pl.col("BinNegatives")
) -> pl.Expr:
    def getFracs(posCol=pl.col("BinPositives"), negCol=pl.col("BinNegatives")):
        return posCol / posCol.sum(), negCol / negCol.sum()

    def zRatioimpl(
        posFractionCol=pl.col("posFraction"),
        negFractionCol=pl.col("negFraction"),
        PositivesCol=pl.sum("BinPositives"),
        NegativesCol=pl.sum("BinNegatives"),
    ):
        return (
            (posFractionCol - negFractionCol)
            / (
                (posFractionCol * (1 - posFractionCol) / PositivesCol)
                + (negFractionCol * (1 - negFractionCol) / NegativesCol)
            ).sqrt()
        ).alias("ZRatio")

    return zRatioimpl(*getFracs(posCol, negCol), posCol.sum(), negCol.sum())


binningDerived.select(
    "Range/Symbol", "Positives", "Negatives", "Positives %", "Negatives %"
).with_columns(zRatio(Positives, Negatives))


## Predictor AUC


The predictor AUC is the univariate performance of this predictor against the outcome. This too can be derived from the positives and negatives and
there is  a convenient function in pdstools to calculate it directly from the positives and negatives.

Again, this function is implemented in cdh_utils: [`cdh_utils.auc_from_bincounts()`](https://pegasystems.github.io/pega-datascientist-tools/Python/autoapi/pdstools/utils/cdh_utils/index.html#pdstools.utils.cdh_utils.auc_from_bincounts)

In [None]:
pos=binningDerived.get_column("Positives").to_numpy()
neg=binningDerived.get_column("Negatives").to_numpy()

o = np.argsort((pos / (pos + neg)))

TNR = np.cumsum(neg[o]) / np.sum(neg)
FPR = np.flip(np.cumsum(neg[o]) / np.sum(neg), axis=0)
TPR = np.flip(np.cumsum(pos[o]) / np.sum(pos), axis=0)
Area = (FPR - np.append(FPR[1:], 0)) * (TPR + np.append(TPR[1:], 0)) / 2
auc = 0.5 + np.abs(0.5-np.sum(Area))

fig = px.line(
    x=TPR, y=TNR,
    labels=dict(x='Specificity', y='Sensitivity'),
    title = f"AUC = {auc.round(3)}",
    width=700, height=700,
    range_x=[1,0],
    template='none'
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=1, x1=0, y0=0, y1=1
)
fig.show()

## Predictor score and log odds

The score is calculated from the log odds which are simply the ratio of the probabilities of positives and negatives. For the actual calculation in ADM this is modified slightly to avoid division-by-zero problems and is written differently to avoid numeric instability as shown below.

In [None]:
N = binningDerived.shape[0]
binningDerived.with_columns(
    LogOdds= (pl.col("Positives %") / pl.col("Negatives %")).log(),
    ModifiedLogOdds=(
        ((Positives + 1 / N).log() - (Positives + 1).sum().log())
        - ((Negatives + 1 / N).log() - (Negatives + 1).sum().log())
    )
).drop("Responses %", "Propensity")

## Propensity mapping

### Log odds contribution for all the predictors

To get to a propensity, the log odds of the relevant bins of the active predictors are added up and divided by the number of active predictors +1, then used to index in the classifier.

Below an example. From all the active predictors of the model for we pick a value (in the middle for numerics, first symbol for symbolics) and show the (modified) log odds. These log odds values are averaged (added up and divided by number of active predictors + 1), and this is the “score” that is mapped to a propensity value by the classifier (which is constructed using the PAV(A) algorithm).

In [None]:
def middleBin():
    return pl.col("BinIndex") == (pl.max("BinIndex") / 2).floor().cast(pl.UInt32)


def RowWiseLogOdds(Bin, Positives, Negatives):
    Bin, N = Bin.arr.get(0) - 1, Positives.arr.lengths()
    Pos, Neg = Positives.arr.get(Bin), Negatives.arr.get(Bin)
    PosSum, NegSum = Positives.arr.sum(), Negatives.arr.sum()
    return (
        (((Pos + (1 / N)).log() - (PosSum + N).log()))
        - (((Neg + (1 / N)).log()) - (NegSum + N).log())
    ).alias("Modified Log odds")


df = (
    modelpredictors.filter(pl.col("PredictorName") != "Classifier")
    .groupby("PredictorName")
    .agg(
        Value=pl.when(pl.col("Type").first() == "numeric")
        .then(
            ((pl.col("BinLowerBound") + pl.col("BinUpperBound")) / 2).where(middleBin())
        )
        .otherwise(pl.col("BinSymbol").str.split(",").arr.first().where(middleBin())),
        Bin=pl.col("BinIndex").where(middleBin()),
        Positives=pl.col("BinPositives"),
        Negatives=pl.col("BinNegatives"),
    )
    .with_columns(
        pl.col(["Positives", "Negatives"]).arr.get(pl.col("Bin").arr.get(0) - 1),
        pl.col("Bin", "Value").arr.get(0),
        LogOdds=RowWiseLogOdds(pl.col("Bin"), pl.col("Positives"), pl.col("Negatives")),
    )
    .sort("PredictorName")
)
df.vstack(
    pl.DataFrame(dict(zip(
                df.columns,
                ["Average log odds"] + [None] * 4 + [df["LogOdds"].sum() / len(df)],
                )),
    schema=df.schema,
    )
)


## Classifier

The success rate is defined as $\frac{positives}{positives+negatives}$ per bin. 

The adjusted propensity that is returned is a small modification (Laplace smoothing) to this and calculated as $\frac{0.5+positives}{1+positives+negatives}$ so empty models return a propensity of 0.5.


In [None]:
classifier = modelpredictors.filter(pl.col("EntryType") == "Classifier").with_columns(
    Propensity=(Positives / (Positives / Negatives)),
    AdjustedPropensity=((0.5 + Positives) / (1 + Positives + Negatives)),
).select(
    [
        pl.col("BinIndex").alias("Index"),
        pl.col("BinSymbol").alias("Bin"),
        Positives.alias("Positives"),
        Negatives.alias("Negatives"),
        ((pl.cumsum("BinResponseCount") / pl.sum("BinResponseCount")) * 100).alias(
            "Cum. Total (%)"
        ),
        (pl.col("BinPropensity") * 100).alias("Propensity (%)"),
        (pl.col("AdjustedPropensity") * 100).alias("Adjusted Propensity (%)"),
        ((pl.cumsum("BinPositives") / pl.sum("BinPositives")) * 100).alias(
            "Cum Positives (%)"
        ),
        pl.col("ZRatio"),
        (pl.col("Lift") * 100).alias("Lift(%)"),
        pl.col("BinResponseCount").alias("Responses"),
    ]
)
classifier.drop("Responses")

## Final Propensity

Below the classifier mapping. On the x-axis the binned scores (log odds values), on the y-axis the Propensity. Note the returned propensities are following a slightly adjusted formula, see the table above. The bin that contains the calculated score is highlighted.

The score -0.11403 falls in bin 7 of the classifier, so for this set of inputs, the model returns a propensity of 1.70%.

In [None]:
from pdstools.plots.plots_plotly import ADMVisualisations

fig = ADMVisualisations.distribution_graph(
    modelpredictors.filter(pl.col("EntryType") == "Classifier"),
    "Propensity distribution",
).add_annotation(
    x="[-0.22, 0.99>",
    y=1400,
    text="Returned propensity: 1.7%",
    bgcolor="#FFFFFF",
    bordercolor="#000000",
    showarrow=False,
)
fig.data[0]["marker_color"] = ["grey"] * 6 + ["#1f77b4"] + ["grey"]
fig
