# ADM Explained

__Pega__

__2023-03-15__

This notebook shows exactly how all the values in an ADM model report are calculated. It also shows how the propensity is calculated for a particular customer.

We use one of the shipped datamart exports for the example. This is a model very similar to one used in some of the ADM PowerPoint/Excel deep dive examples. To load your own data, see the vignette on ADM reporting for examples.

For the example we use one particular model: AutoNew36Months over SMS. You can use your own data and select a different model.

To explain the ADM model report, we use one of the IH predictors as an example. Swap for any other predictor when using different data.



In [2]:
import sys
import polars as pl
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio

pio.renderers.default = "notebook_connected"

sys.path.append("../../../")
sys.path.append('../../python')
from pdstools import datasets
from pdstools.utils import cdh_utils

pl.Config.set_fmt_str_lengths(100)

polars.config.Config

In [17]:
adm_datamart = datasets.CDHSample(subset=False)

model = adm_datamart.combinedData.filter(
    (pl.col("Name") == "AutoNew36Months") & (pl.col("Channel") == "SMS")
)

modelpredictors = adm_datamart.combinedData.filter(
    (pl.col("ModelID") == model.collect().select("ModelID")[0, 0])
    & (pl.col("EntryType") != "Inactive")
).with_columns(pl.concat_str(["Issue", "Group"], separator="/").alias("Action")).collect()

predictorbinning = modelpredictors.filter(
    pl.col("PredictorName") == "IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount"
).sort("BinIndex")

## Model Overview

In [21]:
model_overview = modelpredictors.select(pl.col("Action").unique(),
                       pl.col("Channel").unique(),
                       pl.col("Name").unique(),
                       pl.col("PredictorName").unique().list().alias("Active Predictors"),
                       (pl.col("Performance").unique() * 100).alias("Model Performance (AUC)")
                       ).to_pandas().T

In [22]:
model_overview

Unnamed: 0,0
Action,Sales/AutoLoans
Channel,SMS
Name,AutoNew36Months
Active Predictors,"[Customer.AnnualIncome, Customer.CLV_VALUE, IH..."
Model Performance (AUC),60.5845


## Predictor binning for IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount


The ADM model report will show predictor binning similar to this, with all displayed data coming from fields in the ADM data mart. In subsequent sections we’ll show how all the data is derived from the number of positives and negatives in each of the bins.



In [24]:
predictor_binning_overview = predictorbinning.groupby("PredictorName").agg(
    pl.first("ResponseCount").alias("Responses"),
    pl.n_unique("BinIndex").alias("# Bins"),
    (pl.first("PerformanceBin") * 100).alias("Predictor Performance(AUC)"),
).rename({"PredictorName":"Name"}).transpose(include_header=True)

In [25]:
predictor_binning_overview

column,column_0
str,str
"""Name""","""IH.SMS.Outbound.Accepted.pyHistoricalOutcomeCount"""
"""Responses""","""5602"""
"""# Bins""","""5"""
"""Predictor Performance(AUC)""","""65.4153"""


In [26]:
Positives = pl.col("Positives")
Negatives = pl.col("Negatives")
BinPositives = pl.col("BinPositives")
BinNegatives = pl.col("BinNegatives")
sumPositives = pl.sum("BinPositives")
sumNegatives = pl.sum("BinNegatives")

groups = ["Name", "PredictorName", "Channel"]
N = Positives.count().over(groups)



predictorbinning_data = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    ((BinPositives + BinNegatives) / (sumPositives + sumNegatives))
    .round(2)
    .alias("Responses (%)"),
    BinPositives.alias("Positives"),
    (BinPositives / sumPositives).round(2).alias("Positives (%)"),
    BinNegatives.alias("Negatives"),
    (BinNegatives/sumNegatives).round(2).alias("Negatives (%)"),
    (BinPositives / (BinPositives + BinNegatives))
    .round(4)
    .alias("Propensity (%)"),
    pl.col("ZRatio"),
    pl.col("Lift"),
)

In [27]:
predictorbinning_data

Range/Symbol,Responses (%),Positives,Positives (%),Negatives,Negatives (%),Propensity (%),ZRatio,Lift
str,f64,i64,f64,i64,f64,f64,f64,f64
"""MISSING""",0.0,0,0.0,5,0.0,0.0,-2.237073,0.0
"""<7.08""",0.4,7,0.19,2208,0.4,0.0032,-3.051114,0.491773
"""[7.08, 12.04>""",0.34,11,0.31,1919,0.34,0.0057,-0.509054,0.886903
"""[12.04, 18.04>""",0.2,12,0.33,1082,0.19,0.011,1.764386,1.706886
""">=18.04""",0.06,6,0.17,352,0.06,0.0168,1.662827,2.608007


## Bin Statistics

### Positive and Negative ratios

Internally, ADM only keeps track of the total counts of positive and negative responses in each bin. Everything else is derived from those numbers. The percentages and totals are trivially derived, and the propensity is just the number of positives divided by the total. The numbers calculated here match the numbers from the datamart table exactly.

In [6]:
binningDerived = predictorbinning.select(
    pl.col("BinSymbol").alias("Range/Symbol"),
    BinPositives.alias("Positives"),
    BinNegatives.alias("Negatives"),
    (((BinPositives + BinNegatives) / (sumPositives + sumNegatives)) * 100)
    .round(2)
    .alias("Responses %"),
    ((BinPositives / sumPositives) * 100).round(2).alias("Positives %"),
    ((BinNegatives/sumNegatives) * 100).round(2).alias("Negatives %"),
    (BinPositives / (BinPositives + BinNegatives))
    .round(4)
    .alias("Propensity"),
)
binningDerived

Range/Symbol,Positives,Negatives,Responses %,Positives %,Negatives %,Propensity
str,i64,i64,f64,f64,f64,f64
"""MISSING""",0,5,0.09,0.0,0.09,0.0
"""<7.08""",7,2208,39.54,19.44,39.67,0.0032
"""[7.08, 12.04>""",11,1919,34.45,30.56,34.48,0.0057
"""[12.04, 18.04>""",12,1082,19.53,33.33,19.44,0.011
""">=18.04""",6,352,6.39,16.67,6.32,0.0168


### Lift

Lift is the ratio of the propensity in a particular bin over the average propensity. So a value of 1 is the average, larger than 1 means higher propensity, smaller means lower propensity:

In [7]:
sumPositives = pl.sum("Positives")
sumNegatives = pl.sum("Negatives")
binningDerived.select(
    "Range/Symbol",
    "Positives",
    "Negatives",
    (
        (Positives / (Positives + Negatives))
        / (sumPositives / (Positives + Negatives).sum())
    ).alias("Lift"),
)


Range/Symbol,Positives,Negatives,Lift
str,i64,i64,f64
"""MISSING""",0,5,0.0
"""<7.08""",7,2208,0.491773
"""[7.08, 12.04>""",11,1919,0.886903
"""[12.04, 18.04>""",12,1082,1.706886
""">=18.04""",6,352,2.608007


### Z-Ratio

The Z-Ratio is also a measure of the how the propensity in a bin differs from the average, but takes into account the size of the bin and thus is statistically more relevant. It represents the number of standard deviations from the average, so centers around 0. The wider the spread, the better the predictor is.
$$\frac{posFraction-negFraction}{\sqrt(\frac{posFraction*(1-posFraction)}{\sum positives}+\frac{negFraction*(1-negFraction)}{\sum negatives})}$$ 


See [cdh_utils.zRatio()](https://github.com/pegasystems/pega-datascientist-tools/blob/master/python/pdstools/utils/cdh_utils.py#L751) function in pdstools for the implementation of the formula in python

In [9]:
binningDerived.select(
    "Range/Symbol", "Positives", "Negatives", "Positives %", "Negatives %"
).with_columns(cdh_utils.zRatio(Positives, Negatives))


Range/Symbol,Positives,Negatives,Positives %,Negatives %,ZRatio
str,i64,i64,f64,f64,f64
"""MISSING""",0,5,0.0,0.09,-2.237073
"""<7.08""",7,2208,19.44,39.67,-3.051114
"""[7.08, 12.04>""",11,1919,30.56,34.48,-0.509054
"""[12.04, 18.04>""",12,1082,33.33,19.44,1.764386
""">=18.04""",6,352,16.67,6.32,1.662827


## Predictor AUC


The predictor AUC is the univariate performance of this predictor against the outcome. This too can be derived from the positives and negatives and
there is  a convenient function in pdstools to calculate it directly from the positives and negatives.

See [cdh_utls.auc_from_bincounts()](https://github.com/pegasystems/pega-datascientist-tools/blob/master/python/pdstools/utils/cdh_utils.py#L448) function to check how AUC is calculated using positives and negatives.

In [39]:
auc = cdh_utils.auc_from_bincounts(
    pos=binningDerived.get_column("Positives").to_list(),
    neg=binningDerived.get_column("Negatives").to_list(),
)


pos=np.asarray(binningDerived.get_column("Positives").to_list())
neg=np.asarray(binningDerived.get_column("Negatives").to_list())

o = np.argsort((pos / (pos + neg)))

TNR = np.cumsum(neg[o]) / np.sum(neg)
TPR = np.flip(np.cumsum(pos[o]) / np.sum(pos), axis=0)

fig = px.line(
    x=TPR, y=TNR,
    labels=dict(x='Specificity', y='Sensitivity'),
    title = f"AUC = {auc.round(3)}",
    width=700, height=700,
    range_x=[1,0],
    template='none'
)
fig.add_shape(
    type='line', line=dict(dash='dash'),
    x0=1, x1=0, y0=0, y1=1
)
fig.show()

## Predictor score and log odds

The score is calculated from the log odds which are simply the ratio of the probabilities of positives and negatives. For the actual calculation in ADM this is modified slightly to avoid division-by-zero problems and is written differently to avoid numeric instability as shown below.

In [11]:
N = binningDerived.shape[0]
binningDerived.with_columns(
    LogOds= (pl.col("Positives %") / pl.col("Negatives %")).log(),
    ModifiedLogOdds=(
        ((Positives + 1 / N).log() - (Positives + 1).sum().log())
        - ((Negatives + 1 / N).log() - (Negatives + 1).sum().log())
    )
).drop("Responses %", "Propensity")

Range/Symbol,Positives,Negatives,Positives %,Negatives %,LogOds,ModifiedLogOdds
str,i64,i64,f64,f64,f64,f64
"""MISSING""",0,5,0.0,0.09,-inf,1.653661
"""<7.08""",7,2208,19.44,39.67,-0.713262,-0.814094
"""[7.08, 12.04>""",11,1919,30.56,34.48,-0.120687,-0.231992
"""[12.04, 18.04>""",12,1082,33.33,19.44,0.539125,0.426442
""">=18.04""",6,352,16.67,6.32,0.969891,0.872108


## Propensity mapping

### Log odds contribution for all the predictors

To get to a propensity, the log odds of the relevant bins of the active predictors are added up and divided by the number of active predictors +1, then used to index in the classifier.

Below an example. From all the active predictors of the model for we pick a value (in the middle for numerics, first symbol for symbolics) and show the (modified) log odds. These log odds values are averaged (added up and divided by number of active predictors + 1), and this is the “score” that is mapped to a propensity value by the classifier (which is constructed using the PAV(A) algorithm).

In [31]:
Positives = pl.col("BinPositives")
Negatives = pl.col("BinNegatives")
groups = ["Name", "PredictorName", "Channel"]
N = Positives.count().over(groups)

def Fraction(col: pl.Expr, over):
    return (col / col.sum()).over(over)


def LogOdds(col, count, over):
    return (col + 1 / count).log() - (col + 1).sum().over(over).log()

df =  modelpredictors.select(
        Name=pl.col("PredictorName"),
        Value=pl.col("BinSymbol"),
        Bin=pl.col("BinIndex"),
        Positives=BinPositives,
        Negatives=BinNegatives,
        Type=pl.col("Type"),
        BinLowerBound=pl.col("BinLowerBound"),
        BinUpperBound=pl.col("BinUpperBound"),
    )

propensity_mapping = df.filter(pl.col("Name") !="Classifier").with_columns(
    temp=pl.list("Value").over("Name"), nbins=pl.max("Bin").over("Name")
).filter(pl.col("Bin") == (pl.col("nbins") / 2).floor().over("Name")).with_columns(
    Value=(
        pl.when(pl.col("Type") == "numeric")
        .then(((pl.col("BinLowerBound") + pl.col("BinUpperBound")) / 2))
        .otherwise(pl.col("temp").arr.get(0).str.split(",").arr.get(0))
    ).over("Name")
).drop(
    "temp","BinLowerBound", "BinUpperBound", "nbins"
).sort(
    "Name"
) 

model_with_log_odds_calculations = (
    model.collect().filter(
        (pl.col("Name") == "AutoNew36Months")
        & (pl.col("Channel") == "SMS")
        & (pl.col("EntryType") != "Inactive")
    )
    .with_columns(
        posFraction=Fraction(Positives, over=groups),
        negFraction=Fraction(Negatives, over=groups),
    )
    .with_columns(
        LogOdds=pl.col("posFraction") / pl.col("negFraction").log(),
        ModifiedLogOdds=LogOdds(Positives, N, groups) - LogOdds(Negatives, N, groups),
    )
    .select(
        Name=pl.col("PredictorName"),
        Bin=pl.col("BinIndex").cast(pl.Int64),
        LogOdds=pl.col("LogOdds"),
        ModifiedLogOdds=pl.col("ModifiedLogOdds"),
    )
)

log_odds_contributions = propensity_mapping.join(
    model_with_log_odds_calculations, on=["Name", "Bin"], how="left"
)

average_log_odds = log_odds_contributions.select(pl.sum("ModifiedLogOdds") / (log_odds_contributions.shape[0]+1)).row(0)[0]
average_log_odds_df = pl.DataFrame(schema=log_odds_contributions.schema,
             data = {"Name":"Average Log Odds",
                     "Value":None,
                     "Bin":None,
                     "Positives":None,
                     "Negatives":None,
                     "Type":None,
                     "LogOdds":None,
                     "ModifiedLogOdds": average_log_odds})
propensity_mapping = pl.concat([log_odds_contributions,
          average_log_odds_df]
          ,how="vertical")

The predicate '[(col("Type")) == (Utf8(numeric))]' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the groupby operation would. This behavior is experimental and may be subject to change


In [32]:
propensity_mapping

Name,Value,Bin,Positives,Negatives,Type,LogOdds,ModifiedLogOdds
str,str,i64,i64,i64,str,f64,f64
"""Customer.Age""","""37.16""",2.0,7.0,3042.0,"""numeric""",-0.321842,-1.10308
"""Customer.AnnualIncome""","""2827.7097520000007""",1.0,7.0,3401.0,"""numeric""",-0.394725,-1.178083
"""Customer.BusinessSegment""","""middleSegmentPlus""",1.0,10.0,3369.0,"""symbolic""",-0.553275,-0.825686
"""Customer.CLV_VALUE""","""352.52""",1.0,7.0,1379.0,"""numeric""",-0.139355,-0.275516
"""Customer.CreditScore""","""432.2""",1.0,14.0,426.0,"""numeric""",-0.151319,1.568775
"""Customer.Gender""","""M""",1.0,7.0,1735.0,"""symbolic""",-0.166809,-0.505115
"""Customer.MaritalStatus""","""Single""",1.0,6.0,1454.0,"""symbolic""",-0.12416,-0.475067
"""Customer.NetWealth""","""5458.64""",1.0,7.0,2679.0,"""numeric""",-0.265913,-0.939484
"""Customer.OrganizationLabel""","""NON-MISSING""",1.0,31.0,3646.0,"""symbolic""",-2.035502,0.23567
"""Customer.Prefix""","""Dr.""",2.0,8.0,1174.0,"""symbolic""",-0.142792,-0.021901


## Classifier

The success rate is defined as $\frac{positives}{positives+negatives}$ per bin. 

The adjusted propensity that is returned is a small modification (Laplace smoothing) to this and calculated as $\frac{0.5+positives}{1+positives+negatives}$ so empty models return a propensity of 0.5.


In [33]:
classifier = modelpredictors.filter(pl.col("EntryType") == "Classifier")

classifier = classifier.with_columns(
    Propensity=(Positives / (Positives / Negatives)),
    AdjustedPropensity=((0.5 + Positives) / (1 + Positives + Negatives)),
).select(
    [
        pl.col("BinIndex").alias("Index"),
        pl.col("BinSymbol").alias("Bin"),
        Positives.alias("Positives"),
        Negatives.alias("Negatives"),
        ((pl.cumsum("BinResponseCount") / pl.sum("BinResponseCount")) * 100).alias(
            "Cum. Total (%)"
        ),
        (pl.col("BinPropensity") * 100).alias("Propensity (%)"),
        (pl.col("AdjustedPropensity") * 100).alias("Adjusted Propensity (%)"),
        ((pl.cumsum("BinPositives") / pl.sum("BinPositives")) * 100).alias(
            "Cum Positives (%)"
        ),
        pl.col("ZRatio"),
        (pl.col("Lift") * 100).alias("Lift(%)"),
        pl.col("BinResponseCount").alias("Responses"),
    ]
)
classifier.drop("Responses")

Index,Bin,Positives,Negatives,Cum. Total (%),Propensity (%),Adjusted Propensity (%),Cum Positives (%),ZRatio,Lift(%)
i64,str,i64,i64,f64,f64,f64,f64,f64,f64
1,"""<-1.95""",0,35,0.624777,0.0,1.388889,0.0,-5.934769,0.0
2,"""[-1.95, -0.63>""",7,1939,35.362371,0.359712,0.385208,19.444444,-2.322612,55.9752
3,"""[-0.63, -0.5>""",7,1331,59.246698,0.523169,0.560119,38.888889,-0.674919,81.4109
4,"""[-0.5, -0.36>""",8,1125,79.471617,0.70609,0.749559,61.111111,0.289246,109.8755
5,"""[-0.36, -0.29>""",4,446,87.504463,0.888889,0.997783,72.222222,0.590078,138.321
6,"""[-0.29, -0.22>""",3,319,93.25241,0.931677,1.083591,80.555556,0.563599,144.9793
7,"""[-0.22, 0.99>""",2,144,95.858622,1.369863,1.70068,86.111111,0.776338,213.1659
8,""">=0.99""",5,227,100.0,2.155172,2.360515,100.0,1.700289,335.3688


In [34]:
classifier

Index,Bin,Positives,Negatives,Cum. Total (%),Propensity (%),Adjusted Propensity (%),Cum Positives (%),ZRatio,Lift(%),Responses
i64,str,i64,i64,f64,f64,f64,f64,f64,f64,i64
1,"""<-1.95""",0,35,0.624777,0.0,1.388889,0.0,-5.934769,0.0,35
2,"""[-1.95, -0.63>""",7,1939,35.362371,0.359712,0.385208,19.444444,-2.322612,55.9752,1946
3,"""[-0.63, -0.5>""",7,1331,59.246698,0.523169,0.560119,38.888889,-0.674919,81.4109,1338
4,"""[-0.5, -0.36>""",8,1125,79.471617,0.70609,0.749559,61.111111,0.289246,109.8755,1133
5,"""[-0.36, -0.29>""",4,446,87.504463,0.888889,0.997783,72.222222,0.590078,138.321,450
6,"""[-0.29, -0.22>""",3,319,93.25241,0.931677,1.083591,80.555556,0.563599,144.9793,322
7,"""[-0.22, 0.99>""",2,144,95.858622,1.369863,1.70068,86.111111,0.776338,213.1659,146
8,""">=0.99""",5,227,100.0,2.155172,2.360515,100.0,1.700289,335.3688,232


## Final Propensity

Below the classifier mapping. On the x-axis the binned scores (log odds values), on the y-axis the Propensity. Note the returned propensities are following a slightly adjusted formula, see the table above. The bin that contains the calculated score is highlighted.

The score -0.1094691 falls in bin 7 of the classifier, so for this set of inputs, the model returns a propensity of 1.70%.

In [38]:
df = result.to_pandas()
colors = ["grey", "grey", "grey", "grey", "grey", "grey", "red", "grey"]
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(
    go.Bar(x=df["Bin"], y=df["Responses"], name="Responses", marker_color=colors)
)
fig.add_trace(
    go.Scatter(
        x=df["Bin"],
        y=df["Propensity (%)"],
        yaxis="y2",
        mode="lines+markers",
        hovertemplate = '<b>%{text} %</b>',
        text = [f'Returned Propensity: {round(i,2)}' for i in df["Adjusted Propensity (%)"].to_list()],
    )
)
fig.update_layout(xaxis_title="Range", yaxis_title="Responses",width=800, height=500, template='none',
                  
)
fig.update_yaxes(title_text="Propensity (%)", secondary_y=True)
fig.layout.yaxis2.zeroline = False
fig.update_yaxes(showgrid=False)

In [36]:
fig