# Example data anonymization

In Pega CDH 8.5 and up, it's now possible to record the historical data as seen by the Adaptive Models. See [this academy challenge](https://academy.pega.com/challenge/exporting-historical-data/v4) for reference. This historical data can be further used to experiment with offline models, but also to fine-tune the OOTB Gradient Boosting model. However, sharing this information with Pega can be sensitive as it contains raw predictor data. 

To this end, we provide a simple and transparent script to fully anonimize this dataset.

The DataAnonymization script is now part of pdstools, and you can import it directly as such.

In [1]:
from pdstools import ADMDatamart
from pdstools import Config, DataAnonymization
import polars as pl

# Input data

To demonstrate this process, we're going to anonymise this toy example dataframe:

In [2]:
pl.read_ndjson('../../data/SampleHDS.json')

Context_Name,Customer_MaritalStatus,Customer_CLV,Customer_City,IH_Web_Inbound_Accepted_pxLastGroupID,Decision_Outcome
str,str,i64,str,str,str
"""FirstMortgage3...","""Married""",1460,"""Port Raoul""","""Account""","""Rejected"""
"""FirstMortgage3...","""Unknown""",669,"""Laurianneshire...","""AutoLoans""","""Accepted"""
"""MoneyMarketSav...","""No Resp+""",1174,"""Jacobshaven""","""Account""","""Rejected"""
"""BasicChecking""","""Unknown""",1476,"""Lindton""","""Account""","""Rejected"""
"""BasicChecking""","""Married""",1211,"""South Jimmiesh...","""DepositAccount...","""Accepted"""
"""UPlusFinPerson...","""No Resp+""",533,"""Bergeville""",,"""Rejected"""
"""BasicChecking""","""No Resp+""",555,"""Willyville""","""DepositAccount...","""Rejected"""


As you can see, this dataset consists of regular predictors, IH predictors, context keys and the outcome column. Additionally, some columns are numeric, others are strings. Let's first initialize the DataAnonymization class.

In [3]:
anon = DataAnonymization(hdr_folder='../../data')

By default, the class applies a set of anonymisation techniques:
- Column names are remapped to a non-descriptive name
- Categorical values are hashed with a random seed
- Numerical values are normalized between 0 and 1
- Outcomes are mapped to a binary outcome.

To apply these techniques, simply call `.process()`:

In [4]:
anon.process()

PREDICTOR_0,PREDICTOR_1,PREDICTOR_2,CK_PREDICTOR_0,IH_PREDICTOR_0,OUTCOME
f64,str,str,str,str,bool
0.983033,"""98204418375119...","""13648188728621...","""13393724272505...","""99152283876968...",False
0.144221,"""40194439864190...","""34289627369783...","""13393724272505...","""14569848895294...",True
0.679745,"""74996261949779...","""82530010984907...","""58821356644788...","""99152283876968...",False
1.0,"""17508757597135...","""34289627369783...","""12710223883496...","""99152283876968...",False
0.718982,"""64355859976386...","""13648188728621...","""12710223883496...","""67832225186728...",True
0.0,"""91446674384419...","""82530010984907...","""16136027877032...",,False
0.02333,"""20772971755365...","""82530010984907...","""12710223883496...","""67832225186728...",False


To trace back the columns to their original names, the class also contains a mapping, which does not have to be provided.

In [5]:
anon.column_mapping

{'Customer_CLV': 'PREDICTOR_0',
 'Customer_City': 'PREDICTOR_1',
 'Customer_MaritalStatus': 'PREDICTOR_2',
 'Context_Name': 'CK_PREDICTOR_0',
 'IH_Web_Inbound_Accepted_pxLastGroupID': 'IH_PREDICTOR_0',
 'Decision_Outcome': 'OUTCOME'}

# Configs

Each capability can optionally be turned off - see below for the full list of config options, and refer to the API reference for the full description.

In [6]:
dict(zip(Config.__init__.__code__.co_varnames[1:], Config.__init__.__defaults__))

{'config_file': None,
 'hdr_folder': '.',
 'use_datamart': False,
 'datamart_folder': 'datamart',
 'output_format': 'ndjson',
 'output_folder': 'output',
 'mapping_file': 'mapping.map',
 'mask_predictor_names': True,
 'mask_context_key_names': True,
 'mask_ih_names': True,
 'mask_outcome_name': True,
 'mask_predictor_values': True,
 'mask_context_key_values': True,
 'mask_ih_values': True,
 'mask_outcome_values': True,
 'context_key_label': 'Context_*',
 'ih_label': 'IH_*',
 'outcome_column': 'Decision_Outcome',
 'positive_outcomes': ['Accepted', 'Clicked'],
 'negative_outcomes': ['Rejected', 'Impression'],
 'special_predictors': ['Decision_DecisionTime', 'Decision_OutcomeTime']}

It's easy to change these parameters by just passing the keyword arguments. In the following example, we
- Keep the IH predictor names
- Keep the outcome values
- Keep the context key values
- Keep the context key predictor names

In [7]:
anon = DataAnonymization(hdr_folder='../../data', mask_ih_names=False, mask_outcome_values=False, mask_context_key_values=False, mask_context_key_names=False)
anon.process()

PREDICTOR_0,PREDICTOR_1,PREDICTOR_2,Context_Name,IH_Web_Inbound_Accepted_pxLastGroupID,OUTCOME
f64,str,str,str,str,str
0.983033,"""13971681906009...","""15176930337713...","""FirstMortgage3...","""24782110777461...","""Rejected"""
0.144221,"""69577294999355...","""55473831953643...","""FirstMortgage3...","""30104198515712...","""Accepted"""
0.679745,"""14691701592044...","""14505132909284...","""MoneyMarketSav...","""24782110777461...","""Rejected"""
1.0,"""14082263256326...","""55473831953643...","""BasicChecking""","""24782110777461...","""Rejected"""
0.718982,"""35742398451852...","""15176930337713...","""BasicChecking""","""41946855300578...","""Accepted"""
0.0,"""39219711095332...","""14505132909284...","""UPlusFinPerson...",,"""Rejected"""
0.02333,"""14857210303222...","""14505132909284...","""BasicChecking""","""41946855300578...","""Rejected"""


The configs can also be written and read as such:

In [8]:
anon.config.save_to_config_file('config.json')

In [9]:
anon = DataAnonymization(config=Config(config_file='config.json'))
anon.process()

PREDICTOR_0,PREDICTOR_1,PREDICTOR_2,Context_Name,IH_Web_Inbound_Accepted_pxLastGroupID,OUTCOME
f64,str,str,str,str,str
0.983033,"""13749488081303...","""26147176613051...","""FirstMortgage3...","""14921383665438...","""Rejected"""
0.144221,"""88612884249396...","""71226778007428...","""FirstMortgage3...","""98996640880559...","""Accepted"""
0.679745,"""15135653830369...","""10609058199223...","""MoneyMarketSav...","""14921383665438...","""Rejected"""
1.0,"""86084059998400...","""71226778007428...","""BasicChecking""","""14921383665438...","""Rejected"""
0.718982,"""94223062024558...","""26147176613051...","""BasicChecking""","""84282893704966...","""Accepted"""
0.0,"""16594334543608...","""10609058199223...","""UPlusFinPerson...",,"""Rejected"""
0.02333,"""12843537306871...","""10609058199223...","""BasicChecking""","""84282893704966...","""Rejected"""


# Exporting
Two functions export:
- `create_mapping_file()` writes the mapping file of the predictor names
- `write_to_output()` writes the processed dataframe to disk

Write to output accepts the following extensions: `["ndjson", "parquet", "arrow", "csv"]`

In [16]:
anon.create_mapping_file()
with open('mapping.map') as f:
    print(f.read())

Customer_CLV=PREDICTOR_0
Customer_City=PREDICTOR_1
Customer_MaritalStatus=PREDICTOR_2
Context_Name=CK_PREDICTOR_0
IH_Web_Inbound_Accepted_pxLastGroupID=IH_PREDICTOR_0
Decision_Outcome=OUTCOME



In [11]:
anon.write_to_output(ext='arrow')

In [12]:
pl.read_ipc('output/hds.arrow')

PREDICTOR_0,PREDICTOR_1,PREDICTOR_2,Context_Name,IH_Web_Inbound_Accepted_pxLastGroupID,OUTCOME
f64,str,str,str,str,str
0.983033,"""13709236032899...","""65280124093365...","""FirstMortgage3...","""24209164365759...","""Rejected"""
0.144221,"""61523008642021...","""84822406214008...","""FirstMortgage3...","""70814434193761...","""Accepted"""
0.679745,"""73828690479369...","""17057876247860...","""MoneyMarketSav...","""24209164365759...","""Rejected"""
1.0,"""14834594955202...","""84822406214008...","""BasicChecking""","""24209164365759...","""Rejected"""
0.718982,"""14416422926530...","""65280124093365...","""BasicChecking""","""10444479079610...","""Accepted"""
0.0,"""15868705575342...","""17057876247860...","""UPlusFinPerson...",,"""Rejected"""
0.02333,"""32076854049289...","""17057876247860...","""BasicChecking""","""10444479079610...","""Rejected"""
