# Data Anonymization

In Pega CDH 8.5 and up, it's now possible to record the historical data as seen by the Adaptive Models. See [this academy challenge](https://academy.pega.com/challenge/exporting-historical-data/v4) for reference. This historical data can be further used to experiment with offline models, but also to fine-tune the OOTB Gradient Boosting model. However, sharing this information with Pega can be sensitive as it contains raw predictor data. 

To this end, we provide a simple and transparent script to fully anonimize this dataset.

The DataAnonymization script is now part of pdstools, and you can import it directly as such.

In [1]:
# These lines are only for rendering in the docs, and are hidden through Jupyter tags
# Do not run if you're running the notebook seperately

import os  
import sys
import plotly.io as pio
pio.renderers.default = "notebook_connected"

sys.path.append("../../../")
sys.path.append('../../python')

In [2]:
from pdstools import ADMDatamart
from pdstools import Config, DataAnonymization
import polars as pl

## Input data

To demonstrate this process, we're going to anonymise this toy example dataframe:

In [3]:
pl.read_ndjson('../../../../data/SampleHDS.json')

Context_Name,Customer_MaritalStatus,Customer_CLV,Customer_City,IH_Web_Inbound_Accepted_pxLastGroupID,Decision_Outcome
str,str,i64,str,str,str
"""FirstMortgage3…","""Married""",1460,"""Port Raoul""","""Account""","""Rejected"""
"""FirstMortgage3…","""Unknown""",669,"""Laurianneshire…","""AutoLoans""","""Accepted"""
"""MoneyMarketSav…","""No Resp+""",1174,"""Jacobshaven""","""Account""","""Rejected"""
"""BasicChecking""","""Unknown""",1476,"""Lindton""","""Account""","""Rejected"""
"""BasicChecking""","""Married""",1211,"""South Jimmiesh…","""DepositAccount…","""Accepted"""
"""UPlusFinPerson…","""No Resp+""",533,"""Bergeville""",,"""Rejected"""
"""BasicChecking""","""No Resp+""",555,"""Willyville""","""DepositAccount…","""Rejected"""


As you can see, this dataset consists of regular predictors, IH predictors, context keys and the outcome column. Additionally, some columns are numeric, others are strings. Let's first initialize the DataAnonymization class.

In [4]:
anon = DataAnonymization(hds_folder='../../../../data/')

By default, the class applies a set of anonymisation techniques:
- Column names are remapped to a non-descriptive name
- Categorical values are hashed with a random seed
- Numerical values are normalized between 0 and 1
- Outcomes are mapped to a binary outcome.

To apply these techniques, simply call `.process()`:

In [5]:
anon.process()

filename,PREDICTOR_1,PREDICTOR_2,PREDICTOR_3,Context_Name,IH_PREDICTOR_0,Decision_Outcome
str,str,str,f64,str,str,bool
"""../../../../da…","""14877547429399…","""16817702996572…",1.2927e+19,"""21446399427533…","""32390660893490…",False
"""../../../../da…","""10546666257829…","""48667164975102…",1.4856e+19,"""21446399427533…","""17331154902477…",True
"""../../../../da…","""13645294142382…","""44729548600091…",5.6458e+17,"""72724746675646…","""32390660893490…",False
"""../../../../da…","""87208414476532…","""48667164975102…",4.0723e+18,"""12437605523518…","""32390660893490…",False
"""../../../../da…","""44918746444882…","""16817702996572…",1.4677e+19,"""12437605523518…","""43489759966379…",True
"""../../../../da…","""11250597095858…","""44729548600091…",1.633e+19,"""90811699055305…",,False
"""../../../../da…","""15241890862632…","""44729548600091…",8.7675e+18,"""12437605523518…","""43489759966379…",False


To trace back the columns to their original names, the class also contains a mapping, which does not have to be provided.

In [6]:
anon.column_mapping

{'filename': 'filename',
 'Customer_City': 'PREDICTOR_1',
 'Customer_MaritalStatus': 'PREDICTOR_2',
 'Customer_CLV': 'PREDICTOR_3',
 'Context_Name': 'Context_Name',
 'IH_Web_Inbound_Accepted_pxLastGroupID': 'IH_PREDICTOR_0',
 'Decision_Outcome': 'Decision_Outcome'}

## Configs

Each capability can optionally be turned off - see below for the full list of config options, and refer to the API reference for the full description.

In [7]:
dict(zip(Config.__init__.__code__.co_varnames[1:], Config.__init__.__defaults__))

{'config_file': None,
 'hds_folder': '.',
 'use_datamart': False,
 'datamart_folder': 'datamart',
 'output_format': 'ndjson',
 'output_folder': 'output',
 'mapping_file': 'mapping.map',
 'mask_predictor_names': True,
 'mask_context_key_names': False,
 'mask_ih_names': True,
 'mask_outcome_name': False,
 'mask_predictor_values': True,
 'mask_context_key_values': True,
 'mask_ih_values': True,
 'mask_outcome_values': True,
 'context_key_label': 'Context_*',
 'ih_label': 'IH_*',
 'outcome_column': 'Decision_Outcome',
 'positive_outcomes': ['Accepted', 'Clicked'],
 'negative_outcomes': ['Rejected', 'Impression'],
 'special_predictors': ['Decision_DecisionTime',
  'Decision_OutcomeTime',
  'Decision_Rank'],
 'sample_percentage_schema_inferencing': 0.01}

It's easy to change these parameters by just passing the keyword arguments. In the following example, we
- Keep the IH predictor names
- Keep the outcome values
- Keep the context key values
- Keep the context key predictor names

In [8]:
anon = DataAnonymization(
    hds_folder="../../../../data/",
    mask_ih_names=False,
    mask_outcome_values=False,
    mask_context_key_values=False,
    mask_context_key_names=False,
)
anon.process()


filename,PREDICTOR_1,PREDICTOR_2,PREDICTOR_3,Context_Name,IH_Web_Inbound_Accepted_pxLastGroupID,Decision_Outcome
str,str,str,f64,str,str,str
"""../../../../da…","""71174786955093…","""10515029542116…",1.2927e+19,"""FirstMortgage3…","""16990889431113…","""Rejected"""
"""../../../../da…","""13974280506589…","""87112516097274…",1.4856e+19,"""FirstMortgage3…","""34069750237914…","""Accepted"""
"""../../../../da…","""14685336043466…","""12583966424426…",5.6458e+17,"""MoneyMarketSav…","""16990889431113…","""Rejected"""
"""../../../../da…","""48927962682339…","""87112516097274…",4.0723e+18,"""BasicChecking""","""16990889431113…","""Rejected"""
"""../../../../da…","""15714648868532…","""10515029542116…",1.4677e+19,"""BasicChecking""","""41370477092487…","""Accepted"""
"""../../../../da…","""72178353295496…","""12583966424426…",1.633e+19,"""UPlusFinPerson…",,"""Rejected"""
"""../../../../da…","""10971997341023…","""12583966424426…",8.7675e+18,"""BasicChecking""","""41370477092487…","""Rejected"""


The configs can also be written and read as such:

In [9]:
anon.config.save_to_config_file('config.json')

In [10]:
anon = DataAnonymization(config=Config(config_file='config.json'))
anon.process()

filename,PREDICTOR_1,PREDICTOR_2,PREDICTOR_3,Context_Name,IH_Web_Inbound_Accepted_pxLastGroupID,Decision_Outcome
str,str,str,f64,str,str,str
"""../../../../da…","""28436829283577…","""71151274872333…",1.2927e+19,"""FirstMortgage3…","""11100714486754…","""Rejected"""
"""../../../../da…","""30802074112887…","""11423597689488…",1.4856e+19,"""FirstMortgage3…","""15825052952957…","""Accepted"""
"""../../../../da…","""81264572100345…","""52331112554498…",5.6458e+17,"""MoneyMarketSav…","""11100714486754…","""Rejected"""
"""../../../../da…","""12967156816832…","""11423597689488…",4.0723e+18,"""BasicChecking""","""11100714486754…","""Rejected"""
"""../../../../da…","""17889780541429…","""71151274872333…",1.4677e+19,"""BasicChecking""","""50565355077180…","""Accepted"""
"""../../../../da…","""20356805143982…","""52331112554498…",1.633e+19,"""UPlusFinPerson…",,"""Rejected"""
"""../../../../da…","""12457935825842…","""52331112554498…",8.7675e+18,"""BasicChecking""","""50565355077180…","""Rejected"""


## Exporting
Two functions export:
- `create_mapping_file()` writes the mapping file of the predictor names
- `write_to_output()` writes the processed dataframe to disk

Write to output accepts the following extensions: `["ndjson", "parquet", "arrow", "csv"]`

In [11]:
anon.create_mapping_file()
with open('mapping.map') as f:
    print(f.read())

filename=filename
Customer_City=PREDICTOR_1
Customer_MaritalStatus=PREDICTOR_2
Customer_CLV=PREDICTOR_3
Context_Name=Context_Name
IH_Web_Inbound_Accepted_pxLastGroupID=IH_Web_Inbound_Accepted_pxLastGroupID
Decision_Outcome=Decision_Outcome



In [12]:
anon.write_to_output(ext='arrow')

In [13]:
pl.read_ipc('output/hds.arrow')

PREDICTOR_1,PREDICTOR_2,PREDICTOR_3,Context_Name,IH_Web_Inbound_Accepted_pxLastGroupID,Decision_Outcome
str,str,f64,str,str,str
"""23671383996091…","""17865861892748…",1.2927e+19,"""FirstMortgage3…","""12745285533878…","""Rejected"""
"""42992408375653…","""80078157106940…",1.4856e+19,"""FirstMortgage3…","""90909392678985…","""Accepted"""
"""65668986295303…","""14069977546159…",5.6458e+17,"""MoneyMarketSav…","""12745285533878…","""Rejected"""
"""17533451343986…","""80078157106940…",4.0723e+18,"""BasicChecking""","""12745285533878…","""Rejected"""
"""60082355169441…","""17865861892748…",1.4677e+19,"""BasicChecking""","""11452039262280…","""Accepted"""
"""13589627967963…","""14069977546159…",1.633e+19,"""UPlusFinPerson…",,"""Rejected"""
"""13637770714651…","""14069977546159…",8.7675e+18,"""BasicChecking""","""11452039262280…","""Rejected"""


## Advanced: Hash fuctions

By default, we use [the same hashing algorithm Polars](https://pola-rs.github.io/polars/py-polars/html/reference/expressions/api/polars.Expr.hash.html#polars.Expr.hash) uses: [xxhash](https://github.com/Cyan4973/xxHash), as implemented [here](https://github.com/pola-rs/polars/blob/3f287f370b3c388ed2f3f218b2c096382548136f/polars/polars-core/src/vector_hasher.rs#L266). xxhash is fast to compute, and you can check its performance in collision, dispersion and randomness [here](https://github.com/Cyan4973/xxHash/tree/dev/tests). 

xxhash accepts four distinct seeds, but by default we set the seeds to `0`. It is possible to set the `seed` argument of the `process()` function to `'random'`, which will set all four seeds to a random integer between `0` and `1000000000`. Alternatively, it is possible to supply the four seeds manually with arguments `seed`, `seed_1`, `seed_2` and `seed_3`. 

If the xxhash with (random) seed(s) is not deemed sufficiently secure, it is possible to use your own hashing algorithm.

Note that since we're now running python code and not native Polars code anymore, this will be _significantly_ slower. Nonetheless, it is possible.

Just as an example - this is how one would use sha3_256:

In [14]:
from hashlib import sha3_256

anon.process(algorithm=lambda x: sha3_256(x.encode()).hexdigest())

ComputeError: AttributeError: 'int' object has no attribute 'encode'