## PII data perturbation demo

In this demo we call Presidio (through it's Python interface) and then replace the detected entities with fake ones, using the same techniques in the `FakeDataGenerator` object.

The `PresidioPerturb` class as a wrapper on top of `FakeDataGenerator` which accepts a presidio analyzer response and creates fake sentences based on the original ones.


In [None]:
# install presidio via pip if not yet installed

#!pip install presidio-analyzer
#!pip install presidio-anonymizer

In [None]:
from presidio_analyzer import AnalyzerEngine
from presidio_evaluator.data_generator.presidio_perturb import PresidioPerturb

import pandas as pd

In [None]:
# Set up the fake PII data frame

fake_pii_csv = '../presidio_evaluator/data_generator/raw_data/FakeNameGenerator.com_3000.csv'

fake_pii_df = pd.read_csv(fake_pii_csv, encoding='utf-8')
fake_pii_df.head()

In [None]:
# Instantiate Presidio Analyzer

analyzer = AnalyzerEngine()

In [None]:
presidio_perturb = PresidioPerturb(fake_pii_df=fake_pii_df)

In [None]:
original_text = "Hi my name is Doug Funny and this is my website: https://www.dougf.io/"

presidio_response = analyzer.analyze(original_text,language='en')
presidio_response


In [None]:
# Simple perturbation

presidio_perturb.perturb(original_text=original_text, presidio_response=presidio_response,count=5)

In [None]:
# Restrict name sets
presidio_perturb.perturb(original_text=original_text, presidio_response=presidio_response,count=5,
                         namesets=['Dutch'])


In [None]:
# Restrict name set and gender
presidio_perturb.perturb(original_text=original_text,
                         presidio_response=presidio_response,
                         count=500,
                         namesets=['American','Brazil'], genders=['female'])

In [None]:
# When Presidio fails to detect an entity, it will be available in the fake samples!

text = "Our son asdfhlk used to work in Germany"

response = analyzer.analyze(text=text,language='en')
print(f"Presidio' response: {response}")


fake_samples = presidio_perturb.perturb(original_text=text,presidio_response=response,count=5)
print(f"-------------\nFake examples:\n")
print(*fake_samples, sep = "\n")