# Take-Home Assignment - PII Detection

Human Native's operations team currently reviews huge volumes of content for Personally Identifiable Information (PII), and the amount ot review is growing quickly. We need to be able to automate this system robustly!

This notebook walks through:
- Establishing a _baseline_ - incredibly important in all ML tasks. We discuss contemporary methods that already exist (services, Presidio) to inspire our model choice.
- Training a custom transformer for token-level PII detection
- Comparison of the results

## Assumptions before beginning

- We will focus on textual data
- We will only focus on these PII entities:
    - "NAME"
    - "EMAIL"
    - "LOCATION"
    - "PHONE"
    - "URL"
- We want to classify WORDS (and therefore tokens) as containing PII, as opposed to whether a small chunk of text contains PII _somewhere_ within it.

In [1]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

## Establishing a Baseline

Microsoft Presidio (https://microsoft.github.io/presidio/) is an awesome SDK that provides PII detection in a really use to way!

We have written some small functions that make it super simple to use the functionality of presidio (see `src\presidio`).

Below, we see examples of Presidio Detection:


In [3]:
from src.presidio.detector import presidio_detect

sample_text = "Hi, I'm Mac Walker. Contact: mac@humannative.com (hopefully (; ?), 07123 456789"
result = presidio_detect(sample_text)
for results in result:
    print(results)

type: EMAIL_ADDRESS, start: 29, end: 48, score: 1.0
type: PERSON, start: 8, end: 18, score: 0.85
type: URL, start: 33, end: 48, score: 0.5
type: PHONE_NUMBER, start: 67, end: 79, score: 0.4


Hence, as we can see, Presidio takes in a string of text and returns the token level spans of WHERE  the PII infraction takes place, and WHAT kind of infraction it is.

We therefore want to be able to test this data. We generate dummy data (please see ```src/data``` for dataset generation).

Each generated entry represents a blog post and contains one of several types of PII (given earlier) inserted in different formats (embedded, raw, standalone, or none). The original dataset is saved with the following columns:

```csv
dataset_id, data_id, value, flag, pii_mode, pii_spans
```

The different formats of PII insertion are:

- Embedded: PII is integrated naturally into a sentence to simulate real-world usage.
  Example: "Hi, my name is John Smith and I live in New York."

- Raw: PII is appended to an otherwise unrelated piece of text without much context.
  Example: "The project launched last year john.smith@example.com"

- Standalone: The PII appears as a separate sentence or fragment, disconnected from any surrounding text.
  Example: "john.smith@example.com. The project was completed last year."

- None: The text contains no PII at all. This serves as a control group to test false positives.
  Example: "The project deadline was moved to next Thursday due to client feedback."




The reason we had different insertion techniques for our data generation was because of the nature of transformer models. As we are relying on ```faker```-generated data, we want to embed PII in a way that closely resembles real-world usage. Hence, using actual word and sentences that make sense (e.g. I live in New York). But, we did not want the model to just pick up on spurious relationships (e.g., predicting PII on any tokens after the string "i live in").
I hope that the diversity in PII injection styles ensures that the model learns to detect PII entities based on their actual semantic and syntactic charachteristics, as opposed to their data-generated phrasing context.

Let us, therefore, inspect Presidio's accuracy on our test dataset:

In [5]:
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
import time
from src.presidio.detector import presidio_detect

df = pd.read_csv("data/test/test_Data.csv")

presidio_flags = []

start_time = time.time()

for i, row in df.iterrows():
    text = row["value"]
    try:
        results = presidio_detect(text)
        presidio_flags.append(int(len(results) > 0))
    except Exception as e:
        print(f"Error on row {i}: {e}")
        presidio_flags.append(0)

end_time = time.time()

print(f"\n⏱️ Total prediction time: {end_time - start_time:.2f} seconds")
print(f"⏱️ Average time per sample: {(end_time - start_time) / len(df):.4f} seconds")

df["default_flag"] = presidio_flags



⏱️ Total prediction time: 27.11 seconds
⏱️ Average time per sample: 0.0271 seconds


In [6]:
print("\n Default Presidio Classification Report on Test Dataset:\n")
print(classification_report(df["flag"], df["default_flag"], target_names=["no_pii", "has_pii"]))

print("\n Confusion Matrix:")
print(confusion_matrix(df["flag"], df["default_flag"]))


 Default Presidio Classification Report on Test Dataset:

              precision    recall  f1-score   support

      no_pii       0.77      0.96      0.86       250
     has_pii       0.99      0.90      0.94       750

    accuracy                           0.92      1000
   macro avg       0.88      0.93      0.90      1000
weighted avg       0.93      0.92      0.92      1000


 Confusion Matrix:
[[241   9]
 [ 72 678]]


Hence, we have a good set of results to check against. 
- 241 TN 
- 678 TP
- 9 FP
- 72 FN - This is not good!!!

Our recall is 90% - we should really want to increase this! 

In our case, we really want to mitigate against FN. It is _much_ better to say something is PII when it isn't than to not flag actual PII:

1. Flag PII that isn't actually PII -> slightly reduce the information contained within the training dataset we provide to AI labs
2. Don't flag PII that is actually PII -> possible legal culpability!

Furthermore, our classification results only check whether our model detects PII in a piece of text, not whether we EXACTLY capture where the PII is. This is something to be improved upon in next steps.

This inspires a new way. Luckily, Presidio provides a framework for integrating your own built in model to provide the 'NLP backbone' of PII detection.

I have trained a model (see ```src/training```) to do exactly this!. We use DistilBertforNER from huggingface, finetuning on the data we created in ```data/Data.csv```.

Let us inspect its performance underneath:

In [7]:
from src.evaluation.token_classifier import load_bert_token_model, evaluate_on_dataframe

df = pd.read_csv("data/test/test_Data.csv")
tokenizer, model = load_bert_token_model()

start_time = time.time()

df = evaluate_on_dataframe(df, tokenizer, model)

end_time = time.time()
total_time = end_time - start_time
avg_time = total_time / len(df)

print(f"\n⏱️ Total prediction time: {total_time:.2f} seconds")
print(f"⏱️ Average time per sample: {avg_time:.4f} seconds")

print("\n📊 Token Model Classification Report:")
print(classification_report(df["flag"], df["token_model_flag"], target_names=["no_pii", "has_pii"]))

print("\n🔄 Confusion Matrix:")
print(confusion_matrix(df["flag"], df["token_model_flag"]))


  from .autonotebook import tqdm as notebook_tqdm
Evaluating Token Classifier: 100%|██████████| 1000/1000 [02:34<00:00,  6.47it/s]


⏱️ Total prediction time: 154.54 seconds
⏱️ Average time per sample: 0.1545 seconds

📊 Token Model Classification Report:
              precision    recall  f1-score   support

      no_pii       1.00      1.00      1.00       250
     has_pii       1.00      1.00      1.00       750

    accuracy                           1.00      1000
   macro avg       1.00      1.00      1.00      1000
weighted avg       1.00      1.00      1.00      1000


🔄 Confusion Matrix:
[[250   0]
 [  0 750]]





Okay so this model is obviously quite good! 

That being said - we should not be overly optimistic. The structure of the toy data probably lends itself to this task. It would be much better to test it on  real world data that has been assessed with humans in the loop - e.g. the data from the operations team.