# LLM Gateway for PII Detection

A common complaint around adopting LLMs for enterprise use-cases are those around data privacy.

While open-weight models are always a great option and *should be trialed if possible*, sometimes we just want to demo things really quickly or have really good reasons for using an LLM API. In these cases, it is good practice to have some gateway that can handle scrubbing of **Personal Identifiable Information** (PII) data to mitigate the risk of PII leaking.

In this example, we will look at the [`ai4privacy/pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k) dataset and use the [`CohereForAI/c4ai-command-r-plus`](https://huggingface.co/CohereForAI/c4ai-command-r-plus) for PII Scrubbing.

## Setups

In [None]:
import os
from llm_gateway.providers.cohere import CohereWrapper
from datasets import load_dataset
import cohere
import types
import re

COHERE_API_KEY = os.environ["COHERE_API_KEY"]
# default database url: "postgresql://postgres:postgres@postgres:5432/llm_gateway"
DATABASE_URL = os.environ["DATABASE_URL"]

## LLM Wrapper

In [None]:
wrapper = CohereWrapper()

In [None]:
example = "Bin Liu (binliuliu@gmail.com, (+1) 111-111-1111) committed a mistake when he used PyTorch Trainer instead of HF Trainer."

In [None]:
response, db_record = wrapper.send_cohere_request(
    endpoint='generate',
    model='command-r-plus',
    max_tokens=25,
    prompt=f"{example}\n\nSummarize the above text in 1-2 sentences",
    temperature=0.3
)

In [None]:
print(response)

In [None]:
print(db_record)

The `db_record` is the database record. As we can see, the prompt was scrubbed and the actual `user_input` that was sent out is
```
Bin Liu ([REDACTED EMAIL ADDRESS], (+1) [REDACTED PHONE NUMBER]) committed a mistake when he used PyTorch Trainer instead of HF Trainer.\n\nSummarize the above text in 1-2 sentences.
```

## Scrubbers

From their repository, they implemented the following as scrubbers:
```
ALL_SCRUBBERS = [
    scrub_phone_numbers,
    scrub_credit_card_numbers,
    scrub_email_addresses,
    scrub_postal_codes,
    scrub_sin_numbers,
]
```

If we need to implement another scrubber, we can do that by modifying the wrapper's method.

In [None]:
def my_custom_scrubber(text: str) -> str:
    """Scrub name in text"""
    return re.sub(r"Bin Liu", "[REDACTED PERSON]", text, re.IGNORECASE)

In [None]:
original_method = wrapper.send_cohere_request

def modified_method(self, **kwargs):
    self._validate_cohere_endpoint(kwargs.get('endpoint', None))
    prompt = kwargs.get('prompt', None)

    text = my_custom_scrubber(prompt)
    kwargs['prompt'] = text
    return original_method(self, **kwargs)

# assign a new method to the instance
wrapper.send_cohere_request = types.MethodType(modified_method, wrapper)

In [None]:
response, db_record = wrapper.send_cohere_request(
    endpoint="generate",
    model="command-r-plus",
    max_tokens=25,
    prompt=f"{example}\n\nSummarize the above text in 1-2 sentences.",
    temperature=0.3,
)

print(response)

In [None]:
print(db_record)

The scrubbers are applied sequentially, so if our custom scrubber interferes with any of the default scrubbers, they may behave odd.

## Dataset

In [None]:
pii_ds = load_dataset("ai4privacy/pii-masking-200k")

In [None]:
pii_ds['train'][0]['source_text']

In [None]:
example = pii_ds['train'][0]['source_text']

response, db_record = wrapper.send_cohere_request(
    endpoint='generate',
    model='command-r-plus',
    max_tokens=50,
    prompt=f"{example}\n\nSummarize the above text in 1-2 sentences",
    temperature=0.3
)

In [None]:
print(response)

In [None]:
print(db_record)

## Regular output

In [None]:
co = cohere.Client(api_key=os.environ['COHERE_API_KEY']


response_vanilla = co.generate(
    prompt=f"{example}\n\nSummarize the above text in 1-2 sentences.",
    model="command-r-plus",
    max_tokens=50,
    temperature=0.3
)

In [None]:
print(response_vanilla)