# 🦙 LLM direct redaction using AWS Bedrock

This notebooks explores using LLM models directly to identify `Personal Identifyable Information` (PII).

> ℹ️ We use [AWS Bedrock](...) and markdown inputs as produced by something like [docLing](https://docs.vllm.ai/en/stable/)

> 💡 Might be best to use Bedrock through [langchain's bedrock integration](https://python.langchain.com/docs/integrations/chat/bedrock/)

> ⚠️ Requires `AWS_SECRET_ACCESS_KEY` and `AWS_ACCESS_KEY_ID` env vars

## ⚙️ Setup

In [1]:
# install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh

!uv pip install -q --system ollama tqdm rich funcy boto3 python-dotenv

downloading uv 0.6.11 x86_64-unknown-linux-gnu
no checksums to verify
installing to /root/.local/bin
  uv
  uvx
everything's installed!


## 🦜 LLM

Let's check some random LLM chatter 🗣️


In [2]:
import os
import boto3
import json
from dotenv import load_dotenv


load_dotenv()


# ===================================== 👇 Configure as needed =================================

AWS_REGION = "us-east-1"
MISTRAL_MODEL_ID = "mistral.mixtral-8x7b-instruct-v0:1"
LLAMA_MODEL_ID = "us.meta.llama3-3-70b-instruct-v1:0"

# ===================================== 👆 Configure as needed =================================


# Initialize the Bedrock client
bedrock_client = boto3.client('bedrock-runtime', region_name=AWS_REGION)


def invoke(
    prompt:str, 
    model_id:str,
    max_gen_len:int = 2048, 
    temperature:float = 0.0, 
    top_p:float = 0.9
):
    # Prepare the payload
    payload = {
        "prompt": prompt,
        # "max_gen_len": max_gen_len,
        "temperature": temperature,
        "top_p": top_p
    }
    # Invoke the model
    response = bedrock_client.invoke_model(
        body=json.dumps(payload),
        modelId=model_id,
        accept='application/json',
        contentType='application/json'
    )

    # Parse and print the response
    return json.loads(response["body"].read())
    

def invoke_mistral(prompt:str, model_id:str = MISTRAL_MODEL_ID):
    MISTRAL_TEMPLATE = "<s>[INST] {prompt}[/INST]"
    prompt = MISTRAL_TEMPLATE.format(prompt=prompt)
    response = invoke(prompt, model_id)
    return response["outputs"][0]["text"]
    
    
def invoke_llama(
    user_prompt:str, 
    sys_prompt:str = "You are a helpful assistant", 
    model_id:str = LLAMA_MODEL_ID
):
    LLAMA_TEMPLATE = (
        "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n"
        "{system}<|eot_id|><|start_header_id|>user<|end_header_id|>\n"
        "{user}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
    )
    prompt = LLAMA_TEMPLATE.format(user=user_prompt, system=sys_prompt)
    response = invoke(prompt, model_id)
    return response["generation"]


In [7]:
invoke_llama("Who are you?")

"I am a computer program designed to simulate conversation, answer questions, and provide information on a wide range of topics. I'm here to help with any questions or tasks you may have, and I'll do my best to provide accurate and helpful responses. I don't have a personal identity or emotions, but I'm designed to be friendly and engaging. How can I assist you today?"

## 📚 Data

In [5]:
import re
from pathlib import Path
from utils import split_markdown_by_spans

DATA_DIR = Path("/datasets/client-data-us/")

md_docs = list(DATA_DIR.rglob("**/*.md"))
print(f"Total markdown documents: {len(md_docs)}")

Total markdown documents: 60


## 🫥 Anonymisation

### 🗣️ Prompts

> ℹ️ Check out [Prompt templates for the AWS Bedrock models](https://docs.aws.amazon.com/bedrock/latest/userguide/prompt-templates-and-examples.html)

In [28]:
SYSTEM_PROMPT = """
You are an expert in identifying Personal Identifyable Information (PII) in a given text according
to the following rules:

1. Case Information and Case Participants

- Redact AAA Case Number / Docket Number 
- Redact Hearing Location 
- Redact Names, Physical Addresses, and Other Contact Information (Email, Website, Phone Number, etc.) of: 
    - Parties 
    - Party Representatives 
    - Advocates 
    - AAA Case Managers 
    - Arbitrators (but only if the arbitrator is the employee of a union or party) 

2. Names of Other Individuals and Organizations

- Redact Names, Physical Addresses, and Other Contact Information (Email, Website, Phone Number, etc.) of: 
- Non-Party Organizations 
- Witnesses 
- Non-Party Individuals (Other Than Witnesses) 

Do Not Redact Ranks or Titles of Individuals (unless the title itself can be used to identify a party—e.g., “X County Chief of Police”) 

3. Details that Identify Geographical Location of Parties/Events

In General: Redact any information about the location of the hearing or the matter in dispute that would de-anonymize any party. 
- Redact, if it can be used to identify the location, Names, Acronyms/Abbreviations, Physical Addresses, and Other Contact Information (Email, Website, Phone Number, etc.) of: 
    - Government Bodies (Agency, Council, Police or Fire Department, Bureau, etc.) 
    - Public Places (Parks, Beaches, Botanical Gardens, Zoos, etc.) 
    - Natural Features/Landmarks (Bodies of Water, Mountains, etc.) 
    - Streets and Highways 
    - Manmade Landmarks (Named Buildings, Bridges, Dams, Tunnels, etc.) 
    - Local Infrastructure (Schools, Universities, Hospitals, Police and Fire Stations, Correctional Facilities, Court Houses, etc.) 
    - Cultural Institutions (Museums, Theaters, Stadiums, etc.) 
    
    
Identify all information to be redacted and answer only with a valid formatted JSON.
"""

### 💬 Chat

In [10]:
text = md_docs[0].open().read()

response = invoke_llama(sys_prompt=SYSTEM_PROMPT, user_prompt=text)
print(response)

```json
[
  {
    "Type": "Company Name",
    "Value": "Supply Chain Consultants, Inc."
  },
  {
    "Type": "Company Name",
    "Value": "Arkieva"
  },
  {
    "Type": "Address",
    "Value": "Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808"
  },
  {
    "Type": "Phone Number",
    "Value": "302-738-9215"
  },
  {
    "Type": "Fed ID/TIN",
    "Value": "51-035 0007"
  },
  {
    "Type": "Company Name",
    "Value": "BUCKMAN LABORATORIES INTERNATIONAL, INC."
  },
  {
    "Type": "Address",
    "Value": "1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A."
  },
  {
    "Type": "Phone Number",
    "Value": "(901) 278-0330"
  },
  {
    "Type": "Name",
    "Value": "Sujit K. Singh"
  },
  {
    "Type": "Title",
    "Value": "COO, Arkieva"
  },
  {
    "Type": "Date",
    "Value": "August 17, 2012"
  },
  {
    "Type": "Company Name",
    "Value": "Buckman"
  }
]
```


## 🔗 Structured

Here we use [langchain's](https://www.langchain.com/https://www.langchain.com/) [JsonOutputParse](https://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.json.JsonOutputParser.htmlhttps://python.langchain.com/api_reference/core/output_parsers/langchain_core.output_parsers.json.JsonOutputParser.html) to parse and structure the output.

> ⚠️ The current implementation passes the `system prompt` on every single call. Ideally, we would use a chat session

In [11]:
!uv pip install --system -q langchain-aws

In [35]:
import json

from collections import defaultdict
from dateutil import parser
from pydantic import BaseModel, Field, field_validator
from langchain_core.output_parsers.json import JsonOutputParser
from langchain_core.prompts.prompt import PromptTemplate
from langchain_core.prompts import ChatPromptTemplate


class PIIItem(BaseModel):
    Type: str = Field(..., description="The type of personal idenitfyable information entity. e.g.: Name")
    Value: str = Field(..., description="The entity value. e.g: John Doe")


class PIIList(BaseModel):
    pii: list[PIIItem] = Field(default_factory=list, description="List of identified PIIs")


class PIIEntities(BaseModel):
    dates: list[str] = Field(default_factory=list, description="List of identified calendar dates.")
    person_names: list[str] = Field(default_factory=list, description="List of Person's or individual names")
    orgs: list[str] = Field(default_factory=list, description="List of identified company names.")
    telephones: list[str] = Field(default_factory=list, description="List of identified telephone numbers")
    emails: list[str] = Field(default_factory=list, description="List of identified email addresses")
    locations: list[str] = Field(default_factory=list, description="List of identified locations and addresses")
    
                                    
    @field_validator("dates")
    @classmethod
    def validate_dates(cls, dates):
        valid_dates = []
        for date_str in dates:
            # If it parses as float is not a date
            try:
                float(date_str)
                continue
            except ValueError:
                pass
            
            # Attempt to parse the date string with fuzzy parsing disabled
            try:
                parsed_date = parser.parse(date_str, fuzzy=False)
                valid_dates.append(date_str)
            except ValueError:
                # If parsing fails, skip this date string
                print(f"Invalid date format: '{date_str}'")
                
        return valid_dates
    
    
parser = JsonOutputParser(pydantic_object=PIIList)
print(parser.get_format_instructions())

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"$defs": {"PIIItem": {"properties": {"Type": {"description": "The type of personal idenitfyable information entity. e.g.: Name", "title": "Type", "type": "string"}, "Value": {"description": "The entity value. e.g: John Doe", "title": "Value", "type": "string"}}, "required": ["Type", "Value"], "title": "PIIItem", "type": "object"}}, "properties": {"pii": {"description": "List of identified PIIs", "items": {"$ref": "#/$defs/PIIItem"}, "title": "Pii", "type": "array"}}}
```


In [38]:
from langchain_aws import ChatBedrock


llm = ChatBedrock(
    model=LLAMA_MODEL_ID,
    provider="meta",
    region_name=AWS_REGION,
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"],
    model_kwargs=dict(temperature=0),
)

messages = [
    ("system", SYSTEM_PROMPT),
    ("human", "Here is the document content:\n\n{content}.\n\n{format_instructions}"),
]
prompt = ChatPromptTemplate(
    messages=messages,
    input_variables=["content"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

chain = prompt | llm | parser
res = chain.invoke({"content": text})
res

{'pii': [{'Type': 'Company Name', 'Value': 'Supply Chain Consultants, Inc.'},
  {'Type': 'Company Name', 'Value': 'Arkieva'},
  {'Type': 'Address',
   'Value': 'Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808'},
  {'Type': 'Phone Number', 'Value': '302-738-9215'},
  {'Type': 'Fed ID/TIN', 'Value': '51-035 0007'},
  {'Type': 'Company Name',
   'Value': 'BUCKMAN LABORATORIES INTERNATIONAL, INC.'},
  {'Type': 'Address',
   'Value': '1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A.'},
  {'Type': 'Phone Number', 'Value': '(901) 278-0330'},
  {'Type': 'Name', 'Value': 'Sujit K. Singh'},
  {'Type': 'Title', 'Value': 'COO, Arkieva'},
  {'Type': 'Date', 'Value': 'August 17, 2012'}]}

## 📦 Placeholders

In [55]:
def pii_to_placeholders(pii_list):
    # map each entity to a placeholder and substitute in the original text
    placeholders = {}
    counts = defaultdict(int)
    for pii in pii_list:
        pii = PIIItem(**pii)
        counts[pii.Type] += 1
        placeholder = pii.Type.upper().replace(" ", "_")
        placeholders[pii.Value] = f"[{placeholder}_{counts[pii.Type]}_REDACTED]"
        
    return placeholders

placeholders = pii_to_placeholders(res["pii"])
rev_placeholders = {v:k for k,v in placeholders.items()}
dict(sorted(rev_placeholders.items(), key=lambda item: item[0]))

{'[ADDRESS_1_REDACTED]': 'Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808',
 '[ADDRESS_2_REDACTED]': '1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A.',
 '[COMPANY_NAME_1_REDACTED]': 'Supply Chain Consultants, Inc.',
 '[COMPANY_NAME_2_REDACTED]': 'Arkieva',
 '[COMPANY_NAME_3_REDACTED]': 'BUCKMAN LABORATORIES INTERNATIONAL, INC.',
 '[DATE_1_REDACTED]': 'August 17, 2012',
 '[FED_ID/TIN_1_REDACTED]': '51-035 0007',
 '[NAME_1_REDACTED]': 'Sujit K. Singh',
 '[PHONE_NUMBER_1_REDACTED]': '302-738-9215',
 '[PHONE_NUMBER_2_REDACTED]': '(901) 278-0330',
 '[TITLE_1_REDACTED]': 'COO, Arkieva'}

## 👹 Text masking

In [57]:
import copy
import re


def mask_text(text, placeholders):
    masked_text = copy.copy(text)
    for ent, placeholder in placeholders.items():
        print(f"{ent} --> {placeholder}")
        masked_text = re.sub(ent, f"{placeholder}" , masked_text, count=0, flags=0)

    return masked_text
    
    
doc = md_docs[0]
doc_name = doc.stem
print(mask_text(text, placeholders))

Supply Chain Consultants, Inc. --> [COMPANY_NAME_1_REDACTED]
Arkieva --> [COMPANY_NAME_2_REDACTED]
Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 --> [ADDRESS_1_REDACTED]
302-738-9215 --> [PHONE_NUMBER_1_REDACTED]
51-035 0007 --> [FED_ID/TIN_1_REDACTED]
BUCKMAN LABORATORIES INTERNATIONAL, INC. --> [COMPANY_NAME_3_REDACTED]
1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A. --> [ADDRESS_2_REDACTED]
(901) 278-0330 --> [PHONE_NUMBER_2_REDACTED]
Sujit K. Singh --> [NAME_1_REDACTED]
COO, Arkieva --> [TITLE_1_REDACTED]
August 17, 2012 --> [DATE_1_REDACTED]
#### **MASTER LICENSE AGREEMENT**

**Between Customer and [COMPANY_NAME_1_REDACTED] d/b/a [COMPANY_NAME_2_REDACTED] [ADDRESS_1_REDACTED] Telephone: [PHONE_NUMBER_1_REDACTED] Fed ID/TIN: [FED_ID/TIN_1_REDACTED]**

Customer Name: [COMPANY_NAME_3_REDACTED]

Address: [ADDRESS_2_REDACTED] Telephone: (901) 278-0330

ATTN:

| 1.0  | DEFINITIONS<br>3             |
|------|------------------------------|
| 2.0  | LICENS