# 🦙 PII Redaction using Ollama LLMs

This notebooks explores using LLM models to identify `Personal Identifyable Information` (PII).

> ℹ️ We use [Ollama](https://ollama.com/download/linux) and markdown inputs as produced by something like [docLing](https://docs.vllm.ai/en/stable/)

> ⚠️ As we are limited to running a model that fits in about 16Gb of VRAM we can't get good results. Based on experiments with cloud LLMs, good results require models of at least 30B parameters. Ideally more.

## ⚙️ Setup

> ℹ️ After installation you'll need to run in a separate terminal:
> ```shell
> ollama serve &
> ollama pull llama3.3
> ```

In [None]:
# install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

# Install python deps
!uv pip install -q --system ollama tqdm rich funcy

## 🦜 LLM

Let's check some random LLM chatter 🗣️

> You will need to run in a separate terminal:
> ```shell
> ollama serve &
> ollama pull phi4
> ```


In [28]:
import re

from ollama import chat
from ollama import ChatResponse


# ===================================== 👇 Configure as needed =================================
MODEL_ID = "deepseek-r1:14b" # "phi4", "deepseek-r1:14b", "qwen2.5:14b"
DEFAULT_SYS_PROMPT = "You are a helpful AI assistant."
# ===================================== 👆 Configure as needed =================================


def parse_deepseek_response(response: str) -> tuple[str, list[str]]:
    THINK_TRACE_REGEX = re.compile(r"<think>(.*?)</think>", re.DOTALL)
    BASE_ANSWER_REGEX = re.compile(r"</think>(.*)", re.DOTALL)
    FINAL_ANSWER_REGEX = re.compile(r"\*\*Answer:\*\*(.*)", re.DOTALL)

    # Gather the content between the <think> tags
    think_content = THINK_TRACE_REGEX.search(response).group(1).split("\n\n")

    # Gather the content after the **Answer:**
    answer_content = BASE_ANSWER_REGEX.search(response).group(1)
    if m := FINAL_ANSWER_REGEX.search(answer_content):
        answer_content = m.group(1).strip()

    return answer_content, think_content


def stream_response(stream):
    for chunk in stream:
        print(chunk, end='', flush=True)
        

def generate_stream(prompt: str, sys_prompt:str = DEFAULT_SYS_PROMPT, **options):
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": prompt},
    ]
    stream = chat(
        model=MODEL_ID, messages=messages, stream=True, **options
    )
    for chunk in stream:
        yield chunk['message']['content']

        
def generate(prompt: str, sys_prompt:str = DEFAULT_SYS_PROMPT, **options):
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": prompt},
    ]
    response = chat(model=MODEL_ID, messages=messages, **options)
    return response.message.content


# Run inference
ans = generate("Hey there. Say hello in german.")
print(f"🗣️ Answer:")
print(ans)

🗣️ Answer:
<think>
Okay, the user wants me to say hello in German. Let's see, "hello" in German is "hallo". That should be straightforward.

I should keep it simple and friendly. Maybe just say "Hallo!" with an exclamation mark to sound welcoming.

Make sure the spelling is correct and the greeting is appropriate for a general audience.
</think>

Hallo!


## 📚 Data

In [6]:
import re
from pathlib import Path
from utils import split_markdown_by_spans


DATA_DIR = Path("/datasets/client-data-us/Banneker/Batch1")

md_docs = list(DATA_DIR.glob("*.md"))
print(f"Total markdown documents: {len(md_docs)}")

Total markdown documents: 60


## 🫥 Anonymisation

### 🗣️ Prompts

In [12]:
# Summary template
SUMMARY_TEMPLATE = """
Given the following legal text, please provide a summary of the main points.

```
{context}
```
"""

# System template for redaction
SYSTEM_REDACTION_PROMPT = """
You are an expert in identifying Personal Identifyable Information (PII) in a given text according
to the following rules:

1. Case Information and Case Participants

- Redact AAA Case Number / Docket Number 
- Redact Hearing Location 
- Redact Names, Physical Addresses, and Other Contact Information (Email, Website, Phone Number, etc.) of: 
    - Parties 
    - Party Representatives 
    - Advocates 
    - AAA Case Managers 
    - Arbitrators (but only if the arbitrator is the employee of a union or party) 

2. Names of Other Individuals and Organizations

- Redact Names, Physical Addresses, and Other Contact Information (Email, Website, Phone Number, etc.) of: 
- Non-Party Organizations 
- Witnesses 
- Non-Party Individuals (Other Than Witnesses) 

Do Not Redact Ranks or Titles of Individuals (unless the title itself can be used to identify a party—e.g., “X County Chief of Police”) 

3. Details that Identify Geographical Location of Parties/Events

In General: Redact any information about the location of the hearing or the matter in dispute that would de-anonymize any party. 
- Redact, if it can be used to identify the location, Names, Acronyms/Abbreviations, Physical Addresses, and Other Contact Information (Email, Website, Phone Number, etc.) of: 
    - Government Bodies (Agency, Council, Police or Fire Department, Bureau, etc.) 
    - Public Places (Parks, Beaches, Botanical Gardens, Zoos, etc.) 
    - Natural Features/Landmarks (Bodies of Water, Mountains, etc.) 
    - Streets and Highways 
    - Manmade Landmarks (Named Buildings, Bridges, Dams, Tunnels, etc.) 
    - Local Infrastructure (Schools, Universities, Hospitals, Police and Fire Stations, Correctional Facilities, Court Houses, etc.) 
    - Cultural Institutions (Museums, Theaters, Stadiums, etc.) 
    
    
Identify all information to be redacted and answer only with a valid formatted JSON.
"""

### 🚀 Go

In [31]:
import json

from collections import defaultdict
from dateutil import parser
from funcy import chunks
from pprint import pformat
from pydantic import BaseModel, Field, field_validator
from tqdm.auto import tqdm


class PIIItem(BaseModel):
    Type: str = Field(..., description="The type of personal idenitfyable information entity. e.g.: Name")
    Value: str = Field(..., description="The entity value. e.g: John Doe")


class PIIList(BaseModel):
    pii: list[PIIItem] = Field(default_factory=list, description="List of identified PIIs")
                                
                                    

def inspect_for_piis(text, chunk_size:int = 400):
    # Look for stuff in chunks (simple for now, but to avoid catastrophical forgetting)
    total_chunks = len(text) // chunk_size + int(len(text) % chunk_size > 0)
    
    piis = defaultdict(list)
    chunks_pbar = tqdm(chunks(chunk_size, text), total=total_chunks)
    for i, chunk in enumerate(chunks_pbar):
        chunks_pbar.set_description(f"Chunk {i}")
        res = generate(
            chunk,
            sys_prompt=SYSTEM_REDACTION_PROMPT,
            format=PIIList.model_json_schema()
        )
        print(f"\n\n===================== Chunk {i} =====================")
        for pii in json.loads(res)["pii"]:
            print(pii)
            pii = PIIItem(**pii)
            piis[pii.Type].append(pii.Value)
            
    return piis


# Read the 1st document
text = md_docs[0].open("r").read()

# Inspect for PIIs
piis = inspect_for_piis(text)

# Print the results
for etype, evalue in piis.items():
    print(etype, evalue)

  0%|          | 0/49 [00:00<?, ?it/s]



{'Type': 'Company Name', 'Value': 'Customer Name: BUCKMAN LABORATORIES INTERNATIONAL, INC.'}
{'Type': 'Address', 'Value': '1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A.'}
{'Type': 'Phone Number', 'Value': '(901) 278-0330'}
{'Type': 'Company Name', 'Value': 'Supply Chain Consultants, Inc. d/b/a Arkieva Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 Telephone: 302-738-9215 Fed ID/TIN: 51-035 0007'}
{'Type': 'Address', 'Value': 'Wilmington, DE 19808'}
{'Type': 'Phone Number', 'Value': '302-738-9215'}


{'Type': 'Case Number', 'Value': '3'}
{'Type': 'Name', 'Value': 'LICENSE3'}


{'Type': 'Docket Number', 'Value': '6'}
{'Type': 'Confidentiality Information', 'Value': ' CONFIDENTIALITY7'}




KeyError: 'pii'

In [None]:
# map each entity to a placeholder and substitute in the original text
placeholders = {}
for doc_name, doc_entities in entities.items():
    placeholders[doc_name] = {}
    for ent_type, ent_list in doc_entities.items():
        for i, ent in enumerate(ent_list):
            placeholders[doc_name][ent] = f"{ent_type.upper()}_{i}" 

print(pformat(placeholders))

In [18]:
import copy
import re


def mask_text(text, entities):
    masked_text = copy.copy(text)
    for ent, placeholder in entities.items():
        if ent in ["Customer", "Developer", "Distributor"]:
            continue
        print(f"{ent} --> {placeholder}")
        masked_text = re.sub(ent, f"[{placeholder}]" , masked_text, count=0, flags=0)

    return masked_text
    
    
doc = md_docs[0]
doc_name = doc.stem
doc_text = doc.open().read()
print(mask_text(doc_text, placeholders[doc_name]))

Buckman Laboratories International Inc. --> ORGS_6
Supply Chain Consultants, Inc. d/b/a Arkieva --> ORGS_1
Licensors --> ORGS_2
Representatives --> ORGS_3
Distributors --> ORGS_4
Arkieva --> ORGS_5
Buckman Laboratories International Inc. (Customer) --> PEOPLE_0
Supply Chain Consultants, Inc. d/b/a Arkieva (Developer) --> PEOPLE_1
Licensor --> PEOPLE_6
Customer's employees --> PEOPLE_10
Customer's consultants --> PEOPLE_11
Customer's employees, consultants, or affiliates --> PEOPLE_12
Parties --> PEOPLE_16
Parties (mentioned twice, but referring to the same entities as Developer and Distributor) --> PEOPLE_17
Sujit K. Singh --> PEOPLE_18
#### **MASTER LICENSE AGREEMENT**

**Between Customer and [ORGS_1] Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 Telephone: 302-738-9215 Fed ID/TIN: 51-035 0007**

Customer Name: BUCKMAN LABORATORIES INTERNATIONAL, INC.

Address: 1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A. Telephone: (901) 278-0330

ATTN:

| 1.0  | D

## 🔗 Langchain

In [3]:
!uv pip install --system -q langgraph langchain langchain-core langchain-community langchain-ollama 

In [None]:
import os
import sys

# 🤯 Classic langchain... 4 different types of importing an Ollama LLM, but...
# 👉 from langchain_community.chat_models import ChatOllama
# 👉 from langchain_ollama import OllamaLLM
# 👉 from langchain_community.llms import Ollama
from langchain_ollama import ChatOllama
from langchain_core.messages import AIMessage


# ===================================== 👇 Configure as needed =================================
sys.tracebacklimit = -2
os.environ["TRACEBACK_LIMIT"] = "2"

# Or any other model capable os using tools
MODEL_ID = "llama3.2" # "jacob-ebey/phi4-tools"

OLLAMA_BASE_URL = "http://localhost:11434"
# ===================================== 👆 Configure as needed =================================


# Initialize Ollama
# llm = OllamaLLM(model=MODEL_ID, base_url=OLLAMA_BASE_URL, temperature=0.0)
json_llm = ChatOllama(model=MODEL_ID, base_url=OLLAMA_BASE_URL, temperature=0.0, format="json")
llm = ChatOllama(model=MODEL_ID, base_url=OLLAMA_BASE_URL, temperature=0.0)


prompt = "Please solve: 43x - 9 = 120"
messages = [
    ("system", "You are a helpful math assistant."),
    ("human", prompt),
]

res = llm.invoke(messages)
print(res.content)

## 🤖 Agent

In [None]:
from langchain_core.tools import tool
from langchain_community.tools import DuckDuckGoSearchRun
from langgraph.prebuilt import create_react_agent


# ===================================== 👇 Configure as needed =================================
TEMPLATE = """
In this chunk of markdown text:

```
{context}
```

{request}

Answer with a bullet-point list if any are found. Otherwise respond 'None'
"""
# ===================================== 👆 Configure as needed =================================


search_tool = DuckDuckGoSearchRun()

# Define a simple tool
@tool
def dummy_tool(query: str) -> str:
    """Search for information about a topic."""
    ...


tools = [search_tool]
tool_names = [t.name for t in tools]


# Create the ReAct agent
agent = create_react_agent(
    model=llm,
    tools=tools,
    # prompt=PROMPT_TEMPLATE
)

In [None]:
messages = [("human", "What is an agent in the context of LLMs?")]

# Run the agent
response = agent.invoke({"messages": messages})
print("Response:")
print(response["messages"][-1].content)