# 🦙 Ollama direct redaction

This notebooks explores using LLM models directly to identify `Personal Identifyable Information` (PII).

> ℹ️ We use [Ollama](https://ollama.com/download/linux) and markdown inputs as produced by something like [docLing](https://docs.vllm.ai/en/stable/)

## ⚙️ Setup

In [None]:
# install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Install Ollama
!curl -fsSL https://ollama.com/install.sh | sh

!uv pip install -q --system ollama tqdm rich funcy

downloading uv 0.6.11 x86_64-unknown-linux-gnu
no checksums to verify
installing to /root/.local/bin
  uv
  uvx
everything's installed!
>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
#######################################                                   54.3%

## 🦜 LLM

Let's check some random LLM chatter 🗣️

> You will need to run in a separate terminal:
> ```shell
> ollama serve &
> ollama pull phi4
> ```


In [4]:
from ollama import chat
from ollama import ChatResponse

# ===================================== 👇 Configure as needed =================================
MODEL_ID = "llama3.3:latest", # "phi4", "deepseek-r1", "qwen2.5:14b"
DEFAULT_SYS_PROMPT = "You are a helpful AI assistant."
# ===================================== 👆 Configure as needed =================================


def stream_response(stream):
    for chunk in stream:
        print(chunk, end='', flush=True)
        

def generate_stream(prompt: str, sys_prompt:str = DEFAULT_SYS_PROMPT, **options):
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": prompt},
    ]
    stream = chat(
        model=MODEL_ID, messages=messages, stream=True, **options
    )
    for chunk in stream:
        yield chunk['message']['content']

        
def generate(prompt: str, sys_prompt:str = DEFAULT_SYS_PROMPT, **options):
    messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": prompt},
    ]
    response = chat(model=MODEL_ID, messages=messages, **options)
    return response.message.content


# Run inference
ans = generate("Hey there. Say hello in german.")
print(f"🗣️ Answer:")
print(ans)

ValidationError: 1 validation error for ChatRequest
model
  Input should be a valid string [type=string_type, input_value=('llama3.3:latest',), input_type=tuple]
    For further information visit https://errors.pydantic.dev/2.11/v/string_type

## 📚 Data

In [86]:
import re
from pathlib import Path
from utils import split_markdown_by_spans

DATA_DIR = Path("/datasets/client-data-us/")

md_docs = list(DATA_DIR.rglob("**/*.md"))
print(f"Total markdown documents: {len(md_docs)}")

Total markdown documents: 60


## 🫥 Anonymisation

### 🗣️ Prompts

In [78]:
# System prompt
SYSTEM_PROMPT = """
You are specialized in extracting Personal Identifiable Information (PII).
Identifying information are things like, but not limited to:

- Organizations: Organization and Company names (should not include placehoders. e.g.: Developer, Customer, Client,...)
- Persons: Names of persons or individuals (should not include Organizations, Companies nor Place names)
- Locations: (city names, countries) and addresses
- Contact information: Emails, phone numbers and PO boxes
- Dates: Should not inlcude time periods or durations, nor single numbers. Only full dates at least containing year and month

Answer only with the PII, if none is found, say 'None'

<example>

Given the following text:

```
This contract is between Customer and Supply Chain Consultants, Inc. 
Arkieva Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 
Telephone: 302-738-9215 Fed ID/TIN
```

- **company names**: Supply Chain Consultants, Inc
- **locations**: Arkieva Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808
- **phone numbers**: 302-738-9215

</example>
"""

# Summary template
SUMMARY_TEMPLATE = """
Given the following legal text, please provide a summary of the main points.

```
{context}
```
"""

### 🚀 Go

In [85]:
import json

from collections import defaultdict
from dateutil import parser
from funcy import chunks
from pprint import pformat
from pydantic import BaseModel, Field, field_validator
from tqdm.auto import tqdm


class PIIEntities(BaseModel):
    dates: list[str] = Field(default_factory=list, description="List of identified calendar dates.")
    person_names: list[str] = Field(default_factory=list, description="List of Person's or individual names")
    orgs: list[str] = Field(default_factory=list, description="List of identified company names.")
    telephones: list[str] = Field(default_factory=list, description="List of identified telephone numbers")
    emails: list[str] = Field(default_factory=list, description="List of identified email addresses")
    locations: list[str] = Field(default_factory=list, description="List of identified locations and addresses")
    
                                    
    @field_validator("dates")
    @classmethod
    def validate_dates(cls, dates):
        valid_dates = []
        for date_str in dates:
            # If it parses as float is not a date
            try:
                float(date_str)
                continue
            except ValueError:
                pass
            
            # Attempt to parse the date string with fuzzy parsing disabled
            try:
                parsed_date = parser.parse(date_str, fuzzy=False)
                valid_dates.append(date_str)
            except ValueError:
                # If parsing fails, skip this date string
                print(f"Invalid date format: '{date_str}'")
                
        return valid_dates
                                
                                    

def inspect_for_piis(text, chunk_size:int = 400):
    # Look for stuff in chunks (simple for now, but to avoid catastrophical forgetting)
    total_chunks = len(text) // chunk_size + int(len(text) % chunk_size > 0)
    
    piis = defaultdict(list)
    chunks_pbar = tqdm(chunks(chunk_size, text), total=total_chunks)
    for i, chunk in enumerate(chunks_pbar):
        chunks_pbar.set_description(f"Chunk {i}")
        res = generate(
            chunk,
            sys_prompt=SYSTEM_PROMPT,
            format=PIIEntities.model_json_schema()
        )
        # print(f"\n\n===================== Chunk {i} =====================")
        for cat, vals in json.loads(res).items():
            piis[cat] += vals
            
    return piis


# Read the 1st document
text = md_docs[0].open("r").read()

# Inspect for PIIs
piis = inspect_for_piis(text)
print(json.dumps(piis, indent=2))

  0%|          | 0/49 [00:00<?, ?it/s]

{
  "orgs": [
    "Supply Chain Consultants, Inc.",
    "Ardieva Linden Green Center",
    "BUCKMAN LABORATORIES INTERNATIONAL, INC.",
    "Developer",
    "Supply Chain Consultants, Inc.",
    "Arkieva",
    "Customer",
    "Developer",
    "Developer",
    "Developer",
    "Developer",
    "Customer",
    "Developer",
    "U.S. Commerce Department",
    "Supply Chain Consultants, Inc.",
    "Buckman Laboratories International Inc."
  ],
  "locations": [
    "5460 Fairmont Drive Wilmington, DE 19808",
    "1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A.",
    "N OF DISTRIBUTOR6",
    "TERMINATION7",
    "QUALITY CONTROL7",
    "ASSIGNMENT7",
    "U.S. EXPORT RESTRICTIONS7",
    "MISCELLANEOUS8",
    "Buckman Laboratories International Inc.",
    "Designated Hardware",
    "",
    "page 9 of this Agreement",
    "#page-3-0",
    "page-3-1",
    "None",
    "None",
    "None identified beyond section numbers.",
    "None",
    "none",
    "none",
    "none",
    "none"

In [24]:
import re

re.search("Continental Foods Europe BVBA", text)

<re.Match object; span=(24746, 24775), match='Continental Foods Europe BVBA'>

In [11]:
# map each entity to a placeholder and substitute in the original text
placeholders = {}
for doc_name, doc_entities in entities.items():
    placeholders[doc_name] = {}
    for ent_type, ent_list in doc_entities.items():
        for i, ent in enumerate(ent_list):
            placeholders[doc_name][ent] = f"{ent_type.upper()}_{i}" 

print(pformat(placeholders))

{'Buckman Master License Agreement': {'Arkieva': 'ORGS_5',
                                      'Buckman Laboratories International Inc.': 'ORGS_6',
                                      'Buckman Laboratories International Inc. (Customer)': 'PEOPLE_0',
                                      'Customer': 'PEOPLE_15',
                                      "Customer's consultants": 'PEOPLE_11',
                                      "Customer's employees": 'PEOPLE_10',
                                      "Customer's employees, consultants, or affiliates": 'PEOPLE_12',
                                      'Developer': 'PEOPLE_13',
                                      'Distributor': 'PEOPLE_14',
                                      'Distributors': 'ORGS_4',
                                      'Licensor': 'PEOPLE_6',
                                      'Licensors': 'ORGS_2',
                                      'Parties': 'PEOPLE_16',
                                      'Parties (m

In [18]:
import copy
import re


def mask_text(text, entities):
    masked_text = copy.copy(text)
    for ent, placeholder in entities.items():
        if ent in ["Customer", "Developer", "Distributor"]:
            continue
        print(f"{ent} --> {placeholder}")
        masked_text = re.sub(ent, f"[{placeholder}]" , masked_text, count=0, flags=0)

    return masked_text
    
    
doc = md_docs[0]
doc_name = doc.stem
doc_text = doc.open().read()
print(mask_text(doc_text, placeholders[doc_name]))

Buckman Laboratories International Inc. --> ORGS_6
Supply Chain Consultants, Inc. d/b/a Arkieva --> ORGS_1
Licensors --> ORGS_2
Representatives --> ORGS_3
Distributors --> ORGS_4
Arkieva --> ORGS_5
Buckman Laboratories International Inc. (Customer) --> PEOPLE_0
Supply Chain Consultants, Inc. d/b/a Arkieva (Developer) --> PEOPLE_1
Licensor --> PEOPLE_6
Customer's employees --> PEOPLE_10
Customer's consultants --> PEOPLE_11
Customer's employees, consultants, or affiliates --> PEOPLE_12
Parties --> PEOPLE_16
Parties (mentioned twice, but referring to the same entities as Developer and Distributor) --> PEOPLE_17
Sujit K. Singh --> PEOPLE_18
#### **MASTER LICENSE AGREEMENT**

**Between Customer and [ORGS_1] Linden Green Center 5460 Fairmont Drive Wilmington, DE 19808 Telephone: 302-738-9215 Fed ID/TIN: 51-035 0007**

Customer Name: BUCKMAN LABORATORIES INTERNATIONAL, INC.

Address: 1256 North McLean Boulevard, Memphis Tennessee 38108-1241, U.S.A. Telephone: (901) 278-0330

ATTN:

| 1.0  | D