# vLLM direct redaction

> **⚠️⚠️⚠️ vLLM complicates running larger models compared to Ollama, which makes performance worse. ⚠️⚠️⚠️**

This notebooks explores using LLM models directly to identify `Personal Identifyable Information` (PII).

> ℹ️ We use [vLLM](https://docs.vllm.ai/en/stable/) and markdown inputs as produced by something like [docLing](https://docs.vllm.ai/en/stable/)

## ⚙️ Setup

In [1]:
# install uv
!curl -LsSf https://astral.sh/uv/install.sh | sh

# Install vLLM
!uv pip install --system vllm python-dotenv

downloading uv 0.6.10 x86_64-unknown-linux-gnu
no checksums to verify
installing to /root/.local/bin
  uv
  uvx
everything's installed!
[2mUsing Python 3.11.7 environment at: /usr[0m
[2K[2mResolved [1m126 packages[0m [2min 833ms[0m[0m                                       [0m
[2K[2mPrepared [1m83 packages[0m [2min 41.13s[0m[0m                                           
[2mUninstalled [1m17 packages[0m [2min 1.08s[0m[0m
[2K[2mInstalled [1m83 packages[0m [2min 431ms[0m[0m                              [0m
 [32m+[39m [1mairportsdata[0m[2m==20250224[0m
 [32m+[39m [1mannotated-types[0m[2m==0.7.0[0m
 [32m+[39m [1mastor[0m[2m==0.8.1[0m
 [32m+[39m [1mblake3[0m[2m==1.0.4[0m
 [32m+[39m [1mcompressed-tensors[0m[2m==0.9.2[0m
 [32m+[39m [1mdepyf[0m[2m==0.18.0[0m
 [32m+[39m [1mdiskcache[0m[2m==5.6.3[0m
 [32m+[39m [1mdistro[0m[2m==1.9.0[0m
 [32m+[39m [1mdnspython[0m[2m==2.7.0[0m
 [32m+[39m [1meinops[0m[2m==0.8.1

## 🦜 LLM

In [5]:
import os
from dotenv import load_dotenv
from vllm import LLM, SamplingParams

load_dotenv()

# ===================================== 👇 Configure as needed =================================
assert os.environ["HF_TOKEN"], "Plase set the HF_TOKEN = hf..." # HuggingFace API token"

# Or: 'meta-llama/Llama-3.2-3B-Instruct'
# Or: 'mistralai/Mixtral-8x7B-Instruct-v0.1'
MODEL_ID = "microsoft/Phi-4-mini-instruct"

MAX_MODEL_LEN = 10000    # Adjust depending on avaialble memory
DTYPE = "float16"        # Change to bfloat16 if GPU is capable (cuda > 8.0)
# ===================================== 👆 Configure as needed =================================


llm = LLM(
    model=MODEL_ID,
    max_model_len=MAX_MODEL_LEN,  
    trust_remote_code=True, 
    dtype=DTYPE,
)

config.json:   0%|          | 0.00/2.50k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-4-mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


INFO 03-26 18:13:15 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 03-26 18:13:29 [config.py:585] This model supports multiple tasks: {'generate', 'score', 'reward', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 03-26 18:13:29 [llm_engine.py:241] Initializing a V0 LLM engine (v0.8.2) with config: model='microsoft/Phi-4-mini-instruct', speculative_config=None, tokenizer='microsoft/Phi-4-mini-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_end

tokenizer_config.json:   0%|          | 0.00/2.93k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.91M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/15.5M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

INFO 03-26 18:13:34 [cuda.py:239] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 03-26 18:13:34 [cuda.py:288] Using XFormers backend.
INFO 03-26 18:13:35 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-26 18:13:35 [model_runner.py:1110] Starting to load model microsoft/Phi-4-mini-instruct...
INFO 03-26 18:13:35 [weight_utils.py:265] Using model weights format ['*.safetensors']


model-00002-of-00002.safetensors:   0%|          | 0.00/2.77G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

INFO 03-26 18:13:54 [weight_utils.py:281] Time spent downloading weights for microsoft/Phi-4-mini-instruct: 19.354084 seconds


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 03-26 18:14:00 [loader.py:447] Loading weights took 5.39 seconds
INFO 03-26 18:14:00 [model_runner.py:1146] Model loading took 7.1694 GB and 25.053579 seconds
INFO 03-26 18:14:03 [worker.py:267] Memory profiling takes 2.12 seconds
INFO 03-26 18:14:03 [worker.py:267] the current vLLM instance can use total_gpu_memory (15.73GiB) x gpu_memory_utilization (0.90) = 14.16GiB
INFO 03-26 18:14:03 [worker.py:267] model weights take 7.17GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.88GiB; the rest of the memory reserved for KV Cache is 5.06GiB.
INFO 03-26 18:14:03 [executor_base.py:111] # cuda blocks: 2591, # CPU blocks: 2048
INFO 03-26 18:14:03 [executor_base.py:116] Maximum concurrency for 10000 tokens per request: 4.15x
INFO 03-26 18:14:07 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. I

Capturing CUDA graph shapes: 100%|██████████| 35/35 [01:06<00:00,  1.91s/it]

INFO 03-26 18:15:13 [model_runner.py:1570] Graph capturing finished in 67 secs, took 0.39 GiB
INFO 03-26 18:15:14 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 73.22 seconds





In [71]:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()

### 🗣️ Check chatter

In [6]:
def generate(prompt: str, sampling_params = {}, pbar=False, **chat_args):
    # Compose the messages
    messages = [
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": prompt},
    ]
    # Compose the sampling parameters
    sampling_params = SamplingParams(
        max_tokens=1000,
        temperature=0.0,
        **sampling_params,
    )
    # Inference
    output = llm.chat(
        messages=messages, sampling_params=sampling_params, use_tqdm=pbar, **chat_args
    )
    return output[0].outputs[0].text


# Run inference
ans = generate("hey there. What can you do?", pbar=True)
print(f"🗣️ Answer: {ans}")

## 📚 Data

In [7]:
from pathlib import Path
from utils import split_markdown_by_spans


DATA_DIR = Path("/datasets/client-data-us/")

md_docs = list(DATA_DIR.rglob("**/*.md"))
print(f"Total markdown documents: {len(md_docs)}")

# Read the first document
text = md_docs[0].open("r").read()
print(text[:200])

# Split into chunks
spans = split_markdown_by_spans(text)
for span_id, text in spans.items():
    print(f"ID: {span_id}")
    print(f"Text: {text[:50]}...\n")

## 🫥 Anonymisation

In [19]:
from collections import defaultdict
from tqdm.rich import tqdm
from pprint import pformat

TEMPLATE = """
In this chunk of markdown text:

```
{context}
```

{request}

Answer with a bullet-point list if any are found. Otherwise respond 'None'
"""

requests = {
    "orgs": (
        "What are the companies mentioned in the text? "
        "(Do not include placeholders like: 'Developer', 'Customer', 'Distributor' or similar.)",
    ),
    "loc": "What locations or addresses mentioned in the text?",
    "contact": "What telephone-numbers or emails are mentioned in the text?",
    "people": "What person's are mentioned in the text?",
    "date": "What dates are mentioned in the text?"
}

entities = defaultdict(lambda: defaultdict(list))
doc_iter = tqdm(md_docs)
for md_doc in doc_iter:
    doc_name = md_doc.stem
    doc_iter.set_description(f"📄 {doc_name}...")
    text = md_doc.open().read()
    spans = split_markdown_by_spans(text)
    for span_id, text in spans.items():
        for k, question in requests.items():
            ans = generate(TEMPLATE.format(context=text, request=question))
            if not "None" in ans:
                entities[doc_name][k] += [a.strip() for a in ans.split("- ") if a]
                
    break
                

print(pformat(entities))    

In [11]:
# map each entity to a placeholder and substitute in the original text
placeholders = {}
for doc_name, doc_entities in entities.items():
    placeholders[doc_name] = {}
    for ent_type, ent_list in doc_entities.items():
        for i, ent in enumerate(ent_list):
            placeholders[doc_name][ent] = f"{ent_type.upper()}_{i}" 

print(pformat(placeholders))

In [18]:
import copy
import re


def mask_text(text, entities):
    masked_text = copy.copy(text)
    for ent, placeholder in entities.items():
        if ent in ["Customer", "Developer", "Distributor"]:
            continue
        print(f"{ent} --> {placeholder}")
        masked_text = re.sub(ent, f"[{placeholder}]" , masked_text, count=0, flags=0)

    return masked_text
    
    
doc = md_docs[0]
doc_name = doc.stem
doc_text = doc.open().read()
print(mask_text(doc_text, placeholders[doc_name]))