# DP-Fusion-Lib: Basic Usage

This notebook demonstrates how to use **DP-Fusion-Lib** for differentially private text generation with the Tagger API for automatic PII detection.

**Requirements:**
- GPU with CUDA support (recommended)
- API key from [console.documentprivacy.com](https://console.documentprivacy.com)

**Documentation:** [GitHub Repository](https://github.com/rushil-thareja/dp-fusion-lib)

## 1. Installation

Install the library if not already installed:

In [1]:
# Uncomment to install
# !pip install dp-fusion-lib

In [19]:
!pip install -i https://test.pypi.org/simple/ dp-fusion-lib==0.1.0

Looking in indexes: https://test.pypi.org/simple/


## 2. Configuration

Set your model and API key configuration:

In [4]:
# Model configuration
MODEL_ID = "Qwen/Qwen2.5-7B-Instruct"

# API key - Get your free key at console.documentprivacy.com
API_KEY = "put ure key here"

## 3. Import Libraries and Load Model

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dp_fusion_lib import DPFusion, Tagger, compute_epsilon_single_group

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

PyTorch version: 2.9.0+cu126
CUDA available: True
GPU: Tesla T4


In [5]:
# Load tokenizer
print(f"Loading tokenizer: {MODEL_ID}")
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True
)

# Load model
print(f"Loading model: {MODEL_ID}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
model.eval()

print("Model loaded successfully!")

Loading tokenizer: Qwen/Qwen2.5-7B-Instruct


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading model: Qwen/Qwen2.5-7B-Instruct


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



Model loaded successfully!


## 4. Initialize Tagger API

The Tagger API automatically identifies sensitive phrases in your documents using Constitutional AI.

In [6]:
# Initialize Tagger
tagger = Tagger(api_key=API_KEY, verbose=True)

# List available models
print("Available tagger models:")
available_models = tagger.get_available_models()
for m in available_models:
    print(f"  - {m}")

Available tagger models:
[Tagger] GET https://api.documentprivacy.com/models
[Tagger] Response: {'models': ['llama3.1-8b', 'llama-3.3-70b', 'qwen-3-32b', 'qwen-3-235b-a22b-instruct-2507', 'zai-glm-4.6', 'gpt-oss-120b'], 'default_model': 'llama3.1-8b'}
  - models
  - default_model


In [7]:
# Configure tagger
tagger.set_model("gpt-oss-120b")  # Strong extraction model
tagger.set_constitution("LEGAL")  # Options: LEGAL, HEALTH, FINANCE

print("Tagger configured!")

Tagger configured!


## 5. Initialize DPFusion

In [8]:
# Initialize DPFusion with the tagger
dpf = DPFusion(
    model=model,
    tokenizer=tokenizer,
    max_tokens=100,
    tagger=tagger
)

print("DPFusion initialized!")

DPFusion initialized!


## 6. Prepare the core text u want to privatise

Define the sensitive document you want to paraphrase with privacy protection:

In [9]:
# Example private text (ECHR style legal document)
private_text = """The applicant was born in 1973 and currently resides in Les Salles-sur-Verdon, France.
In the early 1990s, a new criminal phenomenon emerged in Denmark known as 'tax asset stripping cases' (selskabstømmersager)."""

print(f"Document ({len(private_text)} characters):")
print("-" * 60)
print(private_text)

Document (211 characters):
------------------------------------------------------------
The applicant was born in 1973 and currently resides in Les Salles-sur-Verdon, France.
In the early 1990s, a new criminal phenomenon emerged in Denmark known as 'tax asset stripping cases' (selskabstømmersager).


## 7. Build Context with Privacy Annotations

Mark which parts of the conversation are private (sensitive) vs public (instructions):

In [10]:
# Build context using message API
dpf.add_message(
    "system",
    "You are a helpful assistant that paraphrases text.",
    is_private=False  # Instructions are public
)

dpf.add_message(
    "user",
    private_text,
    is_private=True  # Document is private/sensitive
)

dpf.add_message(
    "system",
    "Now paraphrase this text for privacy",
    is_private=False
)

dpf.add_message(
    "assistant",
    "Sure, here is the paraphrase of the above text that ensures privacy:",
    is_private=False
)

print("Context built with privacy annotations!")

Context built with privacy annotations!


## 8. Run Tagger to Identify Sensitive Phrases

The tagger will automatically identify and redact sensitive information:

In [11]:
# Run tagger to extract and redact private phrases
print("Running Tagger API to extract private phrases...")
dpf.run_tagger()

Running Tagger API to extract private phrases...
[Tagger] POST https://api.documentprivacy.com/extract
[Tagger] Input document: The applicant was born in 1973 and currently resides in Les Salles-sur-Verdon, France.
In the early 1990s, a new criminal phenomenon emerged in Denmark known as 'tax asset stripping cases' (selskabstø...
[Tagger] Model: gpt-oss-120b, Constitution: LEGAL
[Tagger] Extracted phrases: ['1973', 'Les Salles-sur-Verdon', 'Les Salles-sur-Verdon, France', 'early 1990s', 'tax asset stripping cases']


In [12]:
# View the private context (original)
print("PRIVATE CONTEXT (full text):")
print("=" * 60)
print(dpf.private_context)

PRIVATE CONTEXT (full text):
<|im_start|>system
You are a helpful assistant that paraphrases text.<|im_end|>
<|im_start|>user
The applicant was born in 1973 and currently resides in Les Salles-sur-Verdon, France.
In the early 1990s, a new criminal phenomenon emerged in Denmark known as 'tax asset stripping cases' (selskabstømmersager).<|im_end|>
<|im_start|>system
Now paraphrase this text for privacy<|im_end|>
<|im_start|>assistant
Sure, here is the paraphrase of the above text that ensures privacy:<|im_end|>
<|im_start|>assistant



In [13]:
# View the public context (redacted)
print("PUBLIC CONTEXT (redacted):")
print("=" * 60)
print(dpf.public_context)

PUBLIC CONTEXT (redacted):
<|im_start|>system
You are a helpful assistant that paraphrases text.<|im_end|>
<|im_start|>user
The applicant was born in ____ and currently resides in_________.
In the_______, a new criminal phenomenon emerged in Denmark known as '____' (selskabstømmersager).<|im_end|>
<|im_start|>system
Now paraphrase this text for privacy<|im_end|>
<|im_start|>assistant
Sure, here is the paraphrase of the above text that ensures privacy:<|im_end|>
<|im_start|>assistant



## 9. Generate with Differential Privacy

Now generate text using DP-Fusion, which provides formal (ε, δ)-DP guarantees:

In [14]:
# Privacy parameters
ALPHA = 2.0    # Rényi divergence order
BETA = 0.01    # Per-token privacy budget (lower = more private)

print(f"Generating with α={ALPHA}, β={BETA}...")
print("-" * 60)

output = dpf.generate(
    alpha=ALPHA,
    beta=BETA,
    temperature=1.0,
    max_new_tokens=100,
    debug=True  # Set to True for detailed output
)

print("Generation complete!")

Generating with α=2.0, β=0.01...
------------------------------------------------------------
[DP-Fusion] Starting generation. Private groups: ['PRIVATE']
[Initial] Prefix shape for group PUBLIC: torch.Size([115])
[Initial] Prefix shape for group PRIVATE: torch.Size([115])
[Initial] Input batch shape: torch.Size([2, 115])
[Initial] Selected Lambda for group PRIVATE: 0.008893966674804688, Divergence: 0.019663169980049133
[Initial] Sampled token 'The' (ID=785)
[Step 1] Selected Lambda for group PRIVATE: 0.1636190414428711, Divergence: 0.01983731985092163
[Step 2] Selected Lambda for group PRIVATE: 0.046942710876464844, Divergence: 0.0197348203510046
[Step 3] Selected Lambda for group PRIVATE: 1.0, Divergence: 0.0004015354788862169
[Step 4] Selected Lambda for group PRIVATE: 0.9472379684448242, Divergence: 0.019990170374512672
[Step 5] Selected Lambda for group PRIVATE: 0.0008144378662109375, Divergence: 0.019963061437010765
[Step 6] Selected Lambda for group PRIVATE: 0.18547725677490234,

In [15]:
# Display generated text
print("GENERATED TEXT:")
print("=" * 60)
print(output['text'])

GENERATED TEXT:
system
You are a helpful assistant that paraphrases text.
user
The applicant was born in ____ and currently resides in_________.
In the_______, a new criminal phenomenon emerged in Denmark known as '____' (selskabstømmersager).
system
Now paraphrase this text for privacy
assistant
Sure, here is the paraphrase of the above text that ensures privacy:
assistant
The individual was born in an unspecified location and currently resides in an unspecified place. In a certain region, a new criminal phenomenon emerged in Denmark known as 'cluster incidents' (selskabstømmersager).


## 10. Analyze Generation Statistics

In [16]:
# Lambda statistics (mixing parameter)
if output['lambdas'].get('PRIVATE'):
    lambdas = output['lambdas']['PRIVATE']
    print("Lambda Statistics (mixing parameter):")
    print(f"  Mean:  {sum(lambdas)/len(lambdas):.4f}")
    print(f"  Min:   {min(lambdas):.4f}")
    print(f"  Max:   {max(lambdas):.4f}")
    print(f"  Count: {len(lambdas)} tokens")

Lambda Statistics (mixing parameter):
  Mean:  0.2215
  Min:   0.0000
  Max:   1.0000
  Count: 45 tokens


In [17]:
# Divergence statistics
if output['divergences'].get('PRIVATE'):
    divs = output['divergences']['PRIVATE']
    print("Divergence Statistics:")
    print(f"  Mean:  {sum(divs)/len(divs):.4f}")
    print(f"  Min:   {min(divs):.4f}")
    print(f"  Max:   {max(divs):.4f}")
    print(f"  Count: {len(divs)} tokens")

Divergence Statistics:
  Mean:  0.0153
  Min:   0.0000
  Max:   0.0200
  Count: 45 tokens


## 11. Compute Privacy Guarantee

Calculate the formal (ε, δ)-DP guarantee for this generation:

In [18]:
# Privacy accounting parameters
DELTA = 1e-5  # Target δ for (ε, δ)-DP

if output['divergences'].get('PRIVATE'):
    eps_result = compute_epsilon_single_group(
        divergences=output['divergences']['PRIVATE'],
        alpha=ALPHA,
        delta=DELTA,
        beta=BETA
    )

    print("=" * 60)
    print("(ε, δ)-DIFFERENTIAL PRIVACY GUARANTEE")
    print("=" * 60)
    print(f"Parameters: α={ALPHA}, β={BETA}, δ={DELTA}")
    print(f"Tokens generated: {eps_result['T']}")
    print()
    print(f"Empirical ε:   {eps_result['empirical']:.4f}")
    print(f"  (computed from actual divergences observed)")
    print()
    print(f"Theoretical ε: {eps_result['theoretical']:.4f}")
    print(f"  (worst-case bound, assuming max divergence per step)")
    print()
    print(f"This generation satisfies ({eps_result['empirical']:.2f}, {DELTA})-DP")
else:
    print("No divergences recorded.")

(ε, δ)-DIFFERENTIAL PRIVACY GUARANTEE
Parameters: α=2.0, β=0.01, δ=1e-05
Tokens generated: 45

Empirical ε:   12.8942
  (computed from actual divergences observed)

Theoretical ε: 13.3129
  (worst-case bound, assuming max divergence per step)

This generation satisfies (12.89, 1e-05)-DP


## 12. Summary

You have successfully:

1. Loaded an LLM with GPU acceleration
2. Used the Tagger API to automatically identify sensitive phrases
3. Generated text with formal differential privacy guarantees
4. Computed the privacy budget (ε) for your generation

**Key Concepts:**

| Metric | Description |
|--------|-------------|
| **Empirical ε** | Actual privacy cost based on observed divergences |
| **Theoretical ε** | Worst-case upper bound for compliance reporting |
| **λ (Lambda)** | Mixing parameter between private and public distributions |
| **β (Beta)** | Per-token privacy budget (lower = more private) |

**Next Steps:**
- Try different `beta` values to adjust privacy-utility tradeoff
- Experiment with different document types and constitutions
- See the [documentation](https://github.com/rushil-thareja/dp-fusion-lib) for advanced usage