# Watermarking Methods - Complete Experiments

This notebook implements and evaluates watermarking methods:

**Prior Work:**
1. **Unigram-Watermark** (Zhao et al., 2023) - Provable Robust Watermarking
2. **KGW/TGRL** (Kirchenbauer et al., 2023) - Red-Green List Watermarking
3. **SEMSTAMP** (Hou et al., 2024) - Semantic Watermark with LSH

**Our Methods:**
4. **GPW** - Gaussian Pancakes Watermarking (basic)
5. **GPW-SP** - GPW with Salted Phase
6. **GPW-SP+SR** - GPW with Salted Phase + Semantic Representation Coupling

## Table of Contents
1. [Setup & Installation](#setup)
2. [Load Models and Data](#load)
3. [Watermark Generation](#generation)
4. [Detection Experiments](#detection)
5. [Attack Robustness](#attacks)
6. [Quality Evaluation](#quality)
7. [Comparison & Analysis](#comparison)

In [None]:
# Unzip uploaded workspace and move to contents/
# Upload watermark_experiments.zip to Colab first, then run this cell
%%bash
echo "Looking for zip file..."
ZIP_FILE=$(find . -maxdepth 1 -name "*.zip" -type f | head -n 1)
if [ -n "$ZIP_FILE" ]; then
  echo "Found zip file: $ZIP_FILE"
  echo "Extracting to WATERMARK_EXPERIMENTS/..."
  unzip -q "$ZIP_FILE" -d WATERMARK_EXPERIMENTS
  echo "Moving contents from WATERMARK_EXPERIMENTS/watermark_experiments/ to current directory..."
  if [ -d "WATERMARK_EXPERIMENTS/watermark_experiments" ]; then
    cp -r WATERMARK_EXPERIMENTS/watermark_experiments/* .
  else
    cp -r WATERMARK_EXPERIMENTS/* .
  fi
  echo "Cleaning up..."
  rm -rf WATERMARK_EXPERIMENTS
  echo "Done! Workspace contents are now in the current directory."
else
  echo "No zip file found. Please upload watermark_experiments.zip first."
fi

## 1. Setup & Installation <a name="setup"></a>

In [None]:
# Install required packages (run once)
!pip install -q torch transformers accelerate
!pip install -q sentence-transformers nltk
!pip install -q bert-score mauve-text
!pip install -q datasets scikit-learn scipy
!pip install -q openai  # For GPT-based attacks (optional)

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m122.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h

In [None]:
# This cell intentionally left blank
# All setup is done in the installation cell above

In [None]:
import os
import sys
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Set device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# For reproducibility
torch.manual_seed(42)
np.random.seed(42)


In [None]:
# Import our watermarking framework
# This notebook is inside watermark_experiments/ folder, so we use direct imports
from watermarkers import (
    UnigramWatermark, KGWWatermark, SEMSTAMPWatermark,
    GPWWatermark, GPWConfig, SRConfig, create_gpw_variant
)
from attacks import (
    SynonymAttack, SwapAttack, TypoAttack,
    PegasusAttack, BigramAttack, CopyPasteAttack
)
from metrics.detection import compute_detection_metrics, tpr_at_fpr
from metrics.quality import compute_perplexity, compute_bertscore, compute_diversity
# Note: C4 data loading is done manually in the data loading cell below
# We don't import from data_loaders to avoid dependency conflicts

ModuleNotFoundError: No module named 'watermarkers'

## 2. Load Models and Data <a name="load"></a>

### Model Selection

We use `Qwen2.5-14B-Instruct` by default for state-of-the-art results.

**For paper replication, you can use:**
- `facebook/opt-1.3b` (KGW, SEMSTAMP papers)
- `openai-community/gpt2-xl` (Unigram paper)
- `gpt2` (for quick testing)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Choose model (uncomment one)
# MODEL_NAME = "gpt2"  # Quick testing
MODEL_NAME = "facebook/opt-1.3b"  # Original paper model
# MODEL_NAME = "Qwen/Qwen2.5-7B-Instruct"  # Faster alternative
# MODEL_NAME = "Qwen/Qwen2.5-14B-Instruct"  # Best results

print(f"Loading model: {MODEL_NAME}")

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    device_map="auto" if device == "cuda" else None,
    trust_remote_code=True
)
if device == "cpu":
    model = model.to(device)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print(f"Model loaded successfully on {device}")

In [None]:
# Load SEMSTAMP sentence encoder (for SEMSTAMP experiments)
from sentence_transformers import SentenceTransformer

# ENCODER_NAME = "all-mpnet-base-v2"  # General use
ENCODER_NAME = "AbeHou/SemStamp-c4-sbert"  # Paper fine-tuned encoder

print(f"Loading sentence encoder: {ENCODER_NAME}")
sentence_encoder = SentenceTransformer(ENCODER_NAME, device=device)
print("Encoder loaded!")

In [None]:
# Load C4 dataset manually (without using data_loaders)
print("Loading C4 RealNewsLike dataset...")
print("This may take a few minutes for the first download...")

from datasets import load_dataset
import random

random.seed(42)

# Load C4 dataset with streaming
try:
    print("Attempting to load C4 realnewslike dataset...")
    dataset = load_dataset("c4", "realnewslike", split="validation", streaming=True)
    print("Dataset loaded successfully!")
except Exception as e:
    print(f"Error loading with 'c4' name: {e}")
    print("Trying alternative: allenai/c4...")
    try:
        dataset = load_dataset("allenai/c4", "realnewslike", split="validation", streaming=True)
        print("Dataset loaded successfully with allenai/c4!")
    except Exception as e2:
        raise RuntimeError(f"Failed to load C4 dataset. Errors: {e}, {e2}") from e2

# Collect samples
num_samples = 200
min_length = 100
max_length = 1000

c4_data = []
print(f"Collecting {num_samples} samples...")

for item in dataset:
    text = item.get("text", "")
    
    # Filter by length
    if len(text) < min_length or len(text) > max_length:
        continue
    
    # Create prompt from first 30 words
    words = text.split()
    if len(words) < 10:
        continue
    
    prompt_words = words[:30]
    prompt = " ".join(prompt_words)
    
    c4_data.append({
        "text": text,
        "prompt": prompt,
        "source": "c4-realnewslike"
    })
    
    if len(c4_data) >= num_samples:
        break

print(f"\nLoaded {len(c4_data)} C4 samples")
if len(c4_data) > 0:
    print(f"\nExample prompt: {c4_data[0]['prompt'][:100]}...")

In [None]:
# Split data into prompts and human baselines
prompts = [d["prompt"] for d in c4_data[:100]]
human_texts = [d["text"] for d in c4_data[100:200]]

print(f"Prompts: {len(prompts)}, Human texts: {len(human_texts)}")

## 3. Watermark Generation <a name="generation"></a>

Initialize all watermarking methods (3 prior work + 3 GPW variants) and generate watermarked text.

In [None]:
# Initialize PRIOR WORK watermarkers

# Unigram-Watermark (Zhao et al., 2023)
unigram_wm = UnigramWatermark(
    model=model,
    tokenizer=tokenizer,
    gamma=0.5,
    delta=2.0,
    z_threshold=4.0,
    device=device
)

# KGW Watermark (Kirchenbauer et al., 2023)
kgw_wm = KGWWatermark(
    model=model,
    tokenizer=tokenizer,
    gamma=0.5,
    delta=2.0,
    z_threshold=4.0,
    context_width=1,
    seeding_scheme="simple_1",
    ignore_repeated_bigrams=True,
    device=device
)

# SEMSTAMP (Hou et al., 2024)
semstamp_wm = SEMSTAMPWatermark(
    model=model,
    tokenizer=tokenizer,
    embedder=sentence_encoder,
    lsh_dim=3,
    margin=0.02,
    z_threshold=4.0,
    device=device
)

print("Prior work watermarkers initialized!")
print(f"  Unigram: {unigram_wm.get_config()}")
print(f"  KGW: {kgw_wm.get_config()}")
print(f"  SEMSTAMP: {semstamp_wm.get_config()}")

In [None]:
# Initialize OUR GPW watermarkers (3 variants)

# GPW - Basic (no salted phase, no SR)
gpw_basic = create_gpw_variant(
    model=model,
    tokenizer=tokenizer,
    variant="GPW",
    alpha=1.2,
    omega=10.0,
    device=device
)

# GPW-SP - Salted Phase
gpw_sp = create_gpw_variant(
    model=model,
    tokenizer=tokenizer,
    variant="GPW-SP",
    alpha=1.2,
    omega=10.0,
    device=device
)

# GPW-SP+SR - Salted Phase + Semantic Representation Coupling
gpw_sp_sr = create_gpw_variant(
    model=model,
    tokenizer=tokenizer,
    variant="GPW-SP+SR",
    alpha=1.2,
    omega=10.0,
    device=device
)

print("\nGPW watermarkers initialized!")
print(f"  GPW: {gpw_basic.get_config()}")
print(f"  GPW-SP: {gpw_sp.get_config()}")
print(f"  GPW-SP+SR: {gpw_sp_sr.get_config()}")

In [None]:
# All watermarkers dict for iteration
ALL_WATERMARKERS = {
    # Prior work
    "Unigram": unigram_wm,
    "KGW": kgw_wm,
    "SEMSTAMP": semstamp_wm,
    # Our methods
    "GPW": gpw_basic,
    "GPW-SP": gpw_sp,
    "GPW-SP+SR": gpw_sp_sr,
}

print(f"Total watermarkers: {len(ALL_WATERMARKERS)}")
print(f"Methods: {list(ALL_WATERMARKERS.keys())}")

In [None]:
# Generate watermarked texts (this may take a while)
NUM_SAMPLES = 50  # Adjust based on available time/compute

wm_texts = {}  # Store generated texts for each method

for method_name, watermarker in ALL_WATERMARKERS.items():
    print(f"\nGenerating {method_name} watermarked texts...")
    wm_texts[method_name] = []
    for prompt in tqdm(prompts[:NUM_SAMPLES]):
        try:
            text = watermarker.generate(prompt, max_new_tokens=120, temperature=0.9, top_p=0.95)
            wm_texts[method_name].append(text)
        except Exception as e:
            print(f"Error with {method_name}: {e}")
            wm_texts[method_name].append(prompt)  # Fallback

print(f"\n\nGeneration complete!")
for name, texts in wm_texts.items():
    print(f"  {name}: {len(texts)} texts")

In [None]:
# Display sample generated texts
print("=" * 80)
print("SAMPLE GENERATED TEXTS")
print("=" * 80)

for method_name in ALL_WATERMARKERS.keys():
    print(f"\n[{method_name}]\n{wm_texts[method_name][0][:200]}...")

print(f"\n[Human Text]\n{human_texts[0][:200]}...")

## 4. Detection Experiments <a name="detection"></a>

Evaluate detection accuracy for each watermarking method.

In [None]:
# Detect watermarks in all texts
wm_scores = {}  # Watermarked text scores
human_scores = {}  # Human text scores (for each detector)

for method_name, watermarker in ALL_WATERMARKERS.items():
    print(f"\nRunning {method_name} detection...")
    
    # Detect in watermarked texts
    wm_scores[method_name] = []
    for t in tqdm(wm_texts[method_name], desc="WM texts"):
        try:
            result = watermarker.detect(t)
            wm_scores[method_name].append(result.z_score)
        except Exception as e:
            wm_scores[method_name].append(0.0)
    
    # Detect in human texts (for FPR calibration)
    human_scores[method_name] = []
    for t in tqdm(human_texts[:NUM_SAMPLES], desc="Human texts"):
        try:
            result = watermarker.detect(t)
            human_scores[method_name].append(result.z_score)
        except Exception as e:
            human_scores[method_name].append(0.0)

In [None]:
# Compute detection metrics for all methods
all_metrics = {}

for method_name in ALL_WATERMARKERS.keys():
    all_metrics[method_name] = compute_detection_metrics(
        wm_scores[method_name], 
        human_scores[method_name]
    )

# Create results table
results_data = []
for method_name, metrics in all_metrics.items():
    results_data.append({
        "Method": method_name,
        "AUC": metrics["auc"],
        "TPR@FPR=1%": metrics["tpr_at_fpr_1"],
        "TPR@FPR=5%": metrics["tpr_at_fpr_5"],
        "Mean WM Z-score": metrics["mean_wm_score"],
        "Mean Human Z-score": metrics["mean_human_score"],
    })

results_df = pd.DataFrame(results_data)

print("\n" + "=" * 80)
print("DETECTION RESULTS (No Attack)")
print("=" * 80)
print(results_df.to_string(index=False))

In [None]:
# Plot Z-score distributions for all 6 methods
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

for idx, method_name in enumerate(ALL_WATERMARKERS.keys()):
    ax = axes[idx]
    ax.hist(human_scores[method_name], bins=20, alpha=0.6, label="Human", color="blue", density=True)
    ax.hist(wm_scores[method_name], bins=20, alpha=0.6, label="Watermarked", color="red", density=True)
    ax.axvline(x=4.0, color="black", linestyle="--", label="Threshold")
    ax.set_xlabel("Z-score")
    ax.set_ylabel("Density")
    ax.set_title(f"{method_name}")
    ax.legend(fontsize=8)

plt.tight_layout()
plt.savefig("z_score_distributions_all.png", dpi=150)
plt.show()

In [None]:
# Plot ROC curves for all methods
from sklearn.metrics import roc_curve, auc

fig, ax = plt.subplots(figsize=(10, 8))

colors = plt.cm.tab10(np.linspace(0, 1, len(ALL_WATERMARKERS)))

for idx, method_name in enumerate(ALL_WATERMARKERS.keys()):
    scores = np.concatenate([wm_scores[method_name], human_scores[method_name]])
    labels = np.concatenate([np.ones(len(wm_scores[method_name])), np.zeros(len(human_scores[method_name]))])
    fpr, tpr, _ = roc_curve(labels, scores)
    roc_auc = auc(fpr, tpr)
    ax.plot(fpr, tpr, label=f"{method_name} (AUC = {roc_auc:.3f})", color=colors[idx], linewidth=2)

ax.plot([0, 1], [0, 1], "k--", label="Random", alpha=0.5)
ax.set_xlabel("False Positive Rate", fontsize=12)
ax.set_ylabel("True Positive Rate", fontsize=12)
ax.set_title("ROC Curves for All Watermark Methods", fontsize=14)
ax.legend(loc="lower right")
ax.grid(True, alpha=0.3)

plt.savefig("roc_curves_all.png", dpi=150)
plt.show()

## 5. Attack Robustness <a name="attacks"></a>

Evaluate detection accuracy after various attacks.

In [None]:
# Initialize attacks
attacks = {
    "Synonym (30%)": SynonymAttack(edit_rate=0.3),
    "Swap (20%)": SwapAttack(edit_rate=0.2),
    "Typo (30%)": TypoAttack(edit_rate=0.3),
}

print(f"Configured attacks: {list(attacks.keys())}")

In [None]:
# Run attacks and measure detection for all methods
attack_results = {}

for attack_name, attack in attacks.items():
    print(f"\n{'='*60}")
    print(f"Applying {attack_name}...")
    print(f"{'='*60}")
    
    attack_results[attack_name] = {}
    
    for method_name, watermarker in ALL_WATERMARKERS.items():
        # Apply attack
        attacked_texts = [attack(t) for t in wm_texts[method_name]]
        
        # Detect
        attacked_scores = []
        for t in attacked_texts:
            try:
                result = watermarker.detect(t)
                attacked_scores.append(result.z_score)
            except:
                attacked_scores.append(0.0)
        
        # Compute metrics
        tpr_1, _ = tpr_at_fpr(attacked_scores, human_scores[method_name], target_fpr=0.01)
        tpr_5, _ = tpr_at_fpr(attacked_scores, human_scores[method_name], target_fpr=0.05)
        
        attack_results[attack_name][method_name] = {
            "TPR@FPR=1%": tpr_1,
            "TPR@FPR=5%": tpr_5,
            "Mean Z-score": np.mean(attacked_scores),
        }
        
        print(f"  {method_name}: TPR@1%={tpr_1:.3f}, TPR@5%={tpr_5:.3f}")

In [None]:
# Create attack robustness table
robustness_data = []

for attack_name, results in attack_results.items():
    for method_name, metrics in results.items():
        robustness_data.append({
            "Attack": attack_name,
            "Method": method_name,
            "TPR@FPR=1%": metrics["TPR@FPR=1%"],
            "TPR@FPR=5%": metrics["TPR@FPR=5%"],
        })

robustness_df = pd.DataFrame(robustness_data)

# Pivot for better visualization
pivot_df = robustness_df.pivot(index="Attack", columns="Method", values="TPR@FPR=1%")

# Reorder columns
column_order = ["Unigram", "KGW", "SEMSTAMP", "GPW", "GPW-SP", "GPW-SP+SR"]
pivot_df = pivot_df[[c for c in column_order if c in pivot_df.columns]]

print("\n" + "=" * 80)
print("ATTACK ROBUSTNESS (TPR@FPR=1%)")
print("=" * 80)
print(pivot_df.to_string())

In [None]:
# Plot attack robustness comparison
fig, ax = plt.subplots(figsize=(14, 6))

pivot_df.plot(kind="bar", ax=ax, width=0.8, edgecolor="black")
ax.set_xlabel("Attack", fontsize=12)
ax.set_ylabel("TPR @ FPR=1%", fontsize=12)
ax.set_title("Watermark Robustness to Attacks", fontsize=14)
ax.legend(title="Method", bbox_to_anchor=(1.02, 1), loc="upper left")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
ax.set_ylim(0, 1)
ax.axhline(y=0.5, color="gray", linestyle="--", alpha=0.5)

plt.tight_layout()
plt.savefig("attack_robustness_all.png", dpi=150)
plt.show()

## 6. Quality Evaluation <a name="quality"></a>

Evaluate text quality of watermarked outputs.

In [None]:
# Compute perplexity for all methods
print("Computing perplexity (using GPT-2)...")

ppl_results = {}
for method_name in ALL_WATERMARKERS.keys():
    ppl_results[method_name] = compute_perplexity(wm_texts[method_name], model_name="gpt2", device=device)
    print(f"  {method_name}: {ppl_results[method_name]['mean']:.2f}")

ppl_human = compute_perplexity(human_texts[:NUM_SAMPLES], model_name="gpt2", device=device)
print(f"  Human: {ppl_human['mean']:.2f}")

In [None]:
# Compute diversity metrics
print("\nComputing diversity metrics...")

div_results = {}
for method_name in ALL_WATERMARKERS.keys():
    div_results[method_name] = compute_diversity(wm_texts[method_name])

div_human = compute_diversity(human_texts[:NUM_SAMPLES])

In [None]:
# Create quality comparison table
quality_data = [{"Method": "Human", "Perplexity": ppl_human["mean"], "Distinct-4": div_human["distinct_4"]}]

for method_name in ALL_WATERMARKERS.keys():
    quality_data.append({
        "Method": method_name,
        "Perplexity": ppl_results[method_name]["mean"],
        "Distinct-4": div_results[method_name]["distinct_4"],
    })

quality_df = pd.DataFrame(quality_data)

print("\n" + "=" * 80)
print("QUALITY RESULTS")
print("=" * 80)
print(quality_df.to_string(index=False))

## 7. Comparison & Analysis <a name="comparison"></a>

In [None]:
# Final summary comparison
print("\n" + "=" * 80)
print("FINAL COMPARISON SUMMARY")
print("=" * 80)

summary_data = []
for method_name in ALL_WATERMARKERS.keys():
    avg_attack_tpr = np.mean([attack_results[a][method_name]["TPR@FPR=1%"] for a in attack_results])
    
    summary_data.append({
        "Method": method_name,
        "Detection AUC": all_metrics[method_name]["auc"],
        "TPR@1% (Clean)": all_metrics[method_name]["tpr_at_fpr_1"],
        "Avg TPR@1% (Attacked)": avg_attack_tpr,
        "Perplexity": ppl_results[method_name]["mean"],
        "Type": "GPW" if "GPW" in method_name else "Prior Work",
    })

summary_df = pd.DataFrame(summary_data)
print(summary_df.to_string(index=False))

In [None]:
# Key findings
print("\n" + "=" * 80)
print("KEY FINDINGS")
print("=" * 80)
print("""
PRIOR WORK:
- Unigram: Fixed green list, robust to random edits
- KGW: Context-dependent, better security but lower robustness
- SEMSTAMP: Sentence-level, robust to paraphrasing

OUR GPW METHODS:
- GPW: Basic cosine scoring with secret direction
- GPW-SP: Context-keyed phase for better security
- GPW-SP+SR: Hidden state coupling for semantic awareness

TRADE-OFFS:
- Higher alpha/omega = stronger watermark, potentially lower quality
- Salted phase improves security against key guessing
- SR coupling adds semantic context but increases computation
""")

In [None]:
# Save all results
summary_df.to_csv("watermark_comparison_results.csv", index=False)
robustness_df.to_csv("attack_robustness_results.csv", index=False)
quality_df.to_csv("quality_results.csv", index=False)

print("Results saved to CSV files!")