# Data Anonymization with LLMs - End-to-End Demo

This notebook demonstrates the modular framework for text anonymization.

**Available methods:**
- **EDA** (Easy Data Augmentation): Baseline techniques using WordNet
- **KNEO** (Knowledge-based Neighbor Operation): Embeddings (GloVe/fastText)
- **LLM**: Language models via Ollama (gemma2, llama3.2, mistral, etc.)

**To change configuration:**
Edit the `configs/config.yaml` file to change models, datasets, and parameters.

## 1. Setup

In [None]:
# Install dependencies (first time only)
# !pip install -r ../requirements.txt
# !python -m spacy download en_core_web_sm

In [None]:
import sys
import warnings
warnings.filterwarnings('ignore')

# Add src to path
sys.path.insert(0, '../src')

# Import framework modules
from eda_anonymizer import EDAAnonymizer
from kneo_anonymizer import KNEOAnonymizer
from llm_anonymizer import OllamaAnonymizer, PROMPT_TEMPLATES
from metrics import AnonymizationMetrics
from utils import (
    load_config, 
    load_all_datasets, 
    set_all_seeds,
    create_output_dir,
    save_anonymized_dataset,
    save_metrics,
    print_comparison_table
)

# Import visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setup visualization
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
pd.set_option('display.max_colwidth', None)

print("Libraries imported successfully")

from src.utils import load_config, load_all_datasets
from src.eda_anonymizer import EDAAnonymizer
from src.kneo_anonymizer import KNEOAnonymizer
from src.llm_anonymizer import OllamaAnonymizer
from src.metrics import AnonymizationMetrics

In [None]:
# Load configuration from YAML file
config = load_config('../configs/config.yaml')

# Display main configuration
print("\nCURRENT CONFIGURATION:")
print(f"   EDA alphas: SR={config['eda']['alpha_sr']}, RI={config['eda']['alpha_ri']}, RS={config['eda']['alpha_rs']}, RD={config['eda']['alpha_rd']}")
print(f"   KNEO: model={config['kneo']['embedding_model']}, k={config['kneo']['k']}")
print(f"   LLM: model={config['llm']['model_name']}, temp={config['llm']['temperature']}")
print(f"   Metrics: SBERT={config['metrics']['sbert_model']}")

### 1.2 Load Dataset

In [None]:
# Define sample size for quick testing (None = full dataset)
SAMPLE_SIZES = {
    'train': 100,       # Change to None to use all data
    'validation': 50,
    'test': None
}

# Load datasets
datasets = load_all_datasets(
    config, 
    base_dir='../data',
    sample_sizes=SAMPLE_SIZES
)

# Extract for convenience
if 'train' in datasets:
    train_texts, train_labels = datasets['train']
    print(f"\nTraining set: {len(train_texts)} samples")

if 'validation' in datasets:
    val_texts, val_labels = datasets['validation']
    print(f"Validation set: {len(val_texts)} samples")

In [None]:
# Display some examples
print("\nExamples from dataset:\n")
for i in range(min(5, len(val_texts))):
    print(f"[{val_labels[i]}] {val_texts[i][:100]}..." if len(val_texts[i]) > 100 else f"[{val_labels[i]}] {val_texts[i]}")

---

## 2. Metodo 1: Easy Data Augmentation (EDA)

EDA applica 4 tecniche di augmentation:
- **SR** (Synonym Replacement): Sostituisce parole con sinonimi da WordNet
- **RI** (Random Insertion): Inserisce sinonimi casuali
- **RS** (Random Swap): Scambia posizioni di parole
- **RD** (Random Deletion): Elimina parole casuali

In [None]:
# Initialize EDA Anonymizer
eda = EDAAnonymizer()

# Apply anonymization
print("Applying EDA...")
anonymized_eda = eda.anonymize_batch(
    val_texts,
    alpha_sr=config['eda']['alpha_sr'],
    alpha_ri=config['eda']['alpha_ri'],
    alpha_rs=config['eda']['alpha_rs'],
    alpha_rd=config['eda']['alpha_rd'],
    show_progress=True
)

print(f"\nEDA completed: {len(anonymized_eda)} sentences anonymized")

In [None]:
# Show examples
print("EDA anonymization examples:\n")
print("-" * 80)
for i in range(min(5, len(val_texts))):
    print(f"\nORIGINAL:  {val_texts[i]}")
    print(f"EDA:       {anonymized_eda[i]}")
    print("-" * 80)

---

## 3. Metodo 2: KNEO (Knowledge-based Neighbor Operation)

KNEO usa embeddings pre-addestrati per sostituire parole con vicini semantici.

**Modelli disponibili:**
- `glove-wiki-gigaword-50/100/200/300` - GloVe embeddings
- `fasttext-wiki-news-subwords-300` - FastText (migliore per testi rumorosi)

In [None]:
# Initialize KNEO Anonymizer
kneo = KNEOAnonymizer(
    embedding_model=config['kneo']['embedding_model'],
    verbose=True
)

In [None]:
# Apply anonymization
print("\nApplying KNEO...")
anonymized_kneo = kneo.anonymize_batch(
    val_texts,
    k=config['kneo']['k'],
    strategy=config['kneo']['strategy'],
    show_progress=True
)

print(f"\nKNEO completed: {len(anonymized_kneo)} sentences anonymized")

# Cache statistics
cache_stats = kneo.get_cache_stats()
print(f"Cache: {cache_stats['cache_size']} unique words processed")

In [None]:
# Show examples
print("üìù KNEO anonymization examples:\n")
print("-" * 80)
for i in range(min(5, len(val_texts))):
    print(f"\nORIGINAL:  {val_texts[i]}")
    print(f"KNEO:      {anonymized_kneo[i]}")
    print("-" * 80)

---

## 4. Method 3: LLM with Ollama

Use local LLM models via Ollama for more sophisticated anonymization.

### Ollama Setup (if not already installed)

```bash
# Run the setup script
bash scripts/setup_ollama.sh

# Or manually:
# macOS: brew install ollama
# Linux: curl -fsSL https://ollama.ai/install.sh | sh

# Start the server
ollama serve

# Download a model (in another terminal)
ollama pull gemma2:2b    # Small and fast
ollama pull llama3.2     # Meta's latest
ollama pull mistral      # Good balance
```

In [None]:
# Try connecting to Ollama
try:
    llm = OllamaAnonymizer(
        model_name=config['llm']['model_name'],
        base_url=config['llm']['base_url'],
        temperature=config['llm']['temperature'],
        max_tokens=config['llm']['max_tokens'],
        prompt_style="paraphrase",
        verbose=True
    )
    
    OLLAMA_AVAILABLE = True
    print("\nOllama connected and ready!")
    
except Exception as e:
    OLLAMA_AVAILABLE = False
    print(f"\nOllama not available: {e}")
    print("   Skip this section if you don't have Ollama installed.")

In [None]:
# Apply LLM (only if available)
if OLLAMA_AVAILABLE:
    # Use a smaller subset for LLM (it's slower)
    llm_sample_size = min(20, len(val_texts))
    llm_texts = val_texts[:llm_sample_size]
    llm_labels = val_labels[:llm_sample_size]
    
    print(f"\nApplying LLM ({llm_sample_size} samples)...")
    
    anonymized_llm = llm.anonymize_batch(
        llm_texts,
        labels=llm_labels,
        show_progress=True
    )
    
    print(f"\nLLM completed: {len(anonymized_llm)} sentences anonymized")
else:
    anonymized_llm = None

In [None]:
# Show LLM examples
if anonymized_llm:
    print("LLM anonymization examples:\n")
    print("-" * 80)
    for i in range(min(5, len(anonymized_llm))):
        print(f"\nORIGINAL:  {llm_texts[i]}")
        print(f"LLM:       {anonymized_llm[i]}")
        print("-" * 80)

---

## 5. Evaluation with Metrics

Evaluate results with 4 metrics:

| Metric | Category | Goal |
|---------|-----------|----------|
| Levenshtein Ratio | Irreversibility | ‚Üì Lower = better |
| Jaccard Similarity | Irreversibility | ‚Üì Lower = better |
| Cosine Similarity | Semantic Utility | ‚Üë Higher = better |
| NER Score | Anonymization | ‚Üë Higher = better |

In [None]:
# Initialize metrics module
metrics = AnonymizationMetrics(
    sbert_model=config['metrics']['sbert_model'],
    spacy_model=config['metrics']['spacy_model'],
    verbose=True
)

In [None]:
# Evaluate EDA - calculate mean scores
print("\nEvaluating EDA...")
eda_results = metrics.evaluate_all(val_texts, anonymized_eda, show_progress=True)

# Get individual scores for distribution visualization
eda_scores = {
    'levenshtein': metrics.calculate_levenshtein_ratio_list(val_texts, anonymized_eda),
    'jaccard': metrics.calculate_jaccard_similarity_list(val_texts, anonymized_eda),
    'cosine': metrics.calculate_cosine_similarity_list(val_texts, anonymized_eda, show_progress=False),
    'ner': metrics.calculate_ner_score_list(val_texts, anonymized_eda)
}

In [None]:
# Evaluate KNEO - calculate mean scores
print("\nEvaluating KNEO...")
kneo_results = metrics.evaluate_all(val_texts, anonymized_kneo, show_progress=True)

# Get individual scores for distribution visualization
kneo_scores = {
    'levenshtein': metrics.calculate_levenshtein_ratio_list(val_texts, anonymized_kneo),
    'jaccard': metrics.calculate_jaccard_similarity_list(val_texts, anonymized_kneo),
    'cosine': metrics.calculate_cosine_similarity_list(val_texts, anonymized_kneo, show_progress=False),
    'ner': metrics.calculate_ner_score_list(val_texts, anonymized_kneo)
}

In [None]:
# Evaluate LLM (if available)
if anonymized_llm:
    print("\nEvaluating LLM...")
    llm_results = metrics.evaluate_all(llm_texts, anonymized_llm, show_progress=True)
    
    # Get individual scores for distribution visualization
    llm_scores = {
        'levenshtein': metrics.calculate_levenshtein_ratio_list(llm_texts, anonymized_llm),
        'jaccard': metrics.calculate_jaccard_similarity_list(llm_texts, anonymized_llm),
        'cosine': metrics.calculate_cosine_similarity_list(llm_texts, anonymized_llm, show_progress=False),
        'ner': metrics.calculate_ner_score_list(llm_texts, anonymized_llm)
    }
else:
    llm_results = None
    llm_scores = None

### 5.1 Compare Results

In [None]:
# Collect all results
all_results = {
    'EDA': eda_results,
    'KNEO': kneo_results
}

if llm_results:
    all_results['LLM'] = llm_results

# Print comparison table
print_comparison_table(all_results)

In [None]:
# Create DataFrame for analysis
df_results = pd.DataFrame(all_results).T
df_results.index.name = 'Method'
df_results = df_results.round(4)
display(df_results)

### 5.2 Visualization

In [None]:
# Bar chart for comparison
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics_names = ['levenshtein_ratio', 'jaccard_similarity', 'cosine_similarity', 'ner_score']
metric_labels = ['Levenshtein Ratio ‚Üì', 'Jaccard Similarity ‚Üì', 'Cosine Similarity ‚Üë', 'NER Score ‚Üë']
colors = ['#e74c3c', '#e74c3c', '#27ae60', '#27ae60']

methods = list(all_results.keys())

for idx, (metric, label, color) in enumerate(zip(metrics_names, metric_labels, colors)):
    ax = axes[idx // 2, idx % 2]
    values = [all_results[m][metric] for m in methods]
    
    bars = ax.bar(methods, values, color=color, alpha=0.8, edgecolor='black')
    ax.set_title(label, fontsize=12, fontweight='bold')
    ax.set_ylim(0, 1)
    
    for bar, val in zip(bars, values):
        ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                f'{val:.3f}', ha='center', va='bottom', fontsize=10)

plt.suptitle('Anonymization Metrics Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

### 5.3 Distribution Visualizations

Visualize the distribution of scores for each metric across different methods using ridge plots.

In [None]:
# Prepare data for distribution plots
# Combine all methods' scores into dictionaries for each metric

# Levenshtein Ratio Distributions
lev_distributions = {
    'EDA': eda_scores['levenshtein'],
    'KNEO': kneo_scores['levenshtein']
}

# Jaccard Similarity Distributions
jac_distributions = {
    'EDA': eda_scores['jaccard'],
    'KNEO': kneo_scores['jaccard']
}

# Cosine Similarity Distributions
cos_distributions = {
    'EDA': eda_scores['cosine'],
    'KNEO': kneo_scores['cosine']
}

# NER Score Distributions
ner_distributions = {
    'EDA': eda_scores['ner'],
    'KNEO': kneo_scores['ner']
}

# Add LLM if available
if llm_scores:
    lev_distributions['LLM'] = llm_scores['levenshtein']
    jac_distributions['LLM'] = llm_scores['jaccard']
    cos_distributions['LLM'] = llm_scores['cosine']
    ner_distributions['LLM'] = llm_scores['ner']

print("Distribution data prepared for plotting")

In [None]:
# Plot Levenshtein Ratio Distribution (Ridge Plot)
print("\nLevenshtein Ratio Distribution (‚Üì lower is better)")
metrics.plot_metric_distributions(
    lev_distributions,
    metric_name="Levenshtein Ratio",
    plot_type="ridge",
    color_palette="Reds_r"
)

In [None]:
# Plot Jaccard Similarity Distribution (Ridge Plot)
print("\nJaccard Similarity Distribution (‚Üì lower is better)")
metrics.plot_metric_distributions(
    jac_distributions,
    metric_name="Jaccard Similarity",
    plot_type="ridge",
    color_palette="Oranges_r"
)

In [None]:
# Plot Cosine Similarity Distribution (Ridge Plot)
print("\nCosine Similarity Distribution (‚Üë higher is better)")
metrics.plot_metric_distributions(
    cos_distributions,
    metric_name="Cosine Similarity (Semantic Utility)",
    plot_type="ridge",
    color_palette="Greens"
)

In [None]:
# Plot NER Score Distribution (Ridge Plot)
print("\nNER Score Distribution (‚Üë higher is better)")
metrics.plot_metric_distributions(
    ner_distributions,
    metric_name="NER Score (Anonymization Quality)",
    plot_type="ridge",
    color_palette="Blues"
)

#### Alternative: Overlay Histograms

You can also use overlapping histograms instead of ridge plots by changing `plot_type="overlay"`.

In [None]:
# Example: Overlay plot for Cosine Similarity (uncomment to use)
# metrics.plot_metric_distributions(
#     cos_distributions,
#     metric_name="Cosine Similarity",
#     plot_type="overlay",
#     color_palette="Set2"
# )

### 5.4 Paraphrase Retrieval


In [None]:
# Evaluate Paraphrase Retrieval for EDA
print("\nEvaluating Paraphrase Retrieval for EDA...")
eda_retrieval = metrics.evaluate_paraphrase_retrieval(
    original_sentences=val_texts,
    paraphrased_sentences=anonymized_eda,
    k_values=[1, 5, 10],
    show_progress=True
)

In [None]:
# Evaluate Paraphrase Retrieval for KNEO
print("\nEvaluating Paraphrase Retrieval for KNEO...")
kneo_retrieval = metrics.evaluate_paraphrase_retrieval(
    original_sentences=val_texts,
    paraphrased_sentences=anonymized_kneo,
    k_values=[1, 5, 10],
    show_progress=True
)

In [None]:
# Evaluate Paraphrase Retrieval for LLM (if available)
if anonymized_llm:
    print("\nEvaluating Paraphrase Retrieval for LLM...")
    llm_retrieval = metrics.evaluate_paraphrase_retrieval(
        original_sentences=llm_texts,
        paraphrased_sentences=anonymized_llm,
        k_values=[1, 5, 10],
        show_progress=True
    )
else:
    llm_retrieval = None

In [None]:
# Compare Paraphrase Retrieval Results
retrieval_results = {
    'EDA': eda_retrieval,
    'KNEO': kneo_retrieval
}

if llm_retrieval:
    retrieval_results['LLM'] = llm_retrieval

# Create DataFrame for comparison
df_retrieval = pd.DataFrame(retrieval_results).T
df_retrieval.index.name = 'Method'
df_retrieval = df_retrieval.round(2)

print("\n" + "="*60)
print("PARAPHRASE RETRIEVAL COMPARISON")
print("="*60)
print("Lower accuracy = Better privacy protection")
print("="*60)
display(df_retrieval)

In [None]:
# Visualize Paraphrase Retrieval Results
fig, ax = plt.subplots(figsize=(10, 6))

methods = list(retrieval_results.keys())
k_values = [1, 5, 10]
x = np.arange(len(methods))
width = 0.25

for i, k in enumerate(k_values):
    values = [retrieval_results[m][f'Accuracy@{k}'] for m in methods]
    bars = ax.bar(x + i*width, values, width, label=f'Accuracy@{k}', alpha=0.8)
    
    # Add value labels on bars
    for bar, val in zip(bars, values):
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2, height + 1,
                f'{val:.1f}%', ha='center', va='bottom', fontsize=9)

ax.set_xlabel('Method', fontsize=12)
ax.set_ylabel('Retrieval Accuracy (%)', fontsize=12)
ax.set_title('Paraphrase Retrieval Attack Results\n(Lower is Better for Privacy)', 
             fontsize=14, fontweight='bold')
ax.set_xticks(x + width)
ax.set_xticklabels(methods)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("Note: Lower retrieval accuracy means better privacy protection!")

---

## 6. Save Results

In [None]:
# Create output directory
output_dir = create_output_dir(config)
print(f"Output directory: {output_dir}")

In [None]:
# Save anonymized datasets
if config['output']['save_anonymized']:
    save_anonymized_dataset(anonymized_eda, val_labels, output_dir, 'eda')
    save_anonymized_dataset(anonymized_kneo, val_labels, output_dir, 'kneo')
    
    if anonymized_llm:
        save_anonymized_dataset(anonymized_llm, llm_labels, output_dir, 'llm')

In [None]:
# Save metrics
if config['output']['save_metrics']:
    save_metrics(all_results, output_dir, 'comparison')

---

## 7. Conclusions

### Summary of methods:

| Method | Pros | Cons | When to use |
|--------|-----|--------|---------------|
| **EDA** | Fast, offline | Less sophisticated | Baseline, large datasets |
| **KNEO** | Good semantics | Requires embeddings | Noisy datasets |
| **LLM** | High quality | Slow, requires GPU | Small datasets, max quality |

### New Visualization Features:

This notebook now includes **distribution visualizations** for all metrics:
- **Ridge Plots**: Show score distributions for each method (stacked density plots)
- **Paraphrase Retrieval**: Privacy attack simulation to measure re-identification risk
- Lower retrieval accuracy = Better privacy protection

### How to replicate experiments:

1. Edit `configs/config.yaml` to change parameters
2. Add your datasets in `data/`
3. Run this notebook from the beginning
4. Compare methods using both mean scores and distribution visualizations

In [None]:
print("\n" + "="*60)
print("DEMO COMPLETED!")
print("="*60)
print("\nNext steps:")
print("1. Edit configs/config.yaml for your experiments")
print("2. Add your datasets in data/")
print("3. Use scripts/run_baseline.py for full runs")
print("4. Use scripts/run_llm.py for LLM on large datasets")