# Text Anonymization Model Training Pipeline

This notebook demonstrates how to fine-tune a T5 model for text anonymization using the Hugging Face ecosystem. The pipeline includes:

1. Loading and preprocessing synthetic data
2. Setting up the model and tokenizer
3. Training configuration and process
4. Model evaluation and metrics
5. Inference examples

## Setup and Dependencies

In [1]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    DataCollatorForSeq2Seq,
    TrainingArguments,
    Trainer
)
import torch
import gc
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import logging
import re
from collections import Counter

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)s] %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)
logger = logging.getLogger(__name__)

# # Set random seed for reproducibility
# torch.manual_seed(42)
# if torch.cuda.is_available():
#     torch.cuda.manual_seed_all(42)

## 1. Data Loading

We'll use our synthetic dataset from the Hugging Face Hub. This dataset contains pairs of original and anonymized texts, perfect for training our model.

In [None]:
# Load dataset
dataset = load_dataset("kurkowski/synthetic-contextual-anonymizer-dataset")

print("Dataset Statistics:")
print(f"Training examples: {len(dataset['train'])}")
print(f"Validation examples: {len(dataset['validation'])}")
print(f"Test examples: {len(dataset['test'])}")

# Display a sample
print("\nSample from training set:")
print("Original:", dataset['train'][0]['context'])
print("Anonymized:", dataset['train'][0]['anonymized_context'])
print("Used labels:", dataset['train'][0]['used_labels'], type(dataset['train'][0]['used_labels']))

## 2. Model Setup

We'll use FLAN-T5-small as our base model. This is a good choice because:
- It's relatively small and fast to train
- It has good text-to-text capabilities
- It's been trained on a variety of tasks

In [3]:
# Initialize tokenizer and model
model_name = "google/flan-t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

## 3. Data Preprocessing

We need to convert our text examples into a format suitable for the model. This includes:
- Adding a task-specific prompt
- Tokenizing inputs and targets
- Creating attention masks

In [None]:
def normalize_labels(text):
    """
    Usuwa numerację z etykiet (np. [NAME_1] -> [NAME]).
    Obsługuje również etykiety z podkreślnikami w nazwie (np. POLICY_NUMBER).
    """
    if isinstance(text, list):
        return [normalize_labels(t) for t in text]
    if not isinstance(text, str):
        return text
    return re.sub(r'\[([A-Z_]+)_\d+\]', r'[\1]', text)

def create_anonymization_prompt(labels):
    """Create a prompt for text anonymization task.
    
    Args:
        labels: List of labels to use for anonymization
        
    Returns:
        String containing the formatted prompt
    """
    return f"""You are a text anonymization expert. Your task is to replace sensitive information with the following labels: { normalize_labels(labels)}.

    Instructions:
    1. Replace each sensitive information with appropriate label from the provided list
    2. For multiple occurrences of the same type, use numbered labels (e.g. [NAME_1], [NAME_2])
    3. Preserve the original text structure and meaning
    4. Follow the examples precisely

    Example:
    Input: "John Smith called Mary Johnson. John's number is 555-0123 and Mary's is 555-4567."
    Output: "[NAME_1] called [NAME_2]. [NAME_1]'s number is [PHONE_1] and [NAME_2]'s is [PHONE_2]."

    Task:
    Anonymize the following text using only these labels: { normalize_labels(labels)}
    Input: 
    """

def convert_examples_to_features(example_batch):
    """Convert text examples to model features.
    
    Args:
        example_batch: Batch of examples from dataset
        
    Returns:
        Dictionary with input_ids, attention_mask, and labels
    """
    input_texts = []
    for text, labels in zip(example_batch["context"], example_batch["used_labels"]):
        prompt = create_anonymization_prompt(labels)
        input_texts.append(prompt + text)
    
    print('input_texts:')
    for text in input_texts[:3]:
        print(text)
    input_encodings = tokenizer(input_texts, truncation=True, padding=True)
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["anonymized_context"], truncation=True, padding=True)
    
    return {
        "input_ids": input_encodings["input_ids"],
        "attention_mask": input_encodings["attention_mask"], 
        "labels": target_encodings["input_ids"]
    }

test_prompt = create_anonymization_prompt(dataset['train'][0]['used_labels'])

# Process all splits
processed_dataset = dataset.map(
    convert_examples_to_features,
    batched=True,
    desc="Processing dataset"
)

In [None]:
def test_anonymization(text_to_anonymize, labels):
    prompt = create_anonymization_prompt(labels)
    print(prompt + text_to_anonymize)
    inputs = tokenizer(
        prompt + text_to_anonymize, 
        return_tensors="pt",  # Tutaj jest OK używać return_tensors="pt"
        truncation=True
    )
    
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=512,
        temperature=0.1
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


print(test_anonymization(processed_dataset['train'][0]['context'], processed_dataset['train'][0]['used_labels']))

## 4. Training Configuration

We'll configure the training process with optimal parameters for our task. Key considerations include:
- Memory efficiency (batch size and gradient accumulation)
- Learning rate and warmup
- Evaluation strategy

In [None]:

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer_args = TrainingArguments(
    output_dir = "anonymizer_model", 
    num_train_epochs=3, 
    warmup_steps = 500,
    per_device_train_batch_size=4,      
    per_device_eval_batch_size=4,      
    weight_decay=0.01, 
    logging_steps=2,
    eval_strategy="steps",
    eval_steps=250,
    save_steps=250,
    gradient_accumulation_steps=2,      
    learning_rate=5e-5,
    save_total_limit=2,
    load_best_model_at_end=True,
    logging_first_step=True,
    logging_dir="./logs",
    optim="adamw_torch_fused",
    gradient_checkpointing=True,        
    torch_compile=False,                
    dataloader_pin_memory=False,      
    torch_empty_cache_steps=2         
)

gc.collect()
torch.cuda.empty_cache() if torch.cuda.is_available() else None
if hasattr(torch.mps, 'empty_cache'):
    torch.mps.empty_cache()

seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=trainer_args,
    tokenizer=tokenizer,
    data_collator=seq2seq_data_collator,
    train_dataset=processed_dataset["train"],
    eval_dataset=processed_dataset["validation"]
)


train_result =trainer.train()

## 5. Model Evaluation

We'll evaluate our model using:
- Loss on test set
- Custom metrics (Precision, Recall, F1) for entity anonymization
- Visualization of training progress

In [None]:
# TO DO