# FLAN-T5-Small Fine-tuning for PII Anonymization (Italian)

This notebook demonstrates how to fine-tune the FLAN-T5-small model for anonymizing Personally Identifiable Information (PII) in Italian text.

## Overview
- Model: FLAN-T5-small (80M parameters)
- Task: PII Anonymization
- Language: Italian
- Approach: Text-to-text format (input text with PII → anonymized output with placeholders)


## 1. Import Libraries


In [None]:
try:
    import numpy as np
    import random
    import dotenv
    import json

    from transformers import (
        AutoTokenizer, # tokenizer model 
        AutoModelForSeq2SeqLM, # main seq2seq model
        Seq2SeqTrainingArguments,
        Seq2SeqTrainer,
        DataCollatorForSeq2Seq # dataset collator
    )

    from libs.utility import detect_accelerator, downloadFromHuggingFace
    from libs.parameters import Properties
    from libs.dataset import CustomPIIDataset, DataPreprocessor
    from datasets import load_dataset, Dataset, DatasetDict

    # metrics
    import wandb
except ImportError as e:
    print(f"Import error: {e}")

# 2. Setup Environment

Declare variables and global identifiers that are used throughout this notebook

In [None]:
# load dotenv
config_env: dict = dotenv.dotenv_values("./localenv")

P_FILE: str = config_env.get("PARAMETER_FILE", "parameters.yaml")
M_REPO: str = config_env.get("MODEL_REPO_ID", "google/flan-t5-small")
DATASET_PATH: str = config_env.get("DATASET_FILE", "../dataset")
DATASET_SPLIT: float = 0.8 # 80% train, 20% validation
OUTPUT_DIR: str = config_env.get("OUTPUT_DIR", "flan-finetuned-ita")

In [None]:
# Check for the presence of an accelerator
device, dtype = detect_accelerator()
print(f"Using device: {device}/{dtype} for training.")

# load parameters
params: Properties = Properties(P_FILE)
print(f"Loaded HF: Cache Dir: {params.config_parameters.huggingface.cache_dir}\nDownloading to {params.config_parameters.huggingface.local_dir}")

In [None]:
# login to Weights and Biases to save metrics
try:
    wandb.login(key=params.config_parameters.wandb.apikey)
except Exception as e:
    print(f"Wandb login failed: {e}")

In [None]:
# download model from HF repository
try:
    model_name: str = downloadFromHuggingFace(M_REPO,
                                            cache_dir=params.config_parameters.huggingface.cache_dir,
                                            local_dir=params.config_parameters.huggingface.local_dir,
                                            apitoken=params.config_parameters.huggingface.apitoken)
except Exception as e:
    print(f"Caught exception: {e}")


## 3. Create machinery to manage a synthetic Italian PII Dataset

Since there's no standard Italian PII anonymization dataset, we'll create synthetic training data with various PII types.
Example source and target expressions are loaded from a `json` file.

The format of a single datapoint is:

`{
    "source": "Example data with PII",
    "target": "Example data with PII Masked Out"
}`

Also, a custom DataSet class is created to manage this dataset during training
Load datataset from disk and prepare randomized splits:
- `train_examples`: 80% of the data, used to train the model.
- `val_examples`: 20% of the data, used to validate the model.

Variables contain a list of dictionaries, each dictionary contains the following keys:
- "source": the text to be classified.
- "target": the anonymized text with placeholders in place of actual PII

In [None]:
# load dataset using custom class
it_pii_dataset: CustomPIIDataset = CustomPIIDataset("../dataset")
print(f"Dataset Loaded! -> Processed {len(it_pii_dataset)} datapoints")

# prepare randomized splits
random.shuffle(it_pii_dataset.dataset)
train_val_split = int(len(it_pii_dataset) * DATASET_SPLIT)
train_examples: list = it_pii_dataset[:train_val_split]
val_examples: list = it_pii_dataset[train_val_split:]

Prepare dataset for training.
A dataset is a dictionary with this format:

- `train`: Dataset table containing all training data in order
- `validation`: Dataset table containing all validation data in order

In [None]:
# Create datasets
train_data = {
    "original": [ex.get("source") for ex in train_examples],
    "anonymized": [ex.get("target") for ex in train_examples]
}

val_data = {
    "original": [ex.get("source") for ex in val_examples],
    "anonymized": [ex.get("target") for ex in val_examples]
}

# the complete rebuilt dataset. this is used for training
dataset = DatasetDict({
    "train": Dataset.from_dict(train_data),
    "validation": Dataset.from_dict(val_data)
})

# Display dataset information
print(f"Training examples: {len(dataset['train'])}")
print(f"Validation examples: {len(dataset['validation'])}")

print("\nExample from training set:")
print(f"Original: {dataset['train'][0]['original']}")
print(f"Anonymized: {dataset['train'][0]['anonymized']}")

print("\nExample from validation set:")
print(f"Original: {dataset['validation'][0]['original']}")
print(f"Anonymized: {dataset['validation'][0]['anonymized']}")


### PII Categories that the model will learn

After a training phase, the model hopefully will learn how to replace the following PII types with placeholders:

- **[NOME]**: Names of people
- **[INDIRIZZO]**: Street addresses
- **[TELEFONO]**: Phone numbers
- **[EMAIL]**: Email addresses
- **[CARTA_CREDITO]**: Credit card numbers
- **[CODICE_FISCALE]**: Italian fiscal codes (tax IDs)
- **[DATA_NASCITA]**: Dates of birth


In [None]:
# Show some examples of the task
print("PII Anonymization Examples:\n")
for i in range(min(5, len(dataset['train']))):
    example = dataset['train'][i]
    print(f"Example {i+1}:")
    print(f"  Input:  {example['original']}")
    print(f"  Output: {example['anonymized']}")
    print()


## 4. Load Model and Tokenizer

Load the model and tokenizer from the local repository. Model should be already present in the filesystem if you have trained it before.
Download is managed at step 2 of this notebook

- `model`: FLAN-T5-small from huggingface

Upon loading, move the model to a GPU `device` if such hardware is detected

In [None]:
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

print(f"Model loaded: {model_name} on {device}")
print(f"Model parameters: {model.num_parameters()}")

## 5. Preprocess Dataset

Prepare data for Training and Validation steps.

- `data_preprocess`: Tokenize the input and output tokens using the tokenizer

Preprocessing is applied to the whole dataset


In [None]:
# Instantiate Preprocessor
dp: DataPreprocessor = DataPreprocessor(tokenizer=tokenizer)

# Process datasets
print("Processing datasets...")
tokenized_datasets = dataset.map(
    dp.data_preprocess,
    batched=True,
    remove_columns=dataset['train'].column_names
)

# print out final dataset
print("Tokenized datasets:")
print(tokenized_datasets)
print("\nFirst tokenized example (input):")
print(tokenizer.decode(tokenized_datasets['train'][0]['input_ids']))


## 6. Setup Training Arguments

Now we set up the training arguments, which are used to configure the training process.

- `OUTPUT_DIR`: The directory where the model will be saved.
- `LEARNING_RATE`: The learning rate used while computing gradient descent.
- `EPOCHS`: The number of training epochs.


In [None]:
# parameters
import uuid
EPOCHS: int = 20
LEARNING_RATE: float = 3e-4
HAS_GPU: bool = (device == "cuda")
RUN_NAME: str = f"flan-t5-it-finetune_{uuid.uuid4()}"

In [None]:
# initialize weights and biases project
wandb.init(
    project=params.config_parameters.wandb.project,
    name=RUN_NAME
)

# setup training parameters
training_args = Seq2SeqTrainingArguments(
    output_dir=OUTPUT_DIR,
    eval_strategy="epoch",
    learning_rate=LEARNING_RATE,  # Higher learning rate for smaller dataset
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=EPOCHS,  # More epochs for small dataset
    weight_decay=0.01,
    save_total_limit=2,
    predict_with_generate=True,
    fp16=HAS_GPU,  # Use mixed precision if GPU available
    dataloader_pin_memory=HAS_GPU, # only on GPU equipped systems. also silences warnings on MPS devices (Apple)
    logging_steps=10,
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    push_to_hub=False,
    report_to="wandb", run_name=RUN_NAME,
)

In [None]:
# Data collator:
# - Build data batches
# - dynamic padding
data_collator = DataCollatorForSeq2Seq(
    tokenizer=tokenizer,
    model=model,
    padding=True
)

# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    processing_class=tokenizer,
    data_collator=data_collator
)

print("Trainer initialized successfully!")


## 7. Train the Model

Now train the model over the dataset. Measure Loss and report back

In [None]:
# train model
print("Starting training...")
train_result = trainer.train()

# report information
print("\nTraining completed!")
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training time: {train_result.metrics['train_runtime']:.2f} seconds")


## 8. Evaluate the Model


In [None]:
print("Evaluating model on validation set...")
eval_result = trainer.evaluate()

print("\nEvaluation Results:")
for key, value in eval_result.items():
    print(f"{key}: {value}")


## 9. Save the Finetuned Model

Finetuned model is ready for consumption. Save it to disk.

- `OUTPUT_DIR`: directory where the model will be saved


In [None]:
# Save the model and tokenizer
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

print(f"Model and tokenizer saved to {OUTPUT_DIR}")
print("\nYou can load the model later with:")
print(f"  tokenizer = AutoTokenizer.from_pretrained('{OUTPUT_DIR}')")
print(f"  model = AutoModelForSeq2SeqLM.from_pretrained('{OUTPUT_DIR}')")