## **Analysis of the Fine-Tuning Process in This Notebook**
This notebook fine-tunes Google's Pegasus model on the CNN/DailyMail dataset for text summarization. The fine-tuning process follows a structured workflow, including data loading, preprocessing, tokenization, training, and evaluation. 


In [1]:
import torch
device = "mps" if torch.backends.mps.is_available() else "cpu"

The code checks whether Apple's Metal Performance Shader (MPS) is available (for Mac M1/M2/M3). If not, it defaults to CPU.

In [2]:
from datasets import load_dataset

samples_fraction = 0.1


dataset = load_dataset("cnn_dailymail", "3.0.0")
dataset = dataset.rename_columns({"article": "document", "highlights": "summary"})
train_data = dataset["train"].shuffle(seed=42).select(range(int(len(dataset["train"]) * samples_fraction)))
val_data = dataset["validation"].shuffle(seed=42).select(range(int(len(dataset["validation"]) * samples_fraction)))

print(dataset["train"])
print(dataset["validation"][0])

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 287113
})
{'document': '(CNN)Share, and your gift will be multiplied. That may sound like an esoteric adage, but when Zully Broussard selflessly decided to give one of her kidneys to a stranger, her generosity paired up with big data. It resulted in six patients receiving transplants. That surprised and wowed her. "I thought I was going to help this one person who I don\'t know, but the fact that so many people can have a life extension, that\'s pretty big," Broussard told CNN affiliate KGO. She may feel guided in her generosity by a higher power. "Thanks for all the support and prayers," a comment on a Facebook page in her name read. "I know this entire journey is much bigger than all of us. I also know I\'m just the messenger." CNN cannot verify the authenticity of the page. But the power that multiplied Broussard\'s gift was data processing of genetic profiles from donor-recipient pairs. It works on a simple swappi

- Loads the CNN/DailyMail dataset, a benchmark dataset for abstractive text summarization.
- The dataset has:
- "article" (the news article) → Renamed as "document".
- "highlights" (the human-written summary) → Renamed as "summary".
- To speed up training, only 10% (samples_fraction = 0.1) of the dataset is used for fine-tuning.
- Training and validation data are shuffled and randomly selected.

In [3]:
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer


# tokenizer
model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)

# preprocessing 
def preprocess_function(examples):
    inputs = tokenizer(
        examples["document"], 
        padding="max_length", 
        truncation=True, 
        max_length=1024
    )
    labels = tokenizer(
        examples["summary"], 
        padding="max_length", 
        truncation=True, 
        max_length=256
    )
    inputs["labels"] = labels["input_ids"]
    return inputs

tokenized_train = train_data.map(preprocess_function, batched=True, remove_columns=["document", "summary"])
tokenized_val = val_data.map(preprocess_function, batched=True, remove_columns=["document", "summary"])

print(tokenized_train, tokenized_val)



Map:   0%|          | 0/28711 [00:00<?, ? examples/s]

Map:   0%|          | 0/1336 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 28711
}) Dataset({
    features: ['id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1336
})


- The Pegasus tokenizer is loaded using transformers.PegasusTokenizer.
- The preprocessing function:
	1. Tokenizes the "document" (full article) and truncates it to 1024 tokens.
	2. Tokenizes the "summary" and truncates it to 256 tokens.
	3. Stores the tokenized summary under "labels", making it suitable for supervised learning.
- Tokenized datasets (tokenized_train and tokenized_val) replace "document" and "summary" columns.


In [None]:
from transformers import PegasusForConditionalGeneration, Trainer, TrainingArguments, BitsAndBytesConfig

model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device) 
model.gradient_checkpointing_enable()


# Define training arguments

training_args = TrainingArguments(
    output_dir="./pegasus_finetuned",
    per_device_train_batch_size=1,  
    gradient_accumulation_steps=4,  
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    num_train_epochs=3,
    bf16=True, 
    save_total_limit=2,  
    logging_dir="./logs",
    logging_steps=100,
    report_to="none"
)


# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

# Start training
trainer.train()

###-  Model Setup
- Loads Pegasus-large, a transformer-based sequence-to-sequence model optimized for abstractive text summarization.
- Moves the model to Apple MPS (or CPU).
- Enables gradient checkpointing, which reduces memory usage by recomputing activations during backpropagation.

### Training Arguments
- Batch Size: 1 (small batch size to avoid memory overload).
- Gradient Accumulation: 4 steps (accumulates gradients over multiple mini-batches to simulate larger batches).
- Evaluation: Performed at the end of each epoch.
- Learning Rate: 5e-5, a standard fine-tuning rate for transformers.
- Epochs: 3 (fine-tunes for three passes over the training data).
- Mixed Precision: Uses bfloat16 (bf16=True), which improves efficiency on Apple M1/M2/M3.
- Checkpointing: Limits saved models to the latest two (save_total_limit=2).
- Logging: Saves logs every 100 steps.

### How Fine-Tuning Works
- Uses the Hugging Face Trainer API, which simplifies training and evaluation.
- The loss function used is cross-entropy loss, as this is a sequence-to-sequence task.
- During fine-tuning:
- The document (input) is passed through the Pegasus encoder.
- The decoder generates token sequences for the summary.
- Loss is computed between the generated summary and the ground truth summary.
- Backpropagation updates the model’s weights to minimize the loss.
- The model is evaluated on the validation set after each epoch.

