In [13]:
import numpy as np 
from datasets import load_dataset
import random
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainingArguments,
    Seq2SeqTrainer
)
from rouge_score import rouge_scorer
import warnings
import torch
import evaluate
warnings.filterwarnings('ignore')


## Fine-Tuning Methods for Language Models

Fine-tuning a language model (LM) is a process of adapting a pre-trained model to a specific task or dataset. Here we discuss several methods for fine-tuning, each with its rationale, pros, cons, and typical use cases.

### 1. Full Model Fine-Tuning

#### Mathematical Formulation:
$$ \hat{\theta} = \arg\min_{\theta} \{\mathcal{L}(D_{\text{finetune}}, \theta)\} $$
where $\hat{\theta}$ are the updated model parameters, $\mathcal{L}$ is a loss function such as cross-entropy on the fine-tuning dataset $D_{\text{finetune}}$, and $\theta$ are the original pre-trained parameters.

#### Rationale:
Fine-tuning the entire model adjusts all parameters on a downstream task, exploiting the pre-trained model as a starting point.

#### Pros:
- No need for manual architecture changes.
- Tends to be very effective when transfer learning from related tasks.

#### Cons:
- Computationally expensive.
- Risk of overfitting on small datasets.

#### Typical Use Cases:
- Supervised learning tasks with substantial labeled data.
- Tasks closely related to the pre-training domain.

### 2. Prompt Engineering

Prompt engineering refers to techniques that involve formulating inputs to elicit desired behaviors from a pre-trained LM without modifying its parameters.

#### Strata:

  #### a. Few-shot Learning
  Few-shot learning is a specific type of prompt engineering where the model is given a few examples to infer the task.

  ##### Rationale:
  By providing the LM with some examples, it can generalize and produce the correct output for similar tasks.

  ##### Pros:
  - Quick adaptation to tasks without parameter updates.
  - Reduces data requirements for training.

  ##### Cons:
  - Performance depends heavily on quality and relevance of examples.
  - Not all models generalize well from few examples.

  ##### Typical Use Cases:
  - Tasks where labeled data is scarce.
  - Quick adaptation to new tasks where model re-training is infeasible.

#### Pros:
- No need to update parameters; computationally inexpensive.
- Flexible and quick to implement for new tasks.

#### Cons:
- Finding effective prompts may require significant trial and error.
- Less effective for tasks distant from pre-training domain.

#### Typical Use Cases:
- Zero-shot or few-shot tasks where training data is limited.
- Exploratory data analysis and probing tasks.

### 3. Layer-wise Over-parameterized Re-parameterization (LORA)

#### Mathematical Formulation:
In LORA, you introduce Δ A and Δ B such that the weight matrix W is re-parameterized by W' = W + ΔA ΔB.
$$ h^{l+1} = \phi(W' h^{l} + b^{l}) $$

#### Rationale:
LORA adds low-rank modifications to the weights of the pre-trained model, enabling fine-tuning with fewer parameters and computational overhead.

#### Pros:
- Efficient parameter updates.
- Can retain the general knowledge of the pre-trained model.

#### Cons:
- May not perform as well as full model fine-tuning in some cases.
- Slightly increased model complexity due to additional parameters.

#### Typical Use Cases:
- Situations with limited computational budgets and training time.
- Scenarios where retaining the original model parameters is critical.

### 4. Transfer Learning with Adapter Layers

#### Mathematical Formulation:
Adapter layers introduce task-specific parameters A such that they transform the intermediate representation H.
$$ H_{\text{adapter}} = \sigma(H W_{\text{down}})W_{\text{up}} $$
where $\sigma$ is a non-linear activation function, and $W_{\text{down}}$, $W_{\text{up}}$ are the learnable parameters of the adapter layer.

#### Rationale:
Adapters allow the model to adapt to a new task by learning a small set of task-specific parameters inserted between the pre-trained layers.

#### Pros:
- Task-specific fine-tuning without altering the original model parameters.
- More parameter-efficient than full fine-tuning.

#### Cons:
- May not reach the same performance as full fine-tuning for every task.
- Introduces additional hyperparameters to tune (e.g., adapter size).

#### Typical Use Cases:
- Multi-task learning where each task requires specific adaptations.
- Scenarios that need to preserve the original model weights for multiple purposes.

### 5. Reinforcement Learning from Human Feedback (RLHF)

#### Mathematical Formulation:
$$ \theta^* = \arg\max_{\theta} \mathbb{E}_{\pi_{\theta}(a|s)} [R(s, a)] $$
where $R(s, a)$ is a reward function typically derived from human feedback, and $\pi_{\theta}(a|s)$ is the policy (e.g., the LM's predictions) parameterized by $\theta$.

#### Rationale:
RLHF directly optimizes the LM's outputs towards behaviors that receive higher rewards according to human preferences.

#### Pros:
- Aligns fine-tuning with human values and preferences.
- Can improve specific aspects of model behavior.

#### Cons:
- Human feedback can be expensive and time-consuming to collect.
- Risk of reward hacking, where the model learns shortcuts to gain rewards.

#### Typical Use Cases:
- Tasks that require subjective quality judgments (e.g., content generation, dialogue systems).
- Scenarios where ethical or safety considerations are paramount.

By selecting the appropriate fine-tuning method, one can tailor a pre-trained LM to a wide range of tasks and use cases with varying requirements and constraints.

## Types of Fine-Tuning Tasks for Language Models

Fine-tuning tasks are specific challenges or objectives used to adapt a pre-trained Language Model (LM) to perform better on certain types of data or problems. Below are some common fine-tuning tasks and their associated loss functions.

### 1. Text Classification

#### Loss Function:
Cross-Entropy Loss
$$ \mathcal{L}(\theta) = -\sum_{(x, y) \in D} \log p_\theta(y | x) $$
where $D$ is the dataset containing text inputs $x$ and labels $y$, and $p_\theta(y | x)$ is the probability assigned by the model with parameters $\theta$ to the correct label $y$ given input $x$.

### 2. Named Entity Recognition (NER)

#### Loss Function:
Conditional Random Field (CRF) (for sequence labeling)
$$ \mathcal{L}(\theta) = -\sum_{(x, \mathbf{y}) \in D} \log p_\theta(\mathbf{y} | x) $$
where $D$ is the dataset of sequences $x$ with corresponding entity label sequences $\mathbf{y}$, and $p_\theta(\mathbf{y} | x)$ is the conditional probability given by the model with parameters $\theta$ to the correct label sequence $\mathbf{y}$.

### 3. Language Generation

#### Loss Function:
Negative Log-Likelihood Loss
$$ \mathcal{L}(\theta) = -\sum_{(x, y) \in D} \sum_{t} \log p_\theta(y_t | y_{<t}, x) $$
where $D$ is the dataset of input-output pairs $(x, y)$, $y_t$ is the $t$-th token in the output sequence $y$, and $y_{<t}$ represents the sequence of tokens before $y_t$. The loss sums over all tokens in the output sequence.

### 4. Machine Translation

#### Loss Function:
Sequence-to-sequence Loss (typically cross-entropy)
$$ \mathcal{L}(\theta) = -\sum_{(x, y) \in D} \log p_\theta(y | x) $$
where $D$ is the dataset of source-target pairs $(x, y)$, $y$ is the translated sequence, and $p_\theta(y | x)$ is the probability assigned by the model with parameters $\theta$ to the translation $y$ given source $x$.

### 5. Question Answering

#### Loss Function:
Span Prediction Loss (combination of start and end token cross-entropy)
$$ \mathcal{L}(\theta) = -\sum_{(x, y_{start}, y_{end}) \in D} (\log p_\theta(y_{start} | x) + \log p_\theta(y_{end} | x)) $$
where $D$ is the dataset of contexts $x$ with corresponding answer start and end positions $y_{start}, y_{end}$, and $p_\theta(y_{start} | x)$, $p_\theta(y_{end} | x)$ are the probabilities assigned to the start and end positions of the answer span.

### 6. Sentiment Analysis

#### Loss Function:
Cross-Entropy Loss
$$ \mathcal{L}(\theta) = -\sum_{(x, y) \in D} \log p_\theta(y | x) $$
where $D$ is the dataset containing text inputs $x$ and sentiment labels $y$, and $p_\theta(y | x)$ is the probability that the model with parameters $\theta$ assigns to the correct sentiment label $y$ given input $x$.

### 7. Summarization

#### Loss Function:
Negative Log-Likelihood Loss for Sequence Generation
$$ \mathcal{L}(\theta) = -\sum_{(x, y) \in D} \sum_{t} \log p_\theta(y_t | y_{<t}, x) $$
where $D$ is the dataset of document-summary pairs $(x, y)$, and the loss is calculated over the output summary tokens $y_{t}$ given the document $x$ and previous tokens $y_{<t}$.


## Fine-Tuning on Multiple Tasks

Fine-tuning a Language Model on multiple tasks is an important approach to build versatile systems that can handle various types of language processing jobs. Here are the most common strategies for multi-task learning and fine-tuning:

### Sequential Fine-Tuning

In sequential fine-tuning, models are first fine-tuned on one task and then subsequently fine-tuned on another. This can potentially lead to catastrophic forgetting, but is a simple approach to adapt models to new tasks based on a common foundation.

### Multi-Task Learning

A model is fine-tuned on multiple tasks simultaneously. This often involves shared representations for all tasks, and sometimes task-specific heads or layers. This approach is designed to help the model generalize better across tasks and leverage shared knowledge.

### Continual Learning

This approach focuses on fine-tuning a model across a sequence of tasks while retaining the ability to perform well on previous tasks. Techniques include replaying data from previous tasks or using regularization strategies to protect previous knowledge.

### Adapters

Fine-tuning through adapter modules involves adding small, task-specific modules to a pre-trained model without altering its weights. Each task's learning is compartmentalized, allowing the model to tackle multiple tasks effectively.

## Load and sub-sample dataset 

In [6]:
# Load a small text summarization public dataset
dataset = load_dataset("xsum")

# Subsample to 100 examples
subsampled_dataset = dataset['train'].shuffle(seed=42).select(range(100))

model_name = 't5-small'

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")

# Tokenization function with padding and truncation
def tokenize_function(examples):
    # Tokenize the texts and include padding and truncation to a fixed length
    model_inputs = tokenizer(examples["document"], padding="max_length", truncation=True, max_length=512)
    # Perform the same steps for the summaries (labels)
    # The summaries are trimmed/padded to a smaller max length to allow for faster training
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], padding="max_length", truncation=True, max_length=128)
    
    # PyTorch expects -100 for ignored indices (e.g., padding) in the labels for sequence-to-sequence models
    labels["input_ids"] = [
        [(label if label != tokenizer.pad_token_id else -100) for label in label_input]
        for label_input in labels["input_ids"]
    ]
    
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Tokenize the dataset
tokenized_datasets = subsampled_dataset.map(tokenize_function, batched=True)

# Prepare the dataset for the trainer
small_train_dataset = tokenized_datasets.shuffle(seed=42).select(range(80)) # 80 for training
small_eval_dataset = tokenized_datasets.shuffle(seed=42).select(range(80, 100)) # 20 for evaluation


In [7]:
small_eval_dataset

Dataset({
    features: ['document', 'summary', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 20
})

## Define finetuning function 

In [17]:
def train_model(model):
    # Define training arguments
    training_args = Seq2SeqTrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        learning_rate=2e-5,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        weight_decay=0.01,
        save_total_limit=2,
        num_train_epochs=2,
        predict_with_generate=True
    )
    
    # Instantiate the Trainer
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=small_train_dataset,
        eval_dataset=small_eval_dataset,
        tokenizer=tokenizer
    )
    
    # Train the model
    trainer.train()
    
    # Return the trained model
    return trainer.model

### Finetune model using different methods 

In [18]:
#Full fine-tuning
pretrained_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
full_finetuned_model = train_model(pretrained_model)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch,Training Loss,Validation Loss
1,No log,3.816803
2,No log,3.717122


## Define evaluation function

In [23]:
def evaluate_model(model, tokenizer, dataset, device='cuda'):
    model.to(device)
    model.eval()
    
    rouge_metric = evaluate.load('rouge')
    generated_texts = []
    references = []
    
    for example in dataset:
        input_text = example['document']
        reference = example['summary']
        
        inputs = tokenizer(input_text, return_tensors='pt', max_length=1024, truncation=True)
        inputs = inputs.to(device)
        
        with torch.no_grad():
            summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
        
        generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
        
        generated_texts.append(generated_summary)
        references.append(reference)
    
    # Compute ROUGE scores for all generated texts
    # Note that references are expected to be lists of lists of strings for 'rouge'
    rouge_results = rouge_metric.compute(predictions=generated_texts, references=[[r] for r in references])
    
    # Process result: Extract the metrics' means
    processed_results = {}
    for key in rouge_results.keys():
        metric_scores = rouge_results[key]
        if isinstance(metric_scores, dict):
            # When metrics return a dict with 'precision', 'recall', and 'fmeasure'
            processed_results[key] = metric_scores['fmeasure'] * 100
        else:
            # When metrics return direct scores (e.g., floats)
            processed_results[key] = metric_scores * 100
    
    return processed_results

## Evaluate full and base model 

In [24]:
# Assuming you have a model, tokenizer, and small_eval_dataset defined and loaded appropriately
rouge_scores = evaluate_model(full_finetuned_model, tokenizer, small_eval_dataset)
print(rouge_scores)

{'rouge1': 20.43459827021852, 'rouge2': 2.743339231408865, 'rougeL': 13.54311808719664, 'rougeLsum': 13.524231259552256}


In [26]:
# Example usage:
# Assuming you have a model, tokenizer, and small_eval_dataset defined and loaded appropriately
rouge_scores = evaluate_model(pretrained_model, tokenizer, small_eval_dataset)
print(rouge_scores)

{'rouge1': 20.43459827021852, 'rouge2': 2.743339231408865, 'rougeL': 13.54311808719664, 'rougeLsum': 13.524231259552256}
