### Task Description:
The task was **headline generation** for news articles, a **conditional text generation** task. Given a news article \( X \), the model generates a concise headline \( Y \).

### Mathematical Function:
The model learns to maximize the likelihood of the correct headline $$\( Y \) conditioned on the news article \( X \):

$$[
P(Y|X) = \prod_{t=1}^{T} P(y_t | y_{<t}, X)
]$$

Where \( Y \) is the sequence of words in the generated headline, and \( X \) is the input article.

### Why Encoder-Decoder Architecture:
The **encoder-decoder** architecture is essential because:
1. The **encoder** captures the semantic information from the news article \( X \).
2. The **decoder** generates the output sequence (headline) based on the context from the encoder.

Using just an encoder or decoder would not provide the full capabilities needed for this sequence-to-sequence task.

# Installation

In [1]:
!pip3 install transformers datasets evaluate accelerate bitsandbytes peft trl rouge_score wandb

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting trl
  Downloading trl-0.13.0-py3-none-any.whl.metadata (11 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloadi

# Imports

In [2]:
from functools import partial
import torch
from torch.utils.data import DataLoader
import transformers
from datasets import load_dataset
import evaluate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from transformers import DataCollatorWithPadding
from transformers import Trainer, TrainingArguments
from peft import get_peft_config, get_peft_model, LoraConfig, TaskType
from peft import prepare_model_for_kbit_training, LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTConfig, SFTTrainer
from transformers import EarlyStoppingCallback

#Importing data

### Loading the Datasets
- **Purpose**: Load training and validation datasets from CSV files.
- **Why Hugging Face `datasets`**:
  - Efficient handling of datasets for NLP tasks.
  - Direct compatibility with Hugging Face tokenizers and models.
- **Verification**: Printing the dataset structure ensures that files are loaded with the expected features and rows.

In [3]:
train_ds=load_dataset("csv",data_files="./data/LABELLED_TRAIN.csv")
val_ds=load_dataset("csv",data_files="./data/LABELLED_DEV.csv")
train_ds

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['ID', 'News Article', 'Caption'],
        num_rows: 3000
    })
})

In [4]:
train_ds["train"]

Dataset({
    features: ['ID', 'News Article', 'Caption'],
    num_rows: 3000
})

- **Dataset Features**:
  - `ID`: A unique identifier for each news article.
  - `News Article`: The full text of the news article, which serves as the input for generating headlines.
  - `Caption`: The corresponding headline or summary of the news article, which serves as the target output.


## what is the average number of tokens in the News Article

### Token Length Analysis
- **Objective**: Understand the tokenization characteristics of the dataset to ensure compatibility with the model's maximum input length and optimize training.


1. **Tokenizer Selection**:
   - The `T5Tokenizer` for the `google/t5-small` model is used to preprocess the text into token IDs compatible with the model.

2. **Tokenization**:
   - Tokenizing the `News Article` column helps calculate the number of tokens for each article.
   - Understanding token lengths ensures the model's input length does not exceed its limit (usually 512 or 1024 tokens for T5).

3. **Analysis**:
   - The average number of tokens across all articles is computed to determine if truncation is required during training.
   - A sample prompt's token count is compared to estimate the space left for actual article content when prefixed.

#### Insights
- If the average token length is significantly lower than the maximum input size, more content can be included in the input.
- A long prompt (e.g., 20+ tokens) reduces the available space for the actual article text and may require trimming the input.


In [5]:
  # Sample dataset
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google-t5/t5-small")
inputs = train_ds["train"]["News Article"]  #

# Tokenize the inputs and calculate the number of tokens
token_lengths = [len(tokenizer.encode(input_text)) for input_text in inputs]

# Calculate the average number of tokens
average_token_length = sum(token_lengths) / len(token_lengths)
print(len(tokenizer.encode("Create an engaging, accurate headline for this news article. Be creative while maintaining context and professionalism.")))
print(f"Average number of tokens per input: {average_token_length}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Token indices sequence length is longer than the specified maximum sequence length for this model (528 > 512). Running this sequence through the model will result in indexing errors


20
Average number of tokens per input: 243.39166666666668


- **Average Token Length**: 243.39 tokens per input, which is well within the limit, but outliers like the one above need handling.

#### Actions:
1. **Truncation**: Shorten inputs exceeding 512 tokens to prevent errors.
2. **Prompt Optimization**: Reduce the length of prompts to maximize usable space for article content.
3. **Padding**: Ensure inputs shorter than the max length are padded for consistent processing.


# Preprocessing 🏹

### Tokenization and Input Preparation

1. **Instruction Prefix**:
   - Add `Create an engaging, accurate headline for this news article while maintaining context:` as a prefix to the "News Article" for better **task-specific guidance** to the model.
2. **Tokenization**:
   - Inputs are tokenized with truncation enabled to avoid sequence length overflow.
   - Padding is *skipped* here for **dynamic padding** during batching with DataCollator
  
3. **Label Tokenization**:
   - Targets ("Caption") are tokenized with `max_length=52` and `padding='max_length'`. This ensures consistent label lengths for model training.
4. **Labels Addition**:
   - `model_inputs` dictionary includes tokenized inputs and corresponding labels, making it ready for the model.



In [8]:
device="cuda" if torch.cuda.is_available else "cpu"
device

'cuda'

In [81]:
def preprocess_function(example, tokenizer, max_target_length=52):
    inputs = example["News Article"]
    targets = example["Caption"]

    # Add instruction prefix to the input (if required)
    # inputs = [f"""headline : {text}""" for text in inputs]
    inputs=[f"""Create an engaging, accurate headline for this news article while maintaining context {text}""" for text in inputs]
    #print(inputs)
    # print(tokenizer.decode(inputs["input_ids"][1]))

    # Tokenize without padding (we'll handle padding dynamically later)
    model_inputs = tokenizer(inputs, truncation=True, padding=False)
    labels = tokenizer(targets, max_length=max_target_length, padding='max_length', truncation=True)
    # Add labels to the input dictionary
    model_inputs["labels"] = labels["input_ids"]

    return model_inputs

determines the computational device (GPU or CPU) to be used for training and inference, ensuring the most efficient execution.

## Computation of metrics

In [137]:
from evaluate import load
rouge = load("rouge")
bleu=load("bleu")

def compute_metrics(pred,tokenizer):

    labels_ids = pred.label_ids
    pred_ids = pred.predictions[0]

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(
        predictions=pred_str,
        references=label_str,
        rouge_types=["rouge1", "rouge2", "rougeL"],
    )
    bleu_output=bleu.compute(predictions=pred_str, references=label_str)

    return {


         "rouge1": round(rouge_output["rouge1"],4),
        "rouge2": round(rouge_output["rouge2"], 4),
        "rougeL": round(rouge_output["rougeL"], 4),
        "bleu": round(bleu_output["bleu"], 4)

    }
def preprocess_logits_for_metrics(logits, labels):
    """
    Original Trainer may have a memory leak.
    This is a workaround to avoid storing too many tensors that are not needed.
    """
    pred_ids = torch.argmax(logits[0], dim=-1)
    return pred_ids, labels

# t5 ⛵

In [None]:

# peft_config = LoraConfig(
#     task_type=TaskType.CAUSAL_LM, inference_mode=False, r=32, lora_alpha=16, lora_dropout=0.1,
#     target_modules=['q','k','v'] # optional, you can target specific layers using this
# )

## Model Setup with QLoRA and PEFT for Efficient Training

**This is important when working with large models like T5 in resource-constrained environments.
*
- https://huggingface.co/google-t5/t5-base
#### 1. **Model Size Consideration:**
The T5 model in question contains **220 million** parameters, which would be inefficient and resource-heavy to train on a platform like Google Colab due to the large memory and compute requirements. Fine-tuning such a model on Colab without memory optimizations can lead to long training times or out-of-memory errors.'

#### 2. **QLoRA Configuration**:
- We use 4-bit quantization for memory efficiency (`load_in_4bit=True`). This reduces the size of the model weights while still maintaining model performance.
- `bnb_4bit_use_double_quant=True` enables double quantization, further optimizing the model’s storage.
- We specify `bnb_4bit_quant_type="nf4"` to use the "nf4" quantization type, which allows for smaller weight representation without sacrificing too much accuracy.
- The compute data type is set to `torch.float16` for faster computation and reduced memory footprint during training.

#### 3. **LoRA Configuration**:
- **LoRA** helps to reduce the number of trainable parameters by focusing only on modifying the low-rank adaptation matrices rather than the entire model.
- `r=4` specifies the rank for the low-rank matrices, controlling how much the adaptation will modify the model.
- `lora_alpha=16` scales the low-rank matrices. A higher value means more significant adaptations in the model.
- `target_modules=['q', 'k', 'v']` indicates that LoRA will be applied to the attention layers (queries, keys, and values) which are crucial for transformer models.
- `lora_dropout=0.05` introduces dropout in the LoRA layers for regularization, reducing overfitting.
- `bias="none"` avoids adding a bias term in the LoRA layers to make the training even more efficient.

This setup is optimized for training with less memory while maintaining high performance, which is essential given the large model size of T5 and the constraints of Colab.

In [91]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
#applying qlora
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

t5_tokenizer = T5Tokenizer.from_pretrained("Michau/t5-base-en-generate-headline")

t5_model = T5ForConditionalGeneration.from_pretrained("Michau/t5-base-en-generate-headline",quantization_config=bnb_config)
# t5_model.gradient_checkpointing_enable()
# t5_model.gradient_checkpointing_disable()
t5_model=prepare_model_for_kbit_training(t5_model)
config = LoraConfig(
    r=4,
    lora_alpha=16,
    target_modules=['q','k','v'],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

t5_model = get_peft_model(t5_model, config)

# t5_model= get_peft_model(t5_model, peft_config)
t5_model.print_trainable_parameters()

tokenizer_config.json:   0%|          | 0.00/43.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now default to True since model is quantized.


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

trainable params: 663,552 || all params: 223,567,104 || trainable%: 0.2968


 **Model Parameters with QLoRA**

After applying QLoRA, the model has:

- **Trainable Parameters**: 663,552 (0.2968% of total parameters)
- **Total Parameters**: 223,567,104

QLoRA helps optimize memory usage by quantizing the model, which reduces the need to update all parameters. Only a small fraction of task-specific parameters are trained, making the process more efficient and suitable for resource-limited environments like Colab. This approach allows us to work with a large model without overwhelming system memory.

### Tokenization and Preprocessing

set the `t5_max_input_length` to the model’s max input length to avoid exceeding the sequence limit. The `preprocess_function` is applied to tokenize and process the datasets, removing unnecessary columns. Finally, we use `DataCollatorWithPadding` to dynamically pad batches, optimizing memory usage and training efficiency.

In [92]:
t5_max_input_length = t5_tokenizer.model_max_length
#apply the preprocessing function to the dataswt
preprocess_with_params=partial(preprocess_function, tokenizer=t5_tokenizer)
tokenized_train_ds = train_ds.map(preprocess_with_params, batched=True,remove_columns=["ID", "News Article", "Caption"])
tokenized_val_ds=val_ds.map(preprocess_with_params,batched=True,remove_columns=["ID", "News Article", "Caption"])

#data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=t5_tokenizer)


Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [93]:
import torch

# Check if a GPU is available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move the model to the chosen device
t5_model.to(device)

# Print the device to confirm
print(f"Model is placed on: {device}")

Model is placed on: cuda


### Trainer Configuration

The `TrainingArguments` are set to control various aspects of the training process. This includes saving results, evaluating after each epoch, setting the learning rate, batch sizes for training and evaluation, and controlling the frequency of logging and saving checkpoints. Additionally, we specify the `Trainer` to use the preloaded model, tokenized datasets, and necessary components like the tokenizer and metrics.

### Hyperparameter Justification

1. **Evaluation Strategy (`"epoch"`)**: Evaluates the model after each epoch to track progress and avoid overfitting.
2. **Learning Rate (`5e-5`)**: A common choice for fine-tuning T5 models, balancing convergence and stability.
3. **Batch Size (`4` for training, `2` for evaluation)**: Chosen to fit within GPU memory limits in Colab, allowing frequent updates.
4. **Epochs (`1`)**: A quick starting point for experimentation, adjusting if necessary.
5. **Weight Decay (`0.01`)**: Regularization to prevent overfitting by penalizing large weights.
6. **Logging & Saving (`10` steps)**: Logs and saves every 10 steps to monitor progress and store recent checkpoints.
7. **Accumulation Steps (`4`)**: Reduces evaluation frequency for faster training while still tracking performance.

These choices balance performance and computational efficiency for quick experimentation.

In [101]:
training_args = TrainingArguments(
    output_dir='./results',  # where to save the results
    evaluation_strategy="epoch",  # evaluate after each epoch
    learning_rate=5e-5,  # learning rate
    per_device_train_batch_size=4,  # batch size for training
    per_device_eval_batch_size=2,  # batch size for evaluation
    num_train_epochs=1,
    report_to="none" ,# number of training epochs
    weight_decay=0.01,  # weight decay for regularization
    logging_dir='./logs',  # where to store logs
    logging_steps=10,  # log every 10 steps
    save_steps=10,  # save model every 10 steps
    save_total_limit=2,
    eval_accumulation_steps=4
    # no_cuda=False# keep only the 2 most recent checkpoints

)
trainer = Trainer(
    model=t5_model,  # your preloaded model
    args=training_args,
    train_dataset=tokenized_train_ds,
    eval_dataset=tokenized_val_ds,
    tokenizer=t5_tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,  # optional: if you want to add metrics
)

  trainer = Trainer(




> **Note on Trainer Choice** 🔴 🔴 🔴 🔴 🔴
While we initially set up the `Trainer` for model fine-tuning, we opted for **Supervised Fine-Tuning (SFT)** with `SFTConfig` instead. This decision was made because **SFT** provides more flexibility and efficiency for fine-tuning tasks, especially when working with specific datasets and model configurations like QLoRA. Using `SFTConfig` allows for more control over the training process, such as customized loss functions and better management of computational resources.





## Supervised finetuning

In [102]:
tokenized_train_ds.set_format(type='torch')
tokenized_val_ds.set_format(type='torch')

In [103]:
import torch
torch.cuda.empty_cache()

### SFT Configuration and Training Setup

We have set up the **Supervised Fine-Tuning (SFT)** configuration using `SFTConfig` with the following parameters:

- **Learning Rate**: `5e-5` — A balanced learning rate to ensure gradual model fine-tuning.
- **Batch Size**: `4` for training and `2` for evaluation — Set based on memory constraints and dataset size.
- **Number of Epochs**: `20` — Sufficient epochs for effective fine-tuning on the dataset.
- **Save and Log Configurations**: Models and logs are saved every epoch and after 10 steps for better tracking and model recovery.
- **Best Model Selection**: `load_best_model_at_end` is enabled to load the best performing model based on evaluation metrics.

We also use **early stopping** with a patience of `3`, meaning training will stop if there's no improvement in the model for 3 consecutive evaluation steps. The metric for best model selection is **BLEU score** (`eval_bleu`), which is well-suited for text generation tasks like headline generation.

This setup, using **SFTTrainer**, ensures a well-rounded fine-tuning process with model monitoring and early stopping for efficiency.

In [151]:
training_args = SFTConfig(
    output_dir='./results',
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    num_train_epochs=20,
    save_total_limit=2,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_strategy="epoch",
    save_steps=10,
    eval_accumulation_steps=4,
    load_best_model_at_end=True,
    metric_for_best_model="eval_bleu",
)

trainer = SFTTrainer(
    model=t5_model,  # preloaded model
    args=training_args,
    train_dataset=tokenized_train_ds["train"],
    eval_dataset=tokenized_val_ds["train"],
    tokenizer=t5_tokenizer,
    data_collator=data_collator,
    compute_metrics=partial(compute_metrics, tokenizer=t5_tokenizer),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    preprocess_logits_for_metrics=preprocess_logits_for_metrics#early stopping
)

  trainer = SFTTrainer(


In [105]:
trainer.train()

  return fn(*args, **kwargs)


Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,0.6313,0.677785,0.3891,0.1512,0.3811,0.0924
2,0.6706,0.668465,0.3944,0.1528,0.3864,0.0916


  return fn(*args, **kwargs)
  return fn(*args, **kwargs)


KeyboardInterrupt: 

### Training Duration and Results

Due to **time and GPU constraints**, we could only run the model for **14 epochs**, which took approximately **2 hours** for training. Below are the results:

| Epoch | Training Loss | Validation Loss | RL  | BLEU |
|-------|---------------|-----------------|-----|------|
| 1     | 0.622900      | 0.668581        | 0.3883 | 0.0944 |
| 2     | 0.658800      | 0.660973        | 0.3922 | 0.0972 |
| 3     | 0.621300      | 0.655607        | 0.3955 | 0.0989 |
| 4     | 0.573300      | 0.649013        | 0.3972 | 0.0970 |
| 5     | 0.567900      | 0.648848        | 0.3989 | 0.1002 |
| 6     | 0.599700      | 0.643754        | 0.4018 | 0.1022 |
| 7     | 0.536600      | 0.642526        | 0.4045 | 0.1051 |
| 8     | 0.571300      | 0.643984        | 0.4029 | 0.1044 |
| 9     | 0.576200      | 0.641517        | 0.4052 | 0.1062 |
| 10    | 0.544300      | 0.638496        | 0.4072 | 0.1093 |
| 11    | 0.545100      | 0.638023        | 0.4087 | 0.1094 |
| 12    | 0.666700      | 0.638624        | 0.4086 | 0.1110 |
| 13    | 0.518900      | 0.637369        | 0.4071 | 0.1139 |
| 14    | 0.593200      | 0.637359        | 0.4089 | 0.1134 |

- **Training Loss**: Decreased over the epochs, indicating that the model was learning effectively.
- **Validation Loss**: Stays relatively stable, suggesting that the model is not overfitting during the limited training duration.
- **RL and BLEU Scores**: Show gradual improvement, with the BLEU score reaching approximately **0.1134** by the end of the 14th epoch.

These results are promising but could improve with further training and tuning.

In [106]:
trainer.evaluate()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,0.6313,0.677785,0.3891,0.1512,0.3811,0.0924
2,0.6656,0.66389,0.3953,0.1546,0.3871,0.0939


{'eval_loss': 0.6638903617858887,
 'eval_rouge1': 0.3953,
 'eval_rouge2': 0.1546,
 'eval_rougeL': 0.3871,
 'eval_bleu': 0.0939}

### Model Checkpoint and Saving

After training for 14 epochs, we loaded the model from the **latest checkpoint** at `./results/checkpoint-10500` to ensure we continue from the best performing state. The model and tokenizer were then saved for future use and deployment to the following directories:

- **Model**: `./t5_ch_model`
- **Tokenizer**: `./t5_ch_tokenizer`

This ensures that we can reuse the trained model for inference without retraining it from scratch.

In [107]:
import os
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Find the latest checkpoint|
checkpoint_dir = "./results"
latest_checkpoint = max([os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)], key=os.path.getmtime)

# Load the model and tokenizer from the latest checkpoint
t5_ch_model = T5ForConditionalGeneration.from_pretrained(latest_checkpoint)
t5_ch_tokenizer = T5Tokenizer.from_pretrained(latest_checkpoint)

# Save the model and tokenizer to a new directory
t5_ch_model.save_pretrained("./t5_ch_model")
t5_ch_tokenizer.save_pretrained("./t5_ch_tokenizer")

('./t5_ch_tokenizer/tokenizer_config.json',
 './t5_ch_tokenizer/special_tokens_map.json',
 './t5_ch_tokenizer/spiece.model',
 './t5_ch_tokenizer/added_tokens.json')

## results for t5

[Click here to view the wandb dashboard for t5_base](https://api.wandb.ai/links/kabirj2505-none/7o0csfm0)

In [None]:
from IPython.display import IFrame

# Provide the URL of your WandB report
wandb_report_url = "https://wandb.ai/kabirj2505-none/huggingface/reports/t5_logicloom--VmlldzoxMDgxNTI1NA?accessToken=p6q5pczkqusn4s7n41sdwybhpaqly68w5m2i3tc7q5fhw1wisssadexsi2dvfrlc"

# Use IFrame to embed the report in the notebook
IFrame(wandb_report_url, width=2000, height=800)

#Bart ⚾

In [111]:
def get_device_map() -> str:
    return 'cuda' if torch.cuda.is_available() else 'cpu'
device = get_device_map()
device

'cuda'



## 1. Model Choice (2):
- **Model**: `facebook/bart-large-cnn`
- **Reasoning**: We selected `facebook/bart-large-cnn` because it is a pre-trained model on a large corpus suitable for text generation tasks. It's especially effective for tasks like summarization, which aligns with headline generation.

## 2. LoRA Configuration:
- **`r = 8`**:
  - **Reasoning**: The rank of the low-rank matrices used in LoRA. A smaller rank reduces the number of trainable parameters, making the model more efficient during fine-tuning. `r = 8` was chosen to balance parameter reduction and the model’s ability to fine-tune effectively.

- **`lora_alpha = 4`**:
  - **Reasoning**: `alpha` is a scaling factor for LoRA updates. A value of `4` was selected to provide an appropriate trade-off between model expressiveness and overfitting, ensuring efficiency without underfitting.

- **`lora_dropout = 0.1`**:
  - **Reasoning**: Dropout helps prevent overfitting by randomly setting a fraction of weights to zero during training. A value of `0.1` was chosen to maintain model capacity and stability during training.

- **`target_modules = ['q_proj', 'k_proj']`**:
  - **Reasoning**: LoRA is applied only to the query (`q_proj`) and key (`k_proj`) layers. The value projection (`v_proj`) layer was omitted to avoid increasing the number of parameters, which could counteract the purpose of LoRA. This choice helps reduce the number of trainable parameters while maintaining performance.

## 3. Model Size:
- **BART Parameter Count**:
  - **Reasoning**: DistilBART has around **406 million parameters**, which is nearly double the number of parameters in T5. This makes DistilBART computationally more expensive but could potentially offer better results at the cost of requiring more resources.

In [138]:
from transformers import BartForConditionalGeneration, BartTokenizer
dbart_model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
dbart_tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")

peft_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM, inference_mode=False, r=8, lora_alpha=4, lora_dropout=0.1,
    target_modules=['q_proj','k_proj']
) # create LoRA config for the finetuning


dbart_model = get_peft_model(dbart_model, peft_config) # create a model ready for LoRA finetuning

dbart_model.print_trainable_parameters()


trainable params: 1,179,648 || all params: 407,470,080 || trainable%: 0.2895


### tokenization and preprocessing dbart

In [139]:

preprocess_with_params=partial(preprocess_function, tokenizer=dbart_tokenizer)
tokenized_train_ds = train_ds.map(preprocess_with_params, batched=True,remove_columns=["ID", "News Article", "Caption"])
tokenized_val_ds=val_ds.map(preprocess_with_params,batched=True,remove_columns=["ID", "News Article", "Caption"])

#data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=dbart_tokenizer)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [140]:
tokenized_train_ds.set_format(type='torch')
tokenized_val_ds.set_format(type='torch')

## Hyperparameter Choices and Reasoning

- **Output Directory** (`output_dir='./dbart_results'`): Saves model results like checkpoints and logs.
- **Evaluation Strategy** (`evaluation_strategy="epoch"`): Evaluates model performance after each epoch for regular feedback.
- **Learning Rate** (`learning_rate=5e-5`): Standard value to balance stable learning and effective training.
- **Batch Size** (`per_device_train_batch_size=4`, `per_device_eval_batch_size=2`): Small batch size to manage memory on large models.
- **Number of Epochs** (`num_train_epochs=10`): Chosen to allow sufficient learning without overfitting.
- **Save Strategy** (`save_strategy="epoch"`, `save_steps=10`): Saves model at regular intervals to avoid losing progress.
- **Weight Decay** (`weight_decay=0.01`): Prevents overfitting by penalizing large weights.
- **Logging** (`logging_steps=10`, `logging_dir='./dbart_logs'`): Regular logging for monitoring progress.
- **Evaluation Accumulation Steps** (`eval_accumulation_steps=4`): Optimizes memory usage during evaluation.
- **Load Best Model at End** (`load_best_model_at_end=True`): Loads the best model based on evaluation metric.
- **Metric for Best Model** (`metric_for_best_model="eval_rougeL"`): Uses ROUGE-L to measure text generation quality.

In [142]:
training_args = SFTConfig(
    output_dir='./dbart_results',
    evaluation_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=2,
    num_train_epochs=10,
    save_total_limit=2,
    weight_decay=0.01,
    logging_dir='./dbart_logs',
    logging_steps=10,
    save_strategy="epoch",
    save_steps=10,
    eval_accumulation_steps=4,
    load_best_model_at_end=True,
    metric_for_best_model="eval_rougeL",
)

trainer = SFTTrainer(
    model=dbart_model,  # preloaded model
    args=training_args,
    train_dataset=tokenized_train_ds["train"],
    eval_dataset=tokenized_val_ds["train"],
    tokenizer=dbart_tokenizer,
    data_collator=data_collator,
    compute_metrics=partial(compute_metrics, tokenizer=dbart_tokenizer),
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
    preprocess_logits_for_metrics=preprocess_logits_for_metrics#early stopping
)

  trainer = SFTTrainer(


In [143]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Bleu
1,5.3827,4.64535,0.1968,0.0739,0.1928,0.0179
2,4.0344,3.257308,0.4084,0.1712,0.4005,0.088
3,3.6219,2.793726,0.4269,0.1838,0.4186,0.0978
4,3.2392,2.557085,0.4292,0.1861,0.4215,0.1027
5,3.1373,2.428076,0.435,0.1918,0.4271,0.1098
6,3.0953,2.346195,0.4376,0.1944,0.4301,0.1112
7,3.0017,2.292832,0.4416,0.1994,0.4346,0.1146
8,3.0005,2.254009,0.444,0.2009,0.4368,0.1172
9,2.9164,2.234267,0.444,0.2015,0.4372,0.119
10,2.9435,2.227767,0.4443,0.2014,0.4374,0.1189


TrainOutput(global_step=7500, training_loss=3.7538461471557616, metrics={'train_runtime': 3677.8872, 'train_samples_per_second': 8.157, 'train_steps_per_second': 2.039, 'total_flos': 2.043294969672499e+16, 'train_loss': 3.7538461471557616, 'epoch': 10.0})

In [144]:

# Find the latest checkpoint|
checkpoint_dir = "./dbart_results"
latest_checkpoint = max([os.path.join(checkpoint_dir, d) for d in os.listdir(checkpoint_dir)], key=os.path.getmtime)

# Load the model and tokenizer from the latest checkpoint
db_ch_model = BartForConditionalGeneration.from_pretrained(latest_checkpoint)
db_ch_tokenizer = BartTokenizer.from_pretrained(latest_checkpoint)

# Save the model and tokenizer to a new directory
db_ch_model.save_pretrained("./db_ch_model")
db_ch_tokenizer.save_pretrained("./db_ch_tokenizer")

('./db_ch_tokenizer/tokenizer_config.json',
 './db_ch_tokenizer/special_tokens_map.json',
 './db_ch_tokenizer/vocab.json',
 './db_ch_tokenizer/merges.txt',
 './db_ch_tokenizer/added_tokens.json')

https://wandb.ai/kabirj2505-none/huggingface/reports/Bart-logicloom--VmlldzoxMDgxODU0OA?accessToken=lvhbe1svzojdypxk1kvhbfz0q2dq3zbf5xi7m5utp8dh796faoct4ipywh09ynep

In [157]:
from IPython.display import IFrame

# Provide the URL of your WandB report
wandb_report_url = "https://wandb.ai/kabirj2505-none/huggingface/reports/Bart-logicloom--VmlldzoxMDgxODU0OA?accessToken=lvhbe1svzojdypxk1kvhbfz0q2dq3zbf5xi7m5utp8dh796faoct4ipywh09ynep"

# Use IFrame to embed the report in the notebook
IFrame(wandb_report_url, width=2000, height=800)

# test

### Generating Headlines Using the Model

Once the model was loaded and transferred to the available device (GPU or CPU), we proceeded to generate headlines for articles in the **unlabelled test dataset**. The function `generate_headline()`:

- Takes an article, preprocesses it, and formats it for input to the T5 model.
- Uses **beam search** for better headline generation quality by specifying `num_beams=5`.
- Decodes the generated token sequence into a human-readable headline.

The headlines for all articles in the **UNLABELLED_TEST.csv** dataset are then collected in the `predictions` list.

#### Key Decisions:
- **Max Length**: Set to 50 to ensure concise and relevant headline generation.
- **Beam Search**: Helps improve output quality by considering multiple possibilities.
- **Device**: The model and data are moved to the appropriate device (GPU or CPU) for efficient computation.

In [145]:
db_ch_model.to(device)

test_ds = load_dataset("csv", data_files="./data/UNLABELLED_TEST.csv")


In [152]:
def generate_headline(article,model,tokenizer):
    # Preprocess the input for the T5 model
    input_text = f"headline: {article}"
    inputs = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True).to(device)

    # Generate predictions
    outputs = model.generate(
        inputs,
        max_length=80,  # Adjust this value for desired headline length
        num_beams=5,    # Beam search for better results
        early_stopping=True
    )

    # Decode the output
    predicted_headline = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return predicted_headline


In [153]:
predictions = []
for article in test_ds["train"]["News Article"]:
    headline = generate_headline(article,db_ch_model,db_ch_tokenizer)
    predictions.append(headline)


In [155]:
import pandas as pd
output_df = pd.DataFrame({
    "ID": test_ds["train"]["ID"],
    "Prediction": predictions
})

In [156]:
output_df.to_csv("predicted_headlines.csv", index=False)
print("Headlines generated and saved to predicted_headlines.csv.")

Headlines generated and saved to predicted_headlines.csv.


## Comparison Between T5 and BART-CNN for Text Generation

### T5 (Text-to-Text Transfer Transformer)
- **Model**: `t5-base`
- **Pretrained on**: C4 dataset
- **Number of Parameters**: 220 million
- **Architecture**: Encoder-decoder architecture.
- **Use Case**: T5 is a general-purpose model that treats all NLP tasks as text-to-text transformations, where both input and output are always text.

#### Hyperparameters:
- **Learning Rate**: 5e-5
- **Batch Size**: 4 (train), 2 (eval)
- **Epochs**: 10
- **Evaluation Metric**: ROUGE-L
- **Save Strategy**: Save model after each epoch

---

### BART-CNN (BART with CNN for summarization)
- **Model**: `facebook/bart-large-cnn`
- **Pretrained on**: CNN/Daily Mail dataset
- **Number of Parameters**: 406 million
- **Architecture**: Encoder-decoder architecture with a denoising autoencoder approach.
- **Use Case**: BART is a sequence-to-sequence model that excels in text generation and summarization tasks. It leverages both reconstruction and generation tasks to learn robust representations.

#### Hyperparameters:
- **Learning Rate**: 5e-5
- **Batch Size**: 4 (train), 2 (eval)
- **Epochs**: 10
- **Evaluation Metric**: ROUGE-L
- **Save Strategy**: Save model after each epoch

---

### Reasons Why BART-CNN May Perform Better Than T5-Base

1. **Larger Number of Parameters**:
   - **T5-Base** has 220 million parameters, whereas **BART-CNN** has 406 million. The larger number of parameters in BART-CNN allows it to capture more intricate relationships in data and provides greater expressiveness, making it more capable of handling complex text generation tasks.

2. **Specialized Pretraining on Summarization Tasks**:
   - BART-CNN is specifically pretrained on the CNN/Daily Mail dataset for summarization, which aligns more closely with headline generation tasks. This specialized pretraining gives BART-CNN an edge when tasked with generating concise summaries or headlines, whereas T5 is a more general-purpose model.

3. **Denoising Autoencoder Approach**:
   - BART uses a denoising autoencoder approach, where parts of the input are corrupted and the model is trained to predict the missing parts. This makes BART especially strong in generative tasks like headline generation, as it has learned to deal with incomplete or noisy inputs, similar to how headlines need to summarize and capture key details from longer articles.

4. **Higher Capacity for Learning Complex Representations**:
   - With BART-CNN’s larger capacity (406M params vs. 220M in T5-Base), it is able to learn more complex representations of the input text. This helps it produce more nuanced and accurate outputs, particularly for tasks requiring high-level abstraction, such as summarizing news articles into concise headlines.

5. **Superior Performance on Sequence-to-Sequence Tasks**:
   - BART-CNN has shown superior performance on text generation tasks, especially in summarization, due to its denoising pretraining and large parameter size. These advantages translate well into tasks like headline generation where the model needs to condense large amounts of information into a succinct form.

6. **Greater Fine-tuning Potential**:
   - With more parameters and a more specialized architecture, BART-CNN can benefit more from fine-tuning on specific datasets. This means that with proper fine-tuning, BART-CNN can outperform T5-Base in headline generation and similar tasks where summarization and condensation of information are crucial.

### Conclusion
Given the larger number of parameters, specialized pretraining on summarization tasks, and a more suitable architecture for text generation, BART-CNN is expected to outperform T5-Base in headline generation tasks. The model's ability to generate more precise and coherent summaries makes it a better candidate for tasks like this.