##**Objective**:

To fine-tune a pre-trained Transformer-based text summarization model on custom data using an NVIDIA CUDA-enabled GPU.

Here's what it means in simpler terms:

  >Fine-tuning: We're taking a pre-trained model (like Pegasus) that already has a basic understanding of language and further training it on your specific dataset ("samsum"). This makes it more accurate and effective for your use case.

  Transformer-based: The underlying technology of the model is a Transformer, a neural network architecture known for its strong performance in natural language processing tasks.

  >Text summarization: The project aims to train a model that can automatically generate concise summaries of text.

  Custom data: We're not just relying on the original training data of the pre-trained model; you are adapting it to a new dataset to improve its performance for summarising dialogues.

  >NVIDIA CUDA-enabled GPU: We're using the processing power of an NVIDIA GPU to accelerate the training process. This utilizes the CUDA parallel computing platform for faster computations.

  Pretrained model : https://huggingface.co/google/pegasus-cnn_dailymail

Dataset : https://huggingface.co/datasets/Samsung/samsum

In [None]:
!nvidia-smi
# gpu info

Mon Dec  2 09:45:30 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   65C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
# install transformer libraries
!pip install transformers[sentencepiece] datasets rouge_score py7zr -q


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.9/67.9 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.7/49.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00

In [None]:
# accelarate
!pip install accelerate -U -q

!pip install transformers accelerate -q

In [None]:
# Pipeline from huggingface :
from transformers import pipeline , set_seed
import matplotlib.pyplot as plt
import pandas as pd
from datasets import load_dataset
import nltk
from nltk.tokenize import sent_tokenize
nltk.download("punkt")


from tqdm import tqdm
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


In [None]:
# check if cuda is available
cud_av = "cuda" if torch.cuda.is_available() else "cpu"
cud_av

'cuda'

In [None]:
model = "google/pegasus-cnn_dailymail"
# loading the model in cuDA
tokenizer = AutoTokenizer.from_pretrained(model)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model).to(cud_av)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model_pegasus

### loading the data set :


In [None]:
# loading the data set :
dataset_samsum = load_dataset("samsum")
dataset_samsum

The repository for samsum contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/samsum.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


corpus.7z:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

load_from_disk("samsum_dataset") is used when we downloaded the data into device memory and then load it

In [None]:
print(dataset_samsum["train"]["dialogue"][1])

Olivia: Who are you voting for in this election? 
Oliver: Liberals as always.
Olivia: Me too!!
Oliver: Great


In [None]:
print(dataset_samsum["train"]["summary"][1])

Olivia and Olivier are voting for liberals in this election. 


In [None]:
# shape of the data  :
dataset_samsum["train"].shape

(14732, 3)

In [None]:
print(dataset_samsum["test"].shape , dataset_samsum["validation"].shape)

(819, 3) (818, 3)


>Most transformer models have their own dedicated tokenizers.

Sequence-to-sequence transformers, like Pegasus  don't directly understand raw text. They operate on numerical representations for Tokenization,  conversion to IDs , creating Masks

In [None]:
# convert the data  for seq2seq models :

def data_to_features(example_batch):
    input_encodings = tokenizer(example_batch["dialogue"], max_length=1024, truncation=True)

    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=128, truncation=True)

    return {
        "input_ids": input_encodings["input_ids"],
        "attention_mask": input_encodings["attention_mask"],
        "labels": target_encodings["input_ids"] }

In [None]:
# applying the func to data
dataset_samsum_tok = dataset_samsum.map(data_to_features , batched=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [None]:
dataset_samsum_tok["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [None]:
# dataset_samsum_tok["test"]

DataCollatorForSeq2Seq is a specialized data collator provided by the Hugging Face transformers library. It's specifically designed for sequence-to-sequence (seq2seq) models, like the Pegasus model

Uses :

>Dynamic Padding: Seq2seq models often work with variable-length input and output sequences. DataCollatorForSeq2Seq automatically pads the sequences in a batch to the same length, ensuring they can be processed efficiently by the model. It handles padding for both the input (dialogue) and the target (summary) sequences.

>Decoder Input Shifting: In seq2seq training, the decoder needs to predict the next token in the target sequence based on the previous tokens. To achieve this, the target sequence is typically shifted by one position to the right. DataCollatorForSeq2Seq takes care of this shifting automatically.


>Creating Batches: It combines individual data samples into batches suitable for training or evaluation.


In [None]:
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model_pegasus)

>TrainingArguments :

This class is used to define all the hyperparameters and configurations related to the training process.

>Trainer:

 The Trainer class provides a high-level API for training PyTorch models, particularly those from the Transformers library. It simplifies the training loop and handles common tasks like gradient accumulation, logging, and evaluation.

In [None]:
from transformers import TrainingArguments , Trainer

training_args = TrainingArguments(
      output_dir="./results",          # Directory to save model checkpoints
      per_device_train_batch_size=4,
      per_device_eval_batch_size=4,       # Batch size per GPU/CPU
      num_train_epochs=3,             # Number of training epochs
      learning_rate=2e-5,            # Learning rate
      weight_decay=0.01,
      logging_steps=10,
      evaluation_strategy='steps',
      eval_steps=500,
      save_steps=1e6,
      gradient_accumulation_steps=16,
      gradient_checkpointing=True,
      fp16=True,           # Enable mixed precision training to reduce memory usage
      report_to="none" # we can use wandb or tensorboard to see weights and baises
  )



> Tried with 14732 , then 5000 train data -- got "CUDA out of memory" hence took 1000 data points and reduced gradient accumilation ot 16 steps

In [None]:
dataset_samsum_tok["train"] = dataset_samsum_tok["train"].shuffle(seed=42).select(range(1000))

In [None]:
trainer = Trainer(
      model=model_pegasus,            #  Pegasus model
      args=training_args,            # Training arguments you defined
      data_collator=data_collator,    # Data collator for your dataset
      train_dataset=dataset_samsum_tok["train"],  # Training dataset
      eval_dataset=dataset_samsum_tok["validation"], # Evaluation dataset
  )


PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.

This environment variable is a way to fine-tune how PyTorch manages memory allocation on your NVIDIA GPU. It's specifically designed to address a common issue called memory fragmentation.

What is memory fragmentation?

Imagine your GPU memory as a parking lot. When you first start, it's empty and you can park large vehicles (tensors) easily. However, as you allocate and deallocate memory (cars entering and leaving), the free space becomes fragmented into smaller, non-contiguous chunks. This makes it difficult to find a single, large space for a new, big tensor, even if there's enough total free space available. This leads to the "CUDA out of memory" error.

How does expandable_segments:True help?

This setting aims to mitigate fragmentation by enabling a feature called "expandable segments" within the PyTorch CUDA memory allocator. With expandable segments enabled, the allocator can try to merge these smaller free chunks into larger blocks when a large allocation request comes in. This reduces fragmentation and allows PyTorch to use the available GPU memory more efficiently.

In simpler terms:

By setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True, you're basically telling PyTorch to be smarter about organizing the GPU memory and try to avoid fragmentation, preventing those "out of memory" errors even if there's technically enough free space.

In [None]:
!export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

In [None]:
trainer.train()  # Start the training

Step,Training Loss,Validation Loss




TrainOutput(global_step=45, training_loss=2.5765409893459745, metrics={'train_runtime': 343.146, 'train_samples_per_second': 8.743, 'train_steps_per_second': 0.131, 'total_flos': 2004024162779136.0, 'train_loss': 2.5765409893459745, 'epoch': 2.896})

TrainOutput(global_step=45, training_loss=2.5765409893459745, metrics={'train_runtime': 343.146, 'train_samples_per_second': 8.743, 'train_steps_per_second': 0.131, 'total_flos': 2004024162779136.0, 'train_loss': 2.5765409893459745, 'epoch': 2.896})

## Model - Evaluation :

In [None]:
# def giveme_batch_sized_chunks(list_of_elements, batch_size):
#     """
#     Yield successive batch-sized chunks from list_of_
#     from list_of_elements.
#     """
#     for i in range(0, len(list_of_elements), batch_size):
#         yield list_of_elements[i:i + batch_size]

# def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
#                                batch_size=16, device=cud_av,
#                                column_text="article",
#                                column_summary="highlights"):
#     article_batches = list(giveme_batch_sized_chunks(dataset[column_text], batch_size))
#     target_batches = list(giveme_batch_sized_chunks(dataset[column_summary], batch_size))

#     for article_batch, target_batch in tqdm(
#         zip(article_batches, target_batches), total=len(article_batches)):

#         inputs = tokenizer(article_batch, max_length=1024,  truncation=True,
#                         padding="max_length", return_tensors="pt")

#         summaries = model.generate(input_ids=inputs["input_ids"].to(device),
#                          attention_mask=inputs["attention_mask"].to(device),
#                          length_penalty=0.8, num_beams=8, max_length=128)
#         ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

#         # Finally, we decode the generated texts,
#         # replace the  token, and add the decoded texts with the references to the metric.
#         decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
#                                 clean_up_tokenization_spaces=True)
#                for s in summaries]

#         decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

#         metric.add_batch(predictions=decoded_summaries, references=target_batch)

#         # compute score

#     score = metric.compute()

#     return score


In [None]:
!pip install --upgrade datasets

In [None]:
!pip install evaluate

from evaluate import load
# common rouge score :
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load('rouge')

In [None]:
def giveme_batch_sized_chunks(self, list_of_elements, batch_size):
    """
    Yield successive batch-sized chunks from list_of_
    from list_of_elements.
    """
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i:i + batch_size]

def calculate_metric_on_test_ds(self, dataset, metric, model, tokenizer,
                               batch_size=16, device="cuda", # changed to use string literal "cuda"
                               column_text="article",
                               column_summary="highlights"):
    article_batches = list(self.giveme_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(self.giveme_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        inputs = tokenizer(article_batch, max_length=512,  # Reduced max_length
                        truncation=True,
                        padding="max_length", return_tensors="pt")

        summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                         attention_mask=inputs["attention_mask"].to(device),
                         length_penalty=0.8, num_beams=4,  # Reduced num_beams
                         max_length=128)
        ''' parameter for length penalty ensures that the model does not generate sequences that are too long. '''

        # Finally, we decode the generated texts,
        # replace the  token, and add the decoded texts with the references to the metric.
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                clean_up_tokenization_spaces=True)
               for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

        metric.add_batch(predictions=decoded_summaries, references=target_batch)

        # compute score

    score = metric.compute()

    return score

In [None]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:15], rouge_metric, trainer.model, tokenizer, batch_size = 1, column_text = 'dialogue', column_summary= 'summary'
)

#rouge_dict = dict((rn, score[rn].mid.fmeasure ) for rn in rouge_names )
rouge_dict = dict((rn, score[rn]) for rn in rouge_names)

pd.DataFrame(rouge_dict, index = [f'pegasus'] )

100%|██████████| 15/15 [00:27<00:00,  1.82s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.020708,0.0,0.020552,0.020765


In [None]:
# save the model  :
model_pegasus.save_pretrained("finetued_pegasus_model")
# saving tokenizer :
tokenizer.save_pretrained("tokenizer")


('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [None]:
# loading the tokenizer :

tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

In [None]:
# # test prediction :


# gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}



# sample_text = dataset_samsum["test"][45]["dialogue"]

# reference = dataset_samsum["test"][45]["summary"]

# pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

# ##
# print("Dialogue:")
# print(sample_text)


# print("\nReference Summary:")
# print(reference)


# print("\nModel Summary:")
# print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

In [None]:

# loading the tokenizer :
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

# test prediction :
gen_kwargs = {"length_penalty": 0.8, "num_beams": 8, "max_length": 90}

sample_text = dataset_samsum["test"][45]["dialogue"]
reference = dataset_samsum["test"][45]["summary"]


pipe = pipeline("summarization", model="finetued_pegasus_model", tokenizer=tokenizer,device=-1 )

print("Dialogue:")
print(sample_text)

print("\nReference Summary:")
print(reference)

print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Your max_length is set to 90, but your input_length is only 80. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=40)


Dialogue:
Josh: Stephen, I think you've accidentaly taken my notebook home
Stephen: wait lemme check
Stephen: nope, I don't see it anywhere
Jack: oh shit, I've got it xDDD I don't even know why
Josh: xDDD ok, no problem, cool I know where it is
Jack: I'll bring it tomorow

Reference Summary:
Josh thinks Stephen accidentally took his notebook. Jack has it and will bring it tomorrow.

Model Summary:
Stephen accidentally takes Josh's notebook home .<n>Stephen doesn't see it anywhere, Jack knows where it is .<n>Josh will bring it tomorow .


Reference Summary:

Josh thinks Stephen accidentally took his notebook. Jack has it and will bring it tomorrow.

Model Summary:

Stephen accidentally takes Josh's notebook home .<n>Stephen doesn't see it anywhere, Jack knows where it is .<n>Josh will bring it tomorow .