In [1]:
#!nvidia-smi

In [2]:
#!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q

In [3]:
#!pip install --upgrade accelerate
#!pip uninstall -y transformers accelerate
#!pip install transformers accelerate

In [4]:
#pipeline:	High-level API for using NLP models
#set_seed:	Controls randomness for reproducibility, Without set_seed(), running the same code multiple times could give different outputs. With it, you get consistent results every time
from transformers import pipeline, set_seed
#load_dataset: Load public datasets (e.g., CNN/DailyMail, WMT),
#load_from_disk: Load a previously saved dataset from your local storage
#load_metric: Used to load standard evaluation metrics (e.g., BLEU, ROUGE, accuracy)
from datasets import load_dataset, load_from_disk, load_metric
import matplotlib.pyplot as plt
import pandas as pd

#AutoModelForSeq2SeqLM: Loads models for sequence-to-sequence tasks (e.g., translation, summarization).
#AutoTokenizer: Loads the correct tokenizer for the model.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

import nltk
from nltk.tokenize import sent_tokenize

#Adds a progress bar to loops (helpful for tracking long processes)
from tqdm import tqdm
import torch

#punkt is a pre-trained model that splits text into sentences.
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

The pipeline function is a high-level API that makes it extremely easy to use pre-trained models for common NLP tasks without manually loading models or tokenizers.

How it works:
Under the hood, it automatically:

1. Selects the correct model and tokenizer.

2. Downloads the model if not already cached.

3. Runs preprocessing, inference, and postprocessing.

In [5]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

1. model_ckpt = "google/pegasus-cnn_dailymail"
- This sets a variable model_ckpt (short for "model checkpoint") to the name of a pre-trained model hosted on the Hugging Face Model Hub.
- "google/pegasus-cnn_dailymail" refers to a specific variant of the PEGASUS model trained by Google on the CNN/DailyMail dataset for abstractive summarization.
  - PEGASUS is a state-of-the-art text summarization model built on the Transformer architecture (similar to BERT/GPT).
  - The CNN/DailyMail dataset contains news articles paired with human-written summaries, commonly used for training summarization models.

2. tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
- This line loads the tokenizer associated with the model checkpoint.
- AutoTokenizer is a Hugging Face class that automatically selects the correct tokenizer class for the specified model.
- from_pretrained(model_ckpt) downloads the tokenizer configuration (like vocabulary, special tokens, etc.) for "google/pegasus-cnn_dailymail".
- The tokenizer:
  - Converts raw text into token IDs (input for the model).
  - Adds necessary special tokens like <pad>, <s>, </s>, etc.
  - Handles batching, padding, and truncation.

3. model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)
- This line loads the pre-trained PEGASUS model itself.
- AutoModelForSeq2SeqLM is a class for sequence-to-sequence models (like encoder-decoder transformers) used for tasks like summarization, translation, and text generation.
- from_pretrained(model_ckpt) downloads the pre-trained weights and configuration for "google/pegasus-cnn_dailymail".
- .to(device) moves the model to the appropriate hardware:
  - device can be 'cuda' (GPU) or 'cpu', depending on your setup.
  - This allows the model to run efficiently on available hardware.


In [6]:
model_ckpt = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.12k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [7]:
#download and unzip data
#wget is a command-line utility used to download files from the web.
!wget https://github.com/entbappy/Branching-tutorial/raw/master/summarizer-data.zip
!unzip summarizer-data.zip

--2025-06-07 20:37:04--  https://github.com/entbappy/Branching-tutorial/raw/master/summarizer-data.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/summarizer-data.zip [following]
--2025-06-07 20:37:05--  https://raw.githubusercontent.com/entbappy/Branching-tutorial/master/summarizer-data.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7903594 (7.5M) [application/zip]
Saving to: ‘summarizer-data.zip’


2025-06-07 20:37:07 (5.45 MB/s) - ‘summarizer-data.zip’ saved [7903594/7903594]

Archive:  summarizer-data.zip
  inflating: samsum-test.csv         
  inf

In [8]:
import os
print(os.listdir('.'))

['.config', 'summarizer-data.zip', 'samsum-train.csv', 'samsum-test.csv', 'samsum_dataset', 'samsum-validation.csv', 'sample_data']


In [9]:
dataset_samsum = load_from_disk('file://samsum_dataset')
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In the DatasetDict, the keys and values are structured like a Python dictionary
  - Key: 'train'
  - Value: a Dataset object with 14,732 rows and the specified features

In [10]:
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]
print(f"Split lengths: {split_lengths}")
print(f"Features: {dataset_samsum['train'].column_names}")

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']


In [11]:
print("Dialogue")
print(dataset_samsum['test'][0]['dialogue'])
print("Summary")
print(dataset_samsum['test'][0]['summary'])

Dialogue
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye
Summary
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.


In [12]:
def convert_examples_to_features(example_batch):
  input_encodings = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)

  with tokenizer.as_target_tokenizer():
    target_encodings = tokenizer(example_batch['summary'], max_length=128, truncation=True)

  return{
      'input_ids' : input_encodings['input_ids'],
      'atttention_masks' : input_encodings['attention_mask'],
      'labels' : target_encodings['input_ids']
  }

In [13]:
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True)

Map:   0%|          | 0/14732 [00:00<?, ? examples/s]



Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

In [14]:
dataset_samsum_pt['train']

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'atttention_masks', 'labels'],
    num_rows: 14732
})

### Training

In [15]:
#DataCollatorForSeq2Seq is a utility from the Hugging Face transformers library
#Dynamically Pad all inputs (input_ids, attention_mask) to the same length. Pad the labels (labels) as well. By default, it pads labels with -100 (not 0) because in PyTorch, the cross-entropy loss ignores the label -100
from transformers import DataCollatorForSeq2Seq
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

The TrainingArguments class from Hugging Face's transformers library is used to configure training options for the Trainer API


In [16]:
from transformers import TrainingArguments, Trainer

trainer_args = TrainingArguments(
    output_dir='pegasus-samsum', num_train_epochs=1, warmup_steps=500,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    weight_decay=0.01, logging_steps=10,
    eval_strategy='steps', eval_steps=500, save_steps=1e6,
    gradient_accumulation_steps=16
)

In [17]:
trainer = Trainer(model=model_pegasus, args=trainer_args,
                  tokenizer=tokenizer, data_collator=seq2seq_data_collator,
                  train_dataset=dataset_samsum_pt["train"],
                  eval_dataset=dataset_samsum_pt["validation"])

  trainer = Trainer(model=model_pegasus, args=trainer_args,


In [18]:
import os
os.environ["WANDB_MODE"] = "disabled"

In [19]:
trainer.train()



Step,Training Loss,Validation Loss
500,1.6533,1.487004




TrainOutput(global_step=921, training_loss=1.8247759614006316, metrics={'train_runtime': 2758.207, 'train_samples_per_second': 5.341, 'train_steps_per_second': 0.334, 'total_flos': 5531718781673472.0, 'train_loss': 1.8247759614006316, 'epoch': 1.0})

###Evalution

In [20]:
def generate_batch_sized_chunks(list_of_elements, batch_size):
  """split the dataset into smaller batches that we can process simultaneously
  Yield successive batch-sized chunks from list_of_elements."""
  for i in range(0, len(list_of_elements), batch_size):
    yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer, batch_size=16,
                                device=device, column_text='article',
                                column_summary="highlights"):
  article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
  target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

  for article_batch, target_batch in tqdm(
      zip(article_batches, target_batches), total=len(article_batches)):

      inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                         padding="max_length", return_tensors="pt")

      summaries = model.generate(input_ids=inputs["input_ids"].to(device),
                                 attention_mask=inputs["attention_mask"].to(device),
                                 length_penalty=0.8, num_beams=8, max_length=128)

      #Finally, we decode the generated texts
      #replace the token, and add the decoded texts with the references to the metric.
      decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                            clean_up_tokenization_spaces=True) for s in summaries]
      decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

      metric.add_batch(predictions=decoded_summaries, references=target_batch)
  #finally compute and run the rouge score
  score = metric.compute()
  return score

In [21]:
pip install rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=d8129ab21f9ef1a972622b3b3c036f98374926d9b8fbb64d29c2e3fc682c41fe
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [22]:
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = load_metric('rouge')

  rouge_metric = load_metric('rouge')


Downloading builder script:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

In [23]:
score = calculate_metric_on_test_ds(
    dataset_samsum['test'], rouge_metric, trainer.model, tokenizer, batch_size=2, column_text='dialogue', column_summary='summary'
)

rouge_dict = dict((rn, score[rn].mid.fmeasure) for rn in rouge_names)

pd.DataFrame(rouge_dict, index=[f'pegasus'])

  0%|          | 0/410 [00:00<?, ?it/s]Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.58.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
100%|██████████| 410/410 [18:10<00:00,  2.66s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.018665,0.000267,0.018542,0.018592


####Save Model

In [24]:
model_pegasus.save_pretrained("pegasus-samsum-model")

###Save Tokenizer

In [25]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

###Load

In [26]:
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

###Prediction

In [27]:
gen_kwargs = {"length_penalty":0.8, "num_beams":8, "max_length":128}

sample_text = dataset_samsum["test"][0]["dialogue"]
reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model", tokenizer=tokenizer)

print("Dialogue:")
print(sample_text)

print("\nReference Summary:")
print(reference)

print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])

Device set to use cuda:0
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda can't find Betty's number. Larry called Betty's last time they were at the park together. Hannah wants Amanda to text Larry.
