
-----

# **Text Summarization Using HuggingFace**

- Install these libraries.

```python
!pip install transformers[sentencepiece] datasets sacrebleu evaluate rouge_score py7zr -q
```
```python
!pip install --upgrade accelerate
!pip uninstall -y transformers accelerate
!pip install transformers accelerate
```

Here's a brief overview of each library:

1. **transformers**: This library by Hugging Face provides pre-trained models for natural language processing (NLP) tasks, such as text classification, translation, summarization, and more. It supports various architectures like BERT, GPT-2, and T5.

2. **datasets**: Also from Hugging Face, this library allows easy access to a wide range of datasets for NLP tasks. It simplifies loading, preprocessing, and using datasets, making it easier to train and evaluate models.

3. **sacrebleu**: This library is used for calculating BLEU scores, a metric for evaluating the quality of text that has been machine-translated from one language to another. It provides a standardized way to compute BLEU scores and includes various features for handling different text formats.

4. **rouge_score**: This library is used for computing ROUGE scores, which are metrics for evaluating automatic summarization and machine translation. ROUGE measures the overlap between the generated text and reference text, focusing on recall, precision, and F1 scores.

5. **py7zr**: This library is a Python implementation for handling 7z (7-Zip) archive files. It allows for the extraction and creation of compressed files, which can be useful for managing large datasets or models.

6. **-q**: This flag is generally used with `pip` to suppress output messages, making the installation process quieter.

These libraries are commonly used in NLP projects, particularly when working with model training and evaluation.


-----

### **1. Import Required Libraries**

In [None]:
# Import tqdm for creating progress bars in loops
from tqdm import tqdm

# Import PyTorch for tensor computations and model handling
import torch

# Import NLTK library for natural language processing tasks
import nltk
nltk.download("punkt")

# Import sentence tokenizer from NLTK
from nltk.tokenize import sent_tokenize

# Import pandas for data manipulation and analysis
import pandas as pd

# Import matplotlib for data visualization
import matplotlib.pyplot as plt

# evaluate library and replacing load_metric with evaluate.load
import evaluate

# Import Hugging Face Transformers library for using pre-trained models
from transformers import pipeline, set_seed

# Import model and tokenizer classes for sequence-to-sequence learning
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Import datasets library for loading and manipulating datasets
from datasets import load_dataset , load_from_disk

# Import DataCollatorForSeq2Seq from the Transformers library
# This class is used for dynamically padding sequences to the maximum length in a batch,
# making it suitable for sequence-to-sequence tasks during training or evaluation
from transformers import DataCollatorForSeq2Seq

# Import TrainingArguments and Trainer from the Transformers library
# TrainingArguments is a class that holds various parameters for training (like learning rate, batch size, etc.)
# Trainer is a high-level class that simplifies the training and evaluation of models
from transformers import TrainingArguments, Trainer

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### **Check If gpu is available**

In [None]:
# Check if a GPU (CUDA) is available for computation
# If a GPU is available, set the device to "cuda"; otherwise, use "cpu"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Output the chosen device (either "cuda" or "cpu")
device

'cuda'

### **2. Load the Tokenizer Model**

In [None]:
# Specify the model checkpoint from Hugging Face's model hub
model_ckpt = "google/pegasus-cnn_dailymail"

# Load the tokenizer associated with the specified model checkpoint
# The tokenizer is responsible for converting text into tokens that the model can process
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

### **3. Load the pre-trained sequence-to-sequence model from the specified checkpoint**

In [None]:
# The model is set to the device (either GPU or CPU) for further computations
model_pegasus = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-cnn_dailymail and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### **4. Load the SAMSum dataset, which is used for dialogue summarization tasks**

In [None]:
dataset_samsum = load_dataset("samsum")

In [None]:
dataset_samsum

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [None]:
# Check a Value
dataset_samsum["train"]["dialogue"][1]

'Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great'

### **5. Create a list that contains the lengths of each split in the SAMSum dataset**

In [None]:
# This iterates through each split (e.g., train, validation, test) and calculates the length
split_lengths = [len(dataset_samsum[split]) for split in dataset_samsum]

# Print the lengths of each dataset split
print(f"Split lengths: {split_lengths}")

# Print the feature names (column names) of the training set
print(f"Features: {dataset_samsum['train'].column_names}")

# Print a header for the dialogue section
print("\nDialogue:")

# Print the dialogue from the second entry (index 1) in the test set
print(dataset_samsum["test"][1]["dialogue"])

# Print a header for the summary section
print("\nSummary:")

# Print the summary corresponding to the second entry (index 1) in the test set
print(dataset_samsum["test"][1]["summary"])

Split lengths: [14732, 819, 818]
Features: ['id', 'dialogue', 'summary']

Dialogue:
Eric: MACHINE!
Rob: That's so gr8!
Eric: I know! And shows how Americans see Russian ;)
Rob: And it's really funny!
Eric: I know! I especially like the train part!
Rob: Hahaha! No one talks to the machine like that!
Eric: Is this his only stand-up?
Rob: Idk. I'll check.
Eric: Sure.
Rob: Turns out no! There are some of his stand-ups on youtube.
Eric: Gr8! I'll watch them now!
Rob: Me too!
Eric: MACHINE!
Rob: MACHINE!
Eric: TTYL?
Rob: Sure :)

Summary:
Eric and Rob are going to watch a stand-up on youtube.



### **6. Pre-Process the Dataset (Convert Example to Features)**

- This function processes a batch of examples by tokenizing dialogues and summaries, preparing them for model input by creating appropriate input IDs, attention masks, and labels.


In [None]:
def convert_examples_to_features(example_batch):
    # Encode the dialogues from the input batch using the tokenizer
    # Set the maximum length to 1024 tokens and enable truncation for longer texts
    input_encodings = tokenizer(example_batch['dialogue'], max_length=1024, truncation=True)

    # Use the tokenizer configured for target text (summaries)
    with tokenizer.as_target_tokenizer():
        # Encode the summaries from the input batch
        # Set the maximum length to 128 tokens and enable truncation
        target_encodings = tokenizer(example_batch['summary'], max_length=128, truncation=True)

    # Return a dictionary containing:
    # - 'input_ids': token IDs for the input dialogues
    # - 'attention_mask': mask indicating which tokens are actual tokens vs padding
    # - 'labels': token IDs for the target summaries (used for training)
    return {
        'input_ids': input_encodings['input_ids'],
        'attention_mask': input_encodings['attention_mask'],
        'labels': target_encodings['input_ids']
    }

In [None]:

'''
-  Apply the 'convert_examples_to_features' function to the SAMSum dataset
-  This processes the dataset in batches to efficiently convert dialogues and summaries into model-ready features
-  The result is stored in 'dataset_samsum_pt' '''
dataset_samsum_pt = dataset_samsum.map(convert_examples_to_features, batched=True)

Map:   0%|          | 0/818 [00:00<?, ? examples/s]



In [None]:
'''
-  Access the training split of the processed SAMSum dataset
-  This contains the features generated by the 'convert_examples_to_features' function
'''

dataset_samsum_pt["train"]

Dataset({
    features: ['id', 'dialogue', 'summary', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 14732
})

In [None]:
# Access the input IDs of the second example (index 1) in the training split of the processed SAMSum dataset
# This retrieves the tokenized representation of the dialogue for that specific example
dataset_samsum_pt["train"]["input_ids"][1]

[18038,
 151,
 2632,
 127,
 119,
 6228,
 118,
 115,
 136,
 2974,
 152,
 10463,
 151,
 35884,
 130,
 329,
 107,
 18038,
 151,
 2587,
 314,
 1242,
 10463,
 151,
 1509,
 1]

In [None]:
dataset_samsum_pt["train"]["labels"][1]

[18038, 111, 34296, 127, 6228, 118, 33195, 115, 136, 2974, 107, 1]

In [None]:
dataset_samsum_pt["train"]["attention_mask"][1]

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

### **7. Train the model**

- Create a data collator for sequence-to-sequence tasks using the specified tokenizer and model

In [None]:
# This collator will pad inputs and labels dynamically when creating batches for training
seq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model_pegasus)

# Define training arguments for the Trainer
trainer_args = TrainingArguments(
    # Directory where the model and checkpoints will be saved
    output_dir='pegasus-samsum',

    # Number of training epochs
    num_train_epochs=1,

    # Number of warmup steps for learning rate scheduling
    warmup_steps=500,

    # Batch size for training on each device (e.g., GPU)
    per_device_train_batch_size=1,

    # Batch size for evaluation on each device
    per_device_eval_batch_size=1,

    # Weight decay for regularization
    weight_decay=0.01,

    # How often to log training progress (in steps)
    logging_steps=10,

    # Strategy for evaluating the model during training
    evaluation_strategy='steps',

    # Steps between evaluations
    eval_steps=500,

    # Steps at which the model checkpoint will be saved
    save_steps=1e6,

    # Number of gradient accumulation steps (to effectively increase batch size)
    gradient_accumulation_steps=16
)



In [None]:
# Initialize the Trainer class with the specified parameters for training the model
trainer = Trainer(
    # The model to be trained (Pegasus model)
    model=model_pegasus,

    # Training arguments defined earlier
    args=trainer_args,

    # The tokenizer used for processing input and output sequences
    tokenizer=tokenizer,

    # The data collator that handles dynamic padding of sequences
    data_collator=seq2seq_data_collator,

    # The dataset to be used for training (in this case, using the test split)
    train_dataset=dataset_samsum_pt["test"],

    # The dataset to be used for evaluation during training (validation split)
    eval_dataset=dataset_samsum_pt["validation"]
)

In [None]:
# Start the training process for the model using the specified training parameters and datasets
trainer.train()

Step,Training Loss,Validation Loss




TrainOutput(global_step=51, training_loss=3.0044142264945832, metrics={'train_runtime': 257.2971, 'train_samples_per_second': 3.183, 'train_steps_per_second': 0.198, 'total_flos': 313450454089728.0, 'train_loss': 3.0044142264945832, 'epoch': 0.9963369963369964})

## **8. Evaluate the model Performance**

#### **Calculate the ROUGE scores on a subset of the test dataset (first 10 examples)**

In [None]:
import torch

def generate_batch_sized_chunks(list_of_elements, batch_size):
    """split the dataset into smaller batches that we can process simultaneously
    Yield successive batch-sized chunks from list_of_elements."""
    for i in range(0, len(list_of_elements), batch_size):
        yield list_of_elements[i : i + batch_size]

def calculate_metric_on_test_ds(dataset, metric, model, tokenizer,
                               batch_size=16, device='cpu',
                               column_text="article",
                               column_summary="highlights"):
    # Move model to the specified device
    model.to(device)

    article_batches = list(generate_batch_sized_chunks(dataset[column_text], batch_size))
    target_batches = list(generate_batch_sized_chunks(dataset[column_summary], batch_size))

    for article_batch, target_batch in tqdm(
        zip(article_batches, target_batches), total=len(article_batches)):

        # Tokenize input articles
        inputs = tokenizer(article_batch, max_length=1024, truncation=True,
                        padding="max_length", return_tensors="pt")

        # Move inputs to the same device as the model
        inputs = {key: value.to(device) for key, value in inputs.items()}

        # Generate summaries with the model
        summaries = model.generate(input_ids=inputs["input_ids"],
                                   attention_mask=inputs["attention_mask"],
                                   length_penalty=0.8, num_beams=8, max_length=128)

        # Decode the generated summaries
        decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True,
                                              clean_up_tokenization_spaces=True)
                             for s in summaries]

        decoded_summaries = [d.replace("", " ") for d in decoded_summaries]

        # Add the generated summaries and targets to the metric
        metric.add_batch(predictions=decoded_summaries, references=target_batch)

    # Compute and return the ROUGE scores
    score = metric.compute()
    return score

rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_metric = evaluate.load('rouge')

# Check if CUDA is available and use the GPU if possible, otherwise use CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Evaluate the model's performance using the ROUGE metric
score = calculate_metric_on_test_ds(
    dataset_samsum['test'][0:10],  # Selecting the first 10 examples from the test dataset
    rouge_metric,                  # The ROUGE metric to be used for evaluation
    trainer.model,                 # The trained model to evaluate
    tokenizer,                     # The tokenizer used for processing the input
    batch_size=2,                  # Number of examples to process in each batch
    column_text='dialogue',        # The column name containing the text input (dialogue)
    column_summary='summary',      # The column name containing the reference summaries
    device=device                  # The device (GPU or CPU) to run the evaluation on
)

# Create a dictionary to store the F1 scores for each ROUGE metric
rouge_dict = {}
for rn in rouge_names:
    if isinstance(score[rn], dict) and 'mid' in score[rn]:
        rouge_dict[rn] = score[rn]['mid'].fmeasure  # Handling case with 'mid' attribute
    else:
        rouge_dict[rn] = score[rn].fmeasure if hasattr(score[rn], 'fmeasure') else score[rn]  # Handling case where the score is a float

# Convert the dictionary of ROUGE scores into a pandas DataFrame for easier viewing
pd.DataFrame(rouge_dict, index=[f'pegasus'])


100%|██████████| 5/5 [00:20<00:00,  4.09s/it]


Unnamed: 0,rouge1,rouge2,rougeL,rougeLsum
pegasus,0.022811,0.0,0.022521,0.022623


## **9. Save model**

In [None]:
model_pegasus.save_pretrained("pegasus-samsum-model")

In [None]:
## Save tokenizer
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/spiece.model',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

## **10. Load the Model**

In [None]:
tokenizer = AutoTokenizer.from_pretrained("/content/tokenizer")

## **11. Prediction**

In [None]:
gen_kwargs = {"length_penalty": 0.8, "num_beams":8, "max_length": 128}

sample_text = dataset_samsum["test"][0]["dialogue"]

reference = dataset_samsum["test"][0]["summary"]

pipe = pipeline("summarization", model="pegasus-samsum-model",tokenizer=tokenizer)

##
print("Dialogue:")
print(sample_text)


print("\nReference Summary:")
print(reference)


print("\nModel Summary:")
print(pipe(sample_text, **gen_kwargs)[0]["summary_text"])


Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Your max_length is set to 128, but your input_length is only 122. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=61)


Dialogue:
Hannah: Hey, do you have Betty's number?
Amanda: Lemme check
Hannah: <file_gif>
Amanda: Sorry, can't find it.
Amanda: Ask Larry
Amanda: He called her last time we were at the park together
Hannah: I don't know him well
Hannah: <file_gif>
Amanda: Don't be shy, he's very nice
Hannah: If you say so..
Hannah: I'd rather you texted him
Amanda: Just text him 🙂
Hannah: Urgh.. Alright
Hannah: Bye
Amanda: Bye bye

Reference Summary:
Hannah needs Betty's number but Amanda doesn't have it. She needs to contact Larry.

Model Summary:
Amanda: Ask Larry Amanda: He called her last time we were at the park together .<n>Hannah: I'd rather you texted him .<n>Amanda: Just text him .
