# TEXT SUMMARIZATION USING BART TRANSFORMER MODEL

BART - Bidirectional and Auto Regressive Transformers

- 1. WITHOUT FINE TUNING
- 2. WITH FINE - TUNING

#### LOADING THE DATASET

In [None]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
from datasets import load_dataset

df = load_dataset("ccdv/pubmed-summarization")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/3.80k [00:00<?, ?B/s]

train-00000-of-00005.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

train-00001-of-00005.parquet:   0%|          | 0.00/208M [00:00<?, ?B/s]

train-00002-of-00005.parquet:   0%|          | 0.00/207M [00:00<?, ?B/s]

train-00003-of-00005.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

train-00004-of-00005.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/59.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/58.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/119924 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/6633 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6658 [00:00<?, ? examples/s]

In [None]:
df

DatasetDict({
    train: Dataset({
        features: ['article', 'abstract'],
        num_rows: 119924
    })
    validation: Dataset({
        features: ['article', 'abstract'],
        num_rows: 6633
    })
    test: Dataset({
        features: ['article', 'abstract'],
        num_rows: 6658
    })
})

In [None]:
df['train'][1]['dialogue']

KeyError: 'dialogue'

In [None]:
df['train'][1]['summary']

'Mrs Parker takes Ricky for his vaccines. Dr. Peters checks the record and then gives Ricky a vaccine.'

## 1. USING THE MODEL WITHOUT FINE TUNING

#### LOADING THE BART MODEL

In [None]:
from transformers import pipeline

text_summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [None]:
article_1 = df['train'][1]['dialogue']

text_summarizer(article_1, max_length=20, min_length=10, do_sample=False)

[{'summary_text': 'Ricky has received his Polio, Tetanus and Hepatitis B shots.'}]

## 2. FINE - TUNING MODEL

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import TrainingArguments, Trainer

In [None]:
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-base")

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-base")

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

'input_ids' represent the tokenized form of your input text. Each token (which could be a word or part of a word) is converted into a unique integer ID based on the model's vocabulary.

'attention_mask' is a tensor that indicates which tokens should be attended to and which should be ignored (usually padding tokens). It’s a binary mask where typically:

- 1 indicates that the token should be attended to.
- 0 indicates that the token is padding and should be ignored.

In sequence-to-sequence models, such as text summarization models, you have:

- Input IDs: Tokenized IDs of the source text (e.g., dialogue).
- Target IDs: Tokenized IDs of the target text (e.g., summary).<br>

During training, the model computes the loss between the predicted sequence and the target sequence. To ensure that padding tokens do not affect this loss calculation, padding token IDs are often replaced with -100.



In [None]:
#tokenization

def preprocess_function(batch):
    source = batch['dialogue']
    target = batch["summary"]
    source_ids = tokenizer(source, truncation=True, padding="max_length", max_length=128)
    target_ids = tokenizer(target, truncation=True, padding="max_length", max_length=128)

    # Replace pad token id with -100 for labels to ignore padding in loss computation
    labels = target_ids["input_ids"]
    labels = [[(label if label != tokenizer.pad_token_id else -100) for label in labels_example] for labels_example in labels]

    return {
        "input_ids": source_ids["input_ids"],
        "attention_mask": source_ids["attention_mask"],
        "labels": labels
    }

df_source = df.map(preprocess_function, batched=True)

Map:   0%|          | 0/12460 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="/content",  # Replace with your output directory
    per_device_train_batch_size=8,
    num_train_epochs=2,  # Adjust number of epochs as needed
    remove_unused_columns=False
)

In [None]:
# Create Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=df_source["train"],
    eval_dataset=df_source["test"]
)

trainer.train()

Step,Training Loss
500,1.1874
1000,1.1765
1500,1.219
2000,1.0479


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


Step,Training Loss
500,1.1874
1000,1.1765
1500,1.219
2000,1.0479
2500,1.0982
3000,1.2005


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}
Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


TrainOutput(global_step=3116, training_loss=1.161297138297359, metrics={'train_runtime': 1175.9641, 'train_samples_per_second': 21.191, 'train_steps_per_second': 2.65, 'total_flos': 1899329067417600.0, 'train_loss': 1.161297138297359, 'epoch': 2.0})

In [None]:
# Evaluate the model
eval_results = trainer.evaluate()

# Print evaluation results
print(eval_results)

{'eval_loss': 1.7255662679672241, 'eval_runtime': 18.272, 'eval_samples_per_second': 82.093, 'eval_steps_per_second': 10.289, 'epoch': 2.0}


#### SAVING THE MODEL

In [None]:
# Save the model and tokenizer after training
model.save_pretrained("/content/your_model_directory")
tokenizer.save_pretrained("/content/your_model_directory")

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


('/content/your_model_directory/tokenizer_config.json',
 '/content/your_model_directory/special_tokens_map.json',
 '/content/your_model_directory/vocab.json',
 '/content/your_model_directory/merges.txt',
 '/content/your_model_directory/added_tokens.json',
 '/content/your_model_directory/tokenizer.json')

#### SUMMARIZING THE CUSTOM DATA USING SAVED MODEL AND TOKENIZER

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load the trained model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("/content/your_model_directory")
model = AutoModelForSeq2SeqLM.from_pretrained("/content/your_model_directory")

# Function to summarize a blog post
def summarize(blog_post):
    # Tokenize the input blog post
    inputs = tokenizer(blog_post, max_length=1024, truncation=True, return_tensors="pt")

    # Generate the summary
    summary_ids = model.generate(inputs["input_ids"], max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Example blog post
blog_post = """
As Yogi Berra famously said, it’s tough to make predictions, especially about the future. But had the baseball legend spent any time observing the UN climate negotiations, he could have safely predicted that climate finance will prove to be a key sticking point at COP29 in Baku at the end of this year.

‘Who will pay and how much?’ are perennial questions at the climate talks, but this year, the discussions about climate finance will be especially prominent. At COP29, Parties to the Paris Agreement must negotiate a new climate finance goal, to replace the existing commitment from 2009 for developed countries to provide US$100 billion climate finance annually from 2020 to 2025 - a commitment that only in 2022 was starting to be fulfilled, according to a recent OECD report.

It is vital that the forthcoming Bonn Climate Change Conference sends the right political signals, and lays the procedural and technical groundwork for an ambitious climate finance deal in Baku.

A pressing need

With global warming already destabilising the climate and devastating people’s lives and livelihoods, the need for finance to reduce greenhouse gas emissions and to adapt to a warming world has never been more pressing.

The sums involved are large. The Paris Agreement’s Global Stocktake process estimates that US$5.8-5.9 trillion is required to implement Nationally Determined Contributions (NDCs) in developing countries up to 2030. They will require US$215-387 billion annually over this period for adaptation. Investments of US$1.5 trillion in renewable energy are required worldwide every year up until 2030, according to IRENA.

But these sums are also affordable and beneficial for developed countries. They should be seen in the context of ongoing investments in energy and other infrastructure: around US$2.3 trillion was invested in energy infrastructure in 2023, of which US$1.74 trillion was in clean energy. These investments will generate strong returns for their investors and reduce the costs for energy consumers.

And, crucially, they should also be seen in the context of the alternative. The latest research estimates that the world economy is already set to face a 19% income reduction within the next 26 years based on the levels of warming we have already locked in. The more we delay and the more the planet heats, the greater the economic costs will be.

Laying the foundations for a new finance goal

While financial resources are beginning to flow, they are not flowing fast enough, and certainly not flowing to those developing countries where need is greatest and access to finance is most challenging.

The UN climate framework provides mechanisms that can enable those flows of climate finance. Back in 2015, parties at the climate talks agreed to establish a “new collective quantified goal” (NCQG) for climate finance. They agreed that the NCQG would be set prior to 2025.

The  ultimate size of the NCQG will be a product of the negotiations, but Parties have agreed it must be a significant increase from the floor of US$100 billion annually. For WWF, it must be needs-based and sufficiently ambitious to meet the scale of the challenge we face, and immediately accessible to help countries that are already facing the chaos of a destabilised climate system.

While developed countries are expected to provide financial and technical support, developing countries also have a role to play. Parties are due to submit revised NDCs in 2025, presenting how they plan to reduce emissions and adapt to climate change. Developing countries have the opportunity to use their NDCs to set out how international climate finance can support them and increase their ambition. To do this, they need to know the finance will be forthcoming.
"""

# Get the summary
summary = summarize(blog_post)
print("Summary:", summary)


Summary: At the UN climate negotiations, the key sticking point is the new climate finance goal. At the forthcoming Bonn Climate Change Conference, Parties are expected to present revised NDCs to replace the existing commitment from 2009 for developed countries to provide US$100 billion climate finance.
