For this task, you will be provided with a dataset containing news articles. Your goal is to develop an abstractive text summarization model that generates concise summaries of these articles.

# Installing the Required Dependency

In [None]:
!pip install transformers[sentencepiece] datasets sacrebleu rouge_score py7zr -q
!pip uninstall -y transformers accelerate
!pip install transformers accelerate
!pip install evaluate

In [40]:
!nvidia-smi

Fri May 19 07:35:08 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P0    29W /  70W |  10655MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Importing Necessary Libaries

In [41]:
from transformers import pipeline,AutoModelForSeq2SeqLM, AutoTokenizer,DataCollatorForSeq2Seq
import pandas as pd
from datasets import load_dataset,load_metric
import torch

Checking for Cuda Dependency

In [42]:
device =  "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

We are using Bart model due to its following reasons
1. Encoder - Decoder Architecture
2. Bidirectionality
3. Pretrained on Large Data
4. Fine Tuning Support

In [43]:
# Enter the model name for using some other model
model_checkpoint = "facebook/bart-base"

In [44]:
# Importing Pretrained Tokenizer from the model itself
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [46]:
# importing the model
# AS this a text generation problem we need to sequence to sequence model
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint).to(device)

CNN Daily Mail dataset is already in the Libaries of Hugging Face 

In [45]:
dataset = load_dataset("cnn_dailymail",version = "3.0.0")
dataset



  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [47]:
dataset['train'].column_names

['article', 'highlights', 'id']

# Creating the pipeline for future test outputs.

In [48]:
pipe = pipeline('summarization',model = model_checkpoint)

Example for checking that model needs to be train or not

In [49]:

test_sample = dataset['test'][1]['article']
test_sample

'(CNN)Never mind cats having nine lives. A stray pooch in Washington State has used up at least three of her own after being hit by a car, apparently whacked on the head with a hammer in a misguided mercy killing and then buried in a field -- only to survive. That\'s according to Washington State University, where the dog -- a friendly white-and-black bully breed mix now named Theia -- has been receiving care at the Veterinary Teaching Hospital. Four days after her apparent death, the dog managed to stagger to a nearby farm, dirt-covered and emaciated, where she was found by a worker who took her to a vet for help. She was taken in by Moses Lake, Washington, resident Sara Mellado. "Considering everything that she\'s been through, she\'s incredibly gentle and loving," Mellado said, according to WSU News. "She\'s a true miracle dog and she deserves a good life." Theia is only one year old but the dog\'s brush with death did not leave her unscathed. She suffered a dislocated jaw, leg inju

In [50]:
output = pipe(test_sample)


KeyboardInterrupt



In [20]:
print(output[0]['summary_text'].replace("<n>","\n"))

(CNN)Never mind cats having nine lives. A stray pooch in Washington State has used up at least three of her own after being hit by a car, apparently whacked on the head with a hammer in a misguided mercy killing and then buried in a field -- only to survive. That's according to Washington State University, where the dog -- a friendly white-and-black bully breed mix now named Theia -- has been receiving care at the Veterinary Teaching Hospital. Four days after her apparent death, the dog managed to stagger to a nearby farm, dirt-covered and emaciated, where she was found by


In [51]:
'''This function basically divided the data into batches for easier calculation and training'''

def generate_batch_sized_chunks(list_of_elements,batch_size):
    ''' returns batches of elements in list '''
    for i in range(0,len(list_of_elements),batch_size):
        yield list_of_elements[i: i + batch_size]  
        

In [52]:
''' This Function is the generalised function for calculating the any metric for the model '''
''' It gives us an quantitative way to check our model functioning '''
''' inputs -> dataset , metric : eg rogue score, model: eg : bart, tokenizer, batch_size,
              device,column_input,column_output '''
def calculate_metric(dataset,metric,model,tokenizer,batch_size = 16,device = device,
                     column_text = 'article',column_summary = "highlights"):
    ''' Returns score of the model'''
    # Converting it to batches 
    text_batches = list(generate_batch_sized_chunks(dataset[column_text],batch_size))
    summary_batches = list(generate_batch_sized_chunks(dataset[column_summary],batch_size))
    
    # For each loop we are checking the metric
    for text_batch,summary_batch in tqdm(zip(text_batches,summary_batches),total = len(text_batches)):
        
        inputs = tokenizer(text_batch,max_length = 1024,truncation = True,padding = "max_length",return_tensors = "pt")
        summaries = model.generate(input_ids  = inputs["input_ids"].to(device),
                                   length_penalty = 0.8,num_beams = 8,max_length  = 128)
        decoded_summaries = [tokenizer.decode(summary,skip_special_tokens = True,
                                                 clean_up_tokenization_spaces = True)
                            for summary in summaries]
        decoded_summaries = [d.replace(""," ") for d in decoded_summaries]
        
        metric.add_batch(predictions = decoded_summaries,references = summary_batch)
        
    score = metric.compute()
    return score 
    
        

# Usage Rouge Metric:
ROUGE provides a standardized and quantitative way to measure the effectiveness of text summarization systems. It allows researchers and practitioners to compare different models and techniques based on their ability to produce accurate and informative summaries

In [53]:
import evaluate
from tqdm import tqdm
rouge_metric = evaluate.load("rouge")

In [54]:
# Calculating Before training for checking the requirment of training
score = calculate_metric(dataset['test'][:4],rouge_metric,model,tokenizer)

100%|██████████| 1/1 [00:03<00:00,  3.55s/it]


In [55]:
score

{'rouge1': 0.008986035944143775,
 'rouge2': 0.001002004008016032,
 'rougeL': 0.008986035944143775,
 'rougeLsum': 0.008986035944143775}

In [57]:
''' This function encodes the text to features for model training'''
''' Return encoded data '''
def convert_to_features(example):
  input_encoded = tokenizer(example['article'],max_length = 1024,truncation = True)
  with tokenizer.as_target_tokenizer():
    target_encodings = tokenizer(example['highlights'],max_length = 128,truncation = True)
  return {
      'input_ids': input_encoded['input_ids'],
      'labels' : target_encodings['input_ids']
  }

''' By Using map we can apply it for all examples '''
tokenized_dataset = dataset.map(convert_to_features,batched = True)

Map:   0%|          | 0/287113 [00:00<?, ? examples/s]



Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

# Fine Tuning / Training the model 
As transformer has huge amount of parameters we cannot train it using current resources instead of that we will pretrained parameters and fine tuning it for our dataset

In [58]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model = model)

In [60]:
from transformers import TrainingArguments, Trainer

# Setting Training Arguments 
trainer_args = TrainingArguments(
    output_dir = "saved_model",
    num_train_epochs = 3,
    warmup_steps = 500,
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
    weight_decay = 0.01,
    logging_steps = 10,
    evaluation_strategy = 'steps',
    eval_steps = 500
)

In [61]:
# Setting up the trainer 
trainer = Trainer(model = model,
                  args = trainer_args,
                  tokenizer = tokenizer,
                  data_collator = data_collator,
                  train_dataset = tokenized_dataset['train'],
                  eval_dataset = tokenized_dataset['validation'])

In [62]:
trainer.train()

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss
500,2.4793,2.173915


In [None]:
# Checking the score after fine tuning 
score = calculate_metric(dataset['test'][:10],rouge_metric,model,tokenizer)

In [None]:
score

In [None]:
# Saving the data
model.save_pretrained("bard_dailycnn")
tokenizer.save_pretrained("tokenizer")