**Step 1: Import Libraries and Load Dataset**

First, we need to import the necessary libraries and load the dataset.

In [None]:
!pip install transformers datasets torch accelerate -U



In [None]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import load_dataset

# Load the CNN/DailyMail dataset
dataset = load_dataset("ccdv/cnn_dailymail", '3.0.0')

# Print dataset structure
print(dataset)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})


**Step 2: Preprocess the Dataset**

Next, we preprocess the dataset by tokenizing the input articles and summaries.

In [None]:
from transformers import T5Tokenizer

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('t5-small')

# Define the preprocessing function
def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["article"]]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")

    # Tokenize summaries
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["highlights"], max_length=150, truncation=True, padding="max_length")

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing to the dataset
tokenized_datasets = dataset.map(preprocess_function, batched=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Map:   0%|          | 0/287113 [00:00<?, ? examples/s]



Map:   0%|          | 0/13368 [00:00<?, ? examples/s]

Map:   0%|          | 0/11490 [00:00<?, ? examples/s]

**Step 3: Load the Model**

Load the T5 model for conditional generation.

In [None]:
from transformers import T5ForConditionalGeneration

# Load the model
model = T5ForConditionalGeneration.from_pretrained('t5-small')

**Step 4: Define Summarization Function**

Create a function to summarize texts using the model.



In [None]:
def summarize_text(text, model, tokenizer, max_length=150, min_length=40, num_beams=4):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
    summary_ids = model.generate(inputs, max_length=max_length, min_length=min_length, length_penalty=2.0, num_beams=num_beams, early_stopping=True)
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

**Step 5: Generate Summaries for Sample Articles**

Test the summarization function on a few articles.

In [None]:
import pandas as pd

articles = dataset['train']['article'][:5]
highlights = dataset['train']['highlights'][:5]

# Create lists to store the results
data = {
    'Original Article': [],
    'Original Summary': [],
    'Generated Summary': []
}

# Iterate over articles and highlights
for article, highlight in zip(articles, highlights):
    data['Original Article'].append(article)
    data['Original Summary'].append(highlight)
    summary = summarize_text(article, model, tokenizer)
    data['Generated Summary'].append(summary)

# Create a DataFrame from the data
df_results = pd.DataFrame(data)

# Print the DataFrame
display(df_results)

Unnamed: 0,Original Article,Original Summary,Generated Summary
0,It's official: U.S. President Barack Obama wan...,Syrian official: Obama climbed to the top of t...,president obama wants congress to weigh in on ...
1,(CNN) -- Usain Bolt rounded off the world cham...,Usain Bolt wins third gold of world championsh...,usain Bolt wins men's 4x100m relay gold in Mos...
2,"Kansas City, Missouri (CNN) -- The General Ser...",The employee in agency's Kansas City office is...,the general services administration allowed an...
3,Los Angeles (CNN) -- A medical doctor in Vanco...,NEW: A Canadian doctor says she was part of a ...,"a medical doctor in Vancouver, British Columbi..."
4,(CNN) -- Police arrested another teen Thursday...,Another arrest made in gang rape outside Calif...,another teen arrested on charges of felony rap...


**Step 6: Evaluate the Model**

Evaluate the model using the ROUGE metric.

1. **ROUGE-1 (R1)**: ROUGE-1 measures the overlap of unigram (single word) tokens between the generated summary and the reference (gold-standard) summary. It calculates the precision, recall, and F1 score of unigrams.

2. **ROUGE-2 (R2)**: ROUGE-2 measures the overlap of bigram (two-word sequences) tokens between the generated summary and the reference summary. Similar to ROUGE-1, it calculates precision, recall, and F1 score of bigrams.

3. **ROUGE-L (RL)**: ROUGE-L measures the longest common subsequence (LCS) between the generated summary and the reference summary. It calculates precision, recall, and F1 score based on the length of the LCS.

4. **ROUGE-W (RW)**: ROUGE-W (sometimes referred to as ROUGE-Lsum) measures the weighted LCS between the generated summary and the reference summary. It assigns more weight to longer matches in the LCS.

In [None]:
import pandas as pd
import evaluate

# Load the ROUGE metric
rouge = evaluate.load('rouge')

def evaluate_model(model, tokenizer, dataset, num_samples=100):
    articles = dataset['validation']['article'][:num_samples]
    references = dataset['validation']['highlights'][:num_samples]
    summaries = []

    for article in articles:
        summary = summarize_text(article, model, tokenizer)
        summaries.append(summary)

    results = rouge.compute(predictions=summaries, references=references)
    return results

# Evaluate the model
results = evaluate_model(model, tokenizer, dataset)

# Define a function to convert the evaluation results to a DataFrame
def format_results(results):
    formatted_results = {
        'Metric': ['ROUGE-1 (R1)', 'ROUGE-2 (R2)', 'ROUGE-L (RL)', 'ROUGE-Lsum (RW)'],
        'Score': [results['rouge1'], results['rouge2'], results['rougeL'], results['rougeLsum']]
    }
    df_results = pd.DataFrame(formatted_results)
    return df_results

# Format the results as a DataFrame
df_results = format_results(results)

# Print the DataFrame
print(df_results)

            Metric     Score
0     ROUGE-1 (R1)  0.277180
1     ROUGE-2 (R2)  0.100567
2     ROUGE-L (RL)  0.203479
3  ROUGE-Lsum (RW)  0.233721


**Step 7: Fine-Tuning the Model**

Set up fine-tuning using the Trainer class.

In [None]:
# import torch
# from transformers import Trainer, TrainingArguments, DataCollatorForSeq2Seq
# import pandas as pd
# import evaluate

# # Load the ROUGE metric
# rouge = evaluate.load('rouge')

# # Create a custom dataset class
# class SummarizationDataset(torch.utils.data.Dataset):
#     def __init__(self, articles, summaries, tokenizer, max_length=512):
#         self.articles = articles
#         self.summaries = summaries
#         self.tokenizer = tokenizer
#         self.max_length = max_length

#     def __len__(self):
#         return len(self.articles)

#     def __getitem__(self, idx):
#         article = self.articles[idx]
#         summary = self.summaries[idx]
#         inputs = self.tokenizer.encode_plus(
#             "summarize: " + article,
#             max_length=self.max_length,
#             truncation=True,
#             padding='max_length',
#             return_tensors='pt'
#         )
#         targets = self.tokenizer.encode_plus(
#             summary,
#             max_length=self.max_length,
#             truncation=True,
#             padding='max_length',
#             return_tensors='pt'
#         )
#         return {
#             'input_ids': inputs['input_ids'].flatten(),
#             'attention_mask': inputs['attention_mask'].flatten(),
#             'labels': targets['input_ids'].flatten()
#         }

# # Prepare the dataset
# train_dataset = SummarizationDataset(
#     dataset['train']['article'],
#     dataset['train']['highlights'],
#     tokenizer
# )

# validation_dataset = SummarizationDataset(
#     dataset['validation']['article'],
#     dataset['validation']['highlights'],
#     tokenizer
# )

# # Data collator
# data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# # Define training arguments
# training_args = TrainingArguments(
#     output_dir='./results',
#     evaluation_strategy="epoch",
#     learning_rate=2e-5,
#     per_device_train_batch_size=4,
#     per_device_eval_batch_size=4,
#     weight_decay=0.01,
#     save_total_limit=3,
#     num_train_epochs=1,
#     logging_dir='./logs',
# )

# # Initialize the Trainer
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=validation_dataset,
#     data_collator=data_collator,
#     tokenizer=tokenizer,
# )

# # Train the model
# trainer.train()

**Data Augmentation - Synonym Replacement**

In [None]:
import random
from datasets import load_dataset
from transformers import pipeline
import pandas as pd
import evaluate
import nltk
from nltk.corpus import wordnet

# Download the wordnet data
nltk.download('wordnet')

# Load the pre-trained summarization pipeline
summarizer = pipeline("summarization")

# Load the ROUGE metric
rouge = evaluate.load('rouge')

[nltk_data] Downloading package wordnet to /root/nltk_data...
No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [None]:
def synonym_replacement(text, n=5):
    words = text.split()
    new_words = words.copy()
    random_word_list = list(set([word for word in words if wordnet.synsets(word)]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = wordnet.synsets(random_word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
        if num_replaced >= n:  # Only replace up to n words
            break
    return ' '.join(new_words)


**Step 8: Evaluate the Model again**

Evaluate the model using the ROUGE metric.

In [None]:
# Function to summarize text
def summarize_text(text, summarizer, max_length=300, min_length=40):
    # Apply truncation to the input sequence
    truncated_text = text[:max_length]
    summary = summarizer(truncated_text, max_length=max_length, min_length=min_length, do_sample=False, truncation=True)[0]['summary_text']
    return summary

# Function to evaluate the model
def evaluate_model(summarizer, dataset, num_samples=100):
    articles = dataset['validation']['article'][:num_samples]
    references = dataset['validation']['highlights'][:num_samples]
    summaries = []

    for article in articles:
        # Apply data augmentation
        augmented_article = synonym_replacement(article)

        # Summarize augmented article
        summary = summarize_text(augmented_article, summarizer)
        summaries.append(summary)

    results = rouge.compute(predictions=summaries, references=references)
    return results

# Evaluate the model
results = evaluate_model(summarizer, dataset)
print(results)

# Format the results as a DataFrame
def format_results(results):
    formatted_results = {
        'Metric': ['ROUGE-1 (R1)', 'ROUGE-2 (R2)', 'ROUGE-L (RL)', 'ROUGE-Lsum (RWs)'],
        'Score': [
            results['rouge1'],
            results['rouge2'],
            results['rougeL'],
            results['rougeLsum']
        ]
    }
    return pd.DataFrame(formatted_results)

# Format and print the results
df_results = format_results(results)
print(df_results)

Your max_length is set to 300, but your input_length is only 69. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your max_length is set to 300, but your input_length is only 61. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=30)
Your max_length is set to 300, but your input_length is only 70. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=35)
Your max_length is set to 300, but your input_length is only 54. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=27)
Your

{'rouge1': 0.29117279687858255, 'rouge2': 0.10066697882530648, 'rougeL': 0.20906519070469215, 'rougeLsum': 0.24082391477287768}
             Metric     Score
0      ROUGE-1 (R1)  0.291173
1      ROUGE-2 (R2)  0.100667
2      ROUGE-L (RL)  0.209065
3  ROUGE-Lsum (RWs)  0.240824


In [None]:
# Format and print the results
df_results = format_results(results)
print(df_results)

             Metric     Score
0      ROUGE-1 (R1)  0.291173
1      ROUGE-2 (R2)  0.100667
2      ROUGE-L (RL)  0.209065
3  ROUGE-Lsum (RWs)  0.240824
