In [1]:
# Install transformers library if not already installed
!pip install transformers --quiet

In [2]:
import pandas as pd
import os
from datasets import Dataset
import torch
from tqdm import tqdm

In [3]:

test_path = "/kaggle/input/newspaper-text-summarization-cnn-dailymail/cnn_dailymail/test.csv"

In [4]:
test_df = pd.read_csv(test_path)

In [5]:
test_dataset = Dataset.from_pandas(test_df)

In [6]:
# Verify structure
print("\ncolumns:", test_dataset.column_names)
print("Sample article:", test_dataset[0]["article"][:1250])
print("Sample summary:", test_dataset[0]["highlights"])


columns: ['id', 'article', 'highlights']
Sample article: Ever noticed how plane seats appear to be getting smaller and smaller? With increasing numbers of people taking to the skies, some experts are questioning if having such packed out planes is putting passengers at risk. They say that the shrinking space on aeroplanes is not only uncomfortable - it's putting our health and safety in danger. More than squabbling over the arm rest, shrinking space on planes putting our health and safety in danger? This week, a U.S consumer advisory group set up by the Department of Transportation said at a public hearing that while the government is happy to set standards for animals flying on planes, it doesn't stipulate a minimum amount of space for humans. 'In a world where animals have more rights to space and food than humans,' said Charlie Leocha, consumer representative on the committee. 'It is time that the DOT and FAA take a stand for humane treatment of passengers.' But could crowding on p

In [10]:
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "sshleifer/distilbart-cnn-12-6"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

summarizer = pipeline(
    "summarization",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

# Test on sample article
sample_article = test_dataset[10]["article"].strip()[:1024]
print("\nOriginal Article:", sample_article[:1250] + "...")

summary = summarizer(
    sample_article,
    max_length=128,
    min_length=30,
    num_beams=4,
    early_stopping=True
)

print("\nGenerated Summary:", summary[0]["summary_text"])

Device set to use cuda:0



Original Article: Biting his nails nervously, these are the first pictures of the migrant boat captain accused of killing 900 men, women and children in one of the worst maritime disasters since World War Two. Tunisian skipper Mohammed Ali Malek, 27, was arrested when he stepped onto Sicilian soil last night, some 24 hours after his  boat capsized in the Mediterranean. Before leaving the Italian coastguard vessel, however, he was forced to watch the bodies of 24 victims of the tragedy being carried off the ship for burial on the island of Malta. He was later charged with multiple manslaughter, causing a shipwreck and aiding illegal immigration. Prosecutors claim he contributed to the disaster by mistakenly ramming the overcrowded fishing boat into a merchant ship that had come to its rescue. As a result of the collision, the migrants shifted position on the boat, which was already off balance, causing it to overturn. Scroll down for videos . Nervous: Tunisian boat captain Mohammed Ali

In [11]:
# Parameters
batch_size = 64  # Adjust based on GPU memory
summarized_data = []

# Process in batches
for i in tqdm(range(0, len(test_dataset), batch_size)):
    batch_articles = [
        test_dataset[j]["article"].strip()[:1024] 
        for j in range(i, min(i + batch_size, len(test_dataset)))
    ]

    try:
        summaries = summarizer(
            batch_articles,
            max_length=96,
            min_length=30,
            num_beams=4,
            early_stopping=True
        )
        for original, summary in zip(batch_articles, summaries):
            summarized_data.append({
                "original_text": original,
                "summary": summary["summary_text"]
            })
    except Exception as e:
        for original in batch_articles:
            summarized_data.append({
                "original_text": original,
                "summary": f"Error: {str(e)}"
            })

# Save to CSV
df = pd.DataFrame(summarized_data)
df.to_csv("summarized_articles.csv", index=False)

print("✅ Done! Summaries saved.")


  5%|▌         | 9/180 [04:56<1:33:16, 32.73s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
 14%|█▍        | 25/180 [13:39<1:24:02, 32.53s/it]Your max_length is set to 96, but your input_length is only 89. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=44)
 45%|████▌     | 81/180 [43:48<52:58, 32.11s/it]  Your max_length is set to 96, but your input_length is only 66. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=33)
 58%|█████▊    | 104/180 [56:04<41:03, 32.41s/it]Your max_length is set to 96, but your input_length is only 94. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_leng

✅ Done! Summaries saved.
