## GPT -- Text Summarization

In [15]:
# Import modules
from transformers import pipeline, GPT2Tokenizer, GPT2LMHeadModel, AutoTokenizer, AutoModelForSeq2SeqLM
from datasets import load_dataset
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
import torch
import pandas as pd
from rouge_score import rouge_scorer
import random

In [3]:
# Import dataset 
dataset = load_dataset("cnn_dailymail", "3.0.0")
dataset

Generating train split:   0%|          | 0/287113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13368 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11490 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 287113
    })
    validation: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 13368
    })
    test: Dataset({
        features: ['article', 'highlights', 'id'],
        num_rows: 11490
    })
})

In [5]:
total_articles=dataset['train']
articles=random.sample(list(total_articles), 100)

# Access the first article in the training dataset
first_article = dataset['train'][0]

# Print the original text and human summary
print("Original Text:", first_article['article'])
print("Human Summary:", first_article['highlights'])

Original Text: LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart.

In [5]:
# Access the dataset splits
train_data = dataset['train']
validation_data = dataset['validation']
test_data = dataset['test']

# View a sample
sample = train_data[0]
print("Original Text:", sample['article'])
print("Human Summary:", sample['highlights'])

Original Text: LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday, but he insists the money won't cast a spell on him. Daniel Radcliffe as Harry Potter in "Harry Potter and the Order of the Phoenix" To the disappointment of gossip columnists around the world, the young actor says he has no plans to fritter his cash away on fast cars, drink and celebrity parties. "I don't plan to be one of those people who, as soon as they turn 18, suddenly buy themselves a massive sports car collection or something similar," he told an Australian interviewer earlier this month. "I don't think I'll be particularly extravagant. "The things I like buying are things that cost about 10 pounds -- books and CDs and DVDs." At 18, Radcliffe will be able to gamble in a casino, buy a drink in a pub or see the horror film "Hostel: Part II," currently six places below his number one movie on the UK box office chart.

In [8]:
# Load the summarization pipeline with a pretrained model
summarization_pipeline = pipeline("summarization", model="facebook/bart-large-cnn")

# Perform abstractive summarization on a single sample
article = articles[0]['article'][:1024]  # Use the article text from the dataset
abstractive_summary = summarization_pipeline(article, max_length=130, min_length=30, do_sample=False)

print("Abstractive Summary:", abstractive_summary[0]['summary_text'])

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Abstractive Summary: Caterham and Marussia have entered administration. Lotus, Force India and Sauber have threatened to boycott the United States Grand Prix as a protest at what they perceive as a lack of action to stop the folly.


In [9]:
# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Add padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token

### Abstractive

In [11]:
def generate_abstractive_summary(article, max_new_tokens=150):
    # Create a summarization prompt
    prompt = f"Summarize the following article:\n\n{article}\n\nSummary:"
    
    # Tokenize the prompt
    inputs = tokenizer.encode(prompt, return_tensors="pt", max_length=1024, truncation=True)
    
    # Generate summary
    outputs = model.generate(
        inputs, 
        max_new_tokens=max_new_tokens,  # Specify the number of tokens to generate
        num_return_sequences=1, 
        no_repeat_ngram_size=2,
        temperature=0.7,
        top_k=50,
        top_p=0.95,
        pad_token_id=tokenizer.eos_token_id
    )
    
    # Decode the generated summary
    summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return summary.split("Summary:")[-1].strip()

# Test the function with a sample article
abstractive_summary = generate_abstractive_summary(articles[0]['article'][:1024])
print("Abstractive Summary:", abstractive_summary)


Abstractive Summary: Eccliestone is not alone in his belief that the current situation is unsustainable.
. The F2 world championship is in danger of collapsing. It is a crisis which is being exacerbated by the fact that F3 is now in the midst of its worst year in history. F4 is on the brink of collapse. And the only way to avoid it is for the teams involved to make sacrifices. This is the problem with Eccly's comments. He is saying that Formula E is 'in crisis' and that it needs to be 'fixed'. He has no idea what he's talking about. His words are not only wrong but also dangerous. If he were to say that he would not be willing to take on any of these teams, he


### Extractive

In [59]:
def generate_extractive_summary(article, num_sentences=3):

    # Parse the input text
    parser = PlaintextParser.from_string(article, Tokenizer("english"))

    # Initialize the LexRank summarizer
    summarizer = LexRankSummarizer()

    # Generate a summary with 3 sentences
    summary = summarizer(parser.document, num_sentences)

    # Extract text from Sentence objects
    summary_text = " ".join([str(sentence) for sentence in summary])
    return summary_text

# Test the function with a sample article
extractive_summary = generate_extractive_summary(sample['article'])
print("Extractive Summary:", extractive_summary)

Extractive Summary: "As the Rome Statute today enters into force for the State of Palestine, Palestine acquires all the rights as well as responsibilities that come with being a State Party to the Statute. "What's objectionable is the attempts to undermine international justice, not Palestine's decision to join a treaty to which over 100 countries around the world are members." "As we have said repeatedly, we do not believe that Palestine is a state and therefore we do not believe that it is eligible to join the ICC," the State Department said in a statement.


In [17]:
def generate_extractive_summary(article, num_sentences=3):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # Load a pretrained BERT-based summarization model from PyTorch Hub
    tokenizer = AutoTokenizer.from_pretrained("allenai/led-base-16384")
    model = AutoModelForSeq2SeqLM.from_pretrained("allenai/led-base-16384").to(device)

    # Tokenize and prepare the article for the model
    inputs = tokenizer.encode(article, return_tensors="pt", max_length=512, truncation=True).to(device)

    # Perform summarization
    summary_ids = model.generate(
        inputs,
        max_length=100,  # Adjust max length of summary
        min_length=30,   # Adjust min length of summary
        num_beams=4,
        length_penalty=2.0,
        early_stopping=True
    )

    # Decode the summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return summary

# Example usage with a sample article
extractive_summary = generate_extractive_summary(articles[0]['article'][:1024], num_sentences=3)
print("Extractive Summary:", extractive_summary)

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Input ids are automatically padded from 235 to 1024 to be a multiple of `config.attention_window`: 1024




In [19]:
# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Evaluate the abstractive summary
rouge_scores = scorer.score(articles[0]['highlights'], abstractive_summary)
print("ROUGE Scores (Abstractive):", rouge_scores)

# Evaluate the extractive summary
rouge_scores_extractive = scorer.score(articles[0]['highlights'], extractive_summary)
print("ROUGE Scores (Extractive):", rouge_scores_extractive)


ROUGE Scores (Abstractive): {'rouge1': Score(precision=0.07751937984496124, recall=0.23809523809523808, fmeasure=0.11695906432748539), 'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0), 'rougeL': Score(precision=0.05426356589147287, recall=0.16666666666666666, fmeasure=0.08187134502923976)}
ROUGE Scores (Extractive): {'rouge1': Score(precision=0.375, recall=0.7142857142857143, fmeasure=0.4918032786885245), 'rouge2': Score(precision=0.27848101265822783, recall=0.5365853658536586, fmeasure=0.36666666666666664), 'rougeL': Score(precision=0.3125, recall=0.5952380952380952, fmeasure=0.4098360655737705)}


In [20]:
def is_valid_input(article, max_tokens=1024):
    """
    Check if the tokenized input is valid for the model.
    """
    prompt = f"Summarize the following article:\n\n{article[:800]}\n\nSummary:"
    inputs = tokenizer.encode(prompt, truncation=True)
    return len(inputs) <= max_tokens

def truncate_article(article, max_tokens=1024):
    prompt_prefix = "Summarize the following article:\n\n"
    prompt_suffix = "\n\nSummary:"
    
    # Calculate space for the article
    prompt_length = len(tokenizer.encode(prompt_prefix + prompt_suffix, truncation=True))
    available_space = max_tokens - prompt_length
    
    # Truncate article
    return article[:available_space]

In [21]:
articles

  'highlights': "Ferrari, Red Bull, McLaren and Mercedes receive lion's share of the £900million handed out .\nThe FIA powerless to act despite costs spiralling out of control .\nCash-strapped Lotus, Force India and Sauber have threatened to boycott the United States Grand Prix .",
  'id': 'c8d89e0b215a36f9713e560491a51d50e1b02c9e'},
 {'article': "The world's oldest primate - dating back 55 million years and small enough to fit in the palm of a hand - has been discovered by scientists. Named Archicebus achilles and smaller . than the pygmy mouse lemur, the tiniest living primate, it was unearthed . near the Yangtze River in central China. Even though it weighed under an ounce, the minute animal provides a vital missing link in human evolution. Its almost complete skeleton marks a pivotal moment on the tree of life when the branch leading to modern monkeys, apes and humans split from that of the nocturnal lemur-like tarsiers of south east Asia. Scroll down for video . Named Archicebus a

In [22]:
results = []

for i, article_data in enumerate(articles):  # Iterate through articles
    print(f"Processing article {i + 1}...")
    article = article_data['article']
    human_summary = article_data['highlights']
    
    # Validate and truncate article
    if not is_valid_input(article):
        print(f"Skipping sample {i + 1}: Input too long.")
        continue
    
    truncated_article = truncate_article(article)
    
    try:
        # Generate summaries
        abstractive_summary = generate_abstractive_summary(truncated_article)
        extractive_summary = generate_extractive_summary(truncated_article, num_sentences=3)
    except IndexError as e:
        print(f"Error processing sample {i + 1}: {e}")
        abstractive_summary = "Error in abstractive summarization"
        extractive_summary = "Error in extractive summarization"
        continue  # Optionally skip adding failed samples

    # Evaluate summaries
    rouge_abstractive = scorer.score(human_summary, abstractive_summary)
    rouge_extractive = scorer.score(human_summary, extractive_summary)
    
    # Flatten ROUGE scores
    rouge_abstractive_flat = {f"abstractive_{k}": v.fmeasure for k, v in rouge_abstractive.items()}
    rouge_extractive_flat = {f"extractive_{k}": v.fmeasure for k, v in rouge_extractive.items()}
    
    # Store results
    results.append({
        "article": article,
        "human_summary": human_summary,
        "abstractive_summary": abstractive_summary,
        "extractive_summary": extractive_summary,
        **rouge_abstractive_flat,
        **rouge_extractive_flat
    })


df = pd.DataFrame(results)
# df.to_csv("gpt2_summarization_results.csv", index=False)
# print("Results saved to 'gpt2_summarization_results.csv'.")

Processing article 1...


Input ids are automatically padded from 231 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 2...


Input ids are automatically padded from 225 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 3...


Input ids are automatically padded from 242 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 4...


Input ids are automatically padded from 217 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 5...


Input ids are automatically padded from 236 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 6...
Processing article 7...


Input ids are automatically padded from 227 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 8...
Processing article 9...


Input ids are automatically padded from 209 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 10...


Input ids are automatically padded from 220 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 11...


Input ids are automatically padded from 215 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 12...
Processing article 13...


Input ids are automatically padded from 224 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 14...


Input ids are automatically padded from 197 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 15...


Input ids are automatically padded from 222 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 16...


Input ids are automatically padded from 232 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 17...
Processing article 18...
Processing article 19...


Input ids are automatically padded from 219 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 20...


Input ids are automatically padded from 198 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 21...


Input ids are automatically padded from 240 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 22...
Processing article 23...


Input ids are automatically padded from 244 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 24...
Processing article 25...
Processing article 26...


Input ids are automatically padded from 216 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 27...


Input ids are automatically padded from 201 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 28...
Processing article 29...


Input ids are automatically padded from 214 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 30...
Processing article 31...


Input ids are automatically padded from 254 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 32...
Processing article 33...
Processing article 34...


Input ids are automatically padded from 234 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 35...
Processing article 36...
Processing article 37...
Processing article 38...


Input ids are automatically padded from 204 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 39...


Input ids are automatically padded from 208 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 40...
Processing article 41...


Input ids are automatically padded from 241 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 42...
Processing article 43...
Processing article 44...


Input ids are automatically padded from 212 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 45...
Processing article 46...
Processing article 47...
Processing article 48...
Processing article 49...


Input ids are automatically padded from 230 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 50...
Processing article 51...


Input ids are automatically padded from 190 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 52...
Processing article 53...
Processing article 54...
Processing article 55...


Input ids are automatically padded from 213 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 56...


Input ids are automatically padded from 228 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 57...
Processing article 58...
Processing article 59...


Input ids are automatically padded from 229 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 60...


Input ids are automatically padded from 206 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 61...
Processing article 62...


Input ids are automatically padded from 223 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 63...
Processing article 64...
Processing article 65...
Processing article 66...
Processing article 67...
Processing article 68...


Input ids are automatically padded from 233 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 69...


Input ids are automatically padded from 202 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 70...
Processing article 71...
Processing article 72...


Input ids are automatically padded from 203 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 73...


Input ids are automatically padded from 210 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 74...


Input ids are automatically padded from 252 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 75...


Input ids are automatically padded from 226 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 76...
Processing article 77...
Processing article 78...
Processing article 79...
Processing article 80...
Processing article 81...
Processing article 82...
Processing article 83...
Processing article 84...
Processing article 85...
Processing article 86...
Processing article 87...
Processing article 88...
Processing article 89...
Processing article 90...
Processing article 91...


Input ids are automatically padded from 237 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 92...
Processing article 93...


Input ids are automatically padded from 218 to 1024 to be a multiple of `config.attention_window`: 1024


Processing article 94...
Processing article 95...
Processing article 96...
Processing article 97...
Processing article 98...
Processing article 99...
Processing article 100...


In [23]:
df

Unnamed: 0,article,human_summary,abstractive_summary,extractive_summary,abstractive_rouge1,abstractive_rouge2,abstractive_rougeL,extractive_rouge1,extractive_rouge2,extractive_rougeL
0,Bernie Ecclestone has urged Formula One's team...,"Ferrari, Red Bull, McLaren and Mercedes receiv...",Eccliode has been in the news a lot lately for...,Bernie Ecclestone has urged Formula One's team...,0.098160,0.000000,0.073620,0.491803,0.366667,0.409836
1,The world's oldest primate - dating back 55 mi...,Archicebus achilles was found near the Yangtze...,The new prune is the oldest living tree in Nor...,The world's oldest primate - dating back 55 mi...,0.157143,0.014493,0.114286,0.342342,0.128440,0.216216
2,By . Daily Mail Reporter . Updated: . 01:46 ES...,Pat Taccetta says ‘justice will be done’ over ...,"The killer, who is believed to be from the sam...","By. Daily Mail Reporter. Updated:. 01:46 EST, ...",0.163934,0.000000,0.087432,0.253521,0.042857,0.084507
3,By . John Greechan . Amateur champion Bradley ...,Bradley Neil achieved one of his Open ambition...,The following is a list of the top 10 Open pla...,By. John Greechan. Amateur champion Bradley Ne...,0.088496,0.000000,0.088496,0.474227,0.273684,0.412371
4,"By . David Martosko, U.s. Political Editor . P...",Trump could face a crowded Republican primary ...,Donald J. Trump has been a major player in the...,"By. David Martosko, U.s. Political Editor. PUB...",0.162437,0.010256,0.131980,0.189781,0.029630,0.116788
...,...,...,...,...,...,...,...,...,...,...
95,"Beijing, China (CNN) -- The Chinese public, es...",Goldkorn: Web uproar focused on whether Li wil...,Li is charged in connection with the killing o...,"Beijing, China (CNN) -- The Chinese public, es...",0.112676,0.000000,0.084507,0.188034,0.069565,0.119658
96,Andre Schurrle impressed on his debut for Wolf...,Andre Schurrle grabbed two assists on his Wolf...,Schürrles is a player who has been a key part ...,Andre Schurrle impressed on his debut for Wolf...,0.175000,0.000000,0.125000,0.438596,0.160714,0.298246
97,"Philip Levine, a Pulitzer Prize-winning poet w...","One of the country's most decorated poets, Lev...",The following is a list of the most important ...,"Philip Levine, a Pulitzer Prize-winning poet w...",0.100719,0.014599,0.086331,0.476190,0.289655,0.326531
98,(CNN) -- When it comes to deciding where the m...,The Lotus Formula One team says it prioritizes...,The Lotus F3 team has been in talks with Ferra...,(CNN) -- When it comes to deciding where the m...,0.267857,0.036364,0.196429,0.507246,0.220588,0.289855


In [25]:
df.to_csv("summarization_model_results.csv", index=False)
print("Results saved to 'gpt2_summarization_results.csv'.")

Results saved to 'gpt2_summarization_results.csv'.
