# AML Project - Abstractive Text Summarization

Problem Statment: Given a news article, generate a summary of two-to-three sentences and a headline for the article.

The summary should be abstractive rather than extractive. In abstractive summarization, new sentences are generated as part of the summary and the sentences in the summary might not be present in the news article.

### Importing necessary packages

In [18]:
import numpy as np
import pandas as pd

# splittin dataset
from sklearn.model_selection import train_test_split

# evaluation metric
from ignite.metrics import Rouge, RougeN, RougeL

# visualisation
# import plotly.express as px
# import plotly.graph_objects as go
# import plotly.offline as pyo
 
# pyo.init_notebook_mode() 

### Importing the Indian News Summary Dataset

In [19]:
df_headline = pd.read_csv('dataset/news_headline.csv', header=0)
df_headline

Unnamed: 0,text,summary
0,"Saurav Kant, an alumnus of upGrad and IIIT-B's...",upGrad learner switches to career in ML & Al w...
1,Kunal Shah's credit card bill payment platform...,Delhi techie wins free food from Swiggy for on...
2,New Zealand defeated India by 8 wickets in the...,New Zealand end Rohit Sharma-led India's 12-ma...
3,"With Aegon Life iTerm Insurance plan, customer...",Aegon life iTerm insurance plan helps customer...
4,Speaking about the sexual harassment allegatio...,"Have known Hirani for yrs, what if MeToo claim..."
...,...,...
98396,A CRPF jawan was on Tuesday axed to death with...,CRPF jawan axed to death by Maoists in Chhatti...
98397,"'Uff Yeh', the first song from the Sonakshi Si...",First song from Sonakshi Sinha's 'Noor' titled...
98398,"According to reports, a new version of the 199...",'The Matrix' film to get a reboot: Reports
98399,A new music video shows rapper Snoop Dogg aimi...,Snoop Dogg aims gun at clown dressed as Trump ...


In [20]:
print(df_headline.shape)

(98401, 2)


### Splitting the dataset in training and testing set (80-20) split

In [21]:
x_train, x_test, y_train, y_test = train_test_split(df_headline['text'], df_headline['summary'], test_size=0.2,random_state=25, shuffle=True)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

x_train_list, y_train_list = x_train.tolist(), y_train.tolist()
x_test_list, y_test_list = x_test.tolist(), y_test.tolist()

print(len(x_train_list), len(y_train_list))
print(len(x_test_list), len(y_test_list))

(78720,) (78720,)
(19681,) (19681,)
78720 78720
19681 19681


### Importing the Pegasus model, tokenizer, trainer, and training arguments for finetuning

In [22]:
from transformers import PegasusForConditionalGeneration, PegasusTokenizerFast, Trainer, TrainingArguments
import torch
print(torch.__version__)
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(torch_device)

1.11.0
cpu


In [8]:
# %%time
# tokenizer_cnn = PegasusTokenizerFast.from_pretrained("google/pegasus-cnn_dailymail")
# model_cnn = PegasusForConditionalGeneration.from_pretrained("google/pegasus-cnn_dailymail").to(torch_device)

In [23]:
%%time
tokenizer_large = PegasusTokenizerFast.from_pretrained("google/pegasus-large")
model_large = PegasusForConditionalGeneration.from_pretrained("trained-model").to(torch_device)

CPU times: total: 14.5 s
Wall time: 24.1 s


In [24]:
# configuaration of tokenizer
tokenizer_large

PreTrainedTokenizerFast(name_or_path='google/pegasus-large', vocab_size=96103, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>', 'mask_token': '<mask_2>', 'additional_special_tokens': ['<mask_1>', '<unk_2>', '<unk_3>', '<unk_4>', '<unk_5>', '<unk_6>', '<unk_7>', '<unk_8>', '<unk_9>', '<unk_10>', '<unk_11>', '<unk_12>', '<unk_13>', '<unk_14>', '<unk_15>', '<unk_16>', '<unk_17>', '<unk_18>', '<unk_19>', '<unk_20>', '<unk_21>', '<unk_22>', '<unk_23>', '<unk_24>', '<unk_25>', '<unk_26>', '<unk_27>', '<unk_28>', '<unk_29>', '<unk_30>', '<unk_31>', '<unk_32>', '<unk_33>', '<unk_34>', '<unk_35>', '<unk_36>', '<unk_37>', '<unk_38>', '<unk_39>', '<unk_40>', '<unk_41>', '<unk_42>', '<unk_43>', '<unk_44>', '<unk_45>', '<unk_46>', '<unk_47>', '<unk_48>', '<unk_49>', '<unk_50>', '<unk_51>', '<unk_52>', '<unk_53>', '<unk_54>', '<unk_55>', '<unk_56>', '<unk_57>', '<unk_58>', '<unk_59>', '<u

In [25]:
# parameters of the model
# model_large

In [26]:
# number of trainable parameters
model_large_params = sum(p.numel() for p in model_large.parameters() if p.requires_grad)
print(model_large_params)

568699904


### Testing the trained Pegasus model on our test data

In [27]:
# function to get summary of a text of list of texts
def get_summary(tokenizer, model, x):
    x_tokenized = tokenizer(x, truncation=True, padding = True, return_tensors="pt").to(torch_device)
    print("Input X tokenized. Generating Summary ...")
    y_pred_tokenized= model.generate(**x_tokenized).to(torch_device)
    print("Summary Generated. Decoding Summary ...")
    y_pred = tokenizer.batch_decode(y_pred_tokenized, skip_special_tokens=True)
    print("Summary Decoded.")
    return y_pred

# function to caluculate rogue score
def calculate_rouge(m, y_pred, y):
    candidate = [i.split() for i in y_pred]
    reference = [i.split() for i in y]
    # print(candidate, reference)
    
    m.update((candidate, reference))
    
    return m.compute()

In [28]:
x_test_list[10]

"Technology giant Google has extended the warranty of its Pixel 2 and Pixel 2 XL smartphones to two years, its VP of Product Management Mario Queiroz announced. This comes after users reported Pixel 2 XL's screens turning grey, suggesting burn-in. Google will also add a mode for saturated colours via a software update on both devices to tackle the issue."

In [30]:
%%time
y_test_pred_pre = get_summary(tokenizer_large, model_large, x_test_list[10])
print(y_test_pred_pre)

Input X tokenized. Generating Summary ...
Summary Generated. Decoding Summary ...
Summary Decoded.
['Google extends warranty of Pixel 2 and Pixel 2 XL to 2 years']
CPU times: total: 10.5 s
Wall time: 5.54 s


In [13]:
y_test_list[10]

'Google ups Pixel 2, XL warranty amid screen burn-in reports'