# Importing libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib
import matplotlib.pyplot as plt

from summarizer import Summarizer,TransformerSummarizer
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

# Import the data

In [2]:
df= pd.read_csv('C:/Users/Admin/OneDrive/Desktop/Project/news_summary.csv', encoding='latin-1', usecols=['headlines', 'text'])
  
from_i = 10
count = 5
headlines = df['headlines']
headlines = headlines[from_i:from_i+count].to_list()
df = df['text']
df = pd.DataFrame(df[from_i:from_i+count])
df.reset_index(inplace=True, drop=True)
df

Unnamed: 0,text
0,India's food regulator Food Safety and Standar...
1,"The mother of Harshit Sharma, the class 12 Cha..."
2,Municipal Corporation of Gurugram on Wednesday...
3,"Scientists, for the first time, successfully f..."
4,A Union Minister of State for Home Affairs inf...


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5 non-null      object
dtypes: object(1)
memory usage: 168.0+ bytes


In [4]:
df['text'].head(10)

0    India's food regulator Food Safety and Standar...
1    The mother of Harshit Sharma, the class 12 Cha...
2    Municipal Corporation of Gurugram on Wednesday...
3    Scientists, for the first time, successfully f...
4    A Union Minister of State for Home Affairs inf...
Name: text, dtype: object

In [5]:
df['text'].str.len().max()

388

In [6]:
# Creation the list with new long block
max_length = 400  # minimum characters in each block
i = 0
bodies = []
while i < len(df):
    body = ""
    body_empty = True
    while (len(body) < max_length) and (i < len(df)):
        if body_empty:
            body = df.loc[i,'text']
            body_empty = False
        else: body += " " + df.loc[i,'text']
        i += 1
    bodies.append(body)
    print("Length of blocks =", len(body))
print(f"\nNumber of text blocks = {len(bodies)}\n")
print("Text blocks:\n", bodies)

Length of blocks = 704
Length of blocks = 743
Length of blocks = 388

Number of text blocks = 3

Text blocks:
 ['India\'s food regulator Food Safety and Standards Authority of India (FSSAI) is planning to create a network to collect leftover food and provide it to the needy. It is looking to connect with organisations which can collect, store and distribute leftover food from weddings and large parties. It further added that all food must meet the safety and hygiene standards. The mother of Harshit Sharma, the class 12 Chandigarh boy who got a hoax job offer call from Google, said that the incident "devastated" his life. He got a call, after which he shared the information with the school principal, who sent out a press release. Harshit is hospitalised since Google denied giving him a job, his mother added.', 'Municipal Corporation of Gurugram on Wednesday said that 19 out of 45 commercial building owners have decided to pay property tax instead of providing free parking to the public.

# Text Summarizing 

In [7]:
min_length_text = 40

1. BERT Summarizing

It is a pre-trained model that is naturally bidirectional. This pre-trained model can be tuned to easily to perform the NLP tasks as specified, Summarization in our case.


In [8]:
%%time
bert_summary = []
for i in range(len(bodies)):
    bert_model = Summarizer()
    bert_summary.append(''.join(bert_model(bodies[i], min_length=min_length_text)))

Wall time: 1min 25s


2. GPT-2 Summarizing

GPT-2 was trained with the goal of causal language modeling (CLM) and is thus capable of predicting the next token in a sequence. GPT-2 may create syntactically coherent text by utilizing this capability. GPT2 is capable of several tasks, including summarization, generation, and translation.

In [9]:
%%time
gpt_summary = []
for i in range(len(bodies)):
    GPT2_model = TransformerSummarizer(transformer_type="GPT2",transformer_model_key="gpt2-medium")
    gpt_summary.append(''.join(GPT2_model(bodies[i], min_length=min_length_text)))

Some weights of GPT2Model were not initialized from the model checkpoint at gpt2-medium and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'h.12.attn.masked_bias', 'h.13.attn.masked_bias', 'h.14.attn.masked_bias', 'h.15.attn.masked_bias', 'h.16.attn.masked_bias', 'h.17.attn.masked_bias', 'h.18.attn.masked_bias', 'h.19.attn.masked_bias', 'h.20.attn.masked_bias', 'h.21.attn.masked_bias', 'h.22.attn.masked_bias', 'h.23.attn.masked_bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of GPT2Model were not initialized from the model checkpoint at gpt2-medium and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bi

Wall time: 2min 36s


3. XLNet Summarizing 

XLNet is a generalized autoregressive language model that learns unsupervised representations of text sequences. This model incorporates modelling techniques from Autoencoder(AE) models(BERT) into AR models while avoiding limitations of AE.

In [10]:
%%time
xlnet_summary = []
for i in range(len(bodies)):
    model = TransformerSummarizer(transformer_type="XLNet",transformer_model_key="xlnet-base-cased")
    xlnet_summary.append(''.join(model(bodies[i], min_length=min_length_text)))

Wall time: 34 s


# Results 

In [11]:
%%time
print("All Summarizing Results:\n")
for i in range(len(bodies)):
    print("ORIGINAL TEXT:")
    print(bodies[i])
    print("\nBERT Summarizing Result:")
    print(bert_summary[i])
    print("\nGPT-2 Summarizing Result:")
    print(gpt_summary[i])
    print("\nXLNet Summarizing Result:")
    print(xlnet_summary[i])
    print("\nOriginal headline:")
    print(headlines[i])
    print("\n\n")

All Summarizing Results:

ORIGINAL TEXT:
India's food regulator Food Safety and Standards Authority of India (FSSAI) is planning to create a network to collect leftover food and provide it to the needy. It is looking to connect with organisations which can collect, store and distribute leftover food from weddings and large parties. It further added that all food must meet the safety and hygiene standards. The mother of Harshit Sharma, the class 12 Chandigarh boy who got a hoax job offer call from Google, said that the incident "devastated" his life. He got a call, after which he shared the information with the school principal, who sent out a press release. Harshit is hospitalised since Google denied giving him a job, his mother added.

BERT Summarizing Result:
India's food regulator Food Safety and Standards Authority of India (FSSAI) is planning to create a network to collect leftover food and provide it to the needy.

GPT-2 Summarizing Result:
India's food regulator Food Safety and 