# Text Summarization

### The Goal
Text summarization aims to save time by quickly providing the essential information from longer articles, enabling individuals to read multiple texts more efficiently, ultimately leading to time savings in the long run.

### Approches
1. Extractive
   
    With this method, we rate each sentence in the document against all others, based on how well each line explains.

2. Abstractive

    This method constructs a one-of-a-kind summary by learning the most significant points from the original text.

In [12]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import datasets
import re
import contractions
from heapq import nlargest

import nltk
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation

import transformers
from transformers import BartTokenizer, BartForConditionalGeneration
import torch
import rouge

from pprint import pprint

In [13]:
train_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="train[0:1%]")
validation_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="validation[0:1%]")
test_data = datasets.load_dataset("cnn_dailymail", "3.0.0", split="test[0:1%]")

In [14]:
train_df = pd.DataFrame(train_data).drop(['id'], axis=1)
train_df.head()

Unnamed: 0,article,highlights
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa..."
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non..."
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical..."


In [17]:
train_df['article_len'] = train_df['article'].map(len)
train_df['highlights_len'] = train_df['highlights'].map(len)

In [18]:
train_df.head()

Unnamed: 0,article,highlights,article_len,highlights_len
0,"LONDON, England (Reuters) -- Harry Potter star...",Harry Potter star Daniel Radcliffe gets £20M f...,2527,217
1,Editor's note: In our Behind the Scenes series...,Mentally ill inmates in Miami are housed on th...,4051,281
2,"MINNEAPOLIS, Minnesota (CNN) -- Drivers who we...","NEW: ""I thought I was going to die,"" driver sa...",3940,224
3,WASHINGTON (CNN) -- Doctors removed five small...,"Five small polyps found during procedure; ""non...",2620,185
4,(CNN) -- The National Football League has ind...,"NEW: NFL chief, Atlanta Falcons owner critical...",5764,273


In [19]:
mean_article_len = train_df['article_len'].mean()
mean_highlights_len = train_df['highlights_len'].mean()

In [20]:
print(f'mean_article_len = {mean_article_len}')
print(f'mean_highlights_len = {mean_highlights_len}')

mean_article_len = 3614.7586206896553
mean_highlights_len = 257.5071403692093


In [21]:
cnn_sample_id = 10
cnn_sample = test_data[cnn_sample_id]
ARTICLE_TO_SUMMARIZE = cnn_sample['article']
EXPECTED_SUMMARY = cnn_sample['highlights']
print("-" * 10)
print("ARTICLE_TO_SUMMARIZE")
print("-" * 10)
print(ARTICLE_TO_SUMMARIZE)
print("-" * 10)
print("EXPECTED_SUMMARY")
print("-" * 10)
print(EXPECTED_SUMMARY)

----------
ARTICLE_TO_SUMMARIZE
----------
London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said. Yahya Rashid, a UK national from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said. He's been charged with engaging in conduct in preparation of acts of terrorism, and with engaging in conduct with the intention of assisting others to commit acts of terrorism. Both charges relate to the period between November 1 and March 31. Rashid is due to appear in Westminster Magistrates' Court on Wednesday, police said. CNN's Lindsay Isaac contributed to this report.
----------
EXPECTED_SUMMARY
----------
London's Metropolitan Police say the man was arrested at Luton airport after landing on a flight from Istanbul .
He's been charged with terror offenses allegedly committed since the start of November .


In [22]:
text = cnn_sample['article']
expected_summary = cnn_sample['highlights']

## Approch# 1. Extractive

In [23]:
nlp = spacy.load("en_core_web_sm")

In [24]:
doc=nlp(text)
tokens=[token.text for token in doc]

In [25]:
word_freq={}
stop_words= list(STOP_WORDS)
punctuation =  punctuation + '\n'

Count the frequency of the words not present ing stop_words and punctuation

In [26]:
for word in doc:
   if word.text.lower() not in stop_words:
     if word.text.lower() not in punctuation:
       if word.text not in word_freq.keys():
         word_freq[word.text]= 1
       else:
         word_freq[word.text]+= 1 
pprint(word_freq)

{'1': 1,
 '19': 1,
 '31': 1,
 'Britain': 1,
 'CNN': 1,
 'CNN)A': 1,
 'Court': 1,
 'Isaac': 1,
 'Istanbul': 1,
 'Lindsay': 1,
 'London': 3,
 'Luton': 1,
 'Magistrates': 1,
 'March': 1,
 'Metropolitan': 1,
 'November': 1,
 'Police': 1,
 'Rashid': 2,
 'Tuesday': 1,
 'Turkey': 1,
 'UK': 1,
 'Wednesday': 2,
 'Westminster': 1,
 'Yahya': 1,
 'acts': 2,
 'airport': 1,
 'appear': 1,
 'arrested': 1,
 'arrived': 1,
 'assisting': 1,
 'charged': 2,
 'charges': 1,
 'commit': 1,
 'conduct': 2,
 'contributed': 1,
 'detained': 1,
 'engaging': 2,
 'flight': 1,
 'intention': 1,
 'man': 1,
 'national': 1,
 'northwest': 1,
 'offenses': 1,
 'old': 1,
 'period': 1,
 'police': 2,
 'preparation': 1,
 'relate': 1,
 'report': 1,
 'returned': 1,
 'said': 3,
 'terror': 1,
 'terrorism': 2,
 'year': 1}


In [27]:
max_freq = max(word_freq.values())
print(max_freq)

3


Normalize the word_req dict

In [28]:
for word in word_freq.keys():
  word_freq[word] = word_freq[word] / max_freq
pprint(word_freq)

{'1': 0.3333333333333333,
 '19': 0.3333333333333333,
 '31': 0.3333333333333333,
 'Britain': 0.3333333333333333,
 'CNN': 0.3333333333333333,
 'CNN)A': 0.3333333333333333,
 'Court': 0.3333333333333333,
 'Isaac': 0.3333333333333333,
 'Istanbul': 0.3333333333333333,
 'Lindsay': 0.3333333333333333,
 'London': 1.0,
 'Luton': 0.3333333333333333,
 'Magistrates': 0.3333333333333333,
 'March': 0.3333333333333333,
 'Metropolitan': 0.3333333333333333,
 'November': 0.3333333333333333,
 'Police': 0.3333333333333333,
 'Rashid': 0.6666666666666666,
 'Tuesday': 0.3333333333333333,
 'Turkey': 0.3333333333333333,
 'UK': 0.3333333333333333,
 'Wednesday': 0.6666666666666666,
 'Westminster': 0.3333333333333333,
 'Yahya': 0.3333333333333333,
 'acts': 0.6666666666666666,
 'airport': 0.3333333333333333,
 'appear': 0.3333333333333333,
 'arrested': 0.3333333333333333,
 'arrived': 0.3333333333333333,
 'assisting': 0.3333333333333333,
 'charged': 0.6666666666666666,
 'charges': 0.3333333333333333,
 'commit': 0.333

In [29]:
sent_tokens = [sent for sent in doc.sents]
pprint(sent_tokens)

[London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said.,
 Yahya Rashid, a UK national from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said.,
 He's been charged with engaging in conduct in preparation of acts of terrorism, and with engaging in conduct with the intention of assisting others to commit acts of terrorism.,
 Both charges relate to the period between November 1 and March 31.,
 Rashid is due to appear in Westminster Magistrates' Court on Wednesday, police said.,
 CNN's Lindsay Isaac contributed to this report.]


Score each sentence based on the score of each word appearing in the sentence

In [30]:
sent_score = {}
for sent in sent_tokens:
  for word in sent:
    if word.text.lower() in word_freq.keys():
      if sent not in sent_score.keys():
        sent_score[sent] = word_freq[word.text.lower()]
      else:
        sent_score[sent] += word_freq[word.text.lower()]  
pprint(sent_score)

{London (CNN)A 19-year-old man was charged Wednesday with terror offenses after he was arrested as he returned to Britain from Turkey, London's Metropolitan Police said.: 5.000000000000001,
 Yahya Rashid, a UK national from northwest London, was detained at Luton airport on Tuesday after he arrived on a flight from Istanbul, police said.: 3.6666666666666665,
 He's been charged with engaging in conduct in preparation of acts of terrorism, and with engaging in conduct with the intention of assisting others to commit acts of terrorism.: 7.333333333333333,
 Both charges relate to the period between November 1 and March 31.: 1.6666666666666665,
 Rashid is due to appear in Westminster Magistrates' Court on Wednesday, police said.: 2.0,
 CNN's Lindsay Isaac contributed to this report.: 0.6666666666666666}


Set the expected summary length from minimum of 
1. 30% of the length of the article
1. mean lenght of the summarized text from the dataset

In [34]:
summary_factor = 0.3
max_len = int(min(len(sent_score) * summary_factor, mean_highlights_len))

print(max_len)
summary = nlargest(n = max_len, iterable= sent_score, key = sent_score.get)
summary = [word.text for word in summary]
summary = " ".join(summary)
print(summary)

1
He's been charged with engaging in conduct in preparation of acts of terrorism, and with engaging in conduct with the intention of assisting others to commit acts of terrorism.


In [32]:
def evaluate_summary(expected_summary, generated_summary):
    rouge_score = rouge.Rouge()
    scores = rouge_score.get_scores(expected_summary, generated_summary, avg=True)
    #pprint(scores)
    score_1 = round(scores['rouge-1']['f'], 2)
    score_2 = round(scores['rouge-2']['f'], 2)
    score_L = round(scores['rouge-l']['f'], 2)
    print("rouge1:", score_1, "| rouge2:", score_2, "| rougeL:", score_2, 
          "--> avg rouge:", round(np.mean([score_1,score_2,score_L]), 2))

In [33]:
evaluate_summary(expected_summary, summary)

rouge1: 0.24 | rouge2: 0.11 | rougeL: 0.11 --> avg rouge: 0.2


## Approch# 2. Abstractive (Language model)

In this method we use a pre-trained Transformer model `facebook/bart-large-cnn` which is fine-tuned on the same dataset.

In [36]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

In [39]:
tokenized_input = tokenizer([ARTICLE_TO_SUMMARIZE], max_length=mean_article_len, truncation=True, return_tensors='pt')
summary_ids = model.generate(tokenized_input['input_ids'], max_length=mean_highlights_len, early_stopping=True)
summary = "".join([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids])
print(summary)

Yahya Rashid, a UK national from northwest London, was detained at Luton airport on Tuesday. He's been charged with engaging in conduct in preparation of acts of terrorism, police say. Rashid is due to appear in Westminster Magistrates' Court on Wednesday.


In [40]:
evaluate_summary(expected_summary, summary)

rouge1: 0.39 | rouge2: 0.14 | rougeL: 0.14 --> avg rouge: 0.29


# References
1. https://www.analyticsvidhya.com/blog/2021/10/text-summarization-using-the-conventional-hugging-face-transformer-and-cosine-similarity/
1. https://pub.towardsai.net/a-full-introduction-on-text-summarization-using-deep-learning-with-sample-code-ft-huggingface-d21e0336f50c
1. https://pub.towardsai.net/how-to-train-a-seq2seq-text-summarization-model-with-sample-code-ft-huggingface-pytorch-8ba97492f885