<a href="https://colab.research.google.com/github/jhakaran1/TextSummarizer/blob/main/TextSummarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Summarization with BART
Here, we have used a pre-trained Hugging-face BART model ([Hugging Face community](https://huggingface.co/docs/transformers/glossary)) fine tuned for summarization.

Firstly, we will have to download and install all the requirements.


In [None]:
!git clone https://github.com/huggingface/transformers \
&& cd transformers \

!pip install -q ./transformers

Cloning into 'transformers'...
remote: Enumerating objects: 159050, done.[K
remote: Counting objects: 100% (1501/1501), done.[K
remote: Compressing objects: 100% (822/822), done.[K
remote: Total 159050 (delta 882), reused 1089 (delta 614), pack-reused 157549[K
Receiving objects: 100% (159050/159050), 160.11 MiB | 20.45 MiB/s, done.
Resolving deltas: 100% (119088/119088), done.
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone


In [None]:
from transformers import BartForConditionalGeneration, BartTokenizer

# facebook/bart-base is a pre-trained BART model for mask filling
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-base")

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Let's firstly see how BART handles mask filling.
Let's define an input example with a work masked for the model to predict

In [None]:
masked_input_text = "I <mask> black coffee and white houses."

What are the top predictions by the model?

In [None]:
input_ids = tokenizer([masked_input_text], return_tensors="pt")["input_ids"]
logits = model(input_ids).logits #Score for each word of vocabulary taken before softmax
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(6)
probs = (values.clone().detach()).tolist()
preds = list(zip(tokenizer.decode(predictions).split(),probs))

In [None]:
print('Token: Probability')
print('----------------------')
for token,probability in preds:
  print(f'{token} : {probability}')

Token: Probability
----------------------
love : 0.12353142350912094
like'm : 0.11104217171669006
grew : 0.09545877575874329
have : 0.0477190800011158
am : 0.04364927485585213


Now, let's complete the sentence.

In [None]:
batch = tokenizer([masked_input_text], return_tensors="pt")
generated_ids = model.generate(batch["input_ids"], max_length=100)
tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

'I love black coffee and white houses.'

Now, let's use BART to summarize some text.

In [None]:
torch_device = 'cpu'
# facebook/bart-large-cnn is a pre-trained BART model fine-tuned for text summarization
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

In [None]:
def bart_summarize(text, num_beams, max_length, min_length, no_repeat_ngram_size):

  text = text.replace('\n','')
  text_input_ids = tokenizer.batch_encode_plus([text], return_tensors='pt', max_length=1024)['input_ids'].to(torch_device)
  summary_ids = model.generate(text_input_ids, num_beams=int(num_beams), max_length=int(max_length), min_length=int(min_length), no_repeat_ngram_size=int(no_repeat_ngram_size))
  summary_txt = tokenizer.decode(summary_ids.squeeze(), skip_special_tokens=True)
  return summary_txt

In [None]:
#Defining parameters
num_beams = 4
no_repeat_ngram_size = 3
max_length = 1000
min_length = 100

Let's try to summarize a news article from NBC

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

def fetch_article_text(headline_index = 0):
  nbc_business = "https://www.nbcnews.com/business"
  res = requests.get(nbc_business)
  soup = BeautifulSoup(res.content, 'html.parser')

  headlines = soup.find_all('span',{'class':'tease-card__headline'})
  url = headlines[headline_index].parent['href']
  res = requests.get(url)
  soup = BeautifulSoup(res.content, 'html.parser')

  article = soup.find('div',{'class':'article-body__content'})
  text = article.text
  return text,url

And now, let's generate a summary

In [None]:
text,url = fetch_article_text()

In [None]:
bart_summarize(text, num_beams, max_length, min_length, no_repeat_ngram_size)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


'The United Auto Workers are negotiating with the "Big Three" U.S. automakers. The current UAW contracts expire at 11:59 p.m. next Thursday. A strike at one, two or all three automakers could happen at any time from next Friday onward. The UAW is seeking a 40% wage hike over four years (amounting to 46% compounded), along with cost-of-living increases; beefed-up retirement benefits, including pensions on par with what autoworkers previously received.'