<a href="https://colab.research.google.com/github/mattignal/article-summary-details/blob/main/Article_Summary_Details.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>




# Summarization Exercise with Two Articles

Can we quickly produce useful abstracts and key details?



In [1]:
# Requirements, uncomment to run

!pip install newspaper3k
!pip install transformers # > 2.2.0
!pip install bert-extractive-summarizer
!pip install spacy
!pip install neuralcoref
!python -m spacy download en_core_web_md
!pip install sentencepiece

Collecting newspaper3k
[?25l  Downloading https://files.pythonhosted.org/packages/d7/b9/51afecb35bb61b188a4b44868001de348a0e8134b4dfa00ffc191567c4b9/newspaper3k-0.2.8-py3-none-any.whl (211kB)
[K     |█▌                              | 10kB 14.8MB/s eta 0:00:01[K     |███                             | 20kB 20.4MB/s eta 0:00:01[K     |████▋                           | 30kB 24.6MB/s eta 0:00:01[K     |██████▏                         | 40kB 17.6MB/s eta 0:00:01[K     |███████▊                        | 51kB 17.1MB/s eta 0:00:01[K     |█████████▎                      | 61kB 17.0MB/s eta 0:00:01[K     |██████████▉                     | 71kB 14.2MB/s eta 0:00:01[K     |████████████▍                   | 81kB 14.3MB/s eta 0:00:01[K     |██████████████                  | 92kB 12.9MB/s eta 0:00:01[K     |███████████████▌                | 102kB 12.1MB/s eta 0:00:01[K     |█████████████████               | 112kB 12.1MB/s eta 0:00:01[K     |██████████████████▋             | 12

In [2]:
# Import Statements
import math
import numpy as np
import re
from newspaper import Article
from textwrap import TextWrapper
from summarizer import Summarizer
from spacy.lang.en import English
from transformers import pipeline, BartTokenizer, BartForConditionalGeneration, BartConfig
import torch
import nltk
nltk.download('punkt')

wrapper = TextWrapper(width=80)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Get Articles
We'll use the newspaper3k application to quickly pull in article details

In [3]:
def get_article(url):
  """Get info about article"""
  article = Article(url)
  article.download()
  article.parse()
  title = article.title
  authors = article.authors
  date = str(article.publish_date.date())
  print("Title:", wrapper.fill(title))
  print("Authors:", str(article.authors))
  print("Date Published:", date)
  print("Number of characters:", len(article.text))
  return article, title, authors, date

In [21]:
# # Crawl URLs with `newspaper3k`
short_article_url = "https://apnews.com/article/joe-biden-us-news-afghanistan-taliban-28143e059b2f07fed62f9404b58a982c" #@param {type: 'string'}
medium_article_url = "https://monthlyreview.org/2009/04/01/the-credit-crisis-is-the-international-role-of-the-dollar-at-stake/" #@param {type: 'string'}

In [22]:
article, title, authors, date = get_article(short_article_url)

Title: Biden seems ready to extend US troop presence in Afghanistan
Authors: ['Robert Burns']
Date Published: 2021-04-08
Number of characters: 6464


In [23]:
# conversely, you can input your own text

# article = ""
# title = ""
# authors = ""
# date = ""

## Splitting Articles into Paragraphs
We'll need to do some additional cleaning and prepare the data so we can summarize key details. Splitting the text into paragraphs will help accomplish this task.

In [24]:
def create_paragraphs(article):
  """Buckets into paragraphs for analysis"""
  paragraphs = article.text.split('\n\n')
  paragraphs = [x for x in paragraphs if len(x) > 100] # must be > 100 characters (assume else is heading or irrelevant)
  print("Paragraphs:")
  for index, paragraph in enumerate(paragraphs): # print sentences
    print("{}: {}".format(index + 1, paragraph))

  print("\nYou can remove any paragraphs you deem unfit by running \n    "
  "paragraphs = drop_paragraphs(paragraphs, list_to_drop)"
  "\nwhere list_to_drop is a list of the above numbers.")

  return paragraphs

def drop_paragraphs(paragraphs, list_to_drop):
  """function to allow the user to remove paragraphs they feel are unimportant"""
  for i in sorted(list_to_drop, reverse=True):
    del paragraphs[i - 1]
  return paragraphs

In [25]:
paragraphs = create_paragraphs(article)

Paragraphs:
1: FILE - In this Nov. 28, 2019, file photo armed soldiers stand guard in the motorcade for President Donald Trump speaks during a surprise Thanksgiving Day visit to the troops at Bagram Air Field, Afghanistan. Without coming right out and saying it, President Joe Biden seems ready to let lapse a May 1 deadline for completing a withdrawal of U.S. troops from Afghanistan. Orderly withdrawals take time, and Biden is running out of it. (AP Photo/Alex Brandon, File)
2: FILE - In this Nov. 28, 2019, file photo armed soldiers stand guard in the motorcade for President Donald Trump speaks during a surprise Thanksgiving Day visit to the troops at Bagram Air Field, Afghanistan. Without coming right out and saying it, President Joe Biden seems ready to let lapse a May 1 deadline for completing a withdrawal of U.S. troops from Afghanistan. Orderly withdrawals take time, and Biden is running out of it. (AP Photo/Alex Brandon, File)
3: WASHINGTON (AP) — Without coming right out and sayi

In [26]:
paragraphs = drop_paragraphs(paragraphs, [1, 2, 19]) # photo details and correction

## Models and Tokenizers
BART, or Bidirectional and Auto-Regressive Transformers, will be used for this task as it performs well for summarization tasks. According to the docs:

> BART uses a standard seq2seq/machine translation architecture with a bidirectional encoder (like BERT) and a left-to-right decoder (like GPT). 

> The pretraining task involves randomly shuffling the order of the original sentences and a novel in-filling scheme, where spans of text are replaced with a single mask token.

> BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa with comparable training resources on GLUE and SQuAD, achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 6 ROUGE.

Here we will use BART-CNN, which has been fine-tuned on the CNN article/summarization datatest.

In [10]:
# initialize BART-CNN
cnn_model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
cnn_tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1399.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1625270765.0, style=ProgressStyle(descr…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




We'll need to create an abstract:

In [61]:
def create_abstract(paragraphs, title, authors, date):
  article_cleaned = " ".join(paragraphs)
  inputs = cnn_tokenizer([article_cleaned], max_length=1024, truncation=True, # limited to first 1024 tokens
                         return_tensors='pt')
  summary_ids = cnn_model.generate(inputs['input_ids'], num_return_sequences=1,
                                  early_stopping=True, num_beams=3,
                                  min_length=80, max_length=120, 
                                  do_sample=False)
  abstract = cnn_tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
  abstract = re.sub(r", writes ([^\s]+) ([^\s]+).", ", writes {}.".format(authors[0]), abstract) # sometimes misses authors name
  abstract = re.sub(r", says ([^\s]+) ([^\s]+)", "", abstract) # sometimes misses authors name
  print("Title:", wrapper.fill(title))
  print("Authors:", wrapper.fill(str(authors)))
  print("Date:", wrapper.fill(date))
  print("\nAbstract:", wrapper.fill(abstract))

In [28]:
create_abstract(paragraphs, title, authors, date)

Title: Biden seems ready to extend US troop presence in Afghanistan
Authors: ['Robert Burns']
Date: 2021-04-08

Abstract: President Joe Biden seems ready to let lapse a May 1 deadline for completing a
withdrawal of U.S. troops from Afghanistan. Removing all of the troops and their
equipment in the next three weeks would be difficult logistically. If the troops
stay, Afghanistan will become Biden’s war. His decisions, now and in coming
months, could determine the legacy of a 2001 U.N. invasion.


In [66]:
def chunk_paragraphs(paragraphs, granularity=2):
  """Chunks paragraphs into start, end, and then a series of middle paragraphs
  param granularity: controls level of detail, more granularity may mean more paragraphs to process
  """
  if len(paragraphs) >= 6:
    block_off = 2
  elif len(paragraphs) >= 4:
    block_off = 1

  middle = paragraphs[block_off:-block_off]

  lengths = []
  chunks = []
  paragraphs_to_chunk = []
  present_length = 0
  for paragraph in middle:
    inputs = cnn_tokenizer([paragraph], return_tensors='pt', truncation=True)
    length = len(inputs['input_ids'][0])
    lengths.append(length)
    avg_length = np.mean(lengths)
    present_length += length
    if present_length > 1024:
      chunks.append(paragraphs_to_chunk)
      present_length = 0
    elif present_length >= 1024 - avg_length*granularity:
      paragraphs_to_chunk.append(paragraph)
      chunks.append(paragraphs_to_chunk)
      paragraphs_to_chunk = []
      present_length = 0
    else:
        paragraphs_to_chunk.append(paragraph)

  if len(chunks) == 0:
    chunks = [paragraphs_to_chunk]

  start_chunks = " ".join(paragraphs[:block_off])

  last_chunk = " ".join(chunks[-1])
  end_chunks = " ".join(paragraphs[-block_off:])
  inputs = cnn_tokenizer([last_chunk], return_tensors='pt', truncation=True)
  lc_length = len(inputs['input_ids'][0])
  inputs = cnn_tokenizer([end_chunks], return_tensors='pt', truncation=True)
  ec_length = len(inputs['input_ids'][0])
  if lc_length + ec_length <= 1024 and ec_length <= 1024 - avg_length*granularity:
    # print("Adding final 'middle chunk' to the end of article chunk.")
    end_chunks = last_chunk + " " + end_chunks
    chunks = chunks[:-1]

  if len(chunks) > 0:
    chunks = [". ".join(x) for x in chunks]
    paragraph_chunks = [start_chunks] + chunks + [end_chunks]

  else:
      paragraph_chunks = [start_chunks] + [end_chunks]

  return paragraph_chunks

In [30]:
paragraph_chunks = chunk_paragraphs(paragraphs, granularity=10)
print("Number of chunks:", len(paragraph_chunks)) # I want three chunks to summarize, granularity 10 seems to work

Adding final 'middle chunk' to the end of article chunk.
Number of chunks: 3


And get key details:

In [79]:
def key_details(paragraph_chunks):
  """Gets key ideas and details from each part of article"""
  details_list = []
  for chunk in paragraph_chunks:
    inputs = cnn_tokenizer([chunk], max_length=1024, return_tensors='pt', truncation=True)
    summary_ids = cnn_model.generate(inputs['input_ids'], num_return_sequences=1, output_scores=False, 
                                  early_stopping=True, num_beams=3, length_penalty=0.2)
    key_idea = [cnn_tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=True) for g in summary_ids]
    key_idea = re.sub(r'(\"[^\"]*)\" [A-Z]', r'\1." ', key_idea[0]) # second quotation(") fix
    key_idea = key_idea.replace('. “', '.“') # additional quotation fix
    key_idea = re.sub(r", writes ([^\s]+) ([^\s]+).", ".", key_idea) # sometimes misses authors name
    key_idea = re.sub(r", says ([^\s]+) ([^\s]+).", ".", key_idea) # sometimes misses authors name
    key_idea = key_idea.replace(", he says", "") # sometimes misses authors name
    key_idea = re.sub(r"\. \b[A-Z].*?\b\: ", ". ", key_idea) # sometimes misses authors name
    nlp = English()
    nlp.add_pipe(nlp.create_pipe('sentencizer'))
    key_idea = [sent.string.strip() for sent in nlp(key_idea).sents][0]
    print(100 * '-')
    print("Key Idea:", wrapper.fill(key_idea))
    summary_ids = cnn_model.generate(inputs['input_ids'], num_return_sequences=1,
                                  early_stopping=True, num_beams=2, 
                                  min_length=80, 
                                  max_length=160, 
                                  do_sample=False)
    details = [cnn_tokenizer.decode(g, skip_special_tokens=True, 
                                    clean_up_tokenization_spaces=True) for g in summary_ids][0]

    # if the key idea is present in the details, let's first look for an alternative generation
    if key_idea in details:
      summary_ids = cnn_model.generate(inputs['input_ids'], num_return_sequences=1,
                                  early_stopping=True, num_beams=2, 
                                  min_length=80, 
                                  max_length=160, 
                                  top_p = 0.8,
                                  do_sample=True)
      details_alt = [cnn_tokenizer.decode(g, skip_special_tokens=True, 
                                           clean_up_tokenization_spaces=True) for g in summary_ids]
      
      # if the key idea is present in the alternative, just use the original and remove the key idea
      if key_idea in details_alt[0]:
        details = details.replace(key_idea, "")
      else:
        details = details_alt[0]
    
    details = re.sub(r", writes ([^\s]+) ([^\s]+).", ".", details) # sometimes misses authors name
    details = re.sub(r", says ([^\s]+) ([^\s]+).", ".", details) # sometimes misses authors name
    details = details.replace(", he says", "") # sometimes misses authors name
    details = re.sub(r"\. \b[A-Z].*?\b\: ", ". ", details) # sometimes misses authors name
    print("\nDetails:", wrapper.fill(details))
    details_list.append(details)
  return details_list

In [40]:
details_list = key_details(paragraph_chunks)

----------------------------------------------------------------------------------------------------
Key Idea: Biden has inched so close to the deadline that his indecision amounts almost to
a decision to put off a pullout of the remaining 2,500 troops.

Details: Biden has inched so close to the deadline that his indecision amounts almost to
a decision to put off a pullout. Removing all of the troops and their equipment
in the next three weeks would be difficult logistically. Biden himself suggested
in late March that the U.S. should hold off on pulling out all of its 2,500
troops until after May 1. The U.N. Security Council has set a May 1 deadline for
completing a withdrawal.
----------------------------------------------------------------------------------------------------
Key Idea: “It’s going to be hard to meet the May 1 deadline,” retired Navy admiral James
Stavridis says.

Details: Biden is under pressure to extend the deadline to get out of Afghanistan. A
retired Navy admiral 

This seems to work well! Let's try it on our longer, more complex article.

In [56]:
article, title, authors, date = get_article(medium_article_url)
paragraphs = create_paragraphs(article)

Title: The Credit Crisis: Is the International Role of the Dollar at Stake?
Authors: ['Ramaa Vasudevan', 'The Editors', 'Hannah Holleman', 'Inger L. Stole', 'John Bellamy Foster', 'Robert W. Mcchesney', 'Simten Cosar', 'Metin Yegenoglu', 'Martin Hart-Landsberg']
Date Published: 2009-04-01
Number of characters: 21030
Paragraphs:
1: Ramaa Vasudevan teaches economics at Colorado State University. She is a member of the Union for Radical Political Economics and the Dollars and Sense collective.
2: As the first tremors of the looming financial crisis ripped through Wall Street, with the meltdown of the subprime mortgage market in the summer of 2007, the dollar plunged sharply. Perversely however, even as some financial pundits were foretelling its collapse, the deepening of the crisis following the bankruptcy of Lehman Brothers in September 2008 actually saw the dollar gain ground sharply (for the first time since the steady decline that began in 2002; see chart 1).
3: Chart 1. Nominal majo

In [57]:
paragraphs = drop_paragraphs(paragraphs, [1, 3, 9, 14]) # author + charts

In [58]:
create_abstract(paragraphs, title, authors, date)
paragraph_chunks = chunk_paragraphs(paragraphs, granularity=2)
# print("Number of chunks:", len(paragraph_chunks))
details_list = key_details(paragraph_chunks)

Title: The Credit Crisis: Is the International Role of the Dollar at Stake?
Authors: ['Ramaa Vasudevan', 'The Editors', 'Hannah Holleman', 'Inger L. Stole', 'John
Bellamy Foster', 'Robert W. Mcchesney', 'Simten Cosar', 'Metin Yegenoglu',
'Martin Hart-Landsberg']
Date: 2009-04-01

Abstract: U.S. dollar plunged sharply after the subprime mortgage meltdown in 2007. But
the deepening of the crisis following the bankruptcy of Lehman Brothers in
September 2008 saw the dollar gain ground sharply. The privileged role of the
dollar as international money has been critical to U.S imperialist hegemony. The
implosion of the financial system has threatened the foundation of dollar
hegemony, writes Ramaa Vasudevan.The current crisis is thus also potentially a
crisis ofdollar hegemony.
----------------------------------------------------------------------------------------------------
Key Idea: Dollar plunged in 2007 after the meltdown of the subprime mortgage market.

Details: The dollar plunged sha

This looks good as well. Can we generate a summary of the important information?

In [77]:
def generate_summary(details_list, authors, granularity=2):
  """generates summary"""
  paragraph_chunks = chunk_paragraphs(details_list, granularity=granularity)
  # print("Number of chunks:", len(paragraph_chunks))
  print("Summary:\n")
  for chunk in paragraph_chunks:
    inputs = cnn_tokenizer([chunk], max_length=1024, truncation=True, # limited to first 1024 tokens
                          return_tensors='pt')
    summary_ids = cnn_model.generate(inputs['input_ids'], num_return_sequences=1,
                                    early_stopping=True, num_beams=3,
                                    min_length=80, max_length=120, 
                                    do_sample=False)
    summary = cnn_tokenizer.decode(summary_ids[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
    summary = re.sub(r", writes ([^\s]+) ([^\s]+).", ", writes {}.".format(authors[0]), summary) # sometimes misses authors name
    summary = re.sub(r", says ([^\s]+) ([^\s]+)", "", summary)
    summary = re.sub(r"\. \b[A-Z].*?\b\: ", ". ", summary) # sometimes misses authors name
    print(wrapper.fill(summary), "\n")



In [78]:
generate_summary(details_list, authors)

Summary:

The dollar plunged sharply in 2007 after the meltdown of the subprime mortgage
market. But the deepening of the crisis following the bankruptcy of Lehman
Brothers in September 2008 saw the dollar gain ground. Bretton Woods
negotiations at the end of the Second World War paved the way for establishing
the dominance of the dollar as international money. The implosion of the
financial system has threatened the foundation of dollar hegemony. 

The privileged role of the dollar provided the U.S. with an international line
of credit. In these circumstances the strategy that the Fed has adopted to
arrest the downward spiral of asset prices is to foment inflation.  China has in
a sense been locked into dollar holdings because selling off would precipitate a
crash. 

