<a href="https://colab.research.google.com/github/katrina906/CS6120-Summarization-Project/blob/main/abstractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#https://www.thepythoncode.com/article/text-summarization-using-huggingface-transformers-python
#https://huggingface.co/blog/how-to-generate

# use much smaller sample size. and then re-evaluate extractive models to match so have comparative stats? or just _best_ extractive model!
  # discuss that results are less robust and list as extension to parallelize. 

# Abstractive Summarization using Encoder-Decoder T5 Model 
TODO: describe model. Using pre-trained model. 
              
__Pros__ over extractive summarization:
- Generates new text, not just repeating what is in the article. Makes summary more engaging to read and may combine ideas better to make the summaries more to the point. 
- Can use input text without having to make text cleaning decisions. Can consider more features like punctuation and capitalization 

__Cons__ over extractive summarization:
- Harder to evaluate: can generate sentences that mean the same thing as the given summary sentences, but using different words, which will not be understood by the ROUGE metrics. 
- Decoding is more computationally intensive and we were unable to generate summaries for as large a sample as we did for extractive summarization. Thus our results are less robust. 
- Sometimes generate <UNK> character
- Cannot guarentee it will generate full sentences. Cuts off mid-sentence if did not generate an end of sequence character before reaching the specified max length.
- Encoding can only take the first 1,017 characters of the text. If topics appear later in the article for the first time, they will be completely missed and not included in the summary
  - TODO: include stat on average number of characters in our articles and how many go over threshold
  - In general, news articles tend to include highlights of the most important information in the first few sentences followed by details, so this should not impact performance as drastically as in other contexts.
  - BCG material follows the pyramid principle where you summarize the key points first, so similar structure to news articles. 

__Potential extensions__:
- Parallelize decoding to allow for better run time
- Grid search through parameters such as length penalty, number of beams, number of ngrams to not repeat

In [None]:
# TODO create loop through configurations similar to extractive 
# TODO set up evaluations 

In [2]:
%%capture
!pip install transformers
!pip install import-ipynb
!pip install sentencepiece

In [3]:
from transformers import pipeline
from transformers import T5ForConditionalGeneration, T5Tokenizer
import pandas as pd
import torch
import import_ipynb
import numpy as np
import tensorflow as tf

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# load in functions from extract_summarization notebook
%cd "drive/MyDrive/Colab Notebooks"
from extractive_summarization import *
%cd ..

/content/drive/MyDrive/Colab Notebooks
importing Jupyter notebook from extractive_summarization.ipynb
/content/drive/My Drive


In [6]:
df = data_setup(n = 20) 

In [8]:
def load_model():
  # initialize the model architecture and weights
  model = T5ForConditionalGeneration.from_pretrained("t5-small")
  # initialize the model tokenizer
  tokenizer = T5Tokenizer.from_pretrained("t5-small")

  return model, tokenizer

In [10]:
model, tokenizer = load_model()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…




## Encode text for inference
- Encode words into numerical vectors using model's tokenizer
- Will automatically convert unknown words into <unk> 
- Use un-processed original sentences because model takes into account features like capitalization and punctuation. Also, features like stopwords are important for generating grammatically correct sentences. 

In [11]:
def encode_input(df):
  df['encoded'] = df.sentences.map(lambda row: tokenizer.encode("summarize: " + ' '.join(row), return_tensors="pt", max_length=1017, truncation=True))
  return df

In [12]:
df = encode_input(df)

## Decoding Methods
- Greedy: select word with highest probability given all prior context: P(w | w<sub>1:t-1</sub>)
  - Con: misses high probability words that occur after a lower probability word because never explore the path 
- Beam Search: considers probability of sequences num_beams long. 
  - Con: higher computation time 
- Sampling methods (_not using_): used to introduce randomness to the text and make it sound more human-like, especially in contexts like story generation. However, in this case, we do not want randomness but rather want the summaries to closely follow the content in the article. 
  - Ex: article about a man attacked by a tiger says that he was conscious and talking in the ambulance. Sampling decoding creates a sentence that claims he was "conscious and talking" with the animal

## Length of Predicted Summary
- Max length:
  - Cannot set based on number of sentences; number of words only
  - Heuristic: average 20 text words per summary word
  - Two configs:
    - If strict, cannot go over heuristic. 
    - If more lenient, can go over by 1 unit of the heuristic (20 words)
- Min length: want to generate summary right around the heuristic; do not want to generate a shorter summary because want enough information for content curator to use. Thus allow to go under by 1 unit of the heuristic if the model predicts an end of sequence token. 

## Other Parameters
- No repeat ngram = 4: these methods tend to generate repetitive sequences of words. This parameters disallows ngrams to repeat if they are of length 4. 
  - Bigrams and trigrams can repeat so entity names that are central to the article can appear multiple times. But do not allow entire phrases to repeat. 
  - Ex: without parameter get sequences like "the heat index will make it feel like 113. the heat index will make it feel like 113"

In [13]:
def decode(df, config):

  df['max_words'] = df.sentences.map(lambda row: int(np.floor(len(''.join(row).split(' ')) / 20))) # average 20 text words per summary word
  if 'max_words_plus' in config:
    df.max_words = df.max_words + 20

  if 'greedy' in config:
    df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 20),
                                         no_repeat_ngram_size = 4), 
                             axis = 1) 
  if 'beam' in config:
    df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 20),
                                         num_beams = 5,
                                         early_stopping = True),
                             axis = 1) 
    
  # decode predicted summary of numbers back into text
  df['predicted_summary'] = df.outputs.map(lambda row: tokenizer.decode(row[0], skip_special_tokens = True))

  return df

In [15]:
df['max_words'] = df.sentences.map(lambda row: int(np.floor(len(''.join(row).split(' ')) / 20))) # average 20 text words per summary word

In [16]:
%%time
df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 20),
                                         num_beams = 5,
                                         early_stopping = True),
                             axis = 1) 
    

CPU times: user 1min 45s, sys: 4.95 s, total: 1min 50s
Wall time: 1min 51s


In [17]:
%%time
df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 20),
                                         num_beams = 5,
                                         use_cache = True,
                                         early_stopping = True),
                             axis = 1) 
    

CPU times: user 1min 46s, sys: 5.7 s, total: 1min 52s
Wall time: 1min 52s


In [None]:
(10000*90)/60/60 # 250 hours for 10,000  
# 1 min 30 seconds for 10 inputs
# more than douple for 20 inputs = 3 min 30 seconds
# greedy - half the time 

250.0

In [None]:
CONFIGURATIONS = [['greedy', 'beam'],
                  ['max_words_strict', 'max_words_plus'],
                  ]    