<a href="https://colab.research.google.com/github/katrina906/CS6120-Summarization-Project/blob/main/abstractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# TODO look for examples where:
  # info that isn't in actual article. try iloc 6 about Jack Johnson
  # repetivtive info

# Abstractive Summarization using Encoder-Decoder T5 Model 
__T5__ (Text-to-Text Transfer Transformer Model)
- Pretrained encoder-decoder. Bidiectional, multi-head self-attention layers, masking for training.
- Very similar to BERT with some new advances in training and the decoder.
- Fine-tuned for specific tasks such as language translation and summarization 
- Trained on the C4 dataset: Colossal Clean Crawled Corpus. 750GB of scraped webpage data 
  - Unrealistic to train own model because of the large amount of data and iterations used to train these popular pre-trained models that are able to generate English well. 
- Teacher forcing in training: minimize cross-entropy loss of predicting next word
- Potential extension is to do custom fine-tuning training on specific corpus. Not necessary for articles because use fairly typical English construction, grammar etc.
  - Might be necessary on BCG corpus because powerpoint English is somewhat different from regular English sentences. 
           
                  
Paper from Google: https://arxiv.org/abs/1910.10683
              
__Pros__ over extractive summarization:
- Generates new text, not just repeating what is in the article. Makes summary more engaging to read and may combine ideas better to make the summaries more to the point. 
- Can use input text without having to make text cleaning decisions. Can consider more features like punctuation and capitalization 

__Cons__ over extractive summarization:
- Decoding is more computationally intensive.
- Sometimes generate <UNK> character
- Cannot guarentee it will generate full sentences. Cuts off mid-sentence if did not generate an end of sequence character before reaching the specified max length.
- Encoding can only take the first 1,017 tokens of the text largely because of self-attention layer: requires n^2 calculations for n tokens because consider entire sequence for attention. If topics appear later in the article for the first time, they will be completely missed and not included in the summary
  - Only 15% of articles have more than 1,017 tokens. And not much over - max is 1,819 tokens. Mean is 653 tokens. 
  - In general, news articles tend to include highlights of the most important information in the first few sentences followed by details, so this should not impact performance as drastically as in other contexts.
  - BCG material follows the pyramid principle where you summarize the key points first, so similar structure to news articles. 

__Potential extensions__:
- Parallelize decoding to allow for better run time
- Grid search through parameters such as length penalty, number of beams, number of ngrams to not repeat
- Longformer Encoder-Decoder uses an encoder model that can handle up to 16,000 tokens by limiting self-attention to a window rather than the entire text (higher layers use a larger window size to combine the local features learned at lower layers). Paper: https://arxiv.org/abs/2004.05150
  - However, most of our articles do not suffer from this issue
  - Much longer training time due to larger encodings

In [None]:
# Resources: 
# https://www.thepythoncode.com/article/text-summarization-using-huggingface-transformers-python
# https://huggingface.co/blog/how-to-generate

In [None]:
#model = EncoderDecoderModel.from_pretrained("patrickvonplaten/longformer2roberta-cnn_dailymail-fp16")
#tokenizer = LongformerTokenizer.from_pretrained("allenai/longformer-base-4096") 
# https://huggingface.co/patrickvonplaten/longformer2roberta-cnn_dailymail-fp16
# longformer encoder + robert decoder 
# fine tuned on this exact dataset - does seem to do better than t5 for this reason (even on articles below truncation limit), but worry that overfitting? don't know what they trained on exactly

In [1]:
%%capture
!pip install transformers
!pip install import-ipynb
!pip install sentencepiece

In [2]:
from transformers import pipeline
from transformers import T5ForConditionalGeneration, T5Tokenizer, EncoderDecoderModel, LongformerTokenizer
import pandas as pd
import torch
import import_ipynb
import numpy as np
import tensorflow as tf

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# load in functions from extract_summarization notebook
%cd "drive/MyDrive/Colab Notebooks"
from extractive_summarization import *
%cd ..

/content/drive/MyDrive/Colab Notebooks
importing Jupyter notebook from extractive_summarization.ipynb
/content/drive/My Drive


In [None]:
def load_model():
  # initialize the model architecture and weights
  model = T5ForConditionalGeneration.from_pretrained("t5-small")
  # initialize the model tokenizer
  tokenizer = T5Tokenizer.from_pretrained("t5-small")

  return model, tokenizer

## Encode text for inference
- Encode words into numerical vectors using model's tokenizer
- Will automatically convert unknown words into <unk> 
- Use un-processed original sentences because model takes into account features like capitalization and punctuation. Also, features like stopwords are important for generating grammatically correct sentences. 

In [None]:
def encode_input(df, tokenizer):
  df['encoded'] = df.sentences.map(lambda row: tokenizer.encode("summarize: " + ' '.join(row), return_tensors="pt", max_length=1017, truncation=True))
  return df

## Decoding Methods
- Greedy: select word with highest probability given all prior context: P(w | w<sub>1:t-1</sub>)
  - Con: misses high probability words that occur after a lower probability word because never explore the path 
- Beam Search: considers probability of sequences num_beams long. 
  - Con: higher computation time 
- Sampling methods (_not using_): used to introduce randomness to the text and make it sound more human-like, especially in contexts like story generation. However, in this case, we do not want randomness but rather want the summaries to closely follow the content in the article. 
  - Ex: article about a man attacked by a tiger says that he was conscious and talking in the ambulance. Sampling decoding creates a sentence that claims he was "conscious and talking" with the animal

## Length of Predicted Summary
- Max length:
  - Cannot set based on number of sentences; number of words only
  - Heuristic: average 12 text words per summary word
  - Two configs:
    - If strict, cannot go over heuristic. 
    - If more lenient, can go over by 1 unit of the heuristic (12 words)
- Min length: want to generate summary right around the heuristic; do not want to generate a shorter summary because want enough information for content curator to use. Thus allow to go under by 1 unit of the heuristic if the model predicts an end of sequence token. 

## Other Parameters
- No repeat ngram = 4: these methods tend to generate repetitive sequences of words. This parameters disallows ngrams to repeat if they are of length 4. 
  - Bigrams and trigrams can repeat so entity names that are central to the article can appear multiple times. But do not allow entire phrases to repeat. 
  - Ex: without parameter get sequences like "the heat index will make it feel like 113. the heat index will make it feel like 113"

In [None]:
def decode(df, model, tokenizer, config):

  df['max_words'] = df.sentences.map(lambda row: int(np.floor(len(''.join(row).split(' ')) / 12))) # average 12 text words per summary word
  if 'max_words_plus' in config:
    df.max_words = df.max_words + 12
  if 'greedy' in config:
    df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 12),
                                         no_repeat_ngram_size = 4), 
                             axis = 1) 
  if 'beam' in config:
    df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 12),
                                         num_beams = 5,
                                         no_repeat_ngram_size = 4,
                                         early_stopping = True),
                             axis = 1) 
    
  # decode predicted summary of numbers back into text
  df['predicted_summary'] = df.outputs.map(lambda row: tokenizer.decode(row[0], skip_special_tokens = True))

  return df

### Training Loop
Loop through list of configurations, decode inputted data using T5 model and specified configurations, and evaluate.    
Evaluation functions from extractive_summarization notebook.   
   
Decoding is much slower than extractive training, so allow for data to be decoded in chunks and saved so that we can use partial start points. 

In [None]:
def train_config_loop_abstractive(df, model, tokenizer, config_list, eval_only = True, continue_iterations = False,
                                 filename = '', split_fraction = None, split_fraction_continue = False):

  if continue_iterations:
    with open('/content/drive/MyDrive/data/' + filename + '.pkl', 'rb') as f:
      results_so_far = pickle.load(f) 
      eval_results = results_so_far[1]
      model_results = results_so_far[2]
      config_list = config_list[config_list.index(results_so_far[0])+1:] # continue from last config
  else:
    eval_results = {}
    model_results = {}

  for config in config_list:
    print(config)
    # allow decoding in chunks of dataframe at at time. Too slow and session crashes before finish
    if split_fraction != None:
      increment = int(np.ceil(len(df) / split_fraction))
      # allow resuming from partial decode of dataframe 
      if (split_fraction_continue == True) and (os.path.exists('/content/drive/MyDrive/data/' + 'abstractive_best_model_' + str(config) + '.pkl')) :
        with open('/content/drive/MyDrive/data/' + 'abstractive_best_model_' + str(config) + '.pkl', 'rb') as f:
          decode_so_far = pickle.load(f) 
          start_index = decode_so_far[1] 
          end_index = start_index + increment
          df_decoded = decode_so_far[0]
      else:
        start_index = 0
        end_index = increment
        df_decoded = pd.DataFrame()

      # decode in chunks size len(df) / split_fraction at a time 
      while start_index < len(df):
        if end_index > len(df):
          end_index = len(df)
        df_partial = decode(df.iloc[start_index:end_index], model, tokenizer, config)
        df_decoded = pd.concat([df_decoded, df_partial], sort = False)
        # save after every increment
        with open('/content/drive/MyDrive/data/' + 'abstractive_best_model_' + str(config) + '.pkl', 'wb') as f:
          pickle.dump([df_decoded, end_index], f)
          print('saving!', end_index)
        # next increment 
        start_index += increment
        end_index += increment
    # if training configs on small samples to find best, can run all at once
    else:
      df_decoded = decode(df, model, tokenizer, config)
    eval_dict = evaluate(df_decoded)
    eval_results[(str(config))] = metrics_distribution(df_decoded)
    if not eval_only:
      model_results[str(config)] = df_decoded[['sentences', 'summary', 'rouge', 'predicted_summary']]

    # save every completed config after finishes
    if filename != '':
      with open('/content/drive/MyDrive/data/' + filename + '.pkl', 'wb') as f:
          pickle.dump([config, eval_results, model_results], f)
          print('saving!')

  return eval_results, model_results

## Main Function
1. Loop through possible configurations for decoding and train on subset of data
2. Select best configuration for each evaluation metric (F-Measure, Precision, Recall with and without averaging between unigram and bigram metrics)
3. Train best configurations on full data

In [None]:
CONFIGURATIONS = [['greedy', 'beam'],
                  ['max_words_strict', 'max_words_plus'],
                  ]    
# cross products of all possible combinations of configurations
model_configurations = list(itertools.product(*CONFIGURATIONS)) 

In [None]:
def main():

  # load T5 model and tokenizer
  model, tokenizer = load_model()

  # load data and encode input
  df = data_setup(n = 10000) 
  df = encode_input(df, tokenizer)

  # train each configuration on a subset of the data and get evaluation metrics 
  eval_results, _ = train_config_loop_abstractive(df.head(1000), model, tokenizer, model_configurations, eval_only = True,
                                                 filename = 'train_config_loop_abstractive', continue_iterations = True)
  # find best config for each evaluation metric
  best_configs = find_best_configs(eval_results)

  # train full model on best configurations for each metric
  eval_results_dict = {} # for each eval metric, distribution of evaluation metrics 
  model_results_dict = {} # for each eval metric, data with predicted summaries
  seen_configs = {}  # keep track of which configs we have trained so far
  seen_metrics = []

  # resume if already trained some best configs on full data 
  if os.path.exists('/content/drive/MyDrive/data/trained_model_abstractive.pkl'):
    with open('/content/drive/MyDrive/data/trained_model_abstractive.pkl', 'rb') as f:
      load = pickle.load(f)
      seen_metrics = load[0]
      eval_results_dict = load[1]
      model_results_dict = load[2]
      seen_configs = load[3]
  for metric in best_configs.keys():
    if metric in seen_metrics: # already done in prior partial run
      continue
    config = tuple(best_configs[metric].strip('(').strip(')').replace("'", "").split(', '))
    if config not in seen_configs.keys():
      eval_results, model_results = train_config_loop_abstractive(df, model, tokenizer, [config], eval_only = False,
                                                                 split_fraction = 10, split_fraction_continue = True)
      eval_results_dict[metric] = eval_results[str(config)][metric]
      model_results_dict[metric] = model_results[str(config)]
      seen_configs[config] = metric
    # prevent duplicative retraining: use existing results if best config for prior metric
    else: 
      eval_results_dict[metric] = eval_results_dict[seen_configs[config]]
      model_results_dict[metric] = model_results_dict[seen_configs[config]]
    seen_metrics.append(metric)

    # save best models
    # save every iteration overwriting
    # if need to restart, load in dictionaries, go through best_configs.keys() but not in seen_metrics, continue adding to dictionaries
    with open('/content/drive/MyDrive/data/trained_model_abstractive.pkl', 'wb') as f: 
        pickle.dump([seen_metrics, eval_results_dict, model_results_dict, seen_configs, best_configs], f)

In [None]:
with open('/content/drive/MyDrive/data/trained_model_abstractive.pkl', 'rb') as f:
      load = pickle.load(f)
      seen_metrics = load[0]
      eval_results_dict = load[1]
      model_results_dict = load[2]
      seen_configs = load[3]

In [None]:
main()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1197.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=242065649.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=791656.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1389353.0, style=ProgressStyle(descript…


