<a href="https://colab.research.google.com/github/katrina906/CS6120-Summarization-Project/blob/main/abstractive_summarization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#https://www.thepythoncode.com/article/text-summarization-using-huggingface-transformers-python
#https://huggingface.co/blog/how-to-generate

# use much smaller sample size. and then re-evaluate extractive models to match so have comparative stats? or just _best_ extractive model!
  # discuss that results are less robust and list as extension to parallelize. 

# Abstractive Summarization using Encoder-Decoder T5 Model 
TODO: describe model. Using pre-trained model. 
              
__Pros__ over extractive summarization:
- Generates new text, not just repeating what is in the article. Makes summary more engaging to read and may combine ideas better to make the summaries more to the point. 
- Can use input text without having to make text cleaning decisions. Can consider more features like punctuation and capitalization 

__Cons__ over extractive summarization:
- Harder to evaluate: can generate sentences that mean the same thing as the given summary sentences, but using different words, which will not be understood by the ROUGE metrics. 
- Decoding is more computationally intensive and we were unable to generate summaries for as large a sample as we did for extractive summarization. Thus our results are less robust. 
- Sometimes generate <UNK> character
- Cannot guarentee it will generate full sentences. Cuts off mid-sentence if did not generate an end of sequence character before reaching the specified max length.
- Encoding can only take the first 1,017 characters of the text. If topics appear later in the article for the first time, they will be completely missed and not included in the summary
  - TODO: include stat on average number of characters in our articles and how many go over threshold
  - In general, news articles tend to include highlights of the most important information in the first few sentences followed by details, so this should not impact performance as drastically as in other contexts.
  - BCG material follows the pyramid principle where you summarize the key points first, so similar structure to news articles. 

__Potential extensions__:
- Parallelize decoding to allow for better run time
- Grid search through parameters such as length penalty, number of beams, number of ngrams to not repeat

In [1]:
%%capture
!pip install transformers
!pip install import-ipynb
!pip install sentencepiece

In [2]:
from transformers import pipeline
from transformers import T5ForConditionalGeneration, T5Tokenizer
import pandas as pd
import torch
import import_ipynb
import numpy as np
import tensorflow as tf

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
# load in functions from extract_summarization notebook
%cd "drive/MyDrive/Colab Notebooks"
from extractive_summarization import *
%cd ..

/content/drive/MyDrive/Colab Notebooks
importing Jupyter notebook from extractive_summarization.ipynb
/content/drive/MyDrive


In [5]:
def load_model():
  # initialize the model architecture and weights
  model = T5ForConditionalGeneration.from_pretrained("t5-small")
  # initialize the model tokenizer
  tokenizer = T5Tokenizer.from_pretrained("t5-small")

  return model, tokenizer

## Encode text for inference
- Encode words into numerical vectors using model's tokenizer
- Will automatically convert unknown words into <unk> 
- Use un-processed original sentences because model takes into account features like capitalization and punctuation. Also, features like stopwords are important for generating grammatically correct sentences. 

In [6]:
def encode_input(df, tokenizer):
  df['encoded'] = df.sentences.map(lambda row: tokenizer.encode("summarize: " + ' '.join(row), return_tensors="pt", max_length=1017, truncation=True))
  return df

## Decoding Methods
- Greedy: select word with highest probability given all prior context: P(w | w<sub>1:t-1</sub>)
  - Con: misses high probability words that occur after a lower probability word because never explore the path 
- Beam Search: considers probability of sequences num_beams long. 
  - Con: higher computation time 
- Sampling methods (_not using_): used to introduce randomness to the text and make it sound more human-like, especially in contexts like story generation. However, in this case, we do not want randomness but rather want the summaries to closely follow the content in the article. 
  - Ex: article about a man attacked by a tiger says that he was conscious and talking in the ambulance. Sampling decoding creates a sentence that claims he was "conscious and talking" with the animal

## Length of Predicted Summary
- Max length:
  - Cannot set based on number of sentences; number of words only
  - Heuristic: average 20 text words per summary word
  - Two configs:
    - If strict, cannot go over heuristic. 
    - If more lenient, can go over by 1 unit of the heuristic (20 words)
- Min length: want to generate summary right around the heuristic; do not want to generate a shorter summary because want enough information for content curator to use. Thus allow to go under by 1 unit of the heuristic if the model predicts an end of sequence token. 

## Other Parameters
- No repeat ngram = 4: these methods tend to generate repetitive sequences of words. This parameters disallows ngrams to repeat if they are of length 4. 
  - Bigrams and trigrams can repeat so entity names that are central to the article can appear multiple times. But do not allow entire phrases to repeat. 
  - Ex: without parameter get sequences like "the heat index will make it feel like 113. the heat index will make it feel like 113"

In [7]:
def decode(df, model, tokenizer, config):

  df['max_words'] = df.sentences.map(lambda row: int(np.floor(len(''.join(row).split(' ')) / 20))) # average 20 text words per summary word
  if 'max_words_plus' in config:
    df.max_words = df.max_words + 20

  if 'greedy' in config:
    df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 20),
                                         no_repeat_ngram_size = 4), 
                             axis = 1) 
  if 'beam' in config:
    df['outputs'] = df.apply(lambda row: model.generate( 
                                         row.encoded, 
                                         max_length=row.max_words, 
                                         min_length=max(0, row.max_words - 20),
                                         num_beams = 5,
                                         early_stopping = True),
                             axis = 1) 
    
  # decode predicted summary of numbers back into text
  df['predicted_summary'] = df.outputs.map(lambda row: tokenizer.decode(row[0], skip_special_tokens = True))

  return df

In [49]:
def train_config_loop_extractive(df, model, tokenizer, config_list, eval_only = True, continue_iterations = False,
                                 filename = '', split_fraction = None, split_fraction_continue = False):

  if continue_iterations:
    with open('/content/drive/MyDrive/data/' + filename + '.pkl', 'rb') as f:
      results_so_far = pickle.load(f) 
      eval_results = results_so_far[1]
      model_results = results_so_far[2]
      config_list = config_list[config_list.index(results_so_far[0])+1:] # continue from last config
  else:
    eval_results = {}
    model_results = {}

  for config in config_list:
    print(config)
    # allow decoding in chunks of dataframe at at time. Too slow and session crashes before finish
    if split_fraction != None:
      increment = int(np.ceil(len(df) / split_fraction))
      # allow resuming from partial decode of dataframe 
      if (split_fraction_continue == True) and (os.path.exists('/content/drive/MyDrive/data/' + 'abstractive_best_model_' + str(config) + '.pkl')) :
        with open('/content/drive/MyDrive/data/' + 'abstractive_best_model_' + str(config) + '.pkl', 'rb') as f:
          decode_so_far = pickle.load(f) 
          start_index = decode_so_far[1] 
          end_index = start_index + increment
          df_decoded = decode_so_far[0]
      else:
        start_index = 0
        end_index = increment
        df_decoded = pd.DataFrame()

      # decode in chunks size len(df) / split_fraction at a time 
      while start_index < len(df):
        if end_index > len(df):
          end_index = len(df)
        df_partial = decode(df.iloc[start_index:end_index], model, tokenizer, config)
        df_decoded = pd.concat([df_decoded, df_partial], sort = False)
        # save after every increment
        with open('/content/drive/MyDrive/data/' + 'abstractive_best_model_' + str(config) + '.pkl', 'wb') as f:
          pickle.dump([df_decoded, end_index], f)
          print('saving!', end_index)
        # next increment 
        start_index += increment
        end_index += increment
    # if training configs on small samples to find best, can run all at once
    else:
      df_decoded = decode(df, model, tokenizer, config)
    eval_dict = evaluate(df_decoded)
    eval_results[(str(config))] = metrics_distribution(df_decoded)
    if not eval_only:
      model_results[str(config)] = df_decoded[['sentences', 'summary', 'rouge', 'predicted_summary']]

    # save every completed config after finishes
    if filename != '':
      with open('/content/drive/MyDrive/data/' + filename + '.pkl', 'wb') as f:
          pickle.dump([config, eval_results, model_results], f)
          print('saving!')

  return eval_results, model_results

In [14]:
CONFIGURATIONS = [['greedy', 'beam'],
                  ['max_words_strict', 'max_words_plus'],
                  ]    
# cross products of all possible combinations of configurations
model_configurations = list(itertools.product(*CONFIGURATIONS)) 

In [64]:
def main():

  # load T5 model and tokenizer
  model, tokenizer = load_model()

  # load data and encode input
  df = data_setup(n = 50) # TODO
  df = encode_input(df, tokenizer)

  # train each configuration on a subset of the data and get evaluation metrics 
  eval_results, _ = train_config_loop_extractive(df.head(1000), model, tokenizer, model_configurations, eval_only = True,
                                                 filename = 'train_config_loop_abstractive', continue_iterations = True)
  # find best config for each evaluation metric
  best_configs = find_best_configs(eval_results)

  # train full model on best configurations for each metric
  eval_results_dict = {} # for each eval metric, distribution of evaluation metrics 
  model_results_dict = {} # for each eval metric, data with predicted summaries
  seen_configs = {}  # keep track of which configs we have trained so far
  seen_metrics = []

  # resume if already trained some best configs on full data 
  if os.path.exists('/content/drive/MyDrive/data/trained_model_abstractive.pkl'):
    with open('/content/drive/MyDrive/data/trained_model_abstractive.pkl', 'rb') as f:
      load = pickle.load(f)
      seen_metrics = load[0]
      eval_results_dict = load[1]
      model_results_dict = load[2]
      seen_configs = load[3]
  for metric in best_configs.keys():
    if metric in seen_metrics: # already done in prior partial run
      continue
    config = tuple(best_configs[metric].strip('(').strip(')').replace("'", "").split(', '))
    if config not in seen_configs.keys():
      eval_results, model_results = train_config_loop_extractive(df, model, tokenizer, [config], eval_only = False,
                                                                 split_fraction = 10, split_fraction_continue = True)
      eval_results_dict[metric] = eval_results[str(config)][metric]
      model_results_dict[metric] = model_results[str(config)]
      seen_configs[config] = metric
    # prevent duplicative retraining: use existing results if best config for prior metric
    else:
      eval_results_dict[metric] = eval_results_dict[seen_configs[config]]
      model_results_dict[metric] = model_results_dict[seen_configs[config]]
    seen_metrics.append(metric)

    # save best models
    # save every iteration overwriting
    # if need to restart, load in dictionaries, go through best_configs.keys() but not in seen_metrics, continue adding to dictionaries
    with open('/content/drive/MyDrive/data/trained_model_abstractive.pkl', 'wb') as f: 
        pickle.dump([seen_metrics, eval_results_dict, model_results_dict, seen_configs, best_configs], f)

In [65]:
main()

('beam', 'max_words_strict')


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


saving! 5
saving! 10
saving! 15
saving! 20
saving! 25
saving! 30
saving! 35
saving! 40
saving! 45
saving! 50
