<a href="https://colab.research.google.com/github/katrina906/CS6120-Summarization-Project/blob/main/compare_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Compare Models Qualitatively & Quantiatively
- Extractive models: Calculate p-values of difference in evaluation metrics with paired bootstrap test
  - Do not calculate p-values between extractive models and abstractive model. Gaps are fairly large, or at least larger than the gaps between extractive models, and the training time is so long that this pvalue calculation is computationally prohibitive. 
- Compare metrics (f-measure, precision, recall) for all best models
- Compare relative lengths of generated summaries between models
- Qualitative comparisons of generated summaries vs gold standard label and vs generated summaries from other models. 

__Conclusions:__     
- Extractive better for our specific task than abstractive
  - If we wanted a system that required no or very little human intervention, then abstractive summarization would be better. The summaries are more to the point (because can drop extraneous details in a sentence and only take the important part) and sentences are in a logical order. 
  - But if we have a human doing post-processing, then extractive is less likely to make mistakes and is easier to explain to business stakeholders and users. 
- Length of generated summary is very important to a model resulting in good recall vs precision 
  - We were unable to decode the abstractive model using the number of sentences heuristic, which generally leads to longer predicted summaries, and this is at least partly explains why it was unable to achieve good recall. An extension would be to experiment with the length restrictions further. 
- Recall is the most important metric: human content curator will take all the generated summary information and revise into a final summary. We want to provide them with more information to work with that captures most of the main points in the article. 
  - Do not want to give significant amounts of "junk", so still ensure precision is reasonable. Helped by the fact we are limiting the length of summaries. 

--> Best extractive recall model is TextRank

## Exctractive and Abstractive Model Pros and Cons

__Extractive__
- Pros
  - All information is guarenteed to be correct and in the article
  - TextRank recall model has the best overall recall statistic out of any model, which means it puts the most information in front of the human evaluator to use (while still having a reasonable precision - does not generate significant garbage)
  - Summary sentences are guarenteed to be gramatically correct
  - Easier to explain to a non-technical audience how we are selecting sentences vs. trusting an algorithm to generate completely new data like in abstractive summarization
- Cons
  - Sentences can appear in illogical orders and be taken out of context: Ex one sentence refers to Bob and the next refers to 'he', but that 'he' was not originally Bob
    - Can occur in abstractive too
  - Extracted sentences can have an important point but also contain unnecessary supporting details.
  - Needs to be long to get all of the information in a human generated summary, which will combine concepts from multiple article sentences into one summary sentence. 

__Abstractive__
- Pros
  - Human generated summaries are qualitatively better because combine multiple concepts from multiple text sentences into one summary sentence. Abstractive methods is better achieve this than extractive.
    - Can take main points of sentences and drop the supporting details
    - More engaging to read sentences
    - Predicted summary sentenes tend to be shorter than the extractive predicted sentences 
  - Can use input text without having to make text cleaning decisions. Can consider more features like punctuation and capitalization 
  - Ordering of sentences is logical. 
- Cons
  - Occasionally create information that is not in the original article (surprisingly rare!)
  - Occasionally repetitive. Can tune ngram repeat hyperparameters to help prevent
  - Not all capitalization in sentences is decoded correctly, especially proper names
  - Sentences are often cut off part way through because we can only specify the max number of characters. Cuts off mid-sentence if did not generate an end of sequence character before reaching the specified max length. 
  - Generated summaries are sometimes too short and miss some important content that is included in the gold standard summary. It produces a valid summary, but it is too much of a summary. 
  - Decoding is more computationally intensive
  - Encoding can only take the first 1,017 tokens of the text largely because of self-attention layer: requires $n^2$ calculations for n tokens because consider entire sequence for attention. If topics appear later in the article for the first time, they will be completely missed and not included in the summary
    - Only 15% of articles have more than 1,017 tokens. And not much over - max is 1,819 tokens. Mean is 653 tokens. 
    - In general, news articles tend to include highlights of the most important information in the first few sentences followed by details, so this should not impact performance as drastically as in other contexts.
    - BCG material follows the pyramid principle where you summarize the key points first, so similar structure to news articles. 


In [1]:
%%capture 
!pip install rouge-score
!pip install import-ipynb
!pip install fasttext
!pip install compress-fasttext
!pip install gensim==3.8.3

In [2]:
import os
import pandas as pd
import numpy as np
import pickle
import string
import re
import sys
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer
from sklearn.feature_extraction import DictVectorizer
from collections import Counter, OrderedDict
from sklearn.metrics.pairwise import cosine_similarity, pairwise_distances
import networkx as nx
from rouge_score import rouge_scorer
import gensim
import fasttext
from gensim.models import FastText
import compress_fasttext
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords  
from nltk import tokenize
import matplotlib.pyplot as plt
import numpy as np
import import_ipynb
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))  
import bokeh
from bokeh.layouts import gridplot, column, row
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
from bokeh.models import Div
from bokeh.models import Span
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.transform import dodge
from math import pi

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [3]:
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# load in functions from extract_summarization notebook
%cd "drive/MyDrive/Colab Notebooks"
import extractive_summarization
%cd ..

/content/drive/MyDrive/Colab Notebooks
importing Jupyter notebook from extractive_summarization.ipynb
/content/drive/My Drive


### Load best models (one per algorithm, per metric)

In [14]:
model_dict = {}
eval_dict = {}
config_dict = {}
for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
  with open('/content/drive/MyDrive/data/trained_model_' + model + '.pkl', 'rb') as f:
    load = pickle.load(f)
    eval_dict[model] = load[1]
    model_dict[model] = load[2]
    config_dict[model] = load[-1]


__Best Configurations for Each Each Evaluation Metric__
- False and True versions are almost always the same, maybe one configuration different. Rest of analysis is going to only use False (unigram only) for simplicity
- Observations
  - Embeddings are never used for vector representations
  - Recall always uses the extraction heuristic that gives the longest summaries, precision the shortest summaries, and F-Measure in between
    - Logical given definitions of precision and recall
  - Bag of Words with binary indicators is almost always used as the vector representation for LSA and TextRank --> counts don't give extra information over binary indicators; word appearing an additional time doesn't have the same importance as a word appearing vs not appearing.
  - Generally no normalization of the vector representation is used, except for TextRank recall
  - Greedy search never used for abstractive decoding; beam is always better
  - Abstractive Fmeasure and Recall models are the same

In [16]:
config_dict

{'abstractive': {('fmeasure', False): "('beam', 'max_words_plus')",
  ('fmeasure', True): "('beam', 'max_words_plus')",
  ('precision', False): "('beam', 'max_words_strict')",
  ('precision', True): "('beam', 'max_words_strict')",
  ('recall', False): "('beam', 'max_words_plus')",
  ('recall', True): "('beam', 'max_words_plus')"},
 'baseline': {('fmeasure', False): "('baseline', 'num_words_gt')",
  ('fmeasure', True): "('baseline', 'num_words_gt')",
  ('precision', False): "('baseline', 'num_words_lt')",
  ('precision', True): "('baseline', 'num_words_lt')",
  ('recall', False): "('baseline', 'num_sentences')",
  ('recall', True): "('baseline', 'num_sentences')"},
 'lsa': {('fmeasure',
   False): "('lsa', 'stopwords', 'lemma', 'bow', 'counts', 'no_normalization', 'trigram', 'num_words_gt')",
  ('fmeasure',
   True): "('lsa', 'stopwords', 'stem', 'bow', 'binary', 'no_normalization', 'all', 'num_words_gt')",
  ('precision',
   False): "('lsa', 'stopwords', 'stem', 'bow', 'counts', 'no_no

## Extractive Comparison: Calculate P-Value of Metric Difference between Models with Paired Bootstrap Test

For best configuration for each evaluation metric, compare the 3 model types: which model is the best and what is the p-value?
1. Calculate difference in stat performance (recall etc.)
2. Generate N bootstrapped samples of data 
3. Train on bootstrapped data
4. Calculate difference in performance on bootstraped data
5. Count percent of replicate diffs that are >= 2 * original diff = p-value  
  - Null hypothesis is that there is no difference and the true effect size is original diff (data happens to be biased towards one model)
  - If see a lot of replicated diffs >= 2 * original diff, then null is true and there is no difference between the models


Result: all are significant


In [None]:
def paired_bootstrap(evals, models, configs, pvalue_dict, model1, model2, metric, save_every_cnt = 10, filename = '', restart = True):
  embeddings = extractive_summarization.load_embeddings()

  # which model is better and by how much
  if evals[model1][metric]['mean'] > evals[model2][metric]['mean']:
    better_model = model1
    other_model = model2
  else:
    better_model = model2
    other_model = model1
  diff =  evals[better_model][metric]['mean'] - evals[other_model][metric]['mean']
  print(better_model, diff)

  # allow start partway through 50 bootstrap samples 
  if not restart and os.path.exists('/content/drive/MyDrive/data/' + filename + '_' + model1 + '_' + model2 + '_' + str(metric) + '.pkl'):
    with open('/content/drive/MyDrive/data/' + filename + '_' + model1 + '_' + model2 + '_' + str(metric) + '.pkl','rb') as f:
      results_so_far = pickle.load(f) 
    gt_diff = results_so_far[0]
    lt_diff = results_so_far[1]
    start = results_so_far[2]
  else:
    gt_diff = 0
    lt_diff = 0
    start = 0

  for i in range(start+1, 51):
    print('BS', i)
    # generate bootstrap samples 
    bs_sample = {}
    bs_sample[model1] = models[model1][metric].sample(n = len(models[model1][metric]), replace = True)
    bs_sample[model2] = models[model2][metric].sample(n = len(models[model2][metric]), replace = True)

    # retrain both models on bootstrap samples with the current config
    bs_results = {}
    for m in [model1, model2]:
      config = tuple(config_dict[m][metric].strip('(').strip(')').replace("'", "").split(', '))
      if 'baseline' in config:
        tfidf, feature_array = extractive_summarization.corpus_tfidf(bs_sample[m])
      else:
        tfidf = ''
        feature_array = ''
      eval_results, _ = extractive_summarization.train_config_loop(bs_sample[m], tfidf, feature_array, embeddings, stop_words, 
                                                                    [config], eval_only = True)
      bs_results[m] = eval_results[str(config)][metric]['mean']
      
    # find difference in relevant stat
    diff_bs = bs_results[better_model] - bs_results[other_model]
    print(diff_bs)
    if diff_bs >= 2*diff:
      gt_diff += 1
    else:
      lt_diff += 1

    # save bootstrap samples every save_every_cnt in case of connection issue, timeout etc.
    if ((i % save_every_cnt) == 0 or (i == 50)) and filename != '':
      with open('/content/drive/MyDrive/data/' + filename + '_' + model1 + '_' + model2 + '_' + str(metric) + '.pkl', 'wb') as f:
        pickle.dump([gt_diff, lt_diff, i], f)
        print('saving!', i)
       
  # calculate p value
  pvalue = gt_diff  / (gt_diff + lt_diff)
  pvalue_dict[metric] = (better_model, pvalue)

  return pvalue_dict

In [None]:
for models in [ ('lsa', 'textrank')]: 
  if os.path.exists('/content/drive/MyDrive/data/pvalue_' + models[0] + '_' + models[1] + '.pkl'):
    # allow loading from pvalue dict with only some of the metrics
    with open('/content/drive/MyDrive/data/pvalue_' + models[0] + '_' + models[1] + '.pkl', 'rb') as f:
      pvalue_dict = pickle.load(f)
  else:
    pvalue_dict = {}
  for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
    if str(metric) not in pvalue_dict.keys():
      pvalue_dict = paired_bootstrap(eval_dict, model_dict, config_dict, pvalue_dict, models[0], models[1], metric,
                                     filename = 'bootstrap_loop', restart = False)
    with open('/content/drive/MyDrive/data/pvalue_' + models[0] + '_' + models[1] + '.pkl', 'wb') as f:
      pickle.dump(pvalue_dict, f)

In [None]:
# open completed pvalue dicts
for models in [('baseline', 'lsa'), ('baseline', 'textrank'), ('lsa', 'textrank')]: 
  with open('/content/drive/MyDrive/data/pvalue_' + models[0] + '_' + models[1] + '.pkl', 'rb') as f:
    pvalue_dict = pickle.load(f)
    print(models, '\n', pvalue_dict)

('baseline', 'lsa') 
 {('fmeasure', False): ('baseline', 0.0), ('precision', False): ('baseline', 0.0), ('recall', False): ('lsa', 0.0)}
('baseline', 'textrank') 
 {('fmeasure', False): ('textrank', 0.0), ('precision', False): ('textrank', 0.0), ('recall', False): ('textrank', 0.0)}
('lsa', 'textrank') 
 {('fmeasure', False): ('textrank', 0.0), ('precision', False): ('textrank', 0.0), ('recall', False): ('textrank', 0.0)}


### Compare evaluation metrics

__Best Configs for Respective Best Metric__   
For each best configuration, compare the metric that it is the best at. Ex: compare the average precision of the configurations of LSA, TextRank, TF-IDF, and abstractive with the best precision.  Plotting 12 separate models.

- LSA is consistently the worst performer
- TextRank is consistently the best extractive model 
- Abstractive does better on precision and F-measure, but significantly worse in recall than any of the other extractive models. 
  - Because abstractive is generating new words, it may be using words that mean the same thing as words in the summary. However, the gold standard summaries are not directly taken from the text, so the same thing can happen with the extractive models. 
  - Worse in recall beacause generates shorter summaries because could only use word based heuristic not sentence based for length (sentence heuristic results in longer sentences than words)


In [17]:
%cd figures

/content/drive/My Drive/figures


In [18]:
output_notebook()

In [27]:
# calculate mean of best metric for each model 
mean_lst = {'lsa':[], 'textrank':[], 'baseline':[], 'abstractive':[]}
for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
  for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
    mean_lst[model].append(eval_dict[model][metric]['mean'])

yvalues = [[i for i in mean_lst[k]] for (k,v) in mean_lst.items()]
yvalues = [item for sublist in yvalues for item in sublist]

xvalues = [[k for i in range(len(mean_lst[k]))] for (k,v) in mean_lst.items()]
xvalues = [item for sublist in xvalues for item in sublist]

mean_lst['metrics'] = ['F-Measure', 'Precision', 'Recall']
source = ColumnDataSource(data=mean_lst)

In [35]:
# side by side metric values for each metric across models (bar plot)
p = figure(x_range=['F-Measure', 'Precision', 'Recall'], y_range=(0, 0.5), plot_height=400, title="Relative Performance of Best Configurations for Target Metric",
           toolbar_location=None, tools="",  background_fill_color="#fafafa")

p.vbar(x=dodge('metrics', -0.3, range=p.x_range), top='baseline', width=0.2, source=source, legend_label="TF-IDF")
p.vbar(x=dodge('metrics',  -0.1,  range=p.x_range), top='lsa', width=0.2, source=source, color = 'darkorange', legend_label="LSA")
p.vbar(x=dodge('metrics',  0.1, range=p.x_range), top='textrank', width=0.2, source=source, color = 'forestgreen', legend_label="TextRank")
p.vbar(x=dodge('metrics',  0.3, range=p.x_range), top='abstractive', width=0.2, source=source, color = 'firebrick', legend_label="Abstractive T5")

# formatting
p.grid.grid_line_color="white"
p.xaxis.axis_label_text_font_size = '12pt'
p.yaxis.axis_label_text_font_size = '12pt'
p.title.text_font_size = '14pt'
p.xaxis.major_label_text_font_size = '12pt'
p.yaxis.major_label_text_font_size = '12pt'
p.legend.location = "top_left"
p.yaxis.minor_tick_line_color = None

output_file("eval_best_compare.html")
show(p)

__Each of Best 12 models for all 3 metrics__    
For the same models as above, plot all 3 metrics, regardless of if it is that configurations best/target metric. If we ultimately want to pick one model, need to ensure that its great performance in one metric is worth its worse performance in other metrics. 

- TextRank: Recall is significantly better in recall than all other models and comaparalbe in other metrics to other extractive models
- Smaller differences between the metrics performances for the 3 abstractive models than for other models
  - All 3 do significantly better in precision than any other model 

In [36]:
# calculate mean of all metrics for each model 
for model in ['textrank', 'lsa', 'baseline', 'abstractive']:
  for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
    model_dict[model][metric]['precision'] = model_dict[model][metric].rouge.map(lambda row: row['rouge1'].precision)
    model_dict[model][metric]['recall'] = model_dict[model][metric].rouge.map(lambda row: row['rouge1'].recall)
    model_dict[model][metric]['fmeasure'] = model_dict[model][metric].rouge.map(lambda row: row['rouge1'].fmeasure)

mean_lst = {}
for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
  for stat in ['fmeasure', 'precision', 'recall']:
    mean_lst[model + '-' + stat] = []
for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
  for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
    for stat in ['fmeasure', 'precision', 'recall']:
      mean_lst[model + '-' + metric[0]].append(model_dict[model][metric][stat].mean())

xvalues = [[k for i in range(len(mean_lst[k]))] for (k,v) in mean_lst.items()]
xvalues = [item for sublist in xvalues for item in sublist]

yvalues = [[i for i in mean_lst[k]] for (k,v) in mean_lst.items()]
yvalues = [item for sublist in yvalues for item in sublist]

mean_lst['metrics'] = ['F-Measure', 'Precision', 'Recall']
source = ColumnDataSource(data=mean_lst)

In [47]:
# side by side metric values for each metric across models (bar plot)
p = figure(x_range=['F-Measure', 'Precision', 'Recall'], y_range=(0, 0.5), plot_height=400, plot_width = 1000, title="Relative Performance for Best Configurations for All Metrics",
           toolbar_location=None, tools="",  background_fill_color="#fafafa")

d = 0.07
w = 0.07
p.vbar(x=dodge('metrics', -6*d, range=p.x_range), top='baseline-fmeasure', width=w, source=source, legend_label="TF-IDF: F-Measure", color = 'deepskyblue')
p.vbar(x=dodge('metrics',  -4*d,  range=p.x_range), top='baseline-precision', width=w, source=source, legend_label="TF-IDF: Precision")
p.vbar(x=dodge('metrics',  -5*d, range=p.x_range), top='baseline-recall', width=w, source=source, legend_label="TF-IDF: Recall", color = 'mediumblue')

p.vbar(x=dodge('metrics', -3*d+0.015, range=p.x_range), top='lsa-fmeasure', width=w, source=source, legend_label="LSA: F-Measure", color = 'sandybrown')
p.vbar(x=dodge('metrics',  -2*d,  range=p.x_range), top='lsa-precision', width=w, source=source, legend_label="LSA: Precision", color = 'darkorange')
p.vbar(x=dodge('metrics',  -d, range=p.x_range), top='lsa-recall', width=w, source=source, legend_label="LSA: Recall", color = 'orangered')

p.vbar(x=dodge('metrics',  0.015, range=p.x_range), top='textrank-fmeasure', width=w, source=source, legend_label="TextRank: F-Measure", color = 'limegreen')
p.vbar(x=dodge('metrics',  d,  range=p.x_range), top='textrank-precision', width=w, source=source, legend_label="TextRank: Precision", color = 'forestgreen')
p.vbar(x=dodge('metrics',  2*d, range=p.x_range), top='textrank-recall', width=w, source=source, legend_label="TextRank: Recall", color = 'green')

p.vbar(x=dodge('metrics', 3*d+0.015, range=p.x_range), top='abstractive-fmeasure', width=w, source=source, legend_label="Abstractive T5: F-Measure", color = 'salmon')
p.vbar(x=dodge('metrics',  4*d,  range=p.x_range), top='abstractive-precision', width=w, source=source, legend_label="Abstractive T5: Precision", color = 'firebrick')
p.vbar(x=dodge('metrics',  5*d, range=p.x_range), top='abstractive-recall', width=w, source=source, legend_label="Abstractive T5: Recall", color = 'darkred')

# formatting
p.grid.grid_line_color="white"
p.xaxis.axis_label_text_font_size = '12pt'
p.yaxis.axis_label_text_font_size = '12pt'
p.title.text_font_size = '14pt'
p.xaxis.major_label_text_font_size = '12pt'
p.yaxis.major_label_text_font_size = '12pt'
p.add_layout(p.legend[0], 'right')
p.yaxis.minor_tick_line_color = None

output_file("eval_all_compare.html")
show(p)


## Compare Predicted Summaries across Algorithms and Metrics


__Length of Predicted Summaries__
- Precision favors shorter summaries because it maximizes the overlap of the actual and predicted summary, so any non-relevant information in the predicted summary hurts the statistic. 
- Recall favors longer summaries because it maximizese the amount of information in the actual summary that is captured in the predicted summary, so the more text in the prediction, the more chances for getting a match with the actual. 
  - TextRank recall generates the longest summaries in terms of number of words. This is a good thing in that it gives human content curators more information to work with (as long as precision is still reasonable)
- Abstractive summaries are shorter across all configurations because we were unable to use the sentences heuristic. Also because able to stop before the heuristic is reached if an end of sequence token is generated.
  - Shorter summaries can be a good thing if more to the point and include fewer extraneous details. 
- Because all of the best F-measure (and best precision and best recall) models are using the same heuristic, they generate similarly long predicted summaries. 


In [61]:
# length of summary by metric within model
summary_len_metric = {'lsa':{}, 'textrank':{}, 'baseline':{}, 'abstractive':{}}
for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
  for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
    df = model_dict[model][metric]
    df['summary_num_words'] = df.predicted_summary.map(lambda row: len(''.join(row).split(' ')))
    df['summary_num_sentences'] = df.predicted_summary.map(lambda row: len(row))

    if model == 'abstractive': # abstractive predicted summary not in list of sentences, instead continuous text
      df['summary_num_words'] = df.predicted_summary.map(lambda row: len(row.split(' ')))
      df['summary_num_sentences'] = df.predicted_summary.map(lambda row: len(tokenize.sent_tokenize(row)))

    summary_len_metric[model][metric] = [df.summary_num_sentences.mean(), df.summary_num_words.mean()]

In [62]:
def barplot(top, color, yrange, ylabel, title, noy = False):
  metrics = ['F-Measure', 'Precision', 'Recall']
  
  p = figure(x_range=metrics, y_range = yrange, plot_height=350, plot_width = 300, title=title, toolbar_location=None, tools="")
  p.vbar(x=metrics, top=top, width=0.9, color = color)

  # formatting
  p.grid.grid_line_color="white"
  p.xaxis.axis_label_text_font_size = '12pt'
  p.yaxis.axis_label_text_font_size = '12pt'
  p.title.text_font_size = '13pt'
  p.xaxis.major_label_text_font_size = '12pt'
  p.yaxis.major_label_text_font_size = '12pt'
  p.yaxis.axis_label = ylabel
  p.yaxis.minor_tick_line_color = None
  p.xaxis.major_label_orientation = pi/4

  if noy:
    p.yaxis.major_label_text_font_size = '0pt'

  return p

In [71]:
p1 = barplot([i[0] for i in list(summary_len_metric['baseline'].values())], '#1F77B4', (0,5.2), ylabel = '# Summary Sentences', title = 'TF-IDF')
p2 = barplot([i[0] for i in list(summary_len_metric['lsa'].values())], 'darkorange', (0,5.2), ylabel = '', title = 'LSA', noy = True)
p3 = barplot([i[0] for i in list(summary_len_metric['textrank'].values())], 'forestgreen', (0,5.2), ylabel = '', title = 'TextRank', noy = True)
p4 = barplot([i[0] for i in list(summary_len_metric['abstractive'].values())], 'firebrick', (0,5.2), ylabel = '', title = 'Abstractive T5', noy = True)

suptitle = Div(text = """
<html>
<head>
<style>
h2
</style>
</head>
<body>
<h2>Precision Favors Short Summaries; Recall Favors Long</h2>
</body>
</html>
""")

output_file("eval_numsentences.html")
show(column(suptitle,gridplot([[p1, p2, p3, p4]])))

In [111]:
p1 = barplot([i[1] for i in list(summary_len_metric['baseline'].values())], '#1F77B4', (0,125), ylabel = '# Summary Words', title = 'TF-IDF')
p2 = barplot([i[1] for i in list(summary_len_metric['lsa'].values())], 'darkorange', (0,125), ylabel = '', title = 'LSA', noy = True)
p3 = barplot([i[1] for i in list(summary_len_metric['textrank'].values())], 'forestgreen', (0,125), ylabel = '', title = 'TextRank', noy = True)
p4 = barplot([i[1] for i in list(summary_len_metric['abstractive'].values())], 'firebrick', (0,125), ylabel = '', title = 'Abstractive T5', noy = True)

suptitle = Div(text = """
<html>
<head>
<style>
h2
</style>
</head>
<body>
<h2>Precision Favors Short Summaries; Recall Favors Long</h2>
</body>
</html>
""")

output_file("eval_numwords.html")
show(column(suptitle,gridplot([[p1, p2, p3, p4]])))

__Relative Length Predicted vs Label Summary__        
Compare number of words in predicted summary vs gold standard summary

- Producing summaries that are longer than the gold standard is ok because we are using article sentences, which we know are longer and will have some unnecessary information in them
  - Human labeler will narrow down from too much text to correct summary
- Abstractive summaries are often shorter than the gold standard because (1) using word based heuristics and (2) have the ability to generate and end of sequence character and end the summary before the heuristic is reached. 
  - Are abstractive summaries are too short and thus miss some important information?

In [75]:
# relative length of actual vs predicted summary per model 
relative_len_metric = {'lsa':{}, 'textrank':{}, 'baseline':{}, 'abstractive': {}}
for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
  for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
    df = model_dict[model][metric]
    df['article_num_words'] = df.summary.map(lambda row: len(row.split(' ')))
    df['diff'] = df.summary_num_words - df.article_num_words 
    relative_len_metric[model][metric] = df['diff'].mean()

In [110]:
p1 = barplot([i for i in list(relative_len_metric['baseline'].values())], '#1F77B4', (-25,75), ylabel = '# Summary Words', title = 'TF-IDF')
p2 = barplot([i for i in list(relative_len_metric['lsa'].values())], 'darkorange', (-25,75), ylabel = '', title = 'LSA', noy = True)
p3 = barplot([i for i in list(relative_len_metric['textrank'].values())], 'forestgreen', (-25,75), ylabel = '', title = 'TextRank', noy = True)
p4 = barplot([i for i in list(relative_len_metric['abstractive'].values())], 'firebrick', (-25,75), ylabel = '# Summary Words', title = 'Abstractive T5')

suptitle = Div(text = """
<html>
<head>
<style>
h3
</style>
</head>
<body>
<h3>Recall Summaries Longer; Precision Summaries Shorter than Gold Standard Summary</h3>
</body>
</html>
""")

suptitle2 = Div(text = """
<html>
<head>
<style>
h3
</style>
</head>
<body>
<h3>Extractive Summaries Shorter than Gold Standard Summary</h3>
</body>
</html>
""")

output_file("eval_relativelength.html")
show(column(suptitle,gridplot([[p1, p2, p3]]), suptitle2, gridplot([[p4]])))

__Length of Predicted Extracted Sentences__
- Extractive models produce sentences of similar average length, which makes sense since they are all sourcing from original article sentences
- Abstractive produces shorter sentences
  - This is usually preferable in that the generated summary sentences are to the point and don't include extra information, while the extracted text sentences often include a key point plus some supporting information. 

In [94]:
# length of sentences: words per sentence for each model
sentence_len_metric = {'lsa':{}, 'textrank':{}, 'baseline':{}, 'abstractive':{}}
for model in ['lsa', 'textrank', 'baseline', 'abstractive']:
  for metric in [('fmeasure', False), ('precision', False), ('recall', False)]:
    df = model_dict[model][metric]
    df['article_num_words'] = df.predicted_summary.map(lambda row: np.mean([len(i.split(' ')) for i in row]))     
    if model == 'abstractive': # abstractive predicted summary not in list of sentences, instead continuous text
      df['article_num_words'] = df.predicted_summary.map(lambda row: np.mean([len(i.split(' ')) for i in tokenize.sent_tokenize(row)]))         
    sentence_len_metric[model][metric] = df['article_num_words'].mean()

In [96]:
p1 = barplot([i for i in list(sentence_len_metric['baseline'].values())], '#1F77B4', (0,30), ylabel = '# Summary Words', title = 'TF-IDF')
p2 = barplot([i for i in list(sentence_len_metric['lsa'].values())], 'darkorange', (0,30), ylabel = '', title = 'LSA', noy = True)
p3 = barplot([i for i in list(sentence_len_metric['textrank'].values())], 'forestgreen', (0,30), ylabel = '', title = 'TextRank', noy = True)
p4 = barplot([i for i in list(sentence_len_metric['abstractive'].values())], 'firebrick', (0,30), ylabel = '', title = 'Abstractive T5', noy = True)

suptitle = Div(text = """
<html>
<head>
<style>
h3
</style>
</head>
<body>
<h3>Extracted Sentences of Uniform Length across Models; Abstractive Sentences Shorter</h3>
</body>
</html>
""")

output_file("eval_sentencelength.html")
show(column(suptitle,gridplot([[p1, p2, p3, p4]])))

### Qualitative Comparison
- Compare textrank recall vs precision. Get example where precision too vague, recall to rambly
- Compare textrank vs lsa recall. Is LSA qualitatively worse? Why?
- Evaluate abstractive model by itself: examples of short comings of abstractive models
- Compare abstractive recall vs textrank recall

__Extractive: TextRank Recall vs Precision__
  - Zimmerman case and parent's fear: precision is more too the point, recall model too rambly, repeats itself
    - Recall: 'If, during this 16-month ordeal, that thought never crossed your mind, then you have no idea what it is like to be the parent of a young, black male in America.', "Opinion: Zimmerman case echoes issues of race, guns  But this is what it's like to be the parent of a young, black male in this country.", 'After it was determined I was not the black male he was looking for, he let me go.', 'To be the parent of a young black man in this country is to be torn between wanting your son to see the world with no boundaries and warning him of the boundaries that are out there.', "That's when he pulled over, got out of his car, drew his weapon and yelled he was going to shoot me if I didn't stop running."
    - Precision: Opinion: Zimmerman case echoes issues of race, guns  But this is what it's like to be the parent of a young, black male in this country.
  - Carbon footprint of Football (soccer) and efforts to be more environmentally friendly: precision too brief, recall includes more details of the article
    - Recall: "The cistern stores rainwater that can be used to irrigate the pitch, but compared to most clubs' water consumption, it is just a drop in the ocean.", '"Ethical Consumer" estimates that it takes an astonishing 20,000 liters of water per day to maintain a football pitch in the English Premier League, and at Camp Nou, home of Champions League winners Barcelona, up to 54,000 liters of water are needed to irrigate the pitch on a hot day.', 'The stadium is covered with a "living roof" of plants that provide a natural air filtration system, gray water is supplied from two huge ponds near the stadium and solar panels are used to heat water for the toilets.', 'It is not a huge investment, but it is enough to make a difference, and the club say the scheme will pay for itself by reducing energy expenditure.'
    - Precision: The cistern stores rainwater that can be used to irrigate the pitch, but compared to most clubs' water consumption, it is just a drop in the ocean.

__TextRank recall vs LSA recall__
  - Release of Mac OS X Lion: LSA misses some of the main points and hinges on details instead
    - LSA: 'Brian X. Chen of Wired wrote that "some of Lion\'s iOS-like features scale up very well, while others behave very poorly in a desktop environment."', "He found the software's iPad-like scrolling feature distracting and dizzying (he eventually disabled it) but said he enjoyed the system's app-opening full-screen mode and praised new sharing and auto-save functions.", 'The new system, however, does not run iPhone or iPad apps (at least, not yet) and it does run Adobe Flash -- something Steve Jobs and company have summarily banished from their mobile devices.'
    - TextRank: "Lion, the latest version of Apple's operating software for its Mac computers, was released to the public on Wednesday.", 'Mac OS X Lion is available as a $29.99 upgrade for people with the latest version of the Snow Leopard operating system.', 'Lion is the latest in a long line of cat-named operating systems rolled out by Apple for its computers.'
    - Gold standard: "Apple's Mac OS X Lion released on Wednesday    The new operating system for Macs adopts features from mobile devices    New system has 250 new features, Apple says"
  - Dick Cheney heart attack
    - LSA: Cheney had been on the cardiac transplant list for more than 20 months.
    - TextRank: 'Former Vice President Dick Cheney has been released from a Virginia hospital 10 days after undergoing a heart transplant, his office said Tuesday.'
    - Gold standard: Dick Cheney has been released from a Virginia hospital    He was recovering from a heart transplant    Cheney, 71, suffered from at least five heart attacks since 1978    The former Wyoming congressman served as a vice president under President George W. Bush


__Abstractive__
- Precision vs recall: both use beam search, so only difference is that the recall summaries are longer, which we have already discussed is preferable. Rest of analysis is with Recall model. 
- Examples of abstractive creating information that isn't in the article
  - "john avlon: if Zimmerman was black, he wouldn't have been the parent of a young black man. he says he's a parent of young black males in the middle-class, predominantly white neighborhood.." -- article often used the phrase "If Zimmerman was black..." and also discussed the fear of parents of Black children. But this phrase combines those two ideas in a non-sensical way.
    
- Repetitive (and incorrect information)
  - Part of generated summary from story about children being kidnapped and brought to Cuba: "he says the family is receiving "exceptional cooperation" from the united states and the united states."
    - In reality, the article discussed recieving cooperation from the Cuban government
- Not all capitalizaton is recovered correctly
  - "jeb Bush is the clear Republican presidential frontrunner"
- Sentences are cut off part way through because can only specify max characters
  - "the club says it has reduced landfill by 85 percent, moved to electric vehicles at the ground, and used eco-friendly paper for match-day programs. but some clubs are starting'"

In general, pastes together phrases from the articles in grammatically correct ways. Many chunks of words are directly from article.    

__Abstractive vs Extractive__
- Zimmerman case: abstractive misses information
  - Abstractive misses a lot of information, main point. Extractive if long, but hits all of the important points 
- Carbon footprint of Football: abstractive more to the point without extraneous details
  - 'a 2008 survey by "Ethical Consumer" looked at the eco-credentials of clubs in the english premier league. the club says it has reduced landfill by 85 percent, moved to electric vehicles at the ground, and used eco-friendly paper for match-day programs. it's possible for small clubs to do their bit, but their use of solar'
- Dick Cheney heart transplant: abstractive misses information
  - "former vice president was on the cardiac transplant list for more than 20 months. he had a left ventricular assis" (cut off)
  - Doesn't get main point about successful heart transplant, released from hospital. Extractive did get. 
- Storm in Sardinia: abstractive sentences in a better order
  - Extractive: 'But Vargiu said a lot of ministry staff and emergency workers had come to aid the town, and there were many volunteers helping its inhabitants.', 'Vargiu said a lot of money would be needed for Sardinia to recover from the storm, with many families and businesses affected.', 'There are rivers of water in the town.', 'Olbia\'s councilor for tourism, Marco Vargiu, told CNN that the storm had been "a disaster" for the town of about 70,000, with authorities saying 13 people had died in the broader area.', 'Vargiu said the houses were filled with a mixture of water, sand and rubbish.'
    - Order of sentences doesn't make sense
  - Abstractive: "a storm has killed at least 16 people on the italian island of Sardinia. the government has allocated 20 million euros in immediate aid to the island. the money will be used to help save lives, assist the displaced and repair roads. the island has received six months' worth of rainfall in 12 hours."

Extractive more consistently hits all of the important points in an article, even if the ordering of sentences is confusing and it can be too long. Given the humans are editing these summaries, getting all of the information is more important. 