# Summarization Evaluation
This notebook explains the metrics commonly used to evaluate text summarization results and how to use the evaluation utilities provided in the repo. 

## ROUGE
Recall-Oriented Understudy for Gisting Evaluation(ROUGE) is a set of metrics for evaluating automatic text summarization and machine translation results. The metrics compare machine-generated summaries or translations against one or multiple reference summaries or translations created by human.  
Commonly used ROUGE metrics are ROUGE-1, ROUGE-2, and ROUGE-L
* ROUGE-1: Overlap of unigrams (single words) between machine-generated and reference summaries. 
* ROUGE-2: Overlap of bigrams (two adjcent words) between machine-generated and reference summaries.
* ROUGE-L: Longest Common Subsequence (LCS), which doesn't require consecutive matches but in-sequence matches that refect sentence level structure similarity.  

For each metric, recall, precision, and F1 score are computed. 

**Utilities for computing ROUGE**
* `compute_rouge_perl`: The [pyrouge](https://github.com/bheinzerling/pyrouge/tree/master/pyrouge) package based on the ROUGE package written in perl is the most popular package for computing ROUGE scores. We provide the `compute_rouge_perl` function based on pyrouge. 
* `compute_rouge_python`: The [py-rouge](https://pypi.org/project/py-rouge/) package is a Python implementation of the ROUGE metric which produces almost the same results as the perl implemenation. Since it's easier to install than pyrouge and can be extended to other languages, we provide the `compute_rouge_python` function based on py-rouge. Currently, only English is supported. Supports for other languages will be provided on an as-needed basis. 

In [1]:
import os
import sys

nlp_path = os.path.abspath('../../')
if nlp_path not in sys.path:
    sys.path.insert(0, nlp_path)
    
from utils_nlp.eval.compute_rouge import compute_rouge_perl, compute_rouge_python

### Sample inputs
Both `compute_rouge_perl` and `compute_rouge_python` takes lists of candidate summaries and reference summaries as inputs. Alternatively, you can also provide paths to files containing the candidates and references and set the `input_files` argument to `True`. 

In [2]:
summary_candidates = ["The stock market is doing well this year.", "The movie is very popular."]
summary_references = ["The stock market is doing really well in 2019.", "The movie is very popular among millennials."]

### compute_rouge_python
To use `compute_rouge_python`, you only need to install the Python package `py-rouge` and `nltk`.

In [3]:
python_rouge_scores = compute_rouge_python(cand=summary_candidates, ref=summary_references)

Number of candidates: 2
Number of references: 2


In [4]:
print("ROUGE-1: {}".format(python_rouge_scores["rouge-1"]))
print("ROUGE-2: {}".format(python_rouge_scores["rouge-2"]))
print("ROUGE-L: {}".format(python_rouge_scores["rouge-l"]))

ROUGE-1: {'f': 0.7696078431372548, 'p': 0.875, 'r': 0.6904761904761905}
ROUGE-2: {'f': 0.6666666666666667, 'p': 0.7857142857142857, 'r': 0.5833333333333333}
ROUGE-L: {'f': 0.8044834406175039, 'p': 0.8934181487831181, 'r': 0.7343809193130839}


### compute_rouge_perl
To use `compute_rouge_perl`, in addition to installing the Python package `pyrouge`, you also need to go through the following setup steps on a Linux machine.  
**NOTE**: Set `PYROUGE_PATH` to the root directory of the cloned `pyrouge` repo and `PYTHON_PATH` to the root directory of the conda environment where you installed `pyrouge` first.

In [None]:
%%bash
git clone https://github.com/andersjo/pyrouge.git
# PYROUGE_PATH=<root directory of cloned pyrouge repo> #e.g./home/hlu/notebooks/summarization/pyrouge
# PYTHON_PATH=<root directory of conda environment> #e.g./data/anaconda/envs/nlp_gpu
PYROUGE_PATH=/home/hlu/notebooks/summarization/pyrouge
PYTHON_PATH=/data/anaconda/envs/nlp_gpu
$PYTHON_PATH/bin/pyrouge_set_rouge_path $PYROUGE_PATH/tools/ROUGE-1.5.5

# install XML::DOM plugin, instructions https://web.archive.org/web/20171107220839/www.summarizerman.com/post/42675198985/figuring-out-rouge
sudo cpan App::cpanminus
sudo cpanm XML::DOM

# install XLM::Parser and its dependencies
sudo apt-get update
sudo apt-get install libexpat1-dev -y
sudo cpanm  XML::Parser

# Fix WordNet issue
# Instructions https://web.archive.org/web/20180812011301/http://kavita-ganesan.com/rouge-howto/#IamHavingWordNetExceptions
cd  $PYROUGE_PATH/tools/ROUGE-1.5.5/data/
rm WordNet-2.0.exc.db

cd WordNet-2.0-Exceptions/
./buildExeptionDB.pl . exc WordNet-2.0.exc.db
cd ..
ln -s WordNet-2.0-Exceptions/WordNet-2.0.exc.db WordNet-2.0.exc.db

In [5]:
perl_rouge_scores = compute_rouge_perl(cand=summary_candidates, ref=summary_references)

2019-12-03 19:43:25,977 [MainThread  ] [INFO ]  Writing summaries.
2019-12-03 19:43:25,978 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpm29_bwie/system and model files to /tmp/tmpm29_bwie/model.
2019-12-03 19:43:25,979 [MainThread  ] [INFO ]  Processing files in /tmp/tmpf5p8odh5/rouge-tmp-2019-12-03-19-43-25/candidate/.
2019-12-03 19:43:25,980 [MainThread  ] [INFO ]  Processing cand.1.txt.
2019-12-03 19:43:25,981 [MainThread  ] [INFO ]  Processing cand.0.txt.
2019-12-03 19:43:25,982 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpm29_bwie/system.
2019-12-03 19:43:25,982 [MainThread  ] [INFO ]  Processing files in /tmp/tmpf5p8odh5/rouge-tmp-2019-12-03-19-43-25/reference/.
2019-12-03 19:43:25,983 [MainThread  ] [INFO ]  Processing ref.0.txt.
2019-12-03 19:43:25,984 [MainThread  ] [INFO ]  Processing ref.1.txt.
2019-12-03 19:43:25,985 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpm29_bwie/model.
2019-12-03 19:43:25,986 [MainThread  ] [IN

Number of candidates: 2
Number of references: 2
---------------------------------------------
1 ROUGE-1 Average_R: 0.69048 (95%-conf.int. 0.66667 - 0.71429)
1 ROUGE-1 Average_P: 0.87500 (95%-conf.int. 0.75000 - 1.00000)
1 ROUGE-1 Average_F: 0.76961 (95%-conf.int. 0.70588 - 0.83334)
---------------------------------------------
1 ROUGE-2 Average_R: 0.58333 (95%-conf.int. 0.50000 - 0.66667)
1 ROUGE-2 Average_P: 0.78571 (95%-conf.int. 0.57143 - 1.00000)
1 ROUGE-2 Average_F: 0.66666 (95%-conf.int. 0.53333 - 0.80000)
---------------------------------------------
1 ROUGE-3 Average_R: 0.51428 (95%-conf.int. 0.42857 - 0.60000)
1 ROUGE-3 Average_P: 0.75000 (95%-conf.int. 0.50000 - 1.00000)
1 ROUGE-3 Average_F: 0.60577 (95%-conf.int. 0.46154 - 0.75000)
---------------------------------------------
1 ROUGE-4 Average_R: 0.41666 (95%-conf.int. 0.33333 - 0.50000)
1 ROUGE-4 Average_P: 0.70000 (95%-conf.int. 0.40000 - 1.00000)
1 ROUGE-4 Average_F: 0.51515 (95%-conf.int. 0.36363 - 0.66667)
------------

In [44]:
perl_rouge_scores

{'rouge_1_recall': 0.69048,
 'rouge_1_recall_cb': 0.66667,
 'rouge_1_recall_ce': 0.71429,
 'rouge_1_precision': 0.875,
 'rouge_1_precision_cb': 0.75,
 'rouge_1_precision_ce': 1.0,
 'rouge_1_f_score': 0.76961,
 'rouge_1_f_score_cb': 0.70588,
 'rouge_1_f_score_ce': 0.83334,
 'rouge_2_recall': 0.58333,
 'rouge_2_recall_cb': 0.5,
 'rouge_2_recall_ce': 0.66667,
 'rouge_2_precision': 0.78571,
 'rouge_2_precision_cb': 0.57143,
 'rouge_2_precision_ce': 1.0,
 'rouge_2_f_score': 0.66666,
 'rouge_2_f_score_cb': 0.53333,
 'rouge_2_f_score_ce': 0.8,
 'rouge_3_recall': 0.51428,
 'rouge_3_recall_cb': 0.42857,
 'rouge_3_recall_ce': 0.6,
 'rouge_3_precision': 0.75,
 'rouge_3_precision_cb': 0.5,
 'rouge_3_precision_ce': 1.0,
 'rouge_3_f_score': 0.60577,
 'rouge_3_f_score_cb': 0.46154,
 'rouge_3_f_score_ce': 0.75,
 'rouge_4_recall': 0.41666,
 'rouge_4_recall_cb': 0.33333,
 'rouge_4_recall_ce': 0.5,
 'rouge_4_precision': 0.7,
 'rouge_4_precision_cb': 0.4,
 'rouge_4_precision_ce': 1.0,
 'rouge_4_f_score': 

For each score, the 95% confidence interval is also computed, i.e. "\_cb" and "\_ce" stand for the beginning  and end of the confidence interval, respectively.  
In addition to ROUGE-1, ROUGE-2, ROUGE-L, the perl script computes a few other ROUGE scores. See details of all scores [here](https://en.wikipedia.org/wiki/ROUGE_%28metric%29).  

In [3]:
# Some test cases (temporary)
import pandas as pd
c = ["this is really good"]
r = ["this is really great"]
rouge_perl = compute_rouge_perl(c, r)
rouge_python = compute_rouge_python(c, r)

print(pd.DataFrame(rouge_python))

2019-12-03 20:56:06,551 [MainThread  ] [INFO ]  Writing summaries.
2019-12-03 20:56:06,552 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpp4q5yfq5/system and model files to /tmp/tmpp4q5yfq5/model.
2019-12-03 20:56:06,552 [MainThread  ] [INFO ]  Processing files in /tmp/tmpdztrmed9/rouge-tmp-2019-12-03-20-56-06/candidate/.
2019-12-03 20:56:06,553 [MainThread  ] [INFO ]  Processing cand.0.txt.
2019-12-03 20:56:06,554 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpp4q5yfq5/system.
2019-12-03 20:56:06,555 [MainThread  ] [INFO ]  Processing files in /tmp/tmpdztrmed9/rouge-tmp-2019-12-03-20-56-06/reference/.
2019-12-03 20:56:06,556 [MainThread  ] [INFO ]  Processing ref.0.txt.
2019-12-03 20:56:06,558 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpp4q5yfq5/model.
2019-12-03 20:56:06,559 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmpldfqkkda/rouge_conf.xml
2019-12-03 20:56:06,560 [MainThread  ] [INFO ]  Running ROUGE with comma

Number of candidates: 1
Number of references: 1
---------------------------------------------
1 ROUGE-1 Average_R: 0.75000 (95%-conf.int. 0.75000 - 0.75000)
1 ROUGE-1 Average_P: 0.75000 (95%-conf.int. 0.75000 - 0.75000)
1 ROUGE-1 Average_F: 0.75000 (95%-conf.int. 0.75000 - 0.75000)
---------------------------------------------
1 ROUGE-2 Average_R: 0.66667 (95%-conf.int. 0.66667 - 0.66667)
1 ROUGE-2 Average_P: 0.66667 (95%-conf.int. 0.66667 - 0.66667)
1 ROUGE-2 Average_F: 0.66667 (95%-conf.int. 0.66667 - 0.66667)
---------------------------------------------
1 ROUGE-3 Average_R: 0.50000 (95%-conf.int. 0.50000 - 0.50000)
1 ROUGE-3 Average_P: 0.50000 (95%-conf.int. 0.50000 - 0.50000)
1 ROUGE-3 Average_F: 0.50000 (95%-conf.int. 0.50000 - 0.50000)
---------------------------------------------
1 ROUGE-4 Average_R: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
1 ROUGE-4 Average_P: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
1 ROUGE-4 Average_F: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
------------

In [4]:
c = ["this is very good"]
r = ["this is really great"]
rouge_perl = compute_rouge_perl(c, r)
rouge_python = compute_rouge_python(c, r)
print(pd.DataFrame(rouge_python))

2019-12-03 20:56:09,105 [MainThread  ] [INFO ]  Writing summaries.
2019-12-03 20:56:09,106 [MainThread  ] [INFO ]  Processing summaries. Saving system files to /tmp/tmpdx08vejw/system and model files to /tmp/tmpdx08vejw/model.
2019-12-03 20:56:09,106 [MainThread  ] [INFO ]  Processing files in /tmp/tmpgee0mb66/rouge-tmp-2019-12-03-20-56-09/candidate/.
2019-12-03 20:56:09,107 [MainThread  ] [INFO ]  Processing cand.0.txt.
2019-12-03 20:56:09,108 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpdx08vejw/system.
2019-12-03 20:56:09,108 [MainThread  ] [INFO ]  Processing files in /tmp/tmpgee0mb66/rouge-tmp-2019-12-03-20-56-09/reference/.
2019-12-03 20:56:09,109 [MainThread  ] [INFO ]  Processing ref.0.txt.
2019-12-03 20:56:09,110 [MainThread  ] [INFO ]  Saved processed files to /tmp/tmpdx08vejw/model.
2019-12-03 20:56:09,111 [MainThread  ] [INFO ]  Written ROUGE configuration to /tmp/tmp2nhy7o3v/rouge_conf.xml
2019-12-03 20:56:09,112 [MainThread  ] [INFO ]  Running ROUGE with comma

Number of candidates: 1
Number of references: 1
---------------------------------------------
1 ROUGE-1 Average_R: 0.50000 (95%-conf.int. 0.50000 - 0.50000)
1 ROUGE-1 Average_P: 0.50000 (95%-conf.int. 0.50000 - 0.50000)
1 ROUGE-1 Average_F: 0.50000 (95%-conf.int. 0.50000 - 0.50000)
---------------------------------------------
1 ROUGE-2 Average_R: 0.33333 (95%-conf.int. 0.33333 - 0.33333)
1 ROUGE-2 Average_P: 0.33333 (95%-conf.int. 0.33333 - 0.33333)
1 ROUGE-2 Average_F: 0.33333 (95%-conf.int. 0.33333 - 0.33333)
---------------------------------------------
1 ROUGE-3 Average_R: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
1 ROUGE-3 Average_P: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
1 ROUGE-3 Average_F: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
---------------------------------------------
1 ROUGE-4 Average_R: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
1 ROUGE-4 Average_P: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
1 ROUGE-4 Average_F: 0.00000 (95%-conf.int. 0.00000 - 0.00000)
------------