# Evaluation for text summarization

In this notebook, we will measure several metrics for text summarization (such as ROUGE and BLEU) on a set of generated summaries. 
To do this, we will load a dataset containing the original summaries and their correspnding generated summaries. 
(We do not load any tranined model in this notebook). 



## Select the dataset
Our previous notebooks allow you to train a model for different datasets. 
So, please, select the dataset for which you want to load its predictions and obtain its scores:

In [None]:
datasetInfo = [['cnn_dailymail', '3.0.0', 'article', 'highlights'],
               ['gigaword', '1.2.0', 'document', 'summary'],
               ['xsum', '1.1.0', 'document', 'summary'],
               ['reddit','1.0.0', 'content', 'summary'],

               ['biomrc', 'biomrc_large_A', 'abstract','answer'],
               ['biomrc', 'biomrc_large_B', 'abstract','title'],
               ['emotion', '0.0.0','text','label']]


#Please, select the dataset 
numDataset = 2

nameDataset=datasetInfo[numDataset][0]
versionDataset=datasetInfo[numDataset][1]
text_field=datasetInfo[numDataset][2]
summary_field=datasetInfo[numDataset][3]

    

print("Evaluating ", nameDataset)

Evaluating  biomrc


## Load the file containing the predictions
To calculate the metrics, we need to load the file with the predictions created by the model. 

If you are running on Google Colab, you would need to mount your 
google drive:


In [None]:
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    root_colab='drive/My Drive/Colab Notebooks/'
    root=root_colab+'NLPwithDL/TextSummarization/'
else:
    print('Not running on CoLab')
    root='./'



Mounted at /content/drive


## Loading the file with the predictions

We assume that the predictions where saved into a file whose name is the name of the dataset used to train the model:

In [None]:
import os
import pandas as pd

path_predictions=root+'outputs/'+nameDataset+'.csv'
if os.path.exists(path_predictions)==False:
    print('{} does not exist!!!'.format(path_predictions))

#we load the csv file
#df=pd.read_csv(root+'outputs/'+nameDataset+'.csv', usecols=["newSummaries","originalSummaries","fullTexts"])
df=pd.read_csv(root+'outputs/'+nameDataset+'.csv', usecols=["input_text","gold_summary","predicted_summary"])


#predicted_summaries=df['newSummaries'].tolist() #list with the generated summaries
#gold_summaries=df['originalSummaries'].tolist() #list with the original summaries

predicted_summaries=df['predicted_summary'].tolist() #list with the generated summaries
gold_summaries=df['gold_summary'].tolist() #list with the original summaries


print('{} summaries were loaded'.format(len(gold_summaries)))

ParserError: ignored

### Rouge

We are going to use **rouge-1**, **rouge-2** and **rouge-l** to evaluate with test dataframe. To do this, we will use the library **rouge** (https://pypi.org/project/rouge/). 


In [None]:
pip install rouge

Collecting rouge
  Downloading https://files.pythonhosted.org/packages/43/cc/e18e33be20971ff73a056ebdb023476b5a545e744e3fc22acd8c758f1e0d/rouge-1.0.0-py3-none-any.whl
Installing collected packages: rouge
Successfully installed rouge-1.0.0


In [None]:
from rouge import Rouge 

print('Evaluation for ', nameDataset)
print()
rouge = Rouge()

scores = rouge.get_scores(predicted_summaries, gold_summaries, avg=True)
print(scores)

Evaluation for  xsum

{'rouge-1': {'f': 0.32203383333407487, 'p': 0.29007891717047407, 'r': 0.3781294572932764}, 'rouge-2': {'f': 0.09704224982912574, 'p': 0.08693358470797063, 'r': 0.11500725709799853}, 'rouge-l': {'f': 0.2711185391484092, 'p': 0.24859366034231772, 'r': 0.3093694673727616}}


We also save the scores into a file (with the same name that the dataset):

In [None]:
path_scores=root+'scores/'+nameDataset+'.txt'
f = open(path_scores, 'w')
f.write("Scores for {}\n".format(nameDataset))
f.write('____________________________________')
for metric in scores: #metric will be rouge-1, rouge-2 or rouge-L
    f.write("{}\n".format(metric))
    f.write("\n{}\n".format(scores[metric]))

f.write('\n\n')

f.close()

print('scores were saved into {}'.format(path_scores))


scores were saved into drive/My Drive/Colab Notebooks/NLPwithDL/TextSummarization/scores/xsum.txt


#### rouge1, rouge2, rouge3, rougeL, rougeW, rougeS, rougeSU

We also want to obtain other Rouge metrics such as rougeW, rougeS, rougeSU. To do this, we will use the library **rouge-metric**  (#https://pypi.org/project/rouge-metric/)

In [None]:
!pip install rouge-metric

Collecting rouge-metric
[?25l  Downloading https://files.pythonhosted.org/packages/bb/34/18ddbc94f65e8b45220b373b2ad2db6bef7549f4b00b4baaaaa47204be1a/rouge_metric-1.0.1-py3-none-any.whl (151kB)
[K     |██▏                             | 10kB 22.4MB/s eta 0:00:01[K     |████▎                           | 20kB 15.9MB/s eta 0:00:01[K     |██████▌                         | 30kB 14.7MB/s eta 0:00:01[K     |████████▋                       | 40kB 14.4MB/s eta 0:00:01[K     |██████████▉                     | 51kB 11.8MB/s eta 0:00:01[K     |█████████████                   | 61kB 11.8MB/s eta 0:00:01[K     |███████████████▏                | 71kB 12.0MB/s eta 0:00:01[K     |█████████████████▎              | 81kB 11.9MB/s eta 0:00:01[K     |███████████████████▍            | 92kB 12.9MB/s eta 0:00:01[K     |█████████████████████▋          | 102kB 13.0MB/s eta 0:00:01[K     |███████████████████████▊        | 112kB 13.0MB/s eta 0:00:01[K     |██████████████████████████      | 

In [None]:
from rouge_metric import PerlRouge

#The parameters allow us to indicate what Rouge metrics we want to obtain:
rouge = PerlRouge(rouge_n_max=3, rouge_l=True, rouge_w=True,
    rouge_w_weight=1.2, rouge_s=True, rouge_su=True, skip_gap=4)


#then, we use the 
scores = rouge.evaluate(predicted_summaries, gold_summaries)
print(scores)


CalledProcessError: ignored

We also save them into the file with the scores:

In [None]:
f = open(path_scores, 'a')
f.write("Extended ROUGE Scores for {}\n".format(nameDataset))
f.write('____________________________________')
for metric in scores: #metric will be rouge-1, rouge-2 or rouge-L
    f.write("{}\n".format(metric))
    f.write("\n{}\n".format(scores[metric]))
f.write('\n\n')
f.close()

print('scores were saved into {}'.format(path_scores))

### BLEU

In [None]:
import nltk

total = 0
for i in range(len(gold_summaries)):
    total += nltk.translate.bleu_score.sentence_bleu([gold_summaries[i]], predicted_summaries[i],weights=(1.0, 0, 0, 0))

BLEUscore = total / len(gold_summaries)
print('BLEU:',BLEUscore)

Corpus/Sentence contains 0 counts of 4-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().


BLEU: 0.6650918560787469


We also save the BLEU score into the file with the rest of scores:

In [None]:
f = open(path_scores, 'a')
f.write("BLEU score for {}:{}\n".format(nameDataset,BLEUscore))
f.close()

print('scores were saved into {}'.format(path_scores))

scores were saved into drive/My Drive/Colab Notebooks/NLPwithDL/TextSummarization/scores/xsum.txt
