# Miscellaneous notes

## Data source and idea

From the github of HANNA: https://github.com/dig-team/hanna-benchmark-asg, 
we can retrieve the file hanna_stories_annotation.csv, which contains, for 96 prompts, a story generated by a human and a story generated by 10 ASG systems, so 1056 stories in total. 

The idea is to try to reproduce, to some extent the results depicted in "Of human criteria and automatic metrics: a benchmark of the evaluation of story generation"

## Various metrics

* From https://github.com/PierreColombo/nlg_eval_via_simi_measures: DepthScore, BaryScore, InfoLM
* From https://github.com/neural-dialogue-metrics/BLEU: BLEU
* From NLTK : BLEU
* From https://github.com/neural-dialogue-metrics/rouge: ROUGE
* From https://github.com/pltrdy/rouge : ROUGE (alternative implementation)
* From https://github.com/bheinzerling/pyrouge  pyrouge : ROUGE (Rouge155)
* From NLTK : METEOR





# Dependencies


## Installations

In [None]:
#!pip install -r ../requirements.txt

In [None]:
!pip install transformers
!pip install torch

In [None]:
!pip install nltk
!pip install py-rouge
#https://github.com/Diego999/py-rouge

In [None]:
!pip install git+https://github.com/PierreColombo/nlg_eval_via_simi_measures.git

In [None]:
#!pip install git+https://github.com/neural-dialogue-metrics/BLEU.git

#those two have the same package name : rouge
#!pip install git+https://github.com/neural-dialogue-metrics/rouge.git
#!pip install git+https://github.com/pltrdy/rouge

## Dependencies

In [None]:
import numpy as np 
import sklearn
import torch 
import transformers
from tqdm import tqdm

In [None]:
from nlg_eval_via_simi_measures.bary_score import BaryScoreMetric
from nlg_eval_via_simi_measures.depth_score import DepthScoreMetric
from nlg_eval_via_simi_measures.infolm import InfoLM

In [None]:
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import corpus_bleu

# Testing metrics

In [None]:
ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

# Baryscore

def compute_bary(ref, hypothesis):
  metric_call = BaryScoreMetric()
  metric_call.prepare_idfs(ref, hypothesis)
  final_preds = metric_call.evaluate_batch(ref, hypothesis)
  print(final_preds)
  print("="*25)

compute_bary(ref, hypothesis)

ref = ['I like my cakes very much']
hypothesis = ['I like my cakes very much']

compute_bary(ref, hypothesis)

ref = ['I hate these cakes!']
hypothesis = ['I like my cakes very much']

compute_bary(ref, hypothesis)


*   W c'est pour distance de Wasserstein
*   SD, c'est pour Sinkhorn Divergence, donc autre chose et a priori pas ce qui nous intéresse (pour différentes valeurs d'un paramètre)

Donc, a priori, on prend juste "baryscore_W"

* Il faut calculer les idf au niveau du corpus, mais d'après ce que je lis dans l'article, ca se fait séparemment pour chaque paire référence/candidat (en tout cas le contraire n'est pas clairement indiqué)

Donc, à mon avis, même s'il y a une fonction "evaluate_batch", on doit faire les calculs individuellement pour chaque pair, et pas tout concaténer. Mais le calcul de l'idf doit se faire au niveau des **textes** complets que l'on compare. 

Plus la métrique est proche de 0, mieux c'est. 



In [None]:
# DepthScore

ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

metric_call = DepthScoreMetric()
metric_call.prepare_idfs(ref, hypothesis)
final_preds = metric_call.evaluate_batch(hypothesis,ref )
print("DepthScore")
print(final_preds)


Calcul d'un embedding pour le candidat et la référence avec a single layer of Bert (comme Bertscore apparemment) puis calcule la divergence introduite dans "a pseudo metric", donc plus c'est proche de 0 et plus les deux phrases sont proches

In [None]:
# Pour l'instant, ca fonctionne uniquement si on se met sur CPU (il manque surement un .todevice() quelque part)

# InfoLM 

metric = InfoLM()

ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

metric.prepare_idfs(ref, hypothesis)

final_preds = metric.evaluate_batch(hypothesis, ref)#), 
                                    #idf_ref = metric.idf_dict_hyp,
                                    #idf_hyps = metric.idf_dict_ref)
                                    #idf_ref= idf_ref,
                                    #idf_hyps= idf_hypot)

#self.idf_dict_hyp, self.idf_dict_ref 
print(final_preds)

Encore une fois, il me semble qu'on calcule les idf à l'échelle d'une pair candidat (C) / référence (R). 
D'abord, séparemment pour C et R, le PMLM calcule une probabilité discrète par token en le masquant, puis on l'agrège en mettant des poids via les idf, puis on utilise une mesure de divergence pour estimer la distance entre les deux distributions de probabilité obtenues. 

Par défaut, on utilise Fisher Rao qui, d'après l'article, est pas mal et a l'avantage de ne pas inclure de paramètre à optimiser. 

On a 3 résultats, parce que l'algo permet d'utiliser des divergences, qui ne sont pas symétriques. Du coup le premier c'est Div(A,B), le second Div(B,A) et le 3ème la moyenne des deux. Mais vu qu'on utilise la DISTANCE de Fisher Rao (qui évidemment est une distance), elle est symétrique et osef

In [None]:
# Bleu by NLTK (sentence level)
lambda_split_function=lambda x:  x.split()
ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

ref_processed = list(map(lambda_split_function,ref))
hyp_processed = list(map(lambda_split_function,hypothesis))

for i in range(len(ref_processed)):
    print(ref_processed[i],hyp_processed[i])
    print(sentence_bleu([ref_processed[i]],hyp_processed[i]))

In [None]:
# Bleu by NLTK (corpus level)
lambda_split_function=lambda x:  x.split()
ref = ['I like my cakes very much','I love these cakes!']
hypothesis = ['I like my cakes very much','I love my cakes very much']

ref_processed = list(map(lambda_split_function,ref))
hyp_processed = list(map(lambda_split_function,hypothesis))

for i in range(len(hyp_processed)):
    print(ref_processed,hyp_processed[i])
    print(corpus_bleu([ref_processed],[hyp_processed[i]]))

In [None]:
# py-rouge

import rouge
from nltk import download
download('punkt')

def prepare_results(m, p, r, f):
    return '\t{}:\t{}: {:5.2f}\t{}: {:5.2f}\t{}: {:5.2f}'.format(m, 'P', 100.0 * p, 'R', 100.0 * r, 'F1', 100.0 * f)



all_hypothesis = ['I like my cakes very much','I love my cakes very much']
all_references = ['I like my cakes very much','I love these cakes!']

for aggregator in ['Avg', 'Best', 'Individual']:
    print('Evaluation with {}'.format(aggregator))
    apply_avg = aggregator == 'Avg'
    apply_best = aggregator == 'Best'

    evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'],
                           max_n=4,
                           limit_length=True,
                           length_limit=100,
                           length_limit_type='words',
                           apply_avg=apply_avg,
                           apply_best=apply_best,
                           alpha=0.5, # Default F1_score
                           weight_factor=1.2,
                           stemming=True)



    scores = evaluator.get_scores(all_hypothesis, all_references)

    for metric, results in sorted(scores.items(), key=lambda x: x[0]):
        if not apply_avg and not apply_best: # value is a type of list as we evaluate each summary vs each reference
            for hypothesis_id, results_per_ref in enumerate(results):
                nb_references = len(results_per_ref['p'])
                for reference_id in range(nb_references):
                    print('\tHypothesis #{} & Reference #{}: '.format(hypothesis_id, reference_id))
                    print('\t' + prepare_results(metric,results_per_ref['p'][reference_id], results_per_ref['r'][reference_id], results_per_ref['f'][reference_id]))
            print()
        else:
            print(prepare_results(metric, results['p'], results['r'], results['f']))
    print()


In [None]:
!pip install git+https://github.com/neural-dialogue-metrics/rouge.git

In [None]:
from rouge import rouge_n_sentence_level


lambda_split_function=lambda x:  x.split()
ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

ref_processed = list(map(lambda_split_function,ref))
hyp_processed = list(map(lambda_split_function,hypothesis))

for i in range(len(ref_processed)):
    reference_sentence=ref_processed[i]
    summary_sentence=hyp_processed[i]

    # Calculate ROUGE-2.
    recall, precision, rouge = rouge_n_sentence_level(summary_sentence, reference_sentence, 2)
    print('ROUGE-2-R', recall)
    print('ROUGE-2-P', precision)
    print('ROUGE-2-F', rouge)

    # If you just want the F-measure you can do this:
    *_, rouge = rouge_n_sentence_level(summary_sentence, reference_sentence, 2)  # Requires a Python-3 to use *_.
    print('ROUGE-2-R', recall)



In [None]:
!pip install git+https://github.com/pltrdy/rouge

In [None]:
# pltrdy/rouge

from rouge import Rouge 

ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

rouge = Rouge()
scores = rouge.get_scores(hypothesis, ref)
scores

In [None]:
# METEOR score

from nltk.translate import meteor
from nltk import word_tokenize
from nltk import download
download('punkt')
download('wordnet')

ref = ['I like my cakes very much','I hate these cakes!']
hypothesis = ['I like my cakes very much','I like my cakes very much']

for i in range(len(ref)):
    wref=word_tokenize(ref[i])
    whyp=word_tokenize(hypothesis[i])
    print(wref,whyp)
    print(round(meteor([wref],whyp), 4))