# Hallucination Metrics

> DO NOT RUN ALL MODELS AT ONCE! Do FactSumm, then HuggingFace, then BLEURT separately as they all load pretty big models, it can overflow your RAM quite easily

As always, there are a few out there, all of them bearing their own pros and cons. A few desirable criteria for a good metric are thus:
- Bounded
- Reproducible
- Interpretable
- Correlate with human judgement

> Note: In the report we should maybe briefly discuss whether the metrics we pick meet these criteria 

## Fixing FactSumm

> The Open IE method (the part where you need Java) still didn't work for me out-of-the-box and I don't care to fix it as we have good alternatives I think.

One of FactSumm's dependencies has changed its interface over the years and FactSumm is not really a maintained project (however, it has a few ready-to-go implementations of metrics which make our lives easier). There is an easy fix, though.

If you simply install FactSumm with
```bash
pip install factsumm
```
and try to run the example with Lionel Messi, you will likely run into a problem `KeyError: 'entities'`. All you have to do (as per [this](https://github.com/Huffon/factsumm/issues/36)) to fix this is 
1. navigate to your virtual / conda environment and find the installed 'factsumm' package folder
2. open up `factsumm/utils/level_entity.py`
3. Change `line_result = sentence.to_dict(tag_type="ner")` to `line_result = sentence.get_spans('ner')`
4. Change 

```python
for entity in line_result["entities"]:
    if entity["text"] not in cache:
        dedup.append({
            "word": entity["text"],
            "entity": entity["labels"][0].value,
            "start": entity["start_pos"],
            "end": entity["end_pos"],
        })
        cache[entity["text"]] = None
result.append(dedup)
```

with 
```python
for entity in line_result: 
    if entity.text not in cache:
        dedup.append({
            "word": entity.text, 
            "entity": entity.tag, 
            "start": entity.start_position, 
            "end": entity.end_position, 
        }) 
        cache[entity.text] = None
result.append(dedup)
```

5. Done. Re-activate your virtual/conda environment for the changes to take place and re-run your code

In [None]:
from factsumm import FactSumm
factsumm = FactSumm()
article = "Lionel Andrés Messi (born 24 June 1987) is an Argentine professional footballer who plays as a forward and captains both Spanish club Barcelona and the Argentina national team. Often considered as the best player in the world and widely regarded as one of the greatest players of all time, Messi has won a record six Ballon d'Or awards, a record six European Golden Shoes, and in 2020 was named to the Ballon d'Or Dream Team."
summary = "Lionel Andrés Messi (born 24 Aug 1997) is an Spanish professional footballer who plays as a forward and captains both Spanish club Barcelona and the Spanish national team."

## Introduction

Hallucinations are words generated by the model that are not supported by the source input. Deep learning based generation is prone to hallucinate unintended text. These hallucinations degrade system performance and fail to meet user expectations in many real-world scenarios. By applying entity matching, we can improve this problem for the downstream task of summary generation.

In theory all entities in the summary (such as dates, locations and so on), should also be present in the article. Thus we can extract all entities from the summary and compare them to the entities of the original article, spotting potential hallucinations. The more unmatched entities we find, the lower the factualness score of the summary.

From https://huggingface.co/spaces/ml6team/post-processing-summarization

## Classical NLP scores

### FactSumm Triple-based fact score

The most intuitive way to evaluating factual consistency is to count the fact overlap between generated summary and the source document, as shown in Figure 3. Facts are usually represented by relation triples (subject, relation, object), where the subject has a relation to the object. To resolve this problem, Goodrich et al. [2019] change to use relation extraction tools with fixed schema. Considering still the two sentences in Example 1, whether extracting from the source document or the summary, the extracted triples are (Hawaii, is the birthplace of, Obama) in fixed schema extraction. This helps extracted triples easier to compare.

From https://arxiv.org/abs/2104.14839

In [None]:
# Triple-based Module (closed-scheme). Fact Score
factsumm.extract_facts(article, summary, verbose=True)

### ML6 HuggingFace NER + Dependency Parsing scores

In [None]:
import itertools
import numpy as np

import spacy

from flair.nn import Classifier
from flair.data import Sentence

from sentence_transformers import SentenceTransformer

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

def get_transformer_pipeline():
    tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
    model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
    return pipeline("ner", model=model, tokenizer=tokenizer, grouped_entities=True)

sentence_embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
ner_model = get_transformer_pipeline()
nlp = spacy.load("en_core_web_sm")
flair_tagger = Classifier.load('ner')

  from .autonotebook import tqdm as notebook_tqdm
Some weights of the model checkpoint at xlm-roberta-large-finetuned-conll03-english were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2023-09-23 15:46:34,218 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


In [None]:
def get_all_entities_per_sentence(text):
    doc = nlp(text)

    sentences = list(doc.sents)

    entities_all_sentences = []
    for sentence in sentences:
        entities_this_sentence = []

        # SPACY ENTITIES
        for entity in sentence.ents:
            entities_this_sentence.append(str(entity))

        # FLAIR ENTITIES (CURRENTLY NOT USED)
        sentence_entities = Sentence(str(sentence))
        flair_tagger.predict(sentence_entities)
        for entity in sentence_entities.get_spans('ner'):
            entities_this_sentence.append(entity.text)

        # XLM ENTITIES
        entities_xlm = [entity["word"] for entity in ner_model(str(sentence))]
        for entity in entities_xlm:
            entities_this_sentence.append(str(entity))

        entities_all_sentences.append(entities_this_sentence)

    return entities_all_sentences


In [None]:
article = "Lionel Andrés Messi (born 20 June 1987) is an Argentine professional footballer who plays as a forward and captains both Spanish club Barcelona and the Argentina national team. Often considered as the best player in the world and widely regarded as one of the greatest players of all time, Messi has won a record six Ballon d'Or awards, a record six European Golden Shoes, and in 2020 was named to the Ballon d'Or Dream Team."
summary = "Lionel Andrés Messi (born 24 Aug 1997) is an Spanish professional footballer who plays as a forward and captains both Spanish club Barcelona and the Spanish national team."

In [None]:
matched, unmatched = get_and_compare_entities(article, summary)

In [None]:
matched

['Spanish', 'Barcelona', 'Lionel Andrés Messi']

In [None]:
unmatched

['24', '1997']

In [None]:
def check_dependency(text):
    all_entities = get_all_entities_per_sentence(text)
    doc = nlp(text)
    tok_l = doc.to_json()['tokens']
    test_list_dict_output = []

    sentences = list(doc.sents)
    for i, sentence in enumerate(sentences):
        start_id = sentence.start
        end_id = sentence.end
        for t in tok_l:
            if t["id"] < start_id or t["id"] > end_id:
                continue
            head = tok_l[t['head']]
            if t['dep'] == 'amod' or t['dep'] == "pobj":
                object_here = text[t['start']:t['end']]
                object_target = text[head['start']:head['end']]
                if t['dep'] == "pobj" and str.lower(object_target) != "in":
                    continue
                # ONE NEEDS TO BE ENTITY
                if object_here in all_entities[i]:
                    identifier = object_here + t['dep'] + object_target
                    test_list_dict_output.append({"dep": t['dep'], "cur_word_index": (t['id'] - sentence.start),
                                                  "target_word_index": (t['head'] - sentence.start),
                                                  "identifier": identifier, "sentence": str(sentence)})
                elif object_target in all_entities[i]:
                    identifier = object_here + t['dep'] + object_target
                    test_list_dict_output.append({"dep": t['dep'], "cur_word_index": (t['id'] - sentence.start),
                                                  "target_word_index": (t['head'] - sentence.start),
                                                  "identifier": identifier, "sentence": str(sentence)})
                else:
                    continue
    return test_list_dict_output


In [None]:
source_deps = check_dependency(article)
summary_deps = check_dependency(summary)
total_unmatched_deps = []
for summ_dep in summary_deps:
    if not any(summ_dep['identifier'] in art_dep['identifier'] for art_dep in source_deps):
        total_unmatched_deps.append(summ_dep)

print("Found in summary but not in source:\n")
for unmatched_dep in total_unmatched_deps:
    print(unmatched_dep["identifier"].split(unmatched_dep["dep"]))

Found in summary but not in source:

['Spanish', 'footballer']
['Spanish', 'team']


However, we have found that **there are specific dependencies that are often an indication of a wrongly constructed sentence** when there is no article match. We (currently) use 2 common dependencies which - when present in the summary but not in the article - are highly indicative of factualness errors. Furthermore, we only check dependencies between an existing **entity** and its direct connections. Below we highlight all unmatched dependencies that satisfy the discussed constraints. We also discuss the specific results for the currently selected example article.

Taken from https://huggingface.co/spaces/ml6team/post-processing-summarization

### Question-Answer Question-Generation Score

Inspired by other question answering (QA) based automatic metrics in text summarization, Wang et al.; Durmus et al. [2020; 2020] propose QA based factual consistency evaluation metrics QAGS and FEQA separately. These two metrics are all based on the intuition that if we ask questions about a summary and its source document, we will receive similar answers if the summary is factually consistent with the source document. As illustrated in Figure 4, they are all consist of three steps: (1) Given a generated summary, a question generation (QG) model generates a set of questions about the summary, standard answers of which are named entities and key phrases in the summary. (2) Then using question answering (QA) model to answers these questions given the source document. (3) A factual consistency score is computed based on the similarity of corresponding answers. Because evaluating factual consistency at entity-level, these methods are more interpretable than textual-entailment-based methods. The reading comprehension ability of QG and QA models brings these methods promising performance in this task. However, these approaches are computationally expensive.

From https://arxiv.org/abs/2104.14839

In [1]:
 # QA-based Module. QAGS Score
 factsumm.extract_qas(article, summary, verbose=True)

NameError: name 'factsumm' is not defined

### ROUGE score

*Relevant for ROUGE*

Besides the above methods specially designed, there are also several simple but effective methods to evaluate factual consistency, which are usually used as baselines. Durmus et al. [2020] propose that a straightforward metric for factual consistency is the word overlap or semantic similarity between the summary sentence and the source document. The word overlap-based metrics compute ROUGE [Lin, 2004], BLEU [Papineni et al., 2002], between the output summary sentence and each source sentence. And then taking the average score or maximum score across all the source sentences.

*Relevant for BERT score / BLEURT*

The semantic similarity-based metric is similar to word overlap-based methods. Instead of using ROUGE or BLEU, this method uses BERTScore [Zhang* et al., 2020a]. These two types of methods show a baseline level of effectiveness. And experiments in Durmus et al. [2020] show that word overlap-based methods work better in lowly abstractive summarization datasets like CNN/DM [Hermann et al., 2015], semantic similarity-based method works better in highly abstractive summarization datasets like XSum [Narayan et al., 2018]. Abstractiveness of the summarization dataset means the extent how abstract the reference summaries are against the source documents. Extremely, the summarization dataset is the least abstractive if all the reference summaries of which are directly extracted from the source document.

From https://arxiv.org/abs/2104.14839

In [None]:
# ROUGE-based Module. Avg. ROUGE-1, Avg. ROUGE-2, Avg. ROUGE-L
factsumm.calculate_rouge(article, summary)

### BERT score

Through meta-evaluation, Koto et al. [2020] find that the semantic similarity-based method could reach stateof-the-art performance for factual consistency evaluation by searching optimal model parameters (i.e. model layers of pre-trained language model in BERTScore) in highly abstractive summarization dataset XSum [Narayan et al., 2018]. Even so, the correlation with human evaluation is not more than 0.5. Therefore, factual consistency evaluation is still an open issue in exploration.

From https://arxiv.org/abs/2104.14839

In [None]:
# BERTScore Module. BERTScore Score
factsumm.calculate_bert_score(article, summary)

### BLEURT

BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input, a reference and a candidate, and it returns a score that indicates to what extent the candidate is fluent and conveys the meaning of the reference. It is comparable to sentence-BLEU, BERTscore, and COMET.

BLEURT is a trained metric, that is, it is a regression model trained on ratings data. The model is based on BERT and RemBERT.

An overview of BLEURT can be found in our our [blog post](https://ai.googleblog.com/2020/05/evaluating-natural-language-generation.html). Further details are provided in the ACL paper [BLEURT: Learning Robust Metrics for Text Generation](https://arxiv.org/abs/2004.04696) and our EMNLP paper.

#### Setup

I followed the installation instructions from [here](https://github.com/google-research/bleurt/tree/master). 

```bash
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .
```

and then, in the `bleurt` folder:

```bash
# Smaller version of BLEURT
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20-D3.zip .
unzip BLEURT-20-D3.zip

# If your PC can handle it, do 
wget https://storage.googleapis.com/bleurt-oss-21/BLEURT-20.zip .
unzip BLEURT-20.zip
```

to read more about the available, check this: https://github.com/google-research/bleurt/blob/master/checkpoints.md

from bleurt import score

checkpoint = "../bleurt/BLEURT-20-D3"
references = ["Bud Powell was a legendary pianist.", "Bud Powell was a legendary pianist.", "Bud Powell was a legendary pianist."]
candidates = ["Bud Powell was a legendary pianist.", "Bud Powell was a good keys player", "Bud Powell was a New Yorker"]

scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=references, candidates=candidates)


In [4]:
print(scores)

[0.9509367346763611, 0.4507386088371277, 0.3503122925758362]


### SUMMAC score

SUMMAC (Summary Consistency; Laban et al., 2021) is focused on evaluating factual consistency in summarization. They use NLI for detecting in- consistencies by splitting the document and summary into sentences and computing the entailment probabilities on all document/summary sentence pairs, where the premise is a document sentence and the hypothesis is a summary sentence. They aggregate the NLI scores for all pairs by either tak- ing the maximum score per summary sentence and averaging (SCZS) or by training a convolutional neural network to aggregate the scores (SCConv). We use the publicly available implementation.1

In [5]:
from summac.model_summac import SummaCZS, SummaCConv
# import nltk
# nltk.download('punkt')

article = "Lionel Andrés Messi (born 24 June 1987) is an Argentine professional footballer who plays as a forward and captains both Spanish club Barcelona and the Argentina national team. Often considered as the best player in the world and widely regarded as one of the greatest players of all time, Messi has won a record six Ballon d'Or awards, a record six European Golden Shoes, and in 2020 was named to the Ballon d'Or Dream Team."
summary = "Lionel Andrés Messi (born 24 Aug 1997) is an Spanish professional footballer who plays as a forward and captains both Spanish club Barcelona and the Spanish national team."

model_zs = SummaCZS(granularity="sentence", model_name="vitc", device="cpu") # If you have a GPU: switch to: device="cuda"
model_conv = SummaCConv(models=["vitc"], bins='percentile', granularity="sentence", nli_labels="e", device="cpu", start_file="default", agg="mean")

score_zs = model_zs.score([article], [summary])
score_conv = model_conv.score([article], [summary])

print(score_conv)

<All keys matched successfully>
{'scores': [0.20296768844127655]}


SUMMAC will produce a metric score that reflects the quality of the generated hypotheses. You can set a threshold score below which you consider a hypothesis to be indicative of hallucination. For instance, if the SUMMAC score falls below a certain threshold, it may suggest that the hypothesis contains hallucinatory information.

It's important to note that the success of using SUMMAC or any other metric for hallucination detection depends on the quality of the ground truth labels and the design of your evaluation dataset. Additionally, SUMMAC may not explicitly identify hallucination, but it can assess the overall quality of generated text in terms of semantic coherence and alignment with the input premise. The presence of hallucinatory information may result in lower SUMMAC scores, but a comprehensive assessment might require additional metrics or human evaluation for a complete understanding of hallucination in machine-generated text.





