# Hallucination Metrics

As always, there are a few out there, all of them bearing their own pros and cons. A few desirable criteria for a good metric are thus:
- Bounded
- Reproducible
- Interpretable
- Correlate with human judgement

> Note: In the report we should maybe briefly discuss whether the metrics we pick meet these criteria

## Fixing FactSumm

One of FactSumm's dependencies has changed its interface over the years and FactSumm is not really a maintained project (however, it has a few ready-to-go implementations of metrics which make our lives easier). There is an easy fix, though.

If you simply install FactSumm with
```bash
pip install factsumm
```
and try to run the example with Lionel Messi, you will likely run into a problem `KeyError: 'entities'`. All you have to do (as per [this](https://github.com/Huffon/factsumm/issues/36)) to fix this is 
1. navigate to your virtual / conda environment and find the installed 'factsumm' package folder
2. open up `factsumm/utils/level_entity.py`
3. Change `line_result = sentence.to_dict(tag_type="ner")` to `line_result = sentence.get_spans('ner')`
4. Change 

```python
for entity in line_result["entities"]:
    if entity["text"] not in cache:
        dedup.append({
            "word": entity["text"],
            "entity": entity["labels"][0].value,
            "start": entity["start_pos"],
            "end": entity["end_pos"],
        })
        cache[entity["text"]] = None
result.append(dedup)
```

with 
```python
for entity in line_result: 
    if entity.text not in cache:
        dedup.append({
            "word": entity.text, 
            "entity": entity.tag, 
            "start": entity.start_position, 
            "end": entity.end_position, 
        }) 
        cache[entity.text] = None
result.append(dedup)
```

5. Done. Re-activate your virtual/conda environment for the changes to take place and re-run your code

### Not-working metrics

Open IE method (the part where you need Java) still didn't work for me out-of-the-box and I don't care to fix it as we have good alternatives I think.

In [None]:
from factsumm import FactSumm
factsumm = FactSumm()
article = "Lionel Andrés Messi (born 24 June 1987) is an Argentine professional footballer who plays as a forward and captains both Spanish club Barcelona and the Argentina national team. Often considered as the best player in the world and widely regarded as one of the greatest players of all time, Messi has won a record six Ballon d'Or awards, a record six European Golden Shoes, and in 2020 was named to the Ballon d'Or Dream Team."
summary = "Lionel Andrés Messi (born 24 Aug 1997) is an Spanish professional footballer who plays as a forward and captains both Spanish club Barcelona and the Spanish national team."

### Triple-based fact score

The most intuitive way to evaluating factual consistency is to count the fact overlap between generated summary and the source document, as shown in Figure 3. Facts are usually represented by relation triples (subject, relation, object), where the subject has a relation to the object. To resolve this problem, Goodrich et al. [2019] change to use relation extraction tools with fixed schema. Considering still the two sentences in Example 1, whether extracting from the source document or the summary, the extracted triples are (Hawaii, is the birthplace of, Obama) in fixed schema extraction. This helps extracted triples easier to compare.

From https://arxiv.org/abs/2104.14839

In [None]:
# Triple-based Module (closed-scheme). Fact Score
factsumm.extract_facts(article, summary, verbose=True)

### Question-Answer Question-Generation Score

Inspired by other question answering (QA) based automatic metrics in text summarization, Wang et al.; Durmus et al. [2020; 2020] propose QA based factual consistency evaluation metrics QAGS and FEQA separately. These two metrics are all based on the intuition that if we ask questions about a summary and its source document, we will receive similar answers if the summary is factually consistent with the source document. As illustrated in Figure 4, they are all consist of three steps: (1) Given a generated summary, a question generation (QG) model generates a set of questions about the summary, standard answers of which are named entities and key phrases in the summary. (2) Then using question answering (QA) model to answers these questions given the source document. (3) A factual consistency score is computed based on the similarity of corresponding answers. Because evaluating factual consistency at entity-level, these methods are more interpretable than textual-entailment-based methods. The reading comprehension ability of QG and QA models brings these methods promising performance in this task. However, these approaches are computationally expensive.

From https://arxiv.org/abs/2104.14839

In [1]:
 # QA-based Module. QAGS Score
 factsumm.extract_qas(article, summary, verbose=True)

NameError: name 'factsumm' is not defined

### ROUGE score

Besides the above methods specially designed, there are also several simple but effective methods to evaluate factual consistency, which are usually used as baselines. Durmus et al. [2020] propose that a straightforward metric for factual consistency is the word overlap or semantic similarity between the summary sentence and the source document. The word overlap-based metrics compute ROUGE [Lin, 2004], BLEU [Papineni et al., 2002], between the output summary sentence and each source sentence. And then taking the average score or maximum score across all the source sentences.

From https://arxiv.org/abs/2104.14839

In [None]:
# ROUGE-based Module. Avg. ROUGE-1, Avg. ROUGE-2, Avg. ROUGE-L
factsumm.calculate_rouge(article, summary)

In [None]:
# BERTScore Module. BERTScore Score
factsumm.calculate_bert_score(article, summary)