# Factual Hallucination mitigation

## Experiment 1: Increasing context specificity

Evaluate fine-tuned model(s) with an increasing specificity in the prompt. Do we see a difference on LMs fine-tuned on XSum? Do we see a difference on LMs that are trained to follow instructions?

In [1]:
from metric.metrics import compute_metrics_pipeline

from transformers import pipeline
from datasets import load_dataset

article = "Lionel Andrés Messi (born 24 June 1987) is an Argentine professional footballer who plays as a forward and captains both Spanish club Barcelona and the Argentina national team. Often considered as the best player in the world and widely regarded as one of the greatest players of all time, Messi has won a record six Ballon d'Or awards, a record six European Golden Shoes, and in 2020 was named to the Ballon d'Or Dream Team."

# Important! For logging results
MODEL_NAME = "t5-small"
# TODO: Change the summarizer model to our own!
summarizer = pipeline("summarization", model="t5-small")
summarizer(article)

  from .autonotebook import tqdm as notebook_tqdm
2023-10-07 14:40:32.837957: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[{'summary_text': "lionel Andrés Messi is an argentine professional footballer . he has won a record six ballon d'Or awards, six european golden shoes . in 2020 he was named to the ballon dream team."}]

In [2]:
prompt1 = "Summarize the following text: "
prompt2 = "Using the exact wording of the text, summarize the following text: "
prompt3 = "Summarize the following text by including direct quotes where they are essential to convey the author's message accurately: "
prompts = [prompt1, prompt2, prompt3]
dataset = load_dataset("xsum", split="test")

with open("exp1-results.csv", "w") as f:
    f.write("xsum_id,prompt_id,qags,rouge,triples,bleurt,summac,ensemble\n")
    for idx, d in enumerate(dataset):
        if idx == 10:
            break
        article = d["document"]
        for p_idx, prompt in enumerate(prompts):
            # Add the actual article to the prompt
            prompt += article
            # Summarize the article
            pred_summary = summarizer(prompt)[0]['summary_text']
            print(f"Prompt:\n{prompt}")
            print(f"Summary:\n{pred_summary}")
            metric_res = compute_metrics_pipeline([article], [pred_summary])
            print(f"{d['id']},{p_idx},{metric_res['qags'][0]},{metric_res['rouge'][0]},{metric_res['triples'][0]},{metric_res['bleurt'][0]},{metric_res['summac'][0]},{metric_res['ensemble'][0]}\n")
            # Write metrics to csv file
            f.write(f"{d['id']},{p_idx},{metric_res['qags'][0]},{metric_res['rouge'][0]},{metric_res['triples'][0]},{metric_res['bleurt'][0]},{metric_res['summac'][0]},{metric_res['ensemble'][0]}\n")
            print()

KeyboardInterrupt: 

In [None]:
import pandas as pd
exp1_df = pd.read_csv("exp1-results-test.csv")
# Get all the results for separate prompts
p1 = exp1_df[exp1_df["prompt_id"] == 0]
p2 = exp1_df[exp1_df["prompt_id"] == 1]
p3 = exp1_df[exp1_df["prompt_id"] == 2]
p1_mean = p1.mean()
p12_mean = p1.mean()
p1_mean = p1.mean()

## Experiment 2: Chain-of-verification

1. Ask LLM $M_S$ to summarize a text, receive response $resp$
2. Generate questions from $resp$ using a question-generating model $M_{QG}$
3. Ask LLM $M_S$ to answer the generated questions
4. Create a new prompt that is comprised of the generated questions by $M_{QG}$, the answers by $M_S$, and the original prompt to summarize a text
5. Receive a verified response $resp_v$

> Do we need to fine-tune the model on question answering as well?

### Before running - modify factsumm!

So, in order for this to work, we need to generate questions. Factsumm has a way to do this, so it's easiest to just use its implementation. However, you need to add the following function to the `FactSumm` class code in your virtual environment:

```
def extract_questions(
        self,
        summary: str,
        summary_ents: List = None,
        verbose: bool = False,
        device: str = "cpu",
    ) -> List[str]:
        """
        Extract Questions from Question Generation module

            See also https://arxiv.org/abs/2004.04228

        Args:
            summary (str): generated summary
            summary_ents (List, optional): named entities extracted from source. Defaults to None.
            verbose (bool, optional): print verbose option. Defaults to False.
            device (str): device info

        """
        if isinstance(self.qg, str):
            self.qg = load_qg(self.qg, device)

        if isinstance(self.ner, str):
            self.ner = load_ner(self.ner, device)

        summary_lines = self._segment(summary)
        if summary_ents is None:
            summary_ents = self.ner(summary_lines)

        # If no entities (answers) are found, no questions can be generated
        if len(summary_ents) == 0:
            return []

        summary_lines = self._segment(summary)
        summary_qas = self.qg(summary_lines, summary_ents)
        questions = [qa_pair["question"] for qa_pair in summary_qas]

        return questions
```

In [1]:
import sys

import torch
from transformers import pipeline
from datasets import load_dataset

from metric.metrics import compute_metrics_pipeline
import metric.metrics

# Important! For logging results
MODEL_NAME = "t5-small"
# TODO: Change to our trained models!!
summarizer = pipeline("summarization", model="t5-small")

article = "Lionel Andrés Messi (born 24 June 1987) is an Argentine professional footballer who plays as a forward and captains both Spanish club Barcelona and the Argentina national team. Often considered as the best player in the world and widely regarded as one of the greatest players of all time, Messi has won a record six Ballon d'Or awards, a record six European Golden Shoes, and in 2020 was named to the Ballon d'Or Dream Team."

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"

summarizer(article)

  from .autonotebook import tqdm as notebook_tqdm
2023-10-08 09:34:48.657695: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-10-08 09:34:51.092003: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[{'summary_text': "lionel Andrés Messi is an argentine professional footballer . he has won a record six ballon d'Or awards, six european golden shoes . in 2020 he was named to the ballon dream team."}]

In [1]:
dataset = load_dataset("xsum", split="test")
BASE_PROMPT = "Summarize the following text: "


def chain_of_verification(baseline_summary: str, prompt: str, summ_model):
    # Generate questions based on baseline response
    questions = metric.metrics.factsumm.extract_questions(baseline_summary, verbose=False, device=device)

    # Make a modified prompt based on the answers to generated questions    
    # and their answers
    verified_prompt = ""
    for question in questions:
        # TODO: is this correct? Or should we have fine-tuned our summarization model on question-answering?
        resp_q = summ_model(question)[0]['summary_text']
        verified_prompt += f"{question}\n{resp_q}\n"
    verified_prompt += f"\n{prompt}"

    # Get verified response
    resp_v = summ_model(verified_prompt)[0]['summary_text']
    return verified_prompt, resp_v


with open(f"exp2-results-{MODEL_NAME}.csv", "w") as f:
    f.write("xsum_id,qags,rouge,triples,bleurt,summac,ensemble\n")
    for idx, d in enumerate(dataset):
        # Early stopping for testing
        # if idx == 3:
        #     break
        article = d["document"]

        # Add the actual article to the prompt
        prompt = f"{BASE_PROMPT}{article}"
        # Summarize the article
        pred_summary = summarizer(prompt)[0]["summary_text"]

        verified_prompt, verified_pred_summary = chain_of_verification(
            pred_summary, prompt, summarizer
        )

        print(f"Prompt:\n{verified_prompt}")
        print(f"Summary:\n{verified_pred_summary}")

        metric_res = compute_metrics_pipeline([article], [verified_pred_summary])
        print(
            f"{d['id']},"
            f"{metric_res['qags'][0]},"
            f"{metric_res['rouge'][0]},"
            f"{metric_res['triples'][0]},"
            f"{metric_res['bleurt'][0]},"
            f"{metric_res['summac'][0]},"
            f"{metric_res['ensemble'][0]}\n"
        )
        # Write metrics to csv file
        f.write(
            f"{d['id']},"
            f"{metric_res['qags'][0]},"
            f"{metric_res['rouge'][0]},"
            f"{metric_res['triples'][0]},"
            f"{metric_res['bleurt'][0]},"
            f"{metric_res['summac'][0]},"
            f"{metric_res['ensemble'][0]}\n"
        )
        print()


NameError: name 'load_dataset' is not defined