UniFact: A Comprehensive Evaluation Framework for Hallucination Detection and Fact Verification

Welcome to the official repository for the UniFact dataset. This project provides standardized benchmarks for evaluating and unifying two distinct paradigms: Hallucination Detection (HD) and Fact Verification (FV). The dataset includes diverse factual questions, ground-truth answers, and reference documents, specifically designed to support dynamic evaluation in which model responses are generated and verified on the fly. This setup enables a fair, head-to-head comparison of model-centric and text-centric factuality assessment methods. More details can be found in our paper:

Towards Unification of Hallucination Detection and Fact Verification for Large Language Models

🚀 Quick Start Tutorial

Installation

# Create and activate environment
conda create -n UniFact python=3.10
conda activate UniFact

# Install dependencies
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
python -m spacy download en_core_web_sm

Download the dataset

Follow the instructions in the Dataset Setup to download the dataset you want to evaluate on.

Step 1: Run a Simple Baseline (LNPP)

Let's start with the simplest method - LNPP:

# Generate detection scores using LNPP method
python -m scripts.hallucination_detection \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detect_methods lnpp

This will create detection results in: results/<model_name>/<dataset_name>/<dataset_type>/<method_name>.json

Step 2: Evaluate the Results

# Generate AUROC evaluation report
python -m scripts.evaluation \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --judge_model_name_or_path Qwen/Qwen2.5-32B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total"

You will get the results in results/<model_name>/<dataset_name>/<dataset_type>/evaluation_summary.json

{
  "lnpp": {
    "method": "lnpp",
    "total_items": 500,
    "valid_scores": 500,
    "auroc": 0.747575,
    "positive_cases": 200,
    "negative_cases": 300
  },
}

🔧 How to Evaluate Your Own Detection Method

For Hallucination Detection Methods

Step 1: Generate Your Detection Scores

Your method should produce a JSON file with the following format(similar to the example above):

[
  {
    "qid": "triviaqa_tc_2",
    "question": "Who was the man behind The Chipmunks?",
    "main_answer": "Ross Bagdasarian Sr., also known as David Seville, was the man behind The Chipmunks. He was an American singer, songwriter,",
    "detection_score": 0.09336186038951079
  },
  {
    "qid": "triviaqa_tc_13", 
    "question": "What star sign is Jamie Lee Curtis?",
    "main_answer": "Jamie Lee Curtis was born on November 22, 1958. Her star sign is Sagittarius.",
    "detection_score": 0.08400685215989749
  }
]

Important Notes:

detection_score: Higher values should indicate higher likelihood of hallucination
qid: Must match the question IDs from our datasets
main_answer: Should match the LLM's responses

Don't know how to generate the required data? Click to expand

Using Our Data Loader

from src.data_loader import load_structured_qa_dataset

# Load any supported dataset
data_map = load_structured_qa_dataset("triviaqa", "data")
questions = data_map["total"]

# Each question has the format:
# {
#   "qid": "triviaqa_tc_2", 
#   "question": "Who was the man behind The Chipmunks?",
#   "golden_answer": "David Seville",
#   "golden_passages": [],
#   "type": "total"
# }

Using Our Answer Generator

from src.answer_generator import AnswerGenerator

# Initialize generator for your target LLM
generator = AnswerGenerator("meta-llama/Llama-3.1-8B-Instruct")

# Generate answers for each question
for question_data in questions:
    result = generator.generate(
        question_data,
        max_new_tokens=30,
        main_temperature=0.8
    )
    
    # result contains:
    # {
    #   "main_answer": "Ross Bagdasarian Sr., also known as David Seville, ...",
    #   "sample_answers": [...],  # if needed
    #   "pp_pe_metrics": {...}    # includes lnpp, lnpe scores
    # }

Step 2: Place Your Results File

Save your results as: results/<model_name>/<dataset_name>/<dataset_type>/<your_method_name>.json

Step 3: Run Evaluation

python -m scripts.evaluation \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --judge_model_name_or_path Qwen/Qwen2.5-32B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detection_methods <your_method_name>

For Fact Verification Methods

Step 1: Get Basic Results for Input

First, generate the basic question-answer pairs:

python -m scripts.hallucination_detection \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detect_methods lnpp

This creates basic_results.json with the format:

[
  {
    "qid": "triviaqa_tc_2",
    "question": "Who was the man behind The Chipmunks?",
    "main_answer": "Ross Bagdasarian Sr., also known as David Seville, was the man behind The Chipmunks. He was an American singer, songwriter,"
  },
  {
    "qid": "triviaqa_tc_13",
    "question": "What star sign is Jamie Lee Curtis?", 
    "main_answer": "Jamie Lee Curtis was born on November 22, 1958. Her star sign is Sagittarius."
  }
]

Step 2: Generate Your Fact Verification Scores

Process the basic_results.json with your method and produce:

[
  {
    "qid": "triviaqa_tc_2",
    "question": "Who was the man behind The Chipmunks?",
    "main_answer": "Ross Bagdasarian Sr., also known as David Seville, was the man behind The Chipmunks. He was an American singer, songwriter,",
    "detection_score": 0.15
  },
  {
    "qid": "triviaqa_tc_13",
    "question": "What star sign is Jamie Lee Curtis?",
    "main_answer": "Jamie Lee Curtis was born on November 22, 1958. Her star sign is Sagittarius.",
    "detection_score": 0.23
  }
]

Step 3: Evaluate Your Method

Save as fv_<your_method_name>.json and run evaluation:

python -m scripts.evaluation \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --judge_model_name_or_path Qwen/Qwen2.5-32B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detection_methods your_method_name

🧱 Built-in Baseline Methods

The framework includes some baseline methods for comparison. Here's how to use them:

Training-Free Methods (Ready to Use)

# Run multiple training-free methods at once
python -m scripts.hallucination_detection \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detect_methods lnpp lnpe ptrue semantic_entropy seu sindex

Available Methods:

lnpp
lnpe
SelfCheckGPT
- mqag
- bertscore
- ngram
- nli
ptrue
semantic_entropy
seu
sindex

Training-Required Methods

Click to expand setup instructions

`EUBHD` Method

# 1. Generate token frequency statistics
python -m scripts.eubhd.count --tokenizer meta-llama/Llama-3.1-8B-Instruct

# 2. Run evaluation
python -m scripts.hallucination_detection \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detect_methods eubhd \
    --eubhd_idf_path eubhd_idf/token_idf_Llama-3.1-8B-Instruct.pkl

`SAPLMA` Method

# 1. Extract features
python -m scripts.saplma.extract_features \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --input_dir_path training_data/SAPLMA \
    --output_dir_path saplma/Llama-3.1-8B-Instruct_-1/data

# 2. Train probe
python -m scripts.saplma.train_probe \
    --embedding_dir_path saplma/Llama-3.1-8B-Instruct_-1/data \
    --output_probe_path saplma/Llama-3.1-8B-Instruct_-1/probe.pt

# 3. Run evaluation
python -m scripts.hallucination_detection \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detect_methods saplma \
    --saplma_probe_path saplma/Llama-3.1-8B-Instruct_-1/probe.pt

`MIND` Method

# 1. Generate training data
python -m scripts.mind.generate_data \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --wiki_data_dir training_data/MIND \
    --output_dir mind/Llama-3.1-8B-Instruct/text_data

# 2. Extract features  
python -m scripts.mind.extract_features \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --generated_data_dir mind/Llama-3.1-8B-Instruct/text_data \
    --output_feature_dir mind/Llama-3.1-8B-Instruct/feature_data

# 3. Train classifier
python -m scripts.mind.train_mind \
    --feature_dir mind/Llama-3.1-8B-Instruct/feature_data \
    --output_classifier_dir mind/Llama-3.1-8B-Instruct/classifier

# 4. Run evaluation
python -m scripts.hallucination_detection \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "triviaqa" \
    --dataset_type "total" \
    --detect_methods mind \
    --mind_classifier_path mind/Llama-3.1-8B-Instruct/classifier/mind_classifier_best.pt

Built-in Fact Verification Methods

Click to expand setup instructions

Setup Requirements

# Download Wikipedia dump for retrieval
mkdir -p data/dpr
wget -O data/dpr/psgs_w100.tsv.gz https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
cd data/dpr && gzip -d psgs_w100.tsv.gz && cd ../..

# Setup Elasticsearch
cd data
wget -O elasticsearch-8.15.0.tar.gz https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.0-linux-x86_64.tar.gz
tar zxvf elasticsearch-8.15.0.tar.gz && rm elasticsearch-8.15.0.tar.gz
cd elasticsearch-8.15.0 && nohup bin/elasticsearch & && cd ../..
python prep_elastic.py --data_path data/dpr/psgs_w100.tsv --index_name wiki

Train BERT Classifier

Generate training data:

python -m scripts.bert.generate_data \
    --model_name_or_path meta-llama/Llama-3.1-8B-Instruct \
    --judge_model_name_or_path Qwen/Qwen2.5-32B-Instruct \
    --data_path bert/training_data.json

Train the classifier:

python -m scripts.bert.train_bert \
    --output_dir "bert_classifier" \
    --retrieval_type "question_only"

python -m scripts.bert.train_bert \
    --output_dir "bert_classifier" \
    --retrieval_type "question_answer"

Running Fact Verification

python -m scripts.fact_verification \
    --target_llm_name meta-llama/Llama-3.1-8B-Instruct \
    --dataset_name "popqa" \
    --dataset_type "total" \
    --fv_llm_model_name Qwen/Qwen2.5-32B-Instruct \
    --bert_fv_q_model_dir bert_classifier/fv_model_question_only \
    --bert_fv_qa_model_dir bert_classifier/fv_model_question_answer

📊 Understanding Evaluation Results

AUROC Score Interpretation

The framework uses AUROC (Area Under ROC Curve) as the primary metric:

Score Range: 0.0 to 1.0
Higher is Better: Higher AUROC indicates better hallucination detection
Benchmark: 0.5 = random performance, 1.0 = perfect detection

Sample Results Comparison

{
  "lnpp": {
    "method": "lnpp",
    "total_items": 100,
    "valid_scores": 100,
    "auroc": 0.7234,
    "positive_cases": 45,
    "negative_cases": 55
  },
  "my_new_method": {
    "method": "my_new_method", 
    "total_items": 100,
    "valid_scores": 100,
    "auroc": 0.8156,
    "positive_cases": 45,
    "negative_cases": 55
  }
}

In this example, my_new_method (AUROC: 0.8156) outperforms the baseline lnpp method (AUROC: 0.7234).

📋 Dataset Setup

UniFact supports five major QA datasets:

2WikiMultihopQA

Download the dataset from the official repository
Extract and move the folder to data/2wikimultihopqa

HotpotQA

mkdir -p data/hotpotqa
wget -P data/hotpotqa/ http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_distractor_v1.json

PopQA

mkdir -p data/popqa
wget -P data/popqa https://raw.githubusercontent.com/AlexTMallen/adaptive-retrieval/main/data/popQA.tsv

TriviaQA

mkdir -p data/triviaqa
wget -P data/triviaqa https://nlp.cs.washington.edu/triviaqa/data/triviaqa-unfiltered.tar.gz
tar -xvzf data/triviaqa/triviaqa-unfiltered.tar.gz -C data/triviaqa

Natural Questions (NQ)

Download the NQ-open.efficientqa.dev.1.1.jsonl file from the Google Research repository and place it in data/nq/.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts		scripts
src		src
training_data		training_data
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

UniFact: A Comprehensive Evaluation Framework for Hallucination Detection and Fact Verification

🚀 Quick Start Tutorial

Installation

Download the dataset

Step 1: Run a Simple Baseline (LNPP)

Step 2: Evaluate the Results

🔧 How to Evaluate Your Own Detection Method

For Hallucination Detection Methods

Step 1: Generate Your Detection Scores

Using Our Data Loader

Using Our Answer Generator

Step 2: Place Your Results File

Step 3: Run Evaluation

For Fact Verification Methods

Step 1: Get Basic Results for Input

Step 2: Generate Your Fact Verification Scores

Step 3: Evaluate Your Method

🧱 Built-in Baseline Methods

Training-Free Methods (Ready to Use)

Training-Required Methods

EUBHD Method

SAPLMA Method

MIND Method

Built-in Fact Verification Methods

Setup Requirements

Train BERT Classifier

Running Fact Verification

📊 Understanding Evaluation Results

AUROC Score Interpretation

Sample Results Comparison

📋 Dataset Setup

2WikiMultihopQA

HotpotQA

PopQA

TriviaQA

Natural Questions (NQ)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`EUBHD` Method

`SAPLMA` Method

`MIND` Method

Packages