Skip to content

ls3-lab/DRAQ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio

This repo contains the code for the paper "Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio".

Main Results

Qwen3:8b LlaMa3.2:3b
Hard to Judge based on Qwen3.2 Hard to Judge based on LLaMa3.2

Repository Structure

paper_repo/
├── README.md
├── h2r_hard_queries.py         # Hard to Retrieve queries script
├── h2g_generate_passage.py     # Hard to Generate queries script
├── h2j_judgement_binary.py     # Hard to Judge (Binary) script
├── h2j_umbrela_like_llm_judge.py # Hard to Judge (Umbrela-like) script
├── pairwise_judge.py           # Pairwise judge evaluation
├── analyze_human_judgements.py # Analyze human judgements
├── bertscore.py                # BERT score evaluation
├── find_reasons.py             # Analysis of failure reasons
├── bert_score_outputs/         # Outputs from BERT score evaluations
├── binary_judge/              # Binary judge evaluation outputs
├── conf_matrix/               # Confusion matrix results
├── human_judgements/           # Human judgements
├── datasets/                  # Query datasets and evaluation data
├── generated_queries/         # Generated query outputs
├── hard_to_retrieve/          # Hard to retrieve query analysis
│   ├── results/               # Results of hard to retrieve analysis
│   └── runs/                  # Run files for retrieval evaluation
├── modified_qrels/            # Modified relevance judgments
├── pairwise_llm_as_judege/    # Pairwise LLM judge outputs
├── reasoning_by_llm/          # LLM reasoning analysis
└── umbrela/                   # Umbrela framework implementation
    ├── prompts/               # Prompts for LLM evaluation
    └── utils/                 # Utility functions for Umbrela

Quantile Threshold

A statistical measure used to identify hard queries by setting a performance threshold:

  • We use the 0.25 quantile (25th percentile) as our threshold
  • Queries with performance below this threshold are considered "hard"
  • This approach allows us to objectively identify challenging queries across different datasets and models
  • The lower band in our tables represents this 0.25 quantile threshold
  • For each tasks, you can find the script to claculate the quantile

The table below summarizes the key reasons for query failures across the three dimensions we studied:

Category Failure Reasons
Hard to Judge (H2J) • Quantitative data needed
• Specific and niche topic
• Query unrelated to provided criteria
• No relevant passage context
Hard to Generate (H2G) • Up-to-date information needed
• Technical accuracy required
• Risk of misinformation
Hard to Retrieve (H2R) • Numerical data extraction
• Specific product term
• Ambiguous subject reference

You can find the results here: reasoning_by_llm/

Example Queries and Their Hardness Aspects

The table below shows examples of queries and which aspects make them hard:

Query ID Query Hard to Retrieve (H2R) Hard to Generate (H2G) Hard to Judge (H2J)
390360 ia suffix meaning DistilBert Llama + Qwen Graded
673670 what is a aim BM25 LLama Graded
555530 what are best foods to lower cholesterol Bm25 Qwen Graded
443396 lps laws definition hard in binary all LLaMa + Qwen Binary
121171 define etruscans DistilBert LLaMa + Qwen Binary
1108651 what the best way to get clothes white BM25 LLaMa Binary
1129560 accounting definition of building improvements Bm25 Qwen Binary

In this section you can reproduce the results

Setup

pip install -r requirements.txt

Queries where retrieval systems struggle to find relevant documents (NDCG@10 below the 0.25 quantile threshold)

With this script we can find the hard to retrieve queries for a given run file. It returns the quantile threshold for each dataset and save hard queries for each dataset (results below)

python h2r_hard_queries.py \
--run_file_path hard_to_retrieve/runs/run.msmarco-v2-passage.bm25-default.dl21.txt \
--qrel_file_path datasets/qrels.dl21-passage.txt \
--quantile 0.25 \
--output hard_to_retrieve/results/h2r_hard_to_retrieve_dl21.csv

The table below shows the retrieval performance across different datasets. The lower band represents the 0.25 quantile threshold, which we use to identify hard-to-retrieve queries.

Dataset Model Mean NDCG Lower Band (0.25 Quantile)
DL19 BM25 0.217 0.381
DL19 DistilBERT-TAS-B 0.457 0.631
DL20 BM25 0.196 0.298
DL20 DistilBERT-TAS-B 0.498 0.591
DL21 BM25 0.180 0.317
DL21 DistilBERT-TAS-B 0.104 0.208
DL22 BM25 0.093 0.161
DL22 DistilBERT-TAS-B 0.127 0.202

Queries where generative models struggle to produce accurate or relevant responses (BERTScore below the 0.25 quantile threshold with the annotated passage in MSMarco[v1,v2])

With this script we can find the hard to generate queries for a given query file.

python h2g_generate_passage.py \
--queries datasets/test2021-queries-filterd.tsv \
--output h2g_generate_passage_outputs/h2g_generate_passage_dl21.json \
--model qwen3:8b

To calculate the bertscore to produce the result below you can run this code:

python bertscore.py \
    -q datasets/qrels.dl21-passage.txt \
    -g h2g_generate_passage_outputs/h2g_generate_passage_dl21.json \
    -o bert_score_outputs/bert_score_dl21.json \
    -p datasets/qrels_text.dl21

The table below shows the BERT score F1 values 0.25 quantile threshold for generated content across different datasets using the Qwen3 model:

Dataset Model BERT Score F1 (0.25 Quantile)
DL19 Qwen3.2:8B 0.823
DL20 Qwen3.2:8B 0.824
DL21 Qwen3.2:8B 0.824
DL22 Qwen3.2:8B 0.816
DL19 LLaMa3.2:3b 0.832
DL20 LLaMa3.2:3b 0.827
DL21 LLaMa3.2:3b 0.836
DL22 LLaMa3.2:3b 0.826

You can find the results here: bert_score_outputs/

Queries where determining grading or binary relevance between query and document is challenging for automated systems with human annotations

With this script we can find the hard to judge queries for a given qrels file.

Binary Hard to Judge

python h2j_judgement_binary.py \
--dataset 22 \
--model_name llama3.2:3b

Output directory: binary_judge/

Umbrela Like Hard to Judge

python h2j_umbrela_like_llm_judge.py \
--qrel datasets/qrels.dl21-passage.txt \
--model_name qwen3:8b \
--prompt_type bing \
--base_url http://localhost:11434/v1

Output directory: modified_qrels/

To find the hard to judge queries you can run this code:

python h2j_finder.py

LLaMa3.2 Confusion matrixes are available in here

To evaluate the reasons that are tagged by LLMs we use human annotation.

Binary Hard to Judge

python analyze_human_judgements.py 

Output directory: human_judgements/

About

H2x Repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors