Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio

This repo contains the code for the paper "Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio".

Main Results

Qwen3:8b	LlaMa3.2:3b

Repository Structure

paper_repo/
├── README.md
├── h2r_hard_queries.py         # Hard to Retrieve queries script
├── h2g_generate_passage.py     # Hard to Generate queries script
├── h2j_judgement_binary.py     # Hard to Judge (Binary) script
├── h2j_umbrela_like_llm_judge.py # Hard to Judge (Umbrela-like) script
├── pairwise_judge.py           # Pairwise judge evaluation
├── analyze_human_judgements.py # Analyze human judgements
├── bertscore.py                # BERT score evaluation
├── find_reasons.py             # Analysis of failure reasons
├── bert_score_outputs/         # Outputs from BERT score evaluations
├── binary_judge/              # Binary judge evaluation outputs
├── conf_matrix/               # Confusion matrix results
├── human_judgements/           # Human judgements
├── datasets/                  # Query datasets and evaluation data
├── generated_queries/         # Generated query outputs
├── hard_to_retrieve/          # Hard to retrieve query analysis
│   ├── results/               # Results of hard to retrieve analysis
│   └── runs/                  # Run files for retrieval evaluation
├── modified_qrels/            # Modified relevance judgments
├── pairwise_llm_as_judege/    # Pairwise LLM judge outputs
├── reasoning_by_llm/          # LLM reasoning analysis
└── umbrela/                   # Umbrela framework implementation
    ├── prompts/               # Prompts for LLM evaluation
    └── utils/                 # Utility functions for Umbrela

Quantile Threshold

A statistical measure used to identify hard queries by setting a performance threshold:

We use the 0.25 quantile (25th percentile) as our threshold
Queries with performance below this threshold are considered "hard"
This approach allows us to objectively identify challenging queries across different datasets and models
The lower band in our tables represents this 0.25 quantile threshold
For each tasks, you can find the script to claculate the quantile

Failure Reasons Analysis

The table below summarizes the key reasons for query failures across the three dimensions we studied:

Category	Failure Reasons
Hard to Judge (H2J)	• Quantitative data needed • Specific and niche topic • Query unrelated to provided criteria • No relevant passage context
Hard to Generate (H2G)	• Up-to-date information needed • Technical accuracy required • Risk of misinformation
Hard to Retrieve (H2R)	• Numerical data extraction • Specific product term • Ambiguous subject reference

You can find the results here: reasoning_by_llm/

Example Queries and Their Hardness Aspects

The table below shows examples of queries and which aspects make them hard:

Query ID	Query	Hard to Retrieve (H2R)	Hard to Generate (H2G)	Hard to Judge (H2J)
390360	ia suffix meaning	DistilBert	Llama + Qwen	Graded
673670	what is a aim	BM25	LLama	Graded
555530	what are best foods to lower cholesterol	Bm25	Qwen	Graded
443396	lps laws definition	hard in binary all	LLaMa + Qwen	Binary
121171	define etruscans	DistilBert	LLaMa + Qwen	Binary
1108651	what the best way to get clothes white	BM25	LLaMa	Binary
1129560	accounting definition of building improvements	Bm25	Qwen	Binary

In this section you can reproduce the results

Setup

pip install -r requirements.txt

Hard to Retrieve

Queries where retrieval systems struggle to find relevant documents (NDCG@10 below the 0.25 quantile threshold)

With this script we can find the hard to retrieve queries for a given run file. It returns the quantile threshold for each dataset and save hard queries for each dataset (results below)

python h2r_hard_queries.py \
--run_file_path hard_to_retrieve/runs/run.msmarco-v2-passage.bm25-default.dl21.txt \
--qrel_file_path datasets/qrels.dl21-passage.txt \
--quantile 0.25 \
--output hard_to_retrieve/results/h2r_hard_to_retrieve_dl21.csv

The table below shows the retrieval performance across different datasets. The lower band represents the 0.25 quantile threshold, which we use to identify hard-to-retrieve queries.

Dataset	Model	Mean NDCG	Lower Band (0.25 Quantile)
DL19	BM25	0.217	0.381
DL19	DistilBERT-TAS-B	0.457	0.631
DL20	BM25	0.196	0.298
DL20	DistilBERT-TAS-B	0.498	0.591
DL21	BM25	0.180	0.317
DL21	DistilBERT-TAS-B	0.104	0.208
DL22	BM25	0.093	0.161
DL22	DistilBERT-TAS-B	0.127	0.202

Hard to Generate

Queries where generative models struggle to produce accurate or relevant responses (BERTScore below the 0.25 quantile threshold with the annotated passage in MSMarco[v1,v2])

With this script we can find the hard to generate queries for a given query file.

python h2g_generate_passage.py \
--queries datasets/test2021-queries-filterd.tsv \
--output h2g_generate_passage_outputs/h2g_generate_passage_dl21.json \
--model qwen3:8b

To calculate the bertscore to produce the result below you can run this code:

python bertscore.py \
    -q datasets/qrels.dl21-passage.txt \
    -g h2g_generate_passage_outputs/h2g_generate_passage_dl21.json \
    -o bert_score_outputs/bert_score_dl21.json \
    -p datasets/qrels_text.dl21

The table below shows the BERT score F1 values 0.25 quantile threshold for generated content across different datasets using the Qwen3 model:

Dataset	Model	BERT Score F1 (0.25 Quantile)
DL19	Qwen3.2:8B	0.823
DL20	Qwen3.2:8B	0.824
DL21	Qwen3.2:8B	0.824
DL22	Qwen3.2:8B	0.816
DL19	LLaMa3.2:3b	0.832
DL20	LLaMa3.2:3b	0.827
DL21	LLaMa3.2:3b	0.836
DL22	LLaMa3.2:3b	0.826

You can find the results here: bert_score_outputs/

Hard to Judge

Queries where determining grading or binary relevance between query and document is challenging for automated systems with human annotations

With this script we can find the hard to judge queries for a given qrels file.

Binary Hard to Judge

python h2j_judgement_binary.py \
--dataset 22 \
--model_name llama3.2:3b

Output directory: binary_judge/

Umbrela Like Hard to Judge

python h2j_umbrela_like_llm_judge.py \
--qrel datasets/qrels.dl21-passage.txt \
--model_name qwen3:8b \
--prompt_type bing \
--base_url http://localhost:11434/v1

Output directory: modified_qrels/

To find the hard to judge queries you can run this code:

python h2j_finder.py

LLaMa3.2 Confusion matrixes are available in here

Human Annotation

To evaluate the reasons that are tagged by LLMs we use human annotation.

Binary Hard to Judge

python analyze_human_judgements.py

Output directory: human_judgements/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio

Main Results

Repository Structure

Quantile Threshold

Failure Reasons Analysis

Example Queries and Their Hardness Aspects

In this section you can reproduce the results

Setup

Hard to Retrieve

Hard to Generate

Hard to Judge

Binary Hard to Judge

Umbrela Like Hard to Judge

Human Annotation

Binary Hard to Judge

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bert_score_outputs		bert_score_outputs
binary_judge		binary_judge
conf_matrix		conf_matrix
datasets		datasets
generated_queries		generated_queries
hard_to_retrieve		hard_to_retrieve
human_judgements		human_judgements
modified_qrels		modified_qrels
pairwise_llm_as_judege		pairwise_llm_as_judege
reasoning_by_llm		reasoning_by_llm
umbrela		umbrela
.gitignore		.gitignore
README.md		README.md
analyze_human_judgements.py		analyze_human_judgements.py
bertscore.py		bertscore.py
find_reasons.py		find_reasons.py
h2g_generate_passage.py		h2g_generate_passage.py
h2j_finder.py		h2j_finder.py
h2j_judgement_binary.py		h2j_judgement_binary.py
h2j_umbrela_like_llm_judge.py		h2j_umbrela_like_llm_judge.py
h2r_hard_queries.py		h2r_hard_queries.py
hard_queries_overlap-low.png		hard_queries_overlap-low.png
human_judgement.ipynb		human_judgement.ipynb
llama-h2j-low.png		llama-h2j-low.png
pairwise_judge.py		pairwise_judge.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio

Main Results

Repository Structure

Quantile Threshold

Example Queries and Their Hardness Aspects

In this section you can reproduce the results

Setup

Binary Hard to Judge

Umbrela Like Hard to Judge

Binary Hard to Judge

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages