This repo contains the code for the paper "Failing Forward: Understanding Query Failure in Retrieval, Judgment, and Generatio".
| Qwen3:8b | LlaMa3.2:3b |
|---|---|
![]() |
![]() |
paper_repo/
├── README.md
├── h2r_hard_queries.py # Hard to Retrieve queries script
├── h2g_generate_passage.py # Hard to Generate queries script
├── h2j_judgement_binary.py # Hard to Judge (Binary) script
├── h2j_umbrela_like_llm_judge.py # Hard to Judge (Umbrela-like) script
├── pairwise_judge.py # Pairwise judge evaluation
├── analyze_human_judgements.py # Analyze human judgements
├── bertscore.py # BERT score evaluation
├── find_reasons.py # Analysis of failure reasons
├── bert_score_outputs/ # Outputs from BERT score evaluations
├── binary_judge/ # Binary judge evaluation outputs
├── conf_matrix/ # Confusion matrix results
├── human_judgements/ # Human judgements
├── datasets/ # Query datasets and evaluation data
├── generated_queries/ # Generated query outputs
├── hard_to_retrieve/ # Hard to retrieve query analysis
│ ├── results/ # Results of hard to retrieve analysis
│ └── runs/ # Run files for retrieval evaluation
├── modified_qrels/ # Modified relevance judgments
├── pairwise_llm_as_judege/ # Pairwise LLM judge outputs
├── reasoning_by_llm/ # LLM reasoning analysis
└── umbrela/ # Umbrela framework implementation
├── prompts/ # Prompts for LLM evaluation
└── utils/ # Utility functions for Umbrela
A statistical measure used to identify hard queries by setting a performance threshold:
- We use the 0.25 quantile (25th percentile) as our threshold
- Queries with performance below this threshold are considered "hard"
- This approach allows us to objectively identify challenging queries across different datasets and models
- The lower band in our tables represents this 0.25 quantile threshold
- For each tasks, you can find the script to claculate the quantile
The table below summarizes the key reasons for query failures across the three dimensions we studied:
| Category | Failure Reasons |
|---|---|
| Hard to Judge (H2J) | • Quantitative data needed • Specific and niche topic • Query unrelated to provided criteria • No relevant passage context |
| Hard to Generate (H2G) | • Up-to-date information needed • Technical accuracy required • Risk of misinformation |
| Hard to Retrieve (H2R) | • Numerical data extraction • Specific product term • Ambiguous subject reference |
You can find the results here: reasoning_by_llm/
The table below shows examples of queries and which aspects make them hard:
| Query ID | Query | Hard to Retrieve (H2R) | Hard to Generate (H2G) | Hard to Judge (H2J) |
|---|---|---|---|---|
| 390360 | ia suffix meaning | DistilBert | Llama + Qwen | Graded |
| 673670 | what is a aim | BM25 | LLama | Graded |
| 555530 | what are best foods to lower cholesterol | Bm25 | Qwen | Graded |
| 443396 | lps laws definition | hard in binary all | LLaMa + Qwen | Binary |
| 121171 | define etruscans | DistilBert | LLaMa + Qwen | Binary |
| 1108651 | what the best way to get clothes white | BM25 | LLaMa | Binary |
| 1129560 | accounting definition of building improvements | Bm25 | Qwen | Binary |
pip install -r requirements.txtQueries where retrieval systems struggle to find relevant documents (NDCG@10 below the 0.25 quantile threshold)
With this script we can find the hard to retrieve queries for a given run file. It returns the quantile threshold for each dataset and save hard queries for each dataset (results below)
python h2r_hard_queries.py \
--run_file_path hard_to_retrieve/runs/run.msmarco-v2-passage.bm25-default.dl21.txt \
--qrel_file_path datasets/qrels.dl21-passage.txt \
--quantile 0.25 \
--output hard_to_retrieve/results/h2r_hard_to_retrieve_dl21.csvThe table below shows the retrieval performance across different datasets. The lower band represents the 0.25 quantile threshold, which we use to identify hard-to-retrieve queries.
| Dataset | Model | Mean NDCG | Lower Band (0.25 Quantile) |
|---|---|---|---|
| DL19 | BM25 | 0.217 | 0.381 |
| DL19 | DistilBERT-TAS-B | 0.457 | 0.631 |
| DL20 | BM25 | 0.196 | 0.298 |
| DL20 | DistilBERT-TAS-B | 0.498 | 0.591 |
| DL21 | BM25 | 0.180 | 0.317 |
| DL21 | DistilBERT-TAS-B | 0.104 | 0.208 |
| DL22 | BM25 | 0.093 | 0.161 |
| DL22 | DistilBERT-TAS-B | 0.127 | 0.202 |
Queries where generative models struggle to produce accurate or relevant responses (BERTScore below the 0.25 quantile threshold with the annotated passage in MSMarco[v1,v2])
With this script we can find the hard to generate queries for a given query file.
python h2g_generate_passage.py \
--queries datasets/test2021-queries-filterd.tsv \
--output h2g_generate_passage_outputs/h2g_generate_passage_dl21.json \
--model qwen3:8bTo calculate the bertscore to produce the result below you can run this code:
python bertscore.py \
-q datasets/qrels.dl21-passage.txt \
-g h2g_generate_passage_outputs/h2g_generate_passage_dl21.json \
-o bert_score_outputs/bert_score_dl21.json \
-p datasets/qrels_text.dl21The table below shows the BERT score F1 values 0.25 quantile threshold for generated content across different datasets using the Qwen3 model:
| Dataset | Model | BERT Score F1 (0.25 Quantile) |
|---|---|---|
| DL19 | Qwen3.2:8B | 0.823 |
| DL20 | Qwen3.2:8B | 0.824 |
| DL21 | Qwen3.2:8B | 0.824 |
| DL22 | Qwen3.2:8B | 0.816 |
| DL19 | LLaMa3.2:3b | 0.832 |
| DL20 | LLaMa3.2:3b | 0.827 |
| DL21 | LLaMa3.2:3b | 0.836 |
| DL22 | LLaMa3.2:3b | 0.826 |
You can find the results here: bert_score_outputs/
Queries where determining grading or binary relevance between query and document is challenging for automated systems with human annotations
With this script we can find the hard to judge queries for a given qrels file.
python h2j_judgement_binary.py \
--dataset 22 \
--model_name llama3.2:3bOutput directory: binary_judge/
python h2j_umbrela_like_llm_judge.py \
--qrel datasets/qrels.dl21-passage.txt \
--model_name qwen3:8b \
--prompt_type bing \
--base_url http://localhost:11434/v1Output directory: modified_qrels/
To find the hard to judge queries you can run this code:
python h2j_finder.pyLLaMa3.2 Confusion matrixes are available in here
To evaluate the reasons that are tagged by LLMs we use human annotation.
python analyze_human_judgements.py Output directory: human_judgements/

