Code for TrustMargin, a training-free source arbitration method for retrieval-augmented question answering.
TrustMargin asks a deliberately simple inference-time question:
Should this query trust the model's parametric memory, or the retrieved evidence?
For each input question, TrustMargin generates two candidate answers with the same language model:
- a closed-book Direct answer, using the question only;
- a BM25-RAG answer, using the same fixed top-20 retrieved passages.
It then chooses between the two candidates with a two-term margin:
M = M_prior + lambda_bind * M_bind
select RAG if M > tau
select Direct otherwise
The default setting used in our experiments is:
lambda_bind = 0.5
tau = -1.5
topk = 20
seed = 42
This repository contains the minimal implementation needed to run TrustMargin:
- Direct and BM25-RAG inference wrappers.
- TrustMargin source arbitration.
- BM25 retrieval into
data_aug/. - EM/F1 evaluation utilities.
- 2WikiMultihopQA and ComplexWebQuestions dev splits used by the project.
The repository intentionally keeps only TrustMargin and the source wrappers needed to evaluate it. Historical comparison baselines are not included here.
TrustMargin decomposes source selection into two complementary signals.
M_prior checks whether the closed-book model itself prefers the RAG answer or
the Direct answer:
M_prior = log p_D(y_R | q) - log p_D(y_D | q)
where:
qis the question;y_Dis the Direct answer;y_Ris the BM25-RAG answer;p_Dis the closed-book likelihood under the Direct prompt.
If M_prior is positive, the model's parametric memory favors the RAG
candidate. If it is negative, the model favors the Direct candidate.
M_bind checks whether an answer is supported by the interaction between the
question and the retrieved passages, rather than by passage-only prior:
M_bind =
[log p_R(y_R | q, P) - log p_C(y_R | P)]
- [log p_R(y_D | q, P) - log p_C(y_D | P)]
where:
Pis the retrieved top-k passage pool;p_Ris the evidence-conditioned likelihood under the RAG prompt;p_Cis the context-only likelihood with the question removed.
This term rewards answers that are bound to question-conditioned evidence and penalizes answers that merely look plausible from retrieved text alone.
TrustMargin/
|-- data/
| |-- 2wikimultihopqa/dev.json
| `-- complexwebquestions/dev.json
|-- scripts/
| |-- inf/
| | `-- trustmargin.sh
| `-- retrieve/
| |-- index_dpr_wiki.py
| `-- bm25_retrieve.sh
|-- src/
| |-- basic.py
| |-- data.py
| |-- evaluate.py
| |-- ICL.py
| |-- inference.py
| |-- retrieve.py
| `-- trustmargin.py
|-- requirements.txt
`-- README.md
The standard workflow is:
- install the Python environment;
- prepare raw QA data;
- retrieve BM25 top-20 passages;
- run Direct, BM25-RAG, and TrustMargin.
Install the Python dependencies:
conda create -n trustmargin python=3.10
conda activate trustmargin
pip install -r requirements.txtor with uv:
uv venv
source .venv/bin/activate
uv pip install -r requirements.txtThe repository uses HuggingFace causal language models. Install a PyTorch build
that matches your CUDA version if the default pip package is not suitable for
your machine.
Raw datasets are stored under data/{dataset}/dev.json. Each example should
contain:
{
"test_id": "...",
"question": "...",
"answer": "..."
}This repository includes:
data/2wikimultihopqa/dev.json
data/complexwebquestions/dev.json
BM25-RAG and TrustMargin require retrieved passages. Augmented files should be
stored under data_aug/{dataset}/dev.json with the same fields plus:
{
"passages": ["passage 1", "passage 2", "passage 3"]
}The supported dataset names are:
2wikimultihopqa
complexwebquestions
Place augmented BM25 retrieval outputs for these datasets under data_aug/ in
the same format before running BM25-RAG or TrustMargin.
TrustMargin uses the same BM25 retrieval corpus setup as many open-domain QA
RAG pipelines. Following PRAG and DPR-style retrieval, we use the DPR
Wikipedia split corpus psgs_w100.tsv.
mkdir -p data/dpr
wget -O data/dpr/psgs_w100.tsv.gz \
https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gzip -dk data/dpr/psgs_w100.tsv.gzAfter decompression, the corpus should be:
data/dpr/psgs_w100.tsv
Each row contains a passage id, passage text, and title. The indexing script
stores these as Elasticsearch fields named text and title.
If you already have an Elasticsearch server and a wiki index, skip to
retrieval. Otherwise, one simple local setup is:
mkdir -p data
wget -O data/elasticsearch-8.15.0.tar.gz \
https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.15.0-linux-x86_64.tar.gz
tar -xzf data/elasticsearch-8.15.0.tar.gz -C data
rm data/elasticsearch-8.15.0.tar.gzFor a local research machine, configure a single-node, unauthenticated server:
cat >> data/elasticsearch-8.15.0/config/elasticsearch.yml <<'EOF'
discovery.type: single-node
xpack.security.enabled: false
EOFStart Elasticsearch:
data/elasticsearch-8.15.0/bin/elasticsearch -dCheck that it is running:
curl http://localhost:9200Index the DPR Wikipedia corpus as an Elasticsearch index named wiki:
python scripts/retrieve/index_dpr_wiki.py \
--data_path data/dpr/psgs_w100.tsv \
--index_name wiki \
--elastic_url http://localhost:9200 \
--resetUseful checks:
curl "http://localhost:9200/_cat/indices?v"
curl "http://localhost:9200/wiki/_count?pretty"The retrieval code automatically detects common text fields including text,
contents, body, passage, paragraph, content, and txt, so it can also
work with an existing index if the corpus has already been indexed elsewhere.
If data_aug/{dataset}/dev.json already exists, skip this step.
The retrieval script assumes that Elasticsearch is running and that the wiki
corpus has already been indexed. By default, it queries an index named wiki.
bash scripts/retrieve/bm25_retrieve.shEnvironment variables:
| Variable | Default | Description |
|---|---|---|
K |
20 |
Number of passages retrieved per question. |
NUM_THREADS |
32 |
Retrieval worker threads. |
INDEX_NAME |
wiki |
Elasticsearch index name. |
ELASTIC_URL |
http://localhost:9200 |
Elasticsearch endpoint. |
Equivalent Python command:
python src/retrieve.py \
--dataset all \
--split dev \
--data_root data \
--output_root data_aug \
--elastic_url http://localhost:9200 \
--index_name wiki \
--k 20 \
--num_threads 32All inference methods use src/inference.py.
| Method | Context | Description |
|---|---|---|
direct |
none | Closed-book answer generation. |
bm25-rag |
BM25 top-k passages | Standard retrieval-augmented generation. |
trustmargin |
BM25 top-k passages | Selects between Direct and BM25-RAG answers. |
python src/inference.py \
--method direct \
--model_path /path/to/model \
--dataset all \
--data_root data \
--prediction_file outputs/1b/direct.jsonpython src/inference.py \
--method bm25-rag \
--model_path /path/to/model \
--dataset all \
--data_aug_root data_aug \
--prediction_file outputs/1b/rag_at_20.json \
--topk 20The convenience script runs TrustMargin with the default hyperparameters:
bash scripts/inf/trustmargin.shYou can also call the Python entry directly:
python src/inference.py \
--method trustmargin \
--model_path /path/to/model \
--dataset all \
--data_aug_root data_aug \
--prediction_file outputs/1b/trustmargin.json \
--topk 20 \
--max_context_len 2048 \
--max_new_tokens 32 \
--trustmargin_lambda_bind 0.5 \
--trustmargin_tau -1.5Main arguments:
| Argument | Default | Description |
|---|---|---|
--method |
required | direct, bm25-rag, or trustmargin. |
--model_path |
required | HuggingFace model path or name. |
--dataset |
all |
Dataset name or all. |
--datasets |
None |
Optional explicit dataset list. |
--data_root |
data |
Root for raw QA files. |
--data_aug_root |
data_aug |
Root for retrieved-passage files. |
--prediction_file |
auto | Output JSON path. |
--num_samples_for_eval |
-1 |
Use a subset for evaluation; -1 means all. |
--topk |
20 |
Number of passages used by RAG/TrustMargin. |
--max_context_len |
2048 |
Maximum context tokens before generation. |
--max_new_tokens |
20 |
Maximum generation length. |
--seed |
42 |
Random seed. |
--device_map |
auto |
HuggingFace device map; use none for manual CUDA placement. |
--torch_dtype |
float16 |
auto, float16, bfloat16, or float32. |
--trustmargin_lambda_bind |
0.5 |
Weight on the evidence-binding margin. |
--trustmargin_tau |
-1.5 |
Source-selection threshold. |
Prediction files are JSON objects with one entry per dataset:
{
"method": "trustmargin",
"datasets": {
"2wikimultihopqa": {
"records": [
{
"test_id": "...",
"question": "...",
"answer": "...",
"prediction": "...",
"raw_output": "...",
"score": {
"em": 0,
"f1": 0.0
},
"passages": ["..."],
"method_debug": {
"method": "TrustMargin",
"direct_answer": "...",
"rag_answer": "...",
"selected_source": "direct",
"margin": 0.0,
"margins": {
"M_prior": 0.0,
"M_bind": 0.0
}
}
}
]
}
}
}Gold answers are used only for evaluation scores. They are not used by TrustMargin when selecting the source.
- TrustMargin is training-free: it does not update model weights and does not use a separate judge model.
- TrustMargin uses the same top-k passage pool as BM25-RAG. It does not access additional retrieval results during source selection.
- The default comparison setting is Direct vs BM25-RAG@20.
- For reproducibility, keep the model path, top-k, decoding settings,
lambda_bind,tau, and random seed fixed across methods.
If you use this repository, please cite:
@misc{xu2026trustmargin,
title = {TrustMargin: Training-Free Arbitration between Parametric Memory and Retrieved Evidence in Large Language Models},
author = {Jingyan Xu and Hong Shi and Yi Shan and Penghui Liu and Yunhao Bai and Ningyuan Li and Xueyang Liu},
year = {2026},
eprint = {2606.08397},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
doi = {10.48550/arXiv.2606.08397},
url = {https://arxiv.org/abs/2606.08397}
}