TL;DR:
We propose a new bridging method for RAG that rewrites retrieved documents to maximize answer generation utility, using LLM-guided process supervision and scalable distillation. Our method outperforms existing baselines across multiple QA benchmarks.
# Create conda environment
conda create -n rtou python=3.9
conda activate rtou
# Install requirements
pip install -r requirements.txtUse the code in notebook/{dataset_name}.ipynb to preprocess each dataset into our standardized JSON format. In this work, we utilize the datasets as follows:
- Multi-hop QA: HotpotQA, 2WikiMultihopQA, MuSiQue
- Disambiguation QA: AmbigQA
- Web corpus
- Single-hop QA: MS MARCO
- Comprehensive QA: CRAG
Custom tasks:
For other generation tasks (e.g., QA, math, code), format your data as follows:
{
"Question": "your question here",
"answer": ["answer1", "answer2"]
}Also modify these scripts to support your task:
scripts/evaluate.py: for task-specific evaluationscripts/prompts.py: to customize prompts for your taskscripts/run_xxx_xxx.py: to define the end-to-end pipeline
Please make sure the required file paths are correct before running the script.
- RAG
- please give correct
search_cache_name.
bash runs/run_naive_rag.sh
- Generating Bridging Document Distribution
bash runs/run_rewrite_docs.sh
- Training Student Model
bash runs/convert_cache_to_train.sh
cd train
conda activate test
bash train/bash/run_train_rewriting.sh
cd ..
conda activate rtou
bash runs/convert_train_to_cache.sh
- Preference learning
cd dpo-train
bash run.sh
cd ..
conda activate rtou
bash runs/convert_train_to_cache.sh
- Using the trained model to write docs
bash runs/run_rewrite_docs_fromtrain.sh
- Please check files in
reranker/andruns/baselines/
We acknowledge this repository is based on Search-o1.