This repository contains a small tutorial-style codebase for several RAG fusion strategies. The current focus is not completeness, but getting each method into a minimal runnable form that can be extended later.
none: no retrieval augmentation, plain base model evaluation.query: BM25 retrieval + prompt template insertion via{context}.logits: BM25 retrieval + neighbor next-token aggregation +lambda * p + (1 - lambda) * q.latent: FAISS retrieval + embedding-weighted latent vector + trainable projection injected into QKV modules.parametric: document-level LoRA adapters trained from retrieved neighbor documents and fused at inference time.
src/main.py: unified inference entry fornone/query/logits/latent/parametric.src/train_latent.py: training script for latent fusion adapters.src/train_parametric.py: training script for document-level parametric LoRA adapters.src/fusion/: fusion implementations.src/retriever/: BM25 and FAISS retrievers.scripts/: ready-to-run shell scripts for training and inference.
The code is expected to run inside your existing ragdemo environment.
conda activate ragdemoPlain baseline:
bash scripts/run.shQuery fusion:
bash scripts/run_query.shLogits fusion:
bash scripts/run_logits.shLatent fusion inference:
bash scripts/run_latent.shParametric fusion inference:
bash scripts/run_parametric.shTrain the latent fusion adapter:
bash scripts/train_latent.shTrain the parametric document-level LoRA bank:
bash scripts/train_parametric.shThe unified entrypoint is:
python -m src.main \
--dataset hotpotqa/hotpot_qa \
--config distractor \
--split validation \
--model-name Qwen/Qwen2.5-1.5B \
--fusion noneImportant arguments:
--fusion: one ofnone,query,logits,latent,parametric--retriever:bm25orfaiss--encoder-model-name: required forfaiss--user-prompt: prompt template; for query-style text insertion, use{context}--top-k: number of retrieved neighbors--max-samples: evaluate only a small subset for smoke tests--logits-lambda: blend weight used by logits fusion--latent-checkpoint: trained latent adapter checkpoint--parametric-checkpoint: trained parametric adapter bank--lora-rank,--lora-alpha: LoRA configuration for parametric fusion
This path currently implements the simplest form: retrieve with BM25, insert retrieved text into the reserved {context} slot in the user prompt, and then run generation.
Example prompt template:
Use the retrieved context to answer the question.
Question: {question}
{context}
Answer:
This is a minimal version. For each query, the model:
- retrieves neighbors with BM25
- builds one augmented prompt per neighbor
- reads the next-token distribution from each neighbor prompt
- weights neighbor targets by retrieval score
- blends the neighbor distribution with the base model distribution
This path currently assumes:
- retrieval uses FAISS with sentence-transformer embeddings
- the base model is frozen
- only the latent projection layers are trained
- the weighted retrieval embedding is injected into QKV projection outputs
This path currently assumes:
- each document owns one LoRA adapter
- training uses retrieved neighbor documents to help reconstruct the target document
- inference retrieves relevant documents, loads their adapters, and computes a weighted average adapter before generation
This repository is still tutorial code. The implementations are intentionally simple and are meant to be iterated on method by method.