A Confidence-Gated RAG Framework for Robust Mathematical Reasoning under Resource Constraints
Features | Architecture | Results | Quick Start | Citation
MathRAG-Gate is a Retrieval-Augmented Generation (RAG) framework specifically designed for mathematical reasoning tasks, addressing the critical Relevance-Quality Gap in traditional RAG systems when applied to complex reasoning domains.
1. Confidence-Gated Architecture
An adaptive quality assessment mechanism that dynamically selects between rule-based and LLM-based reranking strategies by computing the Pearson correlation (ρ) between structural heuristics and semantic judgments.
2. Staged Batching Engineering
A time-space multiplexing execution pipeline that decouples retrieval/scoring from generation phases, achieving OOM-free inference on consumer-grade GPUs (12GB VRAM) by reducing model switching overhead from O(N) to O(1).
- ✅ Performance Recovery: Restored and improved accuracy from 81.0% (Hybrid RRF) to 86.8%
- ✅ Noise Filtration: Successfully identified and eliminated "high semantic relevance, low logical quality" documents
- ✅ Resource Efficiency: Achieved ~4.5s/sample inference on RTX 4070 Ti (12GB VRAM)
- ✅ Enhanced Robustness: Reduced standard deviation to σ=0.011, significantly improving system stability
Traditional RAG systems optimize for semantic similarity, but in mathematical reasoning tasks requiring strict logical rigor, this leads to retrieval of "semantically relevant but logically deficient" noise documents.
Example: A document containing the correct answer number but lacking derivation steps may have high vector similarity, yet provides no assistance—or even interference—to the LLM's Chain-of-Thought (CoT) reasoning.
Our experiments revealed that naively combining dense and sparse retrieval (Hybrid RRF) achieves 81.0% accuracy, performing worse than single baselines (86.0%). This demonstrates that blindly pursuing high recall introduces destructive noise.
Figure: MathRAG-Gate effectively filters fusion noise, recovering and improving accuracy from 81% to 86.8%
┌─────────────┐
│ User Query │
└──────┬──────┘
│
├─────► Phase 1: Batch Retrieval & Scoring (0.5B Model)
│ ├─ Dense Retrieval (FAISS + BGE)
│ ├─ Sparse Retrieval (BM25)
│ ├─ Hybrid Fusion (RRF)
│ └─ Confidence Gate ──┐
│ │
│ ┌──────────────────┤
│ │ │
│ ▼ ▼
│ Rule-RQP LLM-RQP (Qwen 0.5B)
│ (Fast) (Robust)
│ │ │
│ └────► ρ < 0.45? ──┘
│ │
│ Quality-Aware Reranking
│ │
│ ┌──────────┴──────────┐
│ │ Top-K High-Quality │
│ │ Reasoning Templates │
│ └─────────────────────┘
│
├─────► Phase 2: Batch Generation (7B Model)
│ └─ Qwen-7B Generator (Chain-of-Thought)
│
▼
┌──────────────┐
│ Final Answer │
└──────────────┘
- Dense Retrieval:
BAAI/bge-small-en-v1.5embeddings with FAISS indexing - Sparse Retrieval: BM25 keyword matching
- Hybrid Fusion: Reciprocal Rank Fusion (RRF) algorithm for Top-10 candidates
Rule-RQP (Baseline Anchor)
Fast structural heuristic scorer evaluating document formality:
Score = w_logic × N_logic + w_struct × N_struct + w_math × N_math + w_box × I_box
Features:
- Logic connectors ("Therefore", "\implies")
- Structural markers ("Step 1", "\begin{align}")
- Mathematical density (equations, LaTeX formulas)
- Final answer markers (
\boxed{})
LLM-RQP (Robust Judge)
Qwen2.5-0.5B model with Few-Shot prompting:
- Contrastive Anchors: Low-quality (Score 1) vs High-quality (Score 5) examples
- Deterministic Inference: Temperature=0.1, Token Limit=10
- Robust Parsing: Regex extraction of
[[score]]format
Mechanism:
ρ = Pearson(S_rule, S_llm)
Φ = { Φ_Rule(Low Cost), if ρ > τ
{ Φ_LLM(High Robustness), if ρ ≤ τ
Where τ = 0.45 (threshold)
Scientific Discovery: Structure-Logic Orthogonality
Figure: Correlation analysis between Rule-RQP and LLM-RQP scores (ρ=0.139 < 0.45), demonstrating significant decoupling between structural formality and logical validity in the MATH dataset
Interpretation:
- In simple tasks: Rule scores correlate with logic quality → Gate selects fast Rule mode
- In complex tasks (MATH dataset): Correlation collapses (ρ < 0.45) → Gate triggers LLM Fallback
Challenge: 12GB VRAM cannot concurrently host 7B generator + 0.5B judge
Solution: Time-Space Multiplexing
- Phase 1: Load only Embedding + 0.5B model, complete retrieval/scoring for all samples
- Phase 2: Unload above models, load 7B Generator, batch generate answers using cached Top-K docs
Impact:
- Reduced model switching overhead from O(N) to O(1)
- Stabilized average latency at ~4.5s/sample
- Zero memory overflow (OOM-free)
| Method | Accuracy (%) | Std Dev (±) | Recall@5 (%) | Latency (s) |
|---|---|---|---|---|
| Dense Retrieval | 86.00 | 0.0000 | 100.00 | 4.49 |
| Sparse Retrieval (BM25) | 85.00 | 0.0000 | 99.00 | 4.46 |
| Hybrid Retrieval (RRF) | 81.00 ❌ | 0.0000 | 100.00 | 4.59 |
| MathRAG-Gate (Ours) | 86.80 ✅ | 0.0110 | 100.00 | 4.53 |
Aggregated from 5 repeated experiments (Mean ± Std)
1. Evidence of Fusion Noise
Hybrid RRF's noise primarily manifests as increased Logic Errors:
- Hybrid RRF: 19 logic errors
- MathRAG-Gate: 14 logic errors
- 26% reduction in reasoning errors
2. Hyperparameter Philosophy
Figure: Hyperparameter sensitivity heatmap - Optimal config: W_RRF=0.1, W_Quality=0.9
Analysis:
- This extreme weight allocation demonstrates the system must heavily trust internal quality judgment (Quality) while nearly ignoring external retrieval ranking (Rank)
- Validates the project's core hypothesis: "Quality over Relevance"
3. Top-K Quality Density Curve
MathRAG-Gate's reranking successfully promotes high-quality "reasoning templates" to Top-1:
- Hybrid RRF (Initial): Volatile quality scores (0.75 → 0.45 → 0.65...)
- MathRAG-Gate (Reranked): Monotonically decreasing quality density (0.92 → 0.85 → 0.70...)
- Python: 3.10+
- VRAM: Minimum 12GB (16GB+ recommended)
- Ollama: Pre-installed Qwen2.5 models
# Clone repository
git clone <your-repo-url>
cd mathrag_gate_project
# Install dependencies
pip install -r requirements.txt# Install generator model (7B)
ollama pull qwen2.5:7b
# Install judge model (0.5B)
ollama pull qwen2.5:0.5bCreate .env file:
# Hugging Face mirror (for China)
HF_ENDPOINT=https://hf-mirror.com
# API Key (if using online models)
DASHSCOPE_API_KEY=your_api_key_here# Single experiment (100 samples)
python run_experiments.py
# Hyperparameter optimization (50 samples × 5 configs)
python run_optimization.py# Results saved in timestamped folders
results/
├── exp_20231124_153000/
│ ├── benchmark_results.png # Visualization
│ ├── final_aggregated_report.csv # Statistics
│ └── run_*_detailed.csv # Detailed logsclass Settings(BaseSettings):
# --- LLM Configuration ---
LLM_MODEL_NAME: str = "qwen2.5:7b" # Generator model
QJUDGE_MODEL_NAME: str = "qwen2.5:0.5b" # Judge model
# --- Retrieval Configuration ---
EMBEDDING_MODEL_NAME: str = "BAAI/bge-small-en-v1.5"
DENSE_TOP_K: int = 5
SPARSE_TOP_K: int = 5
HYBRID_TOP_K: int = 10
# --- Reranking Weights (Tuned) ---
RRF_WEIGHT: float = 0.1 # Retrieval rank weight
QUALITY_WEIGHT: float = 0.9 # Quality score weight
# --- Confidence Gate ---
GATE_THRESHOLD: float = 0.45 # Correlation threshold
GATE_SAMPLE_SIZE: int = 200 # Sampling size- Resource-constrained: Reduce
GATE_SAMPLE_SIZEto 100 for faster initialization - Higher precision needs: Increase
QUALITY_WEIGHTto 0.95 - Different domains: Adjust weights based on
run_optimization.pyresults
mathrag_gate_project/
├── src/
│ ├── config.py # Global configuration
│ ├── utils/
│ │ └── data_loader.py # MATH dataset loader
│ ├── retriever/
│ │ ├── dense_retriever.py # Dense retrieval (BGE + FAISS)
│ │ ├── sparse_retriever.py # Sparse retrieval (BM25)
│ │ ├── hybrid_retriever.py # Hybrid retrieval (RRF fusion)
│ │ ├── main_retriever.py # Main retriever (w/ Gate)
│ │ ├── rqar_rule.py # Rule-RQP scorer
│ │ ├── rqar_llm.py # LLM-RQP scorer
│ │ └── rqar_explainer.py # Explainability module
│ ├── migration/
│ │ └── confidence_gate.py # Confidence gate logic
│ ├── eval/
│ │ └── evaluate.py # Evaluation framework
│ └── monitoring/
│ └── metrics.py # System monitoring
├── run_experiments.py # Experiment orchestration
├── run_optimization.py # Hyperparameter grid search
├── requirements.txt # Dependencies
├── paper.pdf # Research paper
├── README.md # This file (English)
└── README.zh-CN.md # Chinese version
Score = w_logic × N_logic # Logic connectors (30%)
+ w_struct × N_struct # Structural markers (30%)
+ w_math × N_math # Mathematical density (25%)
+ w_box × I_box # Final answer marker (15%)Feature Examples:
N_logic: "Therefore", "\implies", "\because"N_struct: "Step 1", "\begin{align}"N_math: Equation count, LaTeX formulas (\frac,\sqrt)I_box: Detect\boxed{}marker
You are a math grader. Rate the reasoning quality from 1 to 5.
### Example 1 (Low Quality - Score 1)
Answer: "The answer is 5."
Reason: No steps, no logic, just a number.
Rating: [[1]]
### Example 2 (High Quality - Score 5)
Answer: "First, let x be the width. Since the area is 20,
we have x * (x+1) = 20. Solving for x, we get x=4..."
Reason: Clear variables, logical steps, and derivation.
Rating: [[5]]
### Target Answer to Grade
{truncated_text}
Rate from 1 to 5. Output ONLY: [[score]]
def rrf_score(rank_dense, rank_sparse, k=60):
score = 1.0 / (k + rank_dense + 1) + 1.0 / (k + rank_sparse + 1)
return scoredef check_consistency(self, documents):
# Sample 200 documents
sample_docs = random.sample(documents, 200)
# Compute dual scores
rule_scores = [rule_rqp(doc) for doc in sample_docs]
llm_scores = [llm_rqp(doc) for doc in sample_docs]
# Pearson correlation
rho, _ = pearsonr(rule_scores, llm_scores)
# Decision
if rho >= 0.45:
return "USE_RULE" # High consistency → Fast mode
else:
return "USE_LLM" # Low consistency → Robust modepython run_experiments.pyOutput:
results/exp_{timestamp}/final_aggregated_report.csvresults/exp_{timestamp}/benchmark_results.png
python run_optimization.pySearch Space:
search_space = [
{"rrf": 0.9, "quality": 0.1}, # Rely on retrieval ranking
{"rrf": 0.7, "quality": 0.3}, # Traditional setting
{"rrf": 0.5, "quality": 0.5}, # Balanced mode
{"rrf": 0.3, "quality": 0.7}, # Quality-oriented
{"rrf": 0.1, "quality": 0.9}, # Trust quality ✅ Optimal
]Output:
results/optimization_logs/optimization_report.csvresults/optimization_logs/tuning_curve.png
If you use MathRAG-Gate in your research, please cite:
@article{mathrag_gate_2024,
title={MathRAG-Gate: A Confidence-Gated RAG Framework for Robust Mathematical Reasoning under Resource Constraints},
author={Your Name},
journal={arXiv preprint},
year={2024}
}We welcome Issues and Pull Requests!
Improvement Directions:
- Neuro-Symbolic Verifier: Integrate Python/SymPy code execution as "hard logic" validator
- Difficulty-Aware Routing: Dynamic inference based on query complexity
- Multilingual Support: Extend to Chinese mathematical reasoning tasks
This project is licensed under the MIT License. See LICENSE file for details.
- Author: [Xiaoteng CHEN, Ruijia YE, Yuhan LIANG]
- Email: [xchen400@connect.hkust-gz.edu.cn]
- Repository: GitHub Repository
- Dataset: MATH Dataset by Hendrycks et al.
- Framework: LlamaIndex
- Models: Qwen2.5 by Alibaba Cloud
- Embeddings: BGE by BAAI