Skip to content

Conversation

@sigridjineth
Copy link

@sigridjineth sigridjineth commented Dec 4, 2025

Summary

This PR integrates XProvence (naver/xprovence-reranker-bgem3-v1), a zero-cost context pruning model for RAG. The model scores sentences by query relevance and removes irrelevant ones, returning both reranking scores and pruned_text (the pruned context).

Motivation

In RAG pipelines, retrieved documents often include distracting content that confuses LLMs and wastes tokens. XProvence mitigates this by:

  • Providing sentence-level relevance scoring
  • Pruning irrelevant sentences while preserving key content
  • Reducing token usage without sacrificing answer quality

Changes

Python Backend (backends/python/)

  • Add XProvenceModel class with process() for sentence-level pruning
  • Add pruned_text field to Score type
  • Make flash_attn imports optional for environments without flash attention
  • Handle bfloat16 → float32 conversion (XProvence process() requires float32)

Core (core/)

  • Pass raw_query and raw_text through the tokenization pipeline for pruning
  • Include pruned_text in inference results

Router (router/)

  • Detect XProvence architecture
  • Include pruned_text in HTTP rerank response

gRPC (backends/grpc-client/, backends/proto/)

  • Add pruned_text field to protobuf definitions
  • Update gRPC client to handle pruned text

Files Changed

  • backends/python/.../xprovence_model.py: New XProvence model implementation
  • backends/python/.../models/__init__.py: Model detection and optional flash_attn import
  • backends/python/.../models/types.py: Add pruned_text to Score
  • backends/proto/embed.proto: Add pruned_text to protobuf
  • core/src/tokenization.rs: Pass raw text for pruning
  • core/src/infer.rs: Handle pruned_text in results
  • core/src/queue.rs: Store raw text in queue entries
  • router/src/http/types.rs: Add pruned_text to response type
  • router/src/http/server.rs: Include pruned_text in rerank response

Configuration

  • XPROVENCE_THRESHOLD: Pruning threshold 0.0–1.0 (default: 0.3)
    • Lower = more conservative (keeps more sentences)
    • Higher = more aggressive (removes more sentences)
  • XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)

Usage

XPROVENCE_THRESHOLD=0.3 \
XPROVENCE_ALWAYS_SELECT_TITLE=true \
text-embeddings-router --model-id naver/xprovence-reranker-bgem3-v1 --port 8080

API Example

Request

curl http://localhost:8080/rerank -X POST \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "What is deep learning?",
    "texts": [
      "Deep learning uses neural networks. The weather is nice. I like pizza."
    ],
    "return_text": true
  }'

Response

[
  {
    "index": 0,
    "text": "Deep learning uses neural networks. The weather is nice. I like pizza.",
    "score": 0.9997,
    "pruned_text": "Deep learning uses neural networks."
  }
]

Test Plan

  • Server starts successfully with the XProvence model
  • Rerank endpoint returns correct scores
  • pruned_text contains only relevant sentences
  • Irrelevant sentences are removed
  • Works with Korean/multilingual text
  • Graceful fallback when pruning fails

References

@sigridjineth sigridjineth force-pushed the provenance branch 3 times, most recently from 5631b2e to 89441fe Compare December 5, 2025 10:22
Add XProvence model integration for zero-cost context pruning in reranking.
XProvence removes irrelevant sentences from passages based on query relevance,
returning both reranking scores and pruned context.

Changes:
- Add XProvenceModel class with process() method for sentence-level pruning
- Add pruned_text field to Score type and HTTP response
- Pass raw_query/raw_text through tokenization pipeline for pruning
- Make flash_attn imports optional for XProvence compatibility
- Add XProvence architecture detection in router and Python backend
- Handle bfloat16 to float32 conversion for XProvence process() method

Configuration:
- XPROVENCE_THRESHOLD: Pruning threshold 0.0-1.0 (default: 0.3)
- XPROVENCE_ALWAYS_SELECT_TITLE: Keep first sentence as title (default: true)

Usage:
  XPROVENCE_THRESHOLD=0.3 text-embeddings-router \
    --model-id naver/xprovence-reranker-bgem3-v1 --port 8080
@sigridjineth sigridjineth changed the title feat: xprovenance feat: Add XProvence Context Pruning Support Dec 5, 2025
@sigridjineth sigridjineth changed the title feat: Add XProvence Context Pruning Support Add Support for XProvence Sentence-Level Context Pruning (naver/xprovence-reranker-bgem3-v1) Dec 5, 2025
@sigridjineth sigridjineth marked this pull request as ready for review December 5, 2025 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant