Improve retrieval performance and relevance (model reuse + reranking) by Ayush-kathil · Pull Request #186 · kubeflow/docs-agent

Ayush-kathil · 2026-03-29T19:49:10Z

Fixes #128

Problem:
SentenceTransformer(EMBEDDING_MODEL) was instantiated inside the milvus_search() function, causing repeated model loading on every request, leading to latency spikes and increased memory usage. Additionally, duplicate definitions of milvus_search existed, causing ambiguity.

Solution:

Moved SentenceTransformer initialization to global scope so it loads once at startup
Updated milvus_search() to reuse the global model instance
Removed duplicate function definition to improve clarity

Impact:

Eliminates repeated model loading
Reduces query latency significantly
Improves memory efficiency
Cleans up redundant code

Tested locally and observed faster response times for repeated queries.

google-oss-prow · 2026-03-29T19:49:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign franciscojavierarceo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Ayush-kathil · 2026-03-29T19:55:29Z

Hi, I’ve submitted a PR fixing this issue by moving SentenceTransformer initialization to the global scope and removing duplicate function definitions. This significantly improves performance and code clarity. Would appreciate feedback!

Ayush-kathil · 2026-04-05T18:45:57Z

This PR addresses a clear performance anti-pattern in the RAG pipeline.

Previously, SentenceTransformer was instantiated inside milvus_search() on every invocation, which is particularly costly given the model size (all-mpnet-base-v2) and its load time (~2–5 seconds). In agentic workflows where multiple tool calls are triggered per user request, this results in compounded latency and unnecessary memory churn.

The refactor ensures that the embedding model is initialized once and reused across requests, aligning with standard practices for ML model lifecycle management in backend services.

What’s good:

Eliminates repeated model loading → significantly reduces per-request latency
Prevents redundant high-memory allocations (~400MB per load)
Improves throughput and scalability under multi-step RAG execution

Suggestions / Minor improvements:

Consider adding a short comment near the initialization explaining why the model is cached (helps future contributors avoid regressions)
If not already handled, ensure thread-safety or document assumptions (e.g., single-process vs multi-worker deployment like Gunicorn)
Optionally, lazy initialization (on first request) could further optimize cold-start scenarios

Overall, this is a meaningful performance improvement with no functional regression. Good contribution.

Ayush-kathil · 2026-04-06T13:43:58Z

Description

This PR resolves a critical performance and memory bottleneck in the RAG pipeline caused by redundant instantiation of SentenceTransformer on every call to milvus_search.

Core Changes

Thread-Safe Singleton
- Introduced a globally cached encoder instance using threading.Lock() with double-checked locking.
- Ensures safe, one-time initialization in multi-threaded environments (e.g., FastAPI, Gunicorn, Uvicorn).
Refactored Endpoints
- Applied the global caching mechanism consistently across:
  - server/app.py (WebSocket layer)
  - server-https/app.py (FastAPI layer)

Performance Impact

Latency Reduction
- Eliminates repeated model loading overhead.
- Converts per-query cost from O(n) initialization to a one-time O(1) startup cost.
- First (cold) query retains ~500ms+ load time; subsequent (warm) queries reuse the cached encoder with near-instant response.
Memory Stability
- Prevents multiple model instances from being created.
- Resolves memory bloat issues observed in long-running Gunicorn/Uvicorn workers.

Validation

Tested locally with no regressions observed.
Performance improvement verified under repeated query conditions.

Please let me know if any refinements or additional checks are required before merge.

Ayush-kathil · 2026-04-10T13:32:40Z

Fix: safe initialization in `_init()`

Hi maintainers,

I noticed a race condition in mcp-server/server.py around the _init() method and tried to address it in this PR.

Right now, _init() lazily sets up the Milvus client and encoder, but there’s no synchronization. If multiple WebSocket connections hit the server at startup, more than one request can enter _init() at the same time. In practice, this can lead to duplicate initialization or one request reading partially initialized state.

Even though #66 introduced an encoder singleton, the initialization flow itself is still not protected.

What I changed

I made _init() safe under concurrent access by adding a lock and using a double-check pattern. This ensures:

initialization happens only once
other concurrent callers wait until it’s done
all callers reuse the same instances afterward

I also added a small concurrent test to make sure this holds up under parallel calls.

Context

This fits under the cleanup work tracked in #72, builds on the Milvus changes from #28, and closes the gap left after #66.

Request

Could you take a look and let me know if this approach aligns with the expected direction?

If everything looks good and checks pass, I’d appreciate an approval/lgtm so tide can pick it up.

Thanks!

Ayush-kathil · 2026-04-16T20:12:36Z

Retrieval Reranking

The retrieval pipeline includes an optional reranking step to improve the relevance of documents returned by the vector store.

Instead of directly returning the top-k results from similarity search, the system retrieves a larger candidate set and reorders those results using additional signals before selecting the final top-k documents.

How it works

The flow is:

Retrieve candidate documents using vector similarity
Expand the candidate pool (configurable)
Compute a hybrid score for each document
Return the top-k documents after reranking

The hybrid score combines:

vector similarity
keyword overlap with document content
keyword overlap with metadata (e.g. file path, citation URL)

This approach improves relevance while keeping the implementation lightweight.

Configuration

Reranking is enabled by default and can be controlled via environment variables:

export RERANK_ENABLED=true
export RERANK_CANDIDATE_MULTIPLIER=3
export RERANK_SIMILARITY_WEIGHT=0.7
export RERANK_KEYWORD_WEIGHT=0.2
export RERANK_METADATA_WEIGHT=0.1
export RERANK_MAX_CANDIDATES=50
export RERANK_MIN_TOKEN_LEN=3
export RERANK_DEBUG_LOG=false
export RERANK_LOG_TOP_N=5

These values can be adjusted depending on the workload and desired trade-offs between recall and precision.

Debugging

For visibility into ranking behavior, debug logging can be enabled:

export RERANK_DEBUG_LOG=true

When enabled, logs include:

top candidates before reranking
top candidates after reranking
per-document score components

This is useful for tuning weights and understanding ranking decisions.

Evaluation

A simple evaluation script is included to compare retrieval behavior:

python eval_retrieval.py

You can modify the script to test custom queries or different top-k values and observe how reranking impacts the final results.

Notes

Reranking is applied consistently across all retrieval entry points
The feature is fully optional and can be disabled if needed
No changes are required to existing APIs or usage patterns

Why this matters

Small improvements in retrieval quality have a direct impact on downstream responses.
This change focuses on improving context selection without introducing additional system complexity.

Closes #204

Instantiate the SentenceTransformer at module level in server-https to avoid recreating the encoder for each milvus_search call, and update milvus_search to use embedding_model.encode(...). Remove the duplicated milvus_search implementation from server/app.py to centralize the search logic and reduce redundancy and overhead from repeated model loads. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

…s_search Implemented thread-safe lazy-loading for SentenceTransformer to eliminate redundant loading within milvus_search. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

Protect SentenceTransformer and MilvusClient initialization with a process-local lock and double-checked locking in kagent-feast-mcp/mcp-server/server.py. Build local instances then publish atomically, add logging and threading imports, and track an _initialized flag to avoid repeated initialization. Add tests (tests/test_init_concurrency.py) that stub dependencies, spawn concurrent workers, and assert only one model/client construction and single pair of shared instances, as well as expected init/info log messages. Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

Ayush-kathil · 2026-04-17T05:38:22Z

Tested with queries:

"kubeflow trainer gpu config"
"pipeline scheduling issue"

Observed improved relevance in returned context.

ArshVermaGit

overall this PR feels like a pretty meaningful improvement to the retrieval pipeline. moving the SentenceTransformer init to a global singleton and protecting _init() with a lock + double check pattern fixes what was clearly a real bottleneck, especially in concurrent setups like FastAPI/Gunicorn where multiple workers could hit initialization at the same time. the added test_init_concurrency.py is a nice touch too — good to see an explicit test covering the race condition scenario instead of just relying on assumption that the lock works. the reranking layer also seems like a practical improvement, hybrid scoring using similarity + keyword + metadata gives better relevance without adding heavy infra. only minor thought maybe documenting why those weight defaults were picked would help future tuning, but not blocking. overall changes look clean, removes duplicate milvus_search definitions and makes lifecycle management more aligned with typical backend ML patterns. seems solid to me, nice work

google-oss-prow · 2026-04-17T20:13:56Z

@ArshVermaGit: changing LGTM is restricted to collaborators

Details

In response to this:

overall this PR feels like a pretty meaningful improvement to the retrieval pipeline. moving the SentenceTransformer init to a global singleton and protecting _init() with a lock + double check pattern fixes what was clearly a real bottleneck, especially in concurrent setups like FastAPI/Gunicorn where multiple workers could hit initialization at the same time. the added test_init_concurrency.py is a nice touch too — good to see an explicit test covering the race condition scenario instead of just relying on assumption that the lock works. the reranking layer also seems like a practical improvement, hybrid scoring using similarity + keyword + metadata gives better relevance without adding heavy infra. only minor thought maybe documenting why those weight defaults were picked would help future tuning, but not blocking. overall changes look clean, removes duplicate milvus_search definitions and makes lifecycle management more aligned with typical backend ML patterns. seems solid to me, nice work

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow bot requested a review from franciscojavierarceo March 29, 2026 19:49

google-oss-prow bot added the size/M label Mar 29, 2026

Ayush-kathil force-pushed the main branch 4 times, most recently from 63baf14 to f05614a Compare March 30, 2026 11:27

Ayush-kathil mentioned this pull request Mar 30, 2026

Performance Bug: SentenceTransformer is re-initialized on every Milvus search request #128

Open

Ayush-kathil force-pushed the main branch from bd1725f to 2e45262 Compare April 5, 2026 18:44

Ayush-kathil mentioned this pull request Apr 5, 2026

SentenceTransformer model reloaded on every search request #189

Open

Ayush-kathil force-pushed the main branch 2 times, most recently from f1c0cd0 to c323042 Compare April 6, 2026 13:39

Ayush-kathil mentioned this pull request Apr 6, 2026

perf: SentenceTransformer model reloaded on every search request #178

Open

google-oss-prow bot added size/L and removed size/M labels Apr 10, 2026

Ayush-kathil force-pushed the main branch from 8a45fec to 1111641 Compare April 10, 2026 13:26

google-oss-prow bot added size/XL and removed size/L labels Apr 16, 2026

Ayush-kathil added 5 commits April 17, 2026 01:42

fix(server-https): preserve multi-hop citations in stream_llm_response

3144ce6

Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

Add shared lightweight reranking across retrieval paths

b0e0ba8

Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>

Ayush-kathil force-pushed the main branch from a00107f to b0e0ba8 Compare April 16, 2026 20:12

Ayush-kathil changed the title ~~Fix performance issue: Avoid re-initializing SentenceTransformer and remove duplicate milvus_search definition~~ Improve retrieval performance and relevance (model reuse + reranking) Apr 17, 2026

ArshVermaGit approved these changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve retrieval performance and relevance (model reuse + reranking)#186

Improve retrieval performance and relevance (model reuse + reranking)#186
Ayush-kathil wants to merge 5 commits intokubeflow:mainfrom
Ayush-kathil:main

Ayush-kathil commented Mar 29, 2026

Uh oh!

google-oss-prow bot commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Apr 5, 2026

Uh oh!

Ayush-kathil commented Apr 6, 2026

Uh oh!

Ayush-kathil commented Apr 10, 2026

Uh oh!

Ayush-kathil commented Apr 16, 2026 •

edited

Loading

Uh oh!

Ayush-kathil commented Apr 17, 2026

Uh oh!

ArshVermaGit left a comment

Uh oh!

google-oss-prow bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ayush-kathil commented Mar 29, 2026

Uh oh!

google-oss-prow bot commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Mar 29, 2026

Uh oh!

Ayush-kathil commented Apr 5, 2026

Uh oh!

Ayush-kathil commented Apr 6, 2026

Description

Core Changes

Performance Impact

Validation

Uh oh!

Ayush-kathil commented Apr 10, 2026

Fix: safe initialization in _init()

What I changed

Context

Request

Uh oh!

Ayush-kathil commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retrieval Reranking

How it works

Configuration

Debugging

Evaluation

Notes

Why this matters

Uh oh!

Ayush-kathil commented Apr 17, 2026

Uh oh!

ArshVermaGit left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: safe initialization in `_init()`

Ayush-kathil commented Apr 16, 2026 •

edited

Loading