Improve retrieval performance and relevance (model reuse + reranking)#186
Improve retrieval performance and relevance (model reuse + reranking)#186Ayush-kathil wants to merge 5 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi, I’ve submitted a PR fixing this issue by moving SentenceTransformer initialization to the global scope and removing duplicate function definitions. This significantly improves performance and code clarity. Would appreciate feedback! |
63baf14 to
f05614a
Compare
|
This PR addresses a clear performance anti-pattern in the RAG pipeline. Previously, The refactor ensures that the embedding model is initialized once and reused across requests, aligning with standard practices for ML model lifecycle management in backend services. What’s good:
Suggestions / Minor improvements:
Overall, this is a meaningful performance improvement with no functional regression. Good contribution. |
f1c0cd0 to
c323042
Compare
DescriptionThis PR resolves a critical performance and memory bottleneck in the RAG pipeline caused by redundant instantiation of Core Changes
Performance Impact
Validation
Please let me know if any refinements or additional checks are required before merge. |
Fix: safe initialization in
|
Retrieval RerankingThe retrieval pipeline includes an optional reranking step to improve the relevance of documents returned by the vector store. Instead of directly returning the top-k results from similarity search, the system retrieves a larger candidate set and reorders those results using additional signals before selecting the final top-k documents. How it worksThe flow is:
The hybrid score combines:
This approach improves relevance while keeping the implementation lightweight. ConfigurationReranking is enabled by default and can be controlled via environment variables: export RERANK_ENABLED=true
export RERANK_CANDIDATE_MULTIPLIER=3
export RERANK_SIMILARITY_WEIGHT=0.7
export RERANK_KEYWORD_WEIGHT=0.2
export RERANK_METADATA_WEIGHT=0.1
export RERANK_MAX_CANDIDATES=50
export RERANK_MIN_TOKEN_LEN=3
export RERANK_DEBUG_LOG=false
export RERANK_LOG_TOP_N=5These values can be adjusted depending on the workload and desired trade-offs between recall and precision. DebuggingFor visibility into ranking behavior, debug logging can be enabled: export RERANK_DEBUG_LOG=trueWhen enabled, logs include:
This is useful for tuning weights and understanding ranking decisions. EvaluationA simple evaluation script is included to compare retrieval behavior: python eval_retrieval.pyYou can modify the script to test custom queries or different Notes
Why this mattersSmall improvements in retrieval quality have a direct impact on downstream responses. Closes #204 |
Instantiate the SentenceTransformer at module level in server-https to avoid recreating the encoder for each milvus_search call, and update milvus_search to use embedding_model.encode(...). Remove the duplicated milvus_search implementation from server/app.py to centralize the search logic and reduce redundancy and overhead from repeated model loads. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>
Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>
…s_search Implemented thread-safe lazy-loading for SentenceTransformer to eliminate redundant loading within milvus_search. Signed-off-by: Ayush-kathil <kathilshiva@gmail.com> Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>
Protect SentenceTransformer and MilvusClient initialization with a process-local lock and double-checked locking in kagent-feast-mcp/mcp-server/server.py. Build local instances then publish atomically, add logging and threading imports, and track an _initialized flag to avoid repeated initialization. Add tests (tests/test_init_concurrency.py) that stub dependencies, spawn concurrent workers, and assert only one model/client construction and single pair of shared instances, as well as expected init/info log messages. Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>
Signed-off-by: Ayush Kathil <kathilshiva@gmail.com>
|
Tested with queries:
Observed improved relevance in returned context. |
ArshVermaGit
left a comment
There was a problem hiding this comment.
overall this PR feels like a pretty meaningful improvement to the retrieval pipeline. moving the SentenceTransformer init to a global singleton and protecting _init() with a lock + double check pattern fixes what was clearly a real bottleneck, especially in concurrent setups like FastAPI/Gunicorn where multiple workers could hit initialization at the same time. the added test_init_concurrency.py is a nice touch too — good to see an explicit test covering the race condition scenario instead of just relying on assumption that the lock works. the reranking layer also seems like a practical improvement, hybrid scoring using similarity + keyword + metadata gives better relevance without adding heavy infra. only minor thought maybe documenting why those weight defaults were picked would help future tuning, but not blocking. overall changes look clean, removes duplicate milvus_search definitions and makes lifecycle management more aligned with typical backend ML patterns. seems solid to me, nice work
|
@ArshVermaGit: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #128
Problem:
SentenceTransformer(EMBEDDING_MODEL) was instantiated inside the milvus_search() function, causing repeated model loading on every request, leading to latency spikes and increased memory usage. Additionally, duplicate definitions of milvus_search existed, causing ambiguity.
Solution:
Impact:
Tested locally and observed faster response times for repeated queries.