[Proposal] Stream-Aware RAG Plugin: Dynamic retrieval with hybrid fallback #37320
Replies: 4 comments 12 replies
-
|
Please check the |
Beta Was this translation helpful? Give feedback.
-
|
@dosu Would you help me to check my proposal's viability and offer me some further suggestions? Thank you |
Beta Was this translation helpful? Give feedback.
-
|
@dosu Thank you for replying! Your comment is truly inspiring, though I don't know if your modified plan is still worthwhile to do and can add value to the Dify programme. stream-aware-rag.md If you fell my plan is not as worthwhile to deal with, please read the following file to suggest me some new projects: |
Beta Was this translation helpful? Give feedback.
-
|
The hybrid fallback chain is a solid architecture. A few thoughts on making each tier more effective based on production experience with similar patterns. On the retrieval strategy itself: rather than treating vector-first with BM25 as a fallback chain (try vector, if insufficient fall back to BM25, then keyword, then pure LLM), the real power comes from fusion — running both retrieval paths simultaneously and combining results with weighted scoring. A typical effective split is 0.3 BM25 plus 0.7 vector similarity, followed by cross-encoder reranking on the merged result set. This catches cases where vector search misses exact terminology (product names, error codes, specific numbers) that BM25 handles well, and cases where BM25 misses semantically equivalent but differently-worded content that vector search handles well. Running them in parallel rather than as a fallback chain means you get the benefit of both on every query, not just when the primary fails. The stream-aware aspect of the proposal — dynamically retrieving as the conversation evolves — is where things get interesting and where memory comes in. As the conversation progresses, the retrieval context should shift to reflect accumulated context, not just the latest user message. We handle this by maintaining a working memory of the evolving conversation context and re-ranking retrieved results against this accumulated context rather than just the initial query. Concretely: after each turn, update a running summary of what the conversation is about (topics, entities mentioned, constraints established), and use that summary as an additional signal when scoring retrieval results. A document that was irrelevant to the first query might become highly relevant by turn three once the user has narrowed the topic. One practical consideration for the confidence-based retrieval trigger (your V2 target): perplexity/entropy thresholds need calibration per domain and per model. A model generating text about a familiar topic will have low perplexity regardless of whether it is hallucinating about specifics. In practice, we found that tracking retrieval hit rate (did the retrieved content actually get used in the response) is a better feedback signal than generation-time perplexity for deciding when retrieval adds value. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Self Checks
Content
I'm building a plugin to bring Self-RAG concepts into Dify in a practical, stable way.
The Problem
Standard RAG fails on multi-step reasoning when initial context isn't enough.
Agent RAG relies on the LLM's "self-awareness" to trigger retrieval, which causes hallucinations.
API Limits: Standard LLM APIs don't allow mid-stream prompt injection. We need a robust fallback strategy, not a fragile single path.
The Solution: Hybrid Triggers & Fallbacks
Instead of one method, the plugin dynamically checks model confidence with a built-in fallback chain:
Target (V2): Chunk-level Confidence Check. Evaluate Perplexity/Entropy every N tokens via an external GPU service.
Fallback 1 (MVP): Prompt-Guided. LLM outputs a special tag (e.g., [RETRIEVE: query]). Plugin intercepts, fetches data, and continues.
Fallback 2: Keyword Catch. Triggers on words like "specifically" or "explain".
Fallback 3: Standard RAG. One-shot retrieval before generation.
Fallback 4: Pure LLM. If the vector DB is down, LLM answers from memory to prevent errors.
Tech Stack & Architecture
Zero-Intrusion: 100% Dify Plugin System. No core code modifications.
Async & Fast: Uses asyncio + httpx. No Celery (keeps streaming latency low).
State: Redis for caching retrieval results and tracking external GPU health.
Roadmap
Phase 1 (MVP): Build Fallback 1-4. Works perfectly with the current Plugin SDK via multi-turn orchestration.
Phase 2: Integrate the V2 Chunk-level Confidence Check using an external GPU service.
I've also added my skills.md here, and it's in Chinese to better suit the community's requirements
stream-aware-rag.md
dify-innovation.md
Beta Was this translation helpful? Give feedback.
All reactions