Replies: 1 comment
-
|
Hi @sharpdima, this is a well-documented pain point. Here's what I found: Short answer: There is no built-in "Merge Datasets" feature or API in Dify. There's no endpoint to move or copy documents between datasets without re-uploading and re-embedding [1]. Your 502 timeout issue is a confirmed performance problem. When you connect 10+ datasets to a Knowledge Retrieval node, the system spawns one thread per dataset with no hard limit — so 50 datasets means 50 simultaneous retrieval threads, each triggering independent HTTP calls and DB queries. Profiling has identified missing cross-request caching, no HTTP connection pooling, and redundant configuration rebuilds as root causes [2]. Immediate mitigations for your current setup:
Options for consolidating your 50 datasets:
For the database approach, at minimum you'd need to update
Several PRs have improved multi-dataset retrieval performance (e.g., Redis caching for model schemas, 15–625x faster DB queries for dataset retrieval), but the core issue of unbounded concurrency with many datasets remains [2]. Your feature request for a native "Merge Datasets" option is a valid one — this gap has come up multiple times in the community. To reply, just mention @dosu. Share context across your team and agents. Try Dosu. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Self Checks
Content
Hi Dify Team,
I am currently developing a large-scale RAG assistant, and we have encountered a structural bottleneck. Our data entry team has processed and uploaded around 50 books into Dify, but unfortunately, they created a separate Dataset (Knowledge Base) for each individual book.
Now, when we connect all 50 datasets to a single Knowledge Retrieval node in our Chatflow (using Multi-path retrieval, Hybrid Search, and a Rerank model), the system struggles. Sending parallel requests to 50 different datasets simultaneously causes severe bottlenecks, often resulting in a 502 Bad Gateway or timeout error from the API.
To optimize the architecture and retrieval speed, we need to consolidate these 50 datasets into just two Master Datasets (e.g., one for Persian resources and one for English).
My questions are:
Is there currently any built-in feature, API method, or database script to merge existing datasets into a single one?
We want to avoid asking our operator to manually re-upload and re-embed hundreds of documents from scratch. Is there any workaround for this?
If this feature does not exist yet, please consider this a strong feature request. Having a "Merge Datasets" option would be incredibly helpful for scaling projects and fixing architectural mistakes without wasting embedding tokens or manual labor.
Thank you for your amazing work on Dify
Beta Was this translation helpful? Give feedback.
All reactions