Feature Request / Help: Merging Multiple Datasets to Avoid 502 Timeout Errors #36905

sharpdima · 2026-06-01T16:40:29Z

sharpdima
Jun 1, 2026

Self Checks

I have searched for existing issues search for existing issues, including closed ones.
I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
[FOR CHINESE USERS] 请务必使用英文提交 Issue，否则会被关闭。谢谢！:)
Please do not modify this template :) and fill in all the required fields.

Content

Hi Dify Team,

I am currently developing a large-scale RAG assistant, and we have encountered a structural bottleneck. Our data entry team has processed and uploaded around 50 books into Dify, but unfortunately, they created a separate Dataset (Knowledge Base) for each individual book.

Now, when we connect all 50 datasets to a single Knowledge Retrieval node in our Chatflow (using Multi-path retrieval, Hybrid Search, and a Rerank model), the system struggles. Sending parallel requests to 50 different datasets simultaneously causes severe bottlenecks, often resulting in a 502 Bad Gateway or timeout error from the API.

To optimize the architecture and retrieval speed, we need to consolidate these 50 datasets into just two Master Datasets (e.g., one for Persian resources and one for English).

My questions are:

Is there currently any built-in feature, API method, or database script to merge existing datasets into a single one?

We want to avoid asking our operator to manually re-upload and re-embed hundreds of documents from scratch. Is there any workaround for this?

If this feature does not exist yet, please consider this a strong feature request. Having a "Merge Datasets" option would be incredibly helpful for scaling projects and fixing architectural mistakes without wasting embedding tokens or manual labor.

Thank you for your amazing work on Dify

@sharpdima · 2026-06-01T16:48:46Z

dosubot[bot]
Bot Jun 1, 2026

Hi @sharpdima, this is a well-documented pain point. Here's what I found:

Short answer: There is no built-in "Merge Datasets" feature or API in Dify. There's no endpoint to move or copy documents between datasets without re-uploading and re-embedding [1].

Your 502 timeout issue is a confirmed performance problem. When you connect 10+ datasets to a Knowledge Retrieval node, the system spawns one thread per dataset with no hard limit — so 50 datasets means 50 simultaneous retrieval threads, each triggering independent HTTP calls and DB queries. Profiling has identified missing cross-request caching, no HTTP connection pooling, and redundant configuration rebuilds as root causes [2].

Immediate mitigations for your current setup:

Reduce RETRIEVAL_SERVICE_EXECUTORS env var to 4–8 to cap thread spawning
Increase DB connection pool: SQLALCHEMY_POOL_SIZE=50, MAX_OVERFLOW=20
Keep datasets per node under 10 if possible [2]

Options for consolidating your 50 datasets:

Database-level migration (unsupported but possible): The Document model has a dataset_id column, and so do DocumentSegment and ChildChunk records. You could theoretically update these fields directly in PostgreSQL to move documents to a target dataset. However, this bypasses application logic and is not officially supported — back up your database first and ensure the target dataset uses the same embedding model.
API-based scripting: Use the Dify API to download documents from old datasets and re-upload them to a consolidated dataset. This unfortunately requires re-embedding but avoids manual work.
Split retrieval across multiple nodes: Instead of one node with 50 datasets, use multiple Knowledge Retrieval nodes (each with ≤10 datasets) and merge results with a Code node.

For the database approach, at minimum you'd need to update dataset_id in these tables for each document:

documents
document_segments
child_chunks (if using parent-child chunking)

Several PRs have improved multi-dataset retrieval performance (e.g., Redis caching for model schemas, 15–625x faster DB queries for dataset retrieval), but the core issue of unbounded concurrency with many datasets remains [2].

Your feature request for a native "Merge Datasets" option is a valid one — this gap has come up multiple times in the community.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request / Help: Merging Multiple Datasets to Avoid 502 Timeout Errors #36905

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request / Help: Merging Multiple Datasets to Avoid 502 Timeout Errors #36905

Uh oh!

sharpdima Jun 1, 2026

Self Checks

Content

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Jun 1, 2026

sharpdima
Jun 1, 2026

dosubot[bot]
Bot Jun 1, 2026