Skip to content

fix(rag): use doc_id dedup key for any provider, not only dify#35759

Merged
fatelei merged 4 commits into
langgenius:mainfrom
ki3nd:fix/rag-dedup-ignore-provider-for-doc-id
May 5, 2026
Merged

fix(rag): use doc_id dedup key for any provider, not only dify#35759
fatelei merged 4 commits into
langgenius:mainfrom
ki3nd:fix/rag-dedup-ignore-provider-for-doc-id

Conversation

@ki3nd
Copy link
Copy Markdown
Contributor

@ki3nd ki3nd commented May 1, 2026

Previously _deduplicate_documents forced doc_id=None for non-dify providers, so two chunks from different source documents with identical text were silently merged into one result via content-based dedup. Any downstream citation pointing at the second document was lost.

Fix: remove the is_dify guard and read doc_id from metadata regardless of provider. When doc_id is present the dedup key becomes (provider, doc_id), which correctly distinguishes chunks that share content but originate from different documents. The content-based fallback is preserved for documents that carry no doc_id.

Also update the _deduplicate_documents docstring to reflect the new provider-agnostic rule, and add three targeted unit tests:

Fixes #35707

Important

  1. Make sure you have read our contribution guidelines
  2. Ensure there is an associated issue and you have been assigned to it
  3. Use the correct syntax to link this PR: Fixes #<issue number>.

Summary

Screenshots

Before After
... ...

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods

Previously _deduplicate_documents forced doc_id=None for non-dify
providers, so two chunks from different source documents with identical
text were silently merged into one result via content-based dedup. Any
downstream citation pointing at the second document was lost.

Fix: remove the is_dify guard and read doc_id from metadata regardless
of provider. When doc_id is present the dedup key becomes (provider,
doc_id), which correctly distinguishes chunks that share content but
originate from different documents. The content-based fallback is
preserved for documents that carry no doc_id.

Also update the _deduplicate_documents docstring to reflect the new
provider-agnostic rule, and add three targeted unit tests:
- regression test for the exact bug scenario (issue langgenius#35707)
- non-dify provider with colliding doc_ids keeps highest-score doc
- dify provider without doc_id still falls back to content dedup

Fixes langgenius#35707
@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label May 1, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-05-01 13:02:03.360813355 +0000
+++ /tmp/pyrefly_pr.txt	2026-05-01 13:01:54.127712107 +0000
@@ -4482,39 +4482,39 @@
 ERROR Object of class `FunctionType` has no attribute `call_count` [missing-attribute]
     --> tests/unit_tests/core/rag/rerank/test_reranker.py:1630:16
 ERROR Argument `list[float] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:1949:20
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2038:20
 ERROR Could not find name `metadata_name` [unknown-name]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2768:29
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2857:29
 ERROR Could not find name `metadata_name` [unknown-name]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2769:29
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2858:29
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3781:64
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3870:64
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3785:67
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3874:67
 ERROR `None` is not subscriptable [unsupported-operation]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4024:16
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4113:16
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4546:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4635:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4598:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4687:40
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4603:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4692:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4619:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4708:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4649:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4738:36
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4654:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4743:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4662:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4751:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4698:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4787:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4726:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4815:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4784:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4873:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4828:44
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4917:44
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
    --> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:198:52
 ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-05-02 14:40:27.926007455 +0000
+++ /tmp/pyrefly_pr.txt	2026-05-02 14:40:16.263065785 +0000
@@ -4482,39 +4482,39 @@
 ERROR Object of class `FunctionType` has no attribute `call_count` [missing-attribute]
     --> tests/unit_tests/core/rag/rerank/test_reranker.py:1630:16
 ERROR Argument `list[float] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:1949:20
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2038:20
 ERROR Could not find name `metadata_name` [unknown-name]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2768:29
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2857:29
 ERROR Could not find name `metadata_name` [unknown-name]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2769:29
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2858:29
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3781:64
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3870:64
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3785:67
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3874:67
 ERROR `None` is not subscriptable [unsupported-operation]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4024:16
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4113:16
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4546:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4635:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4598:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4687:40
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4603:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4692:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4619:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4708:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4649:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4738:36
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4654:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4743:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4662:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4751:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4698:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4787:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4726:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4815:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4784:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4873:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4828:44
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4917:44
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
    --> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:198:52
 ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]

@autofix-ci autofix-ci Bot requested review from Yeuoly and crazywoola as code owners May 2, 2026 14:41
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 3, 2026

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-05-03 14:44:31.769019982 +0000
+++ /tmp/pyrefly_pr.txt	2026-05-03 14:44:19.529998709 +0000
@@ -4482,39 +4482,39 @@
 ERROR Object of class `FunctionType` has no attribute `call_count` [missing-attribute]
     --> tests/unit_tests/core/rag/rerank/test_reranker.py:1630:16
 ERROR Argument `list[float] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:1949:20
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2038:20
 ERROR Could not find name `metadata_name` [unknown-name]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2768:29
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2857:29
 ERROR Could not find name `metadata_name` [unknown-name]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2769:29
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2858:29
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3781:64
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3870:64
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3785:67
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3874:67
 ERROR `None` is not subscriptable [unsupported-operation]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4024:16
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4113:16
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4546:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4635:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4598:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4687:40
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4603:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4692:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4619:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4708:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4649:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4738:36
 ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4654:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4743:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4662:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4751:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4698:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4787:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4726:36
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4815:36
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4784:40
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4873:40
 ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
-    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4828:44
+    --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4917:44
 ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
    --> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:198:52
 ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]

@Qodo-Free-For-OSS
Copy link
Copy Markdown

Hi, _deduplicate_documents now routes any provider with metadata["doc_id"] through the score-comparison branch, where a duplicate with metadata["score"] set to None (or non-numeric) will raise during float conversion and crash hybrid retrieval.

Severity: action required | Category: reliability

How to fix: Guard float conversion for score

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

RetrievalService._deduplicate_documents now applies doc_id-based dedup to all providers. In the duplicate-replacement branch it converts metadata["score"] to float(...). If an external/non-dify document has "score": None (or a non-numeric value) this raises and can crash hybrid retrieval.

Issue Context

External retrieval paths assign document.metadata["score"] = external_document.get("score") without validating the type.

Fix Focus Areas

  • api/core/rag/datasource/retrieval_service.py[234-248]
  • api/core/rag/retrieval/dataset_retrieval.py[653-660]

Implementation notes

  • Treat missing/None/non-numeric scores as “no score”: skip replacement (per docstring: “If a later duplicate has no score, ignore it.”)
  • Use safe parsing:
    • score = doc.metadata.get("score")
    • if score is None: skip
    • else: try: new_score=float(score) except (TypeError, ValueError): skip
  • Consider applying the same safe parsing to old_score as well.
  • Add/extend a unit test for non-dify provider duplicates with doc_id and score=None to ensure no crash and deterministic selection behavior.

Found by Qodo code review

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 5, 2026
@fatelei fatelei added this pull request to the merge queue May 5, 2026
Merged via the queue into langgenius:main with commit 1f29565 May 5, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Knowledge Base: Duplicate content chunks from different documents are collapsed during retrieval (content-only deduplication ignores metadata)

4 participants