fix(rag): use doc_id dedup key for any provider, not only dify#35759
Conversation
Previously _deduplicate_documents forced doc_id=None for non-dify providers, so two chunks from different source documents with identical text were silently merged into one result via content-based dedup. Any downstream citation pointing at the second document was lost. Fix: remove the is_dify guard and read doc_id from metadata regardless of provider. When doc_id is present the dedup key becomes (provider, doc_id), which correctly distinguishes chunks that share content but originate from different documents. The content-based fallback is preserved for documents that carry no doc_id. Also update the _deduplicate_documents docstring to reflect the new provider-agnostic rule, and add three targeted unit tests: - regression test for the exact bug scenario (issue langgenius#35707) - non-dify provider with colliding doc_ids keeps highest-score doc - dify provider without doc_id still falls back to content dedup Fixes langgenius#35707
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-05-01 13:02:03.360813355 +0000
+++ /tmp/pyrefly_pr.txt 2026-05-01 13:01:54.127712107 +0000
@@ -4482,39 +4482,39 @@
ERROR Object of class `FunctionType` has no attribute `call_count` [missing-attribute]
--> tests/unit_tests/core/rag/rerank/test_reranker.py:1630:16
ERROR Argument `list[float] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:1949:20
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2038:20
ERROR Could not find name `metadata_name` [unknown-name]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2768:29
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2857:29
ERROR Could not find name `metadata_name` [unknown-name]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2769:29
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2858:29
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3781:64
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3870:64
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3785:67
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3874:67
ERROR `None` is not subscriptable [unsupported-operation]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4024:16
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4113:16
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4546:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4635:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4598:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4687:40
ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4603:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4692:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4619:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4708:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4649:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4738:36
ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4654:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4743:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4662:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4751:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4698:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4787:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4726:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4815:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4784:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4873:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4828:44
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4917:44
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
--> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:198:52
ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]
|
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-05-02 14:40:27.926007455 +0000
+++ /tmp/pyrefly_pr.txt 2026-05-02 14:40:16.263065785 +0000
@@ -4482,39 +4482,39 @@
ERROR Object of class `FunctionType` has no attribute `call_count` [missing-attribute]
--> tests/unit_tests/core/rag/rerank/test_reranker.py:1630:16
ERROR Argument `list[float] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:1949:20
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2038:20
ERROR Could not find name `metadata_name` [unknown-name]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2768:29
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2857:29
ERROR Could not find name `metadata_name` [unknown-name]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2769:29
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2858:29
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3781:64
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3870:64
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3785:67
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3874:67
ERROR `None` is not subscriptable [unsupported-operation]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4024:16
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4113:16
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4546:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4635:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4598:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4687:40
ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4603:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4692:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4619:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4708:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4649:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4738:36
ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4654:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4743:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4662:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4751:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4698:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4787:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4726:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4815:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4784:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4873:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4828:44
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4917:44
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
--> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:198:52
ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]
|
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-05-03 14:44:31.769019982 +0000
+++ /tmp/pyrefly_pr.txt 2026-05-03 14:44:19.529998709 +0000
@@ -4482,39 +4482,39 @@
ERROR Object of class `FunctionType` has no attribute `call_count` [missing-attribute]
--> tests/unit_tests/core/rag/rerank/test_reranker.py:1630:16
ERROR Argument `list[float] | None` is not assignable to parameter `obj` with type `Sized` in function `len` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:1949:20
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2038:20
ERROR Could not find name `metadata_name` [unknown-name]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2768:29
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2857:29
ERROR Could not find name `metadata_name` [unknown-name]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2769:29
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:2858:29
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3781:64
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3870:64
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval._handle_invoke_result` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3785:67
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:3874:67
ERROR `None` is not subscriptable [unsupported-operation]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4024:16
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4113:16
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4546:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4635:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4598:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4687:40
ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4603:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4692:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4619:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4708:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4649:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4738:36
ERROR Argument `SimpleNamespace` is not assignable to parameter `metadata_condition` with type `MetadataFilteringCondition | None` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4654:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4743:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.single_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4662:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4751:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4698:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4787:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4726:36
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4815:36
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4784:40
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4873:40
ERROR Argument `list[SimpleNamespace]` is not assignable to parameter `available_datasets` with type `list[Dataset]` in function `core.rag.retrieval.dataset_retrieval.DatasetRetrieval.multiple_retrieve` [bad-argument-type]
- --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4828:44
+ --> tests/unit_tests/core/rag/retrieval/test_dataset_retrieval.py:4917:44
ERROR Argument `Iterator[Any | Unknown] | Iterator[Any]` is not assignable to parameter `invoke_result` with type `Generator[Unknown]` in function `core.rag.retrieval.router.multi_dataset_react_route.ReactMultiDatasetRouter._handle_invoke_result` [bad-argument-type]
--> tests/unit_tests/core/rag/retrieval/test_multi_dataset_react_route.py:198:52
ERROR Argument `None` is not assignable to parameter `text` with type `str` in function `core.rag.splitter.text_splitter.RecursiveCharacterTextSplitter.split_text` [bad-argument-type]
|
|
Hi, _deduplicate_documents now routes any provider with metadata["doc_id"] through the score-comparison branch, where a duplicate with metadata["score"] set to None (or non-numeric) will raise during float conversion and crash hybrid retrieval. Severity: action required | Category: reliability How to fix: Guard float conversion for score Agent prompt to fix - you can give this to your LLM of choice:
Found by Qodo code review |
Previously _deduplicate_documents forced doc_id=None for non-dify providers, so two chunks from different source documents with identical text were silently merged into one result via content-based dedup. Any downstream citation pointing at the second document was lost.
Fix: remove the is_dify guard and read doc_id from metadata regardless of provider. When doc_id is present the dedup key becomes (provider, doc_id), which correctly distinguishes chunks that share content but originate from different documents. The content-based fallback is preserved for documents that carry no doc_id.
Also update the _deduplicate_documents docstring to reflect the new provider-agnostic rule, and add three targeted unit tests:
Fixes #35707
Important
Fixes #<issue number>.Summary
Screenshots
Checklist
make lint && make type-check(backend) andcd web && pnpm exec vp staged(frontend) to appease the lint gods