refactor(api): tighten core rag typing batch 1 by tmimmanuel · Pull Request #35210 · langgenius/dify

tmimmanuel · 2026-04-14T19:15:04Z

Summary

Fix an unused as_completed() loop binding in retrieval service
Tighten special-token typing in text splitters to avoid broad collection inference
Make the fixed text splitter’s unused token encoder explicit to the type checker
Add a stable null/initialization guard for upload_file in extract processor
Narrow hit-testing retrieval model handling with a local typed dict
Adjust round-robin invocation typing so the full type-check stack passes

Test plan

uv run --directory api --dev -- basedpyright --threads 8 passes with 0 errors
./dev/pyrefly-check-local passes with 0 errors
uv --directory api run mypy --exclude-gitignore --exclude 'tests/' --exclude 'migrations/' --check-untyped-defs --disable-error-code=import-untyped . passes with 0 errors
No test changes needed
No intended runtime behavior changes — typing cleanup, casts, and null-safety only

Part of #26412

Please review my previous PRs (#34809 #34702 #34796 #34938) which is for same issue(#26412).

github-actions · 2026-04-14T19:16:45Z

Pyrefly Diff

base → PR

--- /tmp/pyrefly_base.txt	2026-04-14 19:16:29.937002281 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 19:16:19.785880733 +0000
@@ -38,26 +38,6 @@
    --> core/llm_generator/llm_generator.py:394:60
 ERROR No matching overload found for function `core.model_manager.ModelInstance.invoke_llm` called with arguments: (prompt_messages=list[SystemPromptMessage | UserPromptMessage], model_parameters=dict[str, float], stream=Literal[False]) [no-matching-overload]
    --> core/llm_generator/llm_generator.py:582:60
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:172:17
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:197:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:217:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:239:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:256:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:281:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:309:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:328:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:344:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:361:13
 ERROR Argument `dict[str, list[str] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
    --> core/ops/mlflow_trace/mlflow_trace.py:271:24
 ERROR Argument `dict[str, dict[str, Any] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
@@ -598,16 +578,6 @@
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:14:5
 ERROR Class member `DuplicateDocumentIndexingTaskProxy.PRIORITY_TASK_FUNC` overrides parent class `BatchDocumentIndexingProxy` in an inconsistent manner [bad-override]
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:15:5
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any` is not assignable to parameter `top_k` with type `int` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:86:19
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | float | int | Any` is not assignable to parameter `score_threshold` with type `float | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:87:29
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `reranking_model` with type `RerankingModelDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:90:29
-ERROR Argument `Literal['reranking_model', True] | RetrievalMethod | dict[str, str] | int | Any` is not assignable to parameter `reranking_mode` with type `str` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:93:28
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `weights` with type `WeightsDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:94:21
 ERROR `handled_tenant_count` was assigned in the current scope before the nonlocal declaration [unknown-name]
   --> services/plugin/plugin_migration.py:92:34
 ERROR `dict[str, Any]` is not assignable to attribute `credentials` with type `Never` [bad-assignment]

Copilot

Pull request overview

This PR is part of the ongoing #26412 refactor to eliminate pyright ignores by tightening/static-typing fixes across api/core/rag and adjacent services, aiming for no intended runtime behavior changes.

Changes:

Fix an unused as_completed() loop binding in RetrievalService.retrieve.
Tighten typing for retrieval-model dict handling in hit testing and for special-token parameters in text splitters.
Add an explicit initialization/guard for upload_file in the extract processor for better null-safety under type checking.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
api/services/hit_testing_service.py	Introduces a local typed dict + cast to narrow retrieval model handling.
api/core/rag/splitter/text_splitter.py	Narrows special-token parameter types and avoids mutable defaults.
api/core/rag/splitter/fixed_text_splitter.py	Aligns special-token typing and makes an intentionally-unused encoder explicit.
api/core/rag/extractor/extract_processor.py	Refactors `upload_file` initialization/guards for type-checking.
api/core/rag/datasource/retrieval_service.py	Fixes an unused loop variable in an `as_completed()` loop.
api/core/model_manager.py	Adjusts round-robin invocation calls to satisfy type-check stack.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T19:49:49Z

+            upload_file = extract_setting.upload_file
            with tempfile.TemporaryDirectory() as temp_dir:
                if not file_path:
-                    assert extract_setting.upload_file is not None, "upload_file is required"
-                    upload_file: UploadFile = extract_setting.upload_file
+                    assert upload_file is not None, "upload_file is required"
                    suffix = Path(upload_file.key).suffix
                    # FIXME mypy: Cannot determine type of 'tempfile._get_candidate_names' better not use it here
                    file_path = f"{temp_dir}/{next(tempfile._get_candidate_names())}{suffix}"  # type: ignore
                    storage.download(upload_file.key, file_path)
                input_file = Path(file_path)
                file_extension = input_file.suffix.lower()
+                assert upload_file is not None, "upload_file is required"


upload_file is asserted as non-None unconditionally after file_path resolution. This breaks callers that pass file_path with no upload_file (e.g. load_from_url() constructs ExtractSetting(datasource_type=FILE, document_model=...) without upload_file and calls extract(..., file_path=...)), causing an AssertionError for non-upload flows.

Consider only asserting upload_file in the branches that actually use upload_file fields (PDF/DOCX paths that need tenant_id/created_by), or restructure so URL-based extraction doesn’t require upload_file unless the chosen extractor needs it.

…rd-router refactor(api): tighten jieba keyword typing batch 2

github-actions · 2026-04-14T19:58:24Z

Pyrefly Diff

base → PR

--- /tmp/pyrefly_base.txt	2026-04-14 19:58:08.056549989 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 19:57:59.637314455 +0000
@@ -38,26 +38,6 @@
    --> core/llm_generator/llm_generator.py:394:60
 ERROR No matching overload found for function `core.model_manager.ModelInstance.invoke_llm` called with arguments: (prompt_messages=list[SystemPromptMessage | UserPromptMessage], model_parameters=dict[str, float], stream=Literal[False]) [no-matching-overload]
    --> core/llm_generator/llm_generator.py:582:60
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:172:17
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:197:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:217:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:239:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:256:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:281:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:309:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:328:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:344:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:361:13
 ERROR Argument `dict[str, list[str] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
    --> core/ops/mlflow_trace/mlflow_trace.py:271:24
 ERROR Argument `dict[str, dict[str, Any] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
@@ -70,14 +50,8 @@
    --> core/ops/mlflow_trace/mlflow_trace.py:415:24
 ERROR Class member `OpsTraceProviderConfigMap.__getitem__` overrides parent class `UserDict` in an inconsistent manner [bad-param-name-override]
    --> core/ops/ops_trace_manager.py:206:9
-ERROR Object of class `NoneType` has no attribute `data_source_type` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:142:36
-ERROR Object of class `NoneType` has no attribute `keyword_table` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:144:13
 ERROR Cannot index into `set[Any]` [bad-index]
-   --> core/rag/datasource/keyword/jieba/jieba.py:157:29
-ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
-  --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
+   --> core/rag/datasource/keyword/jieba/jieba.py:159:29
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:106:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
@@ -598,16 +572,6 @@
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:14:5
 ERROR Class member `DuplicateDocumentIndexingTaskProxy.PRIORITY_TASK_FUNC` overrides parent class `BatchDocumentIndexingProxy` in an inconsistent manner [bad-override]
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:15:5
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any` is not assignable to parameter `top_k` with type `int` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:86:19
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | float | int | Any` is not assignable to parameter `score_threshold` with type `float | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:87:29
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `reranking_model` with type `RerankingModelDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:90:29
-ERROR Argument `Literal['reranking_model', True] | RetrievalMethod | dict[str, str] | int | Any` is not assignable to parameter `reranking_mode` with type `str` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:93:28
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `weights` with type `WeightsDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:94:21
 ERROR `handled_tenant_count` was assigned in the current scope before the nonlocal declaration [unknown-name]
   --> services/plugin/plugin_migration.py:92:34
 ERROR `dict[str, Any]` is not assignable to attribute `credentials` with type `Never` [bad-assignment]

github-actions · 2026-04-14T20:00:27Z

Pyrefly Diff

base → PR

--- /tmp/pyrefly_base.txt	2026-04-14 20:00:12.871661953 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 20:00:02.365745411 +0000
@@ -38,26 +38,6 @@
    --> core/llm_generator/llm_generator.py:394:60
 ERROR No matching overload found for function `core.model_manager.ModelInstance.invoke_llm` called with arguments: (prompt_messages=list[SystemPromptMessage | UserPromptMessage], model_parameters=dict[str, float], stream=Literal[False]) [no-matching-overload]
    --> core/llm_generator/llm_generator.py:582:60
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:172:17
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:197:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:217:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:239:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:256:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:281:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:309:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:328:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:344:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:361:13
 ERROR Argument `dict[str, list[str] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
    --> core/ops/mlflow_trace/mlflow_trace.py:271:24
 ERROR Argument `dict[str, dict[str, Any] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
@@ -70,14 +50,8 @@
    --> core/ops/mlflow_trace/mlflow_trace.py:415:24
 ERROR Class member `OpsTraceProviderConfigMap.__getitem__` overrides parent class `UserDict` in an inconsistent manner [bad-param-name-override]
    --> core/ops/ops_trace_manager.py:206:9
-ERROR Object of class `NoneType` has no attribute `data_source_type` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:142:36
-ERROR Object of class `NoneType` has no attribute `keyword_table` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:144:13
 ERROR Cannot index into `set[Any]` [bad-index]
-   --> core/rag/datasource/keyword/jieba/jieba.py:157:29
-ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
-  --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
+   --> core/rag/datasource/keyword/jieba/jieba.py:159:29
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:106:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
@@ -598,16 +572,6 @@
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:14:5
 ERROR Class member `DuplicateDocumentIndexingTaskProxy.PRIORITY_TASK_FUNC` overrides parent class `BatchDocumentIndexingProxy` in an inconsistent manner [bad-override]
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:15:5
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any` is not assignable to parameter `top_k` with type `int` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:86:19
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | float | int | Any` is not assignable to parameter `score_threshold` with type `float | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:87:29
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `reranking_model` with type `RerankingModelDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:90:29
-ERROR Argument `Literal['reranking_model', True] | RetrievalMethod | dict[str, str] | int | Any` is not assignable to parameter `reranking_mode` with type `str` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:93:28
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `weights` with type `WeightsDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:94:21
 ERROR `handled_tenant_count` was assigned in the current scope before the nonlocal declaration [unknown-name]
   --> services/plugin/plugin_migration.py:92:34
 ERROR `dict[str, Any]` is not assignable to attribute `credentials` with type `Never` [bad-assignment]

refactor(api): tighten core rag typing batch 1

b764427

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. refactor labels Apr 14, 2026

refactor(api): tighten jieba keyword typing batch 2

fe195c5

tmimmanuel mentioned this pull request Apr 14, 2026

refactor(api): tighten jieba keyword typing batch 2 tmimmanuel/dify#1

Merged

3 tasks

asukaminato0721 requested a review from Copilot April 14, 2026 19:46

asukaminato0721 self-assigned this Apr 14, 2026

asukaminato0721 enabled auto-merge April 14, 2026 19:46

Copilot started reviewing on behalf of asukaminato0721 April 14, 2026 19:46 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Merge pull request #1 from tmimmanuel/refactor/pyright-core-rag-keywo…

2c112cc

…rd-router refactor(api): tighten jieba keyword typing batch 2

auto-merge was automatically disabled April 14, 2026 19:56
Head branch was pushed to by a user without write access

tmimmanuel requested review from JohnJyong and QuantumGhost as code owners April 14, 2026 19:56

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 14, 2026

[autofix.ci] apply automated fixes

b9a04a9

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(api): tighten core rag typing batch 1#35210

refactor(api): tighten core rag typing batch 1#35210
tmimmanuel wants to merge 4 commits intolanggenius:mainfrom
tmimmanuel:refactor/pyright-enable-core-rag

tmimmanuel commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tmimmanuel commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

github-actions bot commented Apr 14, 2026

Pyrefly Diff

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 14, 2026

Pyrefly Diff

Uh oh!

github-actions bot commented Apr 14, 2026

Pyrefly Diff

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmimmanuel commented Apr 14, 2026 •

edited

Loading