Skip to content

refactor(api): tighten core rag typing batch 1#35210

Open
tmimmanuel wants to merge 4 commits intolanggenius:mainfrom
tmimmanuel:refactor/pyright-enable-core-rag
Open

refactor(api): tighten core rag typing batch 1#35210
tmimmanuel wants to merge 4 commits intolanggenius:mainfrom
tmimmanuel:refactor/pyright-enable-core-rag

Conversation

@tmimmanuel
Copy link
Copy Markdown
Contributor

@tmimmanuel tmimmanuel commented Apr 14, 2026

Summary

  • Fix an unused as_completed() loop binding in retrieval service
  • Tighten special-token typing in text splitters to avoid broad collection inference
  • Make the fixed text splitter’s unused token encoder explicit to the type checker
  • Add a stable null/initialization guard for upload_file in extract processor
  • Narrow hit-testing retrieval model handling with a local typed dict
  • Adjust round-robin invocation typing so the full type-check stack passes

Test plan

  • uv run --directory api --dev -- basedpyright --threads 8 passes with 0 errors
  • ./dev/pyrefly-check-local passes with 0 errors
  • uv --directory api run mypy --exclude-gitignore --exclude 'tests/' --exclude 'migrations/' --check-untyped-defs --disable-error-code=import-untyped . passes with 0 errors
  • No test changes needed
  • No intended runtime behavior changes — typing cleanup, casts, and null-safety only

Part of #26412

Please review my previous PRs (#34809 #34702 #34796 #34938) which is for same issue(#26412).

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. refactor labels Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-04-14 19:16:29.937002281 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 19:16:19.785880733 +0000
@@ -38,26 +38,6 @@
    --> core/llm_generator/llm_generator.py:394:60
 ERROR No matching overload found for function `core.model_manager.ModelInstance.invoke_llm` called with arguments: (prompt_messages=list[SystemPromptMessage | UserPromptMessage], model_parameters=dict[str, float], stream=Literal[False]) [no-matching-overload]
    --> core/llm_generator/llm_generator.py:582:60
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:172:17
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:197:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:217:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:239:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:256:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:281:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:309:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:328:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:344:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:361:13
 ERROR Argument `dict[str, list[str] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
    --> core/ops/mlflow_trace/mlflow_trace.py:271:24
 ERROR Argument `dict[str, dict[str, Any] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
@@ -598,16 +578,6 @@
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:14:5
 ERROR Class member `DuplicateDocumentIndexingTaskProxy.PRIORITY_TASK_FUNC` overrides parent class `BatchDocumentIndexingProxy` in an inconsistent manner [bad-override]
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:15:5
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any` is not assignable to parameter `top_k` with type `int` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:86:19
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | float | int | Any` is not assignable to parameter `score_threshold` with type `float | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:87:29
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `reranking_model` with type `RerankingModelDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:90:29
-ERROR Argument `Literal['reranking_model', True] | RetrievalMethod | dict[str, str] | int | Any` is not assignable to parameter `reranking_mode` with type `str` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:93:28
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `weights` with type `WeightsDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:94:21
 ERROR `handled_tenant_count` was assigned in the current scope before the nonlocal declaration [unknown-name]
   --> services/plugin/plugin_migration.py:92:34
 ERROR `dict[str, Any]` is not assignable to attribute `credentials` with type `Never` [bad-assignment]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is part of the ongoing #26412 refactor to eliminate pyright ignores by tightening/static-typing fixes across api/core/rag and adjacent services, aiming for no intended runtime behavior changes.

Changes:

  • Fix an unused as_completed() loop binding in RetrievalService.retrieve.
  • Tighten typing for retrieval-model dict handling in hit testing and for special-token parameters in text splitters.
  • Add an explicit initialization/guard for upload_file in the extract processor for better null-safety under type checking.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
api/services/hit_testing_service.py Introduces a local typed dict + cast to narrow retrieval model handling.
api/core/rag/splitter/text_splitter.py Narrows special-token parameter types and avoids mutable defaults.
api/core/rag/splitter/fixed_text_splitter.py Aligns special-token typing and makes an intentionally-unused encoder explicit.
api/core/rag/extractor/extract_processor.py Refactors upload_file initialization/guards for type-checking.
api/core/rag/datasource/retrieval_service.py Fixes an unused loop variable in an as_completed() loop.
api/core/model_manager.py Adjusts round-robin invocation calls to satisfy type-check stack.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +97 to +107
upload_file = extract_setting.upload_file
with tempfile.TemporaryDirectory() as temp_dir:
if not file_path:
assert extract_setting.upload_file is not None, "upload_file is required"
upload_file: UploadFile = extract_setting.upload_file
assert upload_file is not None, "upload_file is required"
suffix = Path(upload_file.key).suffix
# FIXME mypy: Cannot determine type of 'tempfile._get_candidate_names' better not use it here
file_path = f"{temp_dir}/{next(tempfile._get_candidate_names())}{suffix}" # type: ignore
storage.download(upload_file.key, file_path)
input_file = Path(file_path)
file_extension = input_file.suffix.lower()
assert upload_file is not None, "upload_file is required"
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upload_file is asserted as non-None unconditionally after file_path resolution. This breaks callers that pass file_path with no upload_file (e.g. load_from_url() constructs ExtractSetting(datasource_type=FILE, document_model=...) without upload_file and calls extract(..., file_path=...)), causing an AssertionError for non-upload flows.

Consider only asserting upload_file in the branches that actually use upload_file fields (PDF/DOCX paths that need tenant_id/created_by), or restructure so URL-based extraction doesn’t require upload_file unless the chosen extractor needs it.

Copilot uses AI. Check for mistakes.
…rd-router

refactor(api): tighten jieba keyword typing batch 2
auto-merge was automatically disabled April 14, 2026 19:56

Head branch was pushed to by a user without write access

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-04-14 19:58:08.056549989 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 19:57:59.637314455 +0000
@@ -38,26 +38,6 @@
    --> core/llm_generator/llm_generator.py:394:60
 ERROR No matching overload found for function `core.model_manager.ModelInstance.invoke_llm` called with arguments: (prompt_messages=list[SystemPromptMessage | UserPromptMessage], model_parameters=dict[str, float], stream=Literal[False]) [no-matching-overload]
    --> core/llm_generator/llm_generator.py:582:60
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:172:17
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:197:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:217:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:239:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:256:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:281:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:309:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:328:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:344:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:361:13
 ERROR Argument `dict[str, list[str] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
    --> core/ops/mlflow_trace/mlflow_trace.py:271:24
 ERROR Argument `dict[str, dict[str, Any] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
@@ -70,14 +50,8 @@
    --> core/ops/mlflow_trace/mlflow_trace.py:415:24
 ERROR Class member `OpsTraceProviderConfigMap.__getitem__` overrides parent class `UserDict` in an inconsistent manner [bad-param-name-override]
    --> core/ops/ops_trace_manager.py:206:9
-ERROR Object of class `NoneType` has no attribute `data_source_type` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:142:36
-ERROR Object of class `NoneType` has no attribute `keyword_table` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:144:13
 ERROR Cannot index into `set[Any]` [bad-index]
-   --> core/rag/datasource/keyword/jieba/jieba.py:157:29
-ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
-  --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
+   --> core/rag/datasource/keyword/jieba/jieba.py:159:29
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:106:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
@@ -598,16 +572,6 @@
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:14:5
 ERROR Class member `DuplicateDocumentIndexingTaskProxy.PRIORITY_TASK_FUNC` overrides parent class `BatchDocumentIndexingProxy` in an inconsistent manner [bad-override]
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:15:5
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any` is not assignable to parameter `top_k` with type `int` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:86:19
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | float | int | Any` is not assignable to parameter `score_threshold` with type `float | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:87:29
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `reranking_model` with type `RerankingModelDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:90:29
-ERROR Argument `Literal['reranking_model', True] | RetrievalMethod | dict[str, str] | int | Any` is not assignable to parameter `reranking_mode` with type `str` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:93:28
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `weights` with type `WeightsDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:94:21
 ERROR `handled_tenant_count` was assigned in the current scope before the nonlocal declaration [unknown-name]
   --> services/plugin/plugin_migration.py:92:34
 ERROR `dict[str, Any]` is not assignable to attribute `credentials` with type `Never` [bad-assignment]

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-04-14 20:00:12.871661953 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 20:00:02.365745411 +0000
@@ -38,26 +38,6 @@
    --> core/llm_generator/llm_generator.py:394:60
 ERROR No matching overload found for function `core.model_manager.ModelInstance.invoke_llm` called with arguments: (prompt_messages=list[SystemPromptMessage | UserPromptMessage], model_parameters=dict[str, float], stream=Literal[False]) [no-matching-overload]
    --> core/llm_generator/llm_generator.py:582:60
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:172:17
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:197:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:217:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:239:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:256:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:281:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:309:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:328:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:344:13
-ERROR Missing positional argument `function` in function `ModelInstance._round_robin_invoke` [bad-argument-count]
-   --> core/model_manager.py:361:13
 ERROR Argument `dict[str, list[str] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
    --> core/ops/mlflow_trace/mlflow_trace.py:271:24
 ERROR Argument `dict[str, dict[str, Any] | str | None]` is not assignable to parameter `attributes` with type `dict[str, str] | None` in function `mlflow.tracing.fluent.start_span_no_context` [bad-argument-type]
@@ -70,14 +50,8 @@
    --> core/ops/mlflow_trace/mlflow_trace.py:415:24
 ERROR Class member `OpsTraceProviderConfigMap.__getitem__` overrides parent class `UserDict` in an inconsistent manner [bad-param-name-override]
    --> core/ops/ops_trace_manager.py:206:9
-ERROR Object of class `NoneType` has no attribute `data_source_type` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:142:36
-ERROR Object of class `NoneType` has no attribute `keyword_table` [missing-attribute]
-   --> core/rag/datasource/keyword/jieba/jieba.py:144:13
 ERROR Cannot index into `set[Any]` [bad-index]
-   --> core/rag/datasource/keyword/jieba/jieba.py:157:29
-ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
-  --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
+   --> core/rag/datasource/keyword/jieba/jieba.py:159:29
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:106:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
@@ -598,16 +572,6 @@
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:14:5
 ERROR Class member `DuplicateDocumentIndexingTaskProxy.PRIORITY_TASK_FUNC` overrides parent class `BatchDocumentIndexingProxy` in an inconsistent manner [bad-override]
   --> services/document_indexing_proxy/duplicate_document_indexing_task_proxy.py:15:5
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any` is not assignable to parameter `top_k` with type `int` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:86:19
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | float | int | Any` is not assignable to parameter `score_threshold` with type `float | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:87:29
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `reranking_model` with type `RerankingModelDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:90:29
-ERROR Argument `Literal['reranking_model', True] | RetrievalMethod | dict[str, str] | int | Any` is not assignable to parameter `reranking_mode` with type `str` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:93:28
-ERROR Argument `RetrievalMethod | bool | dict[str, str] | int | Any | None` is not assignable to parameter `weights` with type `WeightsDict | None` in function `core.rag.datasource.retrieval_service.RetrievalService.retrieve` [bad-argument-type]
-  --> services/hit_testing_service.py:94:21
 ERROR `handled_tenant_count` was assigned in the current scope before the nonlocal declaration [unknown-name]
   --> services/plugin/plugin_migration.py:92:34
 ERROR `dict[str, Any]` is not assignable to attribute `credentials` with type `Never` [bad-assignment]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants