refactor(api): add null safety to extractor_processor and firecrawl#35209
Conversation
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-04-14 18:08:08.803291603 +0000
+++ /tmp/pyrefly_pr.txt 2026-04-14 18:07:57.682175706 +0000
@@ -78,18 +78,6 @@
--> core/rag/datasource/keyword/jieba/jieba.py:157:29
ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
--> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
-ERROR `upload_file` may be uninitialized [unbound-name]
- --> core/rag/extractor/extract_processor.py:116:61
-ERROR `upload_file` may be uninitialized [unbound-name]
- --> core/rag/extractor/extract_processor.py:126:62
-ERROR `upload_file` may be uninitialized [unbound-name]
- --> core/rag/extractor/extract_processor.py:152:61
-ERROR `upload_file` may be uninitialized [unbound-name]
- --> core/rag/extractor/extract_processor.py:158:62
-ERROR `response` may be uninitialized [unbound-name]
- --> core/rag/extractor/firecrawl/firecrawl_app.py:183:16
-ERROR `response` may be uninitialized [unbound-name]
- --> core/rag/extractor/firecrawl/firecrawl_app.py:192:16
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
--> core/rag/extractor/notion_extractor.py:106:25
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
|
There was a problem hiding this comment.
Pull request overview
This PR improves static typing and null-safety around RAG extraction code paths (notably Firecrawl request helpers and file-based extraction) to reduce pyright/mypy issues as part of removing type-check ignores (#26412).
Changes:
- Add explicit non-
Nonetyping for Firecrawl HTTP responses across retry loops. - Refactor
ExtractProcessor.extract()to reuseupload_filelocal and add assertions to satisfy type-checking for PDF/DOCX extractors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
api/core/rag/extractor/firecrawl/firecrawl_app.py |
Makes retry helper response variables explicitly optional then asserts non-null at loop end to satisfy typing. |
api/core/rag/extractor/extract_processor.py |
Introduces upload_file local and adds asserts before constructing extractors that require tenant/user context. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| elif file_extension == ".pdf": | ||
| assert upload_file is not None | ||
| extractor = PdfExtractor(file_path, upload_file.tenant_id, upload_file.created_by) | ||
| elif file_extension in {".md", ".markdown", ".mdx"}: |
There was a problem hiding this comment.
The new assert upload_file is not None is used as a type-narrowing guard but is also a runtime check that can be stripped with python -O, which would allow upload_file.tenant_id / created_by to raise AttributeError later. Prefer an explicit if upload_file is None: raise ValueError("upload_file is required for PDF extraction") (with a message) rather than a bare assert.
| elif file_extension == ".docx": | ||
| assert upload_file is not None | ||
| extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by) | ||
| elif file_extension == ".doc": |
There was a problem hiding this comment.
Same issue for DOCX: the bare assert upload_file is not None is a runtime check that can be stripped, and it also raises AssertionError without context. Prefer an explicit exception with a clear message when upload_file is required for WordExtractor construction.
| elif file_extension == ".pdf": | ||
| assert upload_file is not None | ||
| extractor = PdfExtractor(file_path, upload_file.tenant_id, upload_file.created_by) | ||
| elif file_extension in {".md", ".markdown", ".mdx"}: |
There was a problem hiding this comment.
This branch also uses a bare assert upload_file is not None before constructing PdfExtractor. For consistency and safety (asserts can be stripped), use an explicit guard/exception with a message when upload_file is required.
| elif file_extension == ".docx": | ||
| assert upload_file is not None | ||
| extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by) |
There was a problem hiding this comment.
The PR description says "No runtime behavior changes", but these new asserts change the failure mode when file_path is provided and extract_setting.upload_file is None (e.g., now AssertionError instead of an attribute error). If this is intended, it should be reflected in the description; otherwise, consider preserving the previous exception type or handling the missing upload_file case more gracefully.
| response: httpx.Response | None = None | ||
| for attempt in range(retries): | ||
| response = httpx.post(url, headers=headers, json=data) | ||
| if response.status_code == 502: | ||
| time.sleep(backoff_factor * (2**attempt)) | ||
| else: | ||
| return response | ||
| assert response is not None, "retries must be at least 1" | ||
| return response | ||
|
|
||
| def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response: | ||
| response: httpx.Response | None = None | ||
| for attempt in range(retries): | ||
| response = httpx.get(url, headers=headers) | ||
| if response.status_code == 502: | ||
| time.sleep(backoff_factor * (2**attempt)) | ||
| else: | ||
| return response | ||
| assert response is not None, "retries must be at least 1" |
There was a problem hiding this comment.
Using assert for validating retries is unsafe in production because asserts can be stripped with python -O. If retries is 0, this currently becomes an AssertionError (or would become returning None if asserts are removed), so callers can hit unexpected runtime behavior. Prefer an explicit guard like raising ValueError when retries < 1 and keep response non-optional thereafter.
| response: httpx.Response | None = None | |
| for attempt in range(retries): | |
| response = httpx.post(url, headers=headers, json=data) | |
| if response.status_code == 502: | |
| time.sleep(backoff_factor * (2**attempt)) | |
| else: | |
| return response | |
| assert response is not None, "retries must be at least 1" | |
| return response | |
| def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response: | |
| response: httpx.Response | None = None | |
| for attempt in range(retries): | |
| response = httpx.get(url, headers=headers) | |
| if response.status_code == 502: | |
| time.sleep(backoff_factor * (2**attempt)) | |
| else: | |
| return response | |
| assert response is not None, "retries must be at least 1" | |
| if retries < 1: | |
| raise ValueError("retries must be at least 1") | |
| for attempt in range(retries): | |
| response = httpx.post(url, headers=headers, json=data) | |
| if response.status_code == 502: | |
| time.sleep(backoff_factor * (2**attempt)) | |
| else: | |
| return response | |
| return response | |
| def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response: | |
| if retries < 1: | |
| raise ValueError("retries must be at least 1") | |
| for attempt in range(retries): | |
| response = httpx.get(url, headers=headers) | |
| if response.status_code == 502: | |
| time.sleep(backoff_factor * (2**attempt)) | |
| else: | |
| return response |
| def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response: | ||
| response: httpx.Response | None = None | ||
| for attempt in range(retries): | ||
| response = httpx.get(url, headers=headers) | ||
| if response.status_code == 502: | ||
| time.sleep(backoff_factor * (2**attempt)) | ||
| else: | ||
| return response | ||
| assert response is not None, "retries must be at least 1" | ||
| return response |
There was a problem hiding this comment.
Same concern as _post_request: assert response is not None is used for runtime validation and may be removed under optimization, reintroducing a None return / uninitialized variable scenario when retries is 0. Use an explicit if retries < 1: raise ValueError(...) guard instead of assert.
|
@asukaminato0721 |
Summary
dataset_keyword_tablein jieba keyword handlercut()return and provide default forNonetopKinvoke_llmoverload mismatch in multi-dataset function call routerpool_paramsdict type in oracle vector to acceptNonevaluesdescribe_collectionandget_server_versionreturns in milvus vectorTest plan
make type-check-corepasses with 0 errorsPart of #26412