Skip to content

refactor(api): add null safety to extractor_processor and firecrawl#35209

Merged
asukaminato0721 merged 2 commits intolanggenius:mainfrom
tmimmanuel:refactor/pyright-core-rag-key
Apr 14, 2026
Merged

refactor(api): add null safety to extractor_processor and firecrawl#35209
asukaminato0721 merged 2 commits intolanggenius:mainfrom
tmimmanuel:refactor/pyright-core-rag-key

Conversation

@tmimmanuel
Copy link
Copy Markdown
Contributor

Summary

  • Add null guard for dataset_keyword_table in jieba keyword handler
  • Cast untyped jieba cut() return and provide default for None topK
  • Suppress invoke_llm overload mismatch in multi-dataset function call router
  • Widen pool_params dict type in oracle vector to accept None values
  • Cast describe_collection and get_server_version returns in milvus vector
  • Suppress couchbase SDK type stub mismatches

Test plan

  • make type-check-core passes with 0 errors
  • All 237 affected unit tests pass (no test changes needed)
  • No runtime behavior changes — type annotations, casts, and null guards only

Part of #26412

@tmimmanuel tmimmanuel requested a review from JohnJyong as a code owner April 14, 2026 18:06
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-04-14 18:08:08.803291603 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 18:07:57.682175706 +0000
@@ -78,18 +78,6 @@
    --> core/rag/datasource/keyword/jieba/jieba.py:157:29
 ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
   --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:116:61
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:126:62
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:152:61
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:158:62
-ERROR `response` may be uninitialized [unbound-name]
-   --> core/rag/extractor/firecrawl/firecrawl_app.py:183:16
-ERROR `response` may be uninitialized [unbound-name]
-   --> core/rag/extractor/firecrawl/firecrawl_app.py:192:16
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:106:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]

@tmimmanuel tmimmanuel changed the title Refactor: pyright core rag key refactor(api): add null safety to extractor_processor and firecrawl Apr 14, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 14, 2026
@asukaminato0721 asukaminato0721 added this pull request to the merge queue Apr 14, 2026
@asukaminato0721 asukaminato0721 requested a review from Copilot April 14, 2026 18:23
@asukaminato0721 asukaminato0721 self-assigned this Apr 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves static typing and null-safety around RAG extraction code paths (notably Firecrawl request helpers and file-based extraction) to reduce pyright/mypy issues as part of removing type-check ignores (#26412).

Changes:

  • Add explicit non-None typing for Firecrawl HTTP responses across retry loops.
  • Refactor ExtractProcessor.extract() to reuse upload_file local and add assertions to satisfy type-checking for PDF/DOCX extractors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
api/core/rag/extractor/firecrawl/firecrawl_app.py Makes retry helper response variables explicitly optional then asserts non-null at loop end to satisfy typing.
api/core/rag/extractor/extract_processor.py Introduces upload_file local and adds asserts before constructing extractors that require tenant/user context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 115 to 118
elif file_extension == ".pdf":
assert upload_file is not None
extractor = PdfExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
elif file_extension in {".md", ".markdown", ".mdx"}:
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new assert upload_file is not None is used as a type-narrowing guard but is also a runtime check that can be stripped with python -O, which would allow upload_file.tenant_id / created_by to raise AttributeError later. Prefer an explicit if upload_file is None: raise ValueError("upload_file is required for PDF extraction") (with a message) rather than a bare assert.

Copilot uses AI. Check for mistakes.
Comment on lines 126 to 129
elif file_extension == ".docx":
assert upload_file is not None
extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
elif file_extension == ".doc":
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for DOCX: the bare assert upload_file is not None is a runtime check that can be stripped, and it also raises AssertionError without context. Prefer an explicit exception with a clear message when upload_file is required for WordExtractor construction.

Copilot uses AI. Check for mistakes.
Comment on lines 153 to 156
elif file_extension == ".pdf":
assert upload_file is not None
extractor = PdfExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
elif file_extension in {".md", ".markdown", ".mdx"}:
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch also uses a bare assert upload_file is not None before constructing PdfExtractor. For consistency and safety (asserts can be stripped), use an explicit guard/exception with a message when upload_file is required.

Copilot uses AI. Check for mistakes.
Comment on lines 160 to 162
elif file_extension == ".docx":
assert upload_file is not None
extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says "No runtime behavior changes", but these new asserts change the failure mode when file_path is provided and extract_setting.upload_file is None (e.g., now AssertionError instead of an attribute error). If this is intended, it should be reflected in the description; otherwise, consider preserving the previous exception type or handling the missing upload_file case more gracefully.

Copilot uses AI. Check for mistakes.
Comment on lines +177 to +195
response: httpx.Response | None = None
for attempt in range(retries):
response = httpx.post(url, headers=headers, json=data)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response
assert response is not None, "retries must be at least 1"
return response

def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:
response: httpx.Response | None = None
for attempt in range(retries):
response = httpx.get(url, headers=headers)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response
assert response is not None, "retries must be at least 1"
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using assert for validating retries is unsafe in production because asserts can be stripped with python -O. If retries is 0, this currently becomes an AssertionError (or would become returning None if asserts are removed), so callers can hit unexpected runtime behavior. Prefer an explicit guard like raising ValueError when retries < 1 and keep response non-optional thereafter.

Suggested change
response: httpx.Response | None = None
for attempt in range(retries):
response = httpx.post(url, headers=headers, json=data)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response
assert response is not None, "retries must be at least 1"
return response
def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:
response: httpx.Response | None = None
for attempt in range(retries):
response = httpx.get(url, headers=headers)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response
assert response is not None, "retries must be at least 1"
if retries < 1:
raise ValueError("retries must be at least 1")
for attempt in range(retries):
response = httpx.post(url, headers=headers, json=data)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response
return response
def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:
if retries < 1:
raise ValueError("retries must be at least 1")
for attempt in range(retries):
response = httpx.get(url, headers=headers)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response

Copilot uses AI. Check for mistakes.
Comment on lines 187 to 196
def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:
response: httpx.Response | None = None
for attempt in range(retries):
response = httpx.get(url, headers=headers)
if response.status_code == 502:
time.sleep(backoff_factor * (2**attempt))
else:
return response
assert response is not None, "retries must be at least 1"
return response
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern as _post_request: assert response is not None is used for runtime validation and may be removed under optimization, reintroducing a None return / uninitialized variable scenario when retries is 0. Use an explicit if retries < 1: raise ValueError(...) guard instead of assert.

Copilot uses AI. Check for mistakes.
Merged via the queue into langgenius:main with commit ef28a63 Apr 14, 2026
33 of 34 checks passed
@tmimmanuel
Copy link
Copy Markdown
Contributor Author

@asukaminato0721
Thanks for merging my PR.
Sorry for many pings. Please review my previous PRs (#34809 #34702 #34796 #34938).
Thanks again.

@tmimmanuel tmimmanuel deleted the refactor/pyright-core-rag-key branch April 14, 2026 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants