refactor(api): add null safety to extractor_processor and firecrawl by tmimmanuel · Pull Request #35209 · langgenius/dify

tmimmanuel · 2026-04-14T18:06:39Z

Summary

Add null guard for dataset_keyword_table in jieba keyword handler
Cast untyped jieba cut() return and provide default for None topK
Suppress invoke_llm overload mismatch in multi-dataset function call router
Widen pool_params dict type in oracle vector to accept None values
Cast describe_collection and get_server_version returns in milvus vector
Suppress couchbase SDK type stub mismatches

Test plan

make type-check-core passes with 0 errors
All 237 affected unit tests pass (no test changes needed)
No runtime behavior changes — type annotations, casts, and null guards only

Part of #26412

…ry methods

github-actions · 2026-04-14T18:08:25Z

Pyrefly Diff

base → PR

--- /tmp/pyrefly_base.txt	2026-04-14 18:08:08.803291603 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 18:07:57.682175706 +0000
@@ -78,18 +78,6 @@
    --> core/rag/datasource/keyword/jieba/jieba.py:157:29
 ERROR Argument `object` is not assignable to parameter `iterable` with type `Iterable[@_]` in function `list.__init__` [bad-argument-type]
   --> core/rag/datasource/keyword/jieba/jieba_keyword_table_handler.py:88:35
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:116:61
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:126:62
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:152:61
-ERROR `upload_file` may be uninitialized [unbound-name]
-   --> core/rag/extractor/extract_processor.py:158:62
-ERROR `response` may be uninitialized [unbound-name]
-   --> core/rag/extractor/firecrawl/firecrawl_app.py:183:16
-ERROR `response` may be uninitialized [unbound-name]
-   --> core/rag/extractor/firecrawl/firecrawl_app.py:192:16
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.post` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:106:25
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]

Copilot

Pull request overview

This PR improves static typing and null-safety around RAG extraction code paths (notably Firecrawl request helpers and file-based extraction) to reduce pyright/mypy issues as part of removing type-check ignores (#26412).

Changes:

Add explicit non-None typing for Firecrawl HTTP responses across retry loops.
Refactor ExtractProcessor.extract() to reuse upload_file local and add assertions to satisfy type-checking for PDF/DOCX extractors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`api/core/rag/extractor/firecrawl/firecrawl_app.py`	Makes retry helper response variables explicitly optional then asserts non-null at loop end to satisfy typing.
`api/core/rag/extractor/extract_processor.py`	Introduces `upload_file` local and adds asserts before constructing extractors that require tenant/user context.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T18:25:59Z

                    elif file_extension == ".pdf":
+                        assert upload_file is not None
                        extractor = PdfExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
                    elif file_extension in {".md", ".markdown", ".mdx"}:


The new assert upload_file is not None is used as a type-narrowing guard but is also a runtime check that can be stripped with python -O, which would allow upload_file.tenant_id / created_by to raise AttributeError later. Prefer an explicit if upload_file is None: raise ValueError("upload_file is required for PDF extraction") (with a message) rather than a bare assert.

Copilot · 2026-04-14T18:26:00Z

                    elif file_extension == ".docx":
+                        assert upload_file is not None
                        extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
                    elif file_extension == ".doc":


Same issue for DOCX: the bare assert upload_file is not None is a runtime check that can be stripped, and it also raises AssertionError without context. Prefer an explicit exception with a clear message when upload_file is required for WordExtractor construction.

Copilot · 2026-04-14T18:26:00Z

                    elif file_extension == ".pdf":
+                        assert upload_file is not None
                        extractor = PdfExtractor(file_path, upload_file.tenant_id, upload_file.created_by)
                    elif file_extension in {".md", ".markdown", ".mdx"}:


This branch also uses a bare assert upload_file is not None before constructing PdfExtractor. For consistency and safety (asserts can be stripped), use an explicit guard/exception with a message when upload_file is required.

Copilot · 2026-04-14T18:26:00Z

                    elif file_extension == ".docx":
+                        assert upload_file is not None
                        extractor = WordExtractor(file_path, upload_file.tenant_id, upload_file.created_by)


The PR description says "No runtime behavior changes", but these new asserts change the failure mode when file_path is provided and extract_setting.upload_file is None (e.g., now AssertionError instead of an attribute error). If this is intended, it should be reflected in the description; otherwise, consider preserving the previous exception type or handling the missing upload_file case more gracefully.

Copilot · 2026-04-14T18:26:01Z

+        response: httpx.Response | None = None
        for attempt in range(retries):
            response = httpx.post(url, headers=headers, json=data)
            if response.status_code == 502:
                time.sleep(backoff_factor * (2**attempt))
            else:
                return response
+        assert response is not None, "retries must be at least 1"
        return response

    def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:
+        response: httpx.Response | None = None
        for attempt in range(retries):
            response = httpx.get(url, headers=headers)
            if response.status_code == 502:
                time.sleep(backoff_factor * (2**attempt))
            else:
                return response
+        assert response is not None, "retries must be at least 1"


Using assert for validating retries is unsafe in production because asserts can be stripped with python -O. If retries is 0, this currently becomes an AssertionError (or would become returning None if asserts are removed), so callers can hit unexpected runtime behavior. Prefer an explicit guard like raising ValueError when retries < 1 and keep response non-optional thereafter.

Suggested change

response: httpx.Response | None = None

for attempt in range(retries):

response = httpx.post(url, headers=headers, json=data)

if response.status_code == 502:

time.sleep(backoff_factor * (2**attempt))

else:

return response

assert response is not None, "retries must be at least 1"

return response

def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:

response: httpx.Response | None = None

for attempt in range(retries):

response = httpx.get(url, headers=headers)

if response.status_code == 502:

time.sleep(backoff_factor * (2**attempt))

else:

return response

assert response is not None, "retries must be at least 1"

if retries < 1:

raise ValueError("retries must be at least 1")

for attempt in range(retries):

response = httpx.post(url, headers=headers, json=data)

if response.status_code == 502:

time.sleep(backoff_factor * (2**attempt))

else:

return response

return response

def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:

if retries < 1:

raise ValueError("retries must be at least 1")

for attempt in range(retries):

response = httpx.get(url, headers=headers)

if response.status_code == 502:

time.sleep(backoff_factor * (2**attempt))

else:

return response

Copilot · 2026-04-14T18:26:01Z

    def _get_request(self, url, headers, retries=3, backoff_factor=0.5) -> httpx.Response:
+        response: httpx.Response | None = None
        for attempt in range(retries):
            response = httpx.get(url, headers=headers)
            if response.status_code == 502:
                time.sleep(backoff_factor * (2**attempt))
            else:
                return response
+        assert response is not None, "retries must be at least 1"
        return response


Same concern as _post_request: assert response is not None is used for runtime validation and may be removed under optimization, reintroducing a None return / uninitialized variable scenario when retries is 0. Use an explicit if retries < 1: raise ValueError(...) guard instead of assert.

tmimmanuel · 2026-04-14T18:38:07Z

@asukaminato0721
Thanks for merging my PR.
Sorry for many pings. Please review my previous PRs (#34809 #34702 #34796 #34938).
Thanks again.

tmimmanuel and others added 2 commits April 10, 2026 19:54

refactor(api): add null safety to extract_processor and firecrawl ret…

2550f8b

…ry methods

Merge branch 'main' into refactor/pyright-core-rag-key

9b3eb00

tmimmanuel requested a review from JohnJyong as a code owner April 14, 2026 18:06

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 14, 2026

tmimmanuel changed the title ~~Refactor: pyright core rag key~~ refactor(api): add null safety to extractor_processor and firecrawl Apr 14, 2026

asukaminato0721 approved these changes Apr 14, 2026

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 14, 2026

asukaminato0721 added this pull request to the merge queue Apr 14, 2026

asukaminato0721 requested a review from Copilot April 14, 2026 18:23

asukaminato0721 self-assigned this Apr 14, 2026

Copilot started reviewing on behalf of asukaminato0721 April 14, 2026 18:23 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

Merged via the queue into langgenius:main with commit ef28a63 Apr 14, 2026
33 of 34 checks passed

tmimmanuel deleted the refactor/pyright-core-rag-key branch April 14, 2026 18:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(api): add null safety to extractor_processor and firecrawl#35209

refactor(api): add null safety to extractor_processor and firecrawl#35209
asukaminato0721 merged 2 commits intolanggenius:mainfrom
tmimmanuel:refactor/pyright-core-rag-key

tmimmanuel commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Uh oh!

tmimmanuel commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tmimmanuel commented Apr 14, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 14, 2026

Pyrefly Diff

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tmimmanuel commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants