Skip to content

fix(api): narrow WaterCrawl JSON responses for type checking#35113

Open
claytonlin1110 wants to merge 2 commits intolanggenius:mainfrom
claytonlin1110:fix/watercrawl-provider-json-dict-narrowing
Open

fix(api): narrow WaterCrawl JSON responses for type checking#35113
claytonlin1110 wants to merge 2 commits intolanggenius:mainfrom
claytonlin1110:fix/watercrawl-provider-json-dict-narrowing

Conversation

@claytonlin1110
Copy link
Copy Markdown

Summary

Fixes #32791

Problem

get_crawl_request (and related client helpers) are inferred from process_response as dict | bytes | list | None | Generator, so subscripting and .get on the result triggered [bad-index] / unsupported subscript errors (see langgenius/dify#32791).

Solution

Introduce _require_json_dict(value, operation=...) that validates isinstance(value, dict) and returns dict[str, Any], otherwise raises WaterCrawlError. Use it for create_crawl_request, get_crawl_request, and get_crawl_request_results in WaterCrawlProvider.

Docs

Add api/agent-notes/core/rag/extractor/watercrawl/provider.py.md describing the invariant.

Checklist

  • This change requires a documentation update, included: Dify Document
  • I understand that this PR may be closed in case there was no previous discussion or issues. (This doesn't apply to typos!)
  • I've added a test for each change that was introduced, and I tried as much as possible to make a single atomic change.
  • I've updated the documentation accordingly.
  • I ran make lint && make type-check (backend) and cd web && pnpm exec vp staged (frontend) to appease the lint gods

@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 13, 2026
@claytonlin1110 claytonlin1110 force-pushed the fix/watercrawl-provider-json-dict-narrowing branch from f83ba1f to 55c3e1c Compare April 13, 2026 22:03
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

base → PR
--- /tmp/pyrefly_base.txt	2026-04-14 03:01:09.428394502 +0000
+++ /tmp/pyrefly_pr.txt	2026-04-14 03:00:59.255399164 +0000
@@ -101,7 +101,7 @@
 ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
    --> core/rag/extractor/notion_extractor.py:371:21
 ERROR Argument `Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
-   --> core/rag/extractor/watercrawl/provider.py:108:37
+   --> core/rag/extractor/watercrawl/provider.py:120:37
 ERROR Object of class `BaseOxmlElement` has no attribute `body` [missing-attribute]
    --> core/rag/extractor/word_extractor.py:426:24
 ERROR Object of class `Document` has no attribute `score` [missing-attribute]

@claytonlin1110
Copy link
Copy Markdown
Author

@asukaminato0721 Would you please review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Refactor/Chore] ERROR Cannot index into Generator[Unknown, None, None] [bad-index]

1 participant