refactor(api): type WaterCrawl API responses with TypedDict#33700
refactor(api): type WaterCrawl API responses with TypedDict#33700asukaminato0721 merged 2 commits intolanggenius:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the type safety and clarity of the WaterCrawl API integration by replacing untyped Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-03-18 20:11:36.786117604 +0000
+++ /tmp/pyrefly_pr.txt 2026-03-18 20:11:27.652002056 +0000
@@ -399,44 +399,44 @@
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
--> core/rag/extractor/notion_extractor.py:368:21
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR Object of class `Generator` has no attribute `get`
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR Object of class `Generator` has no attribute `get`
ERROR Object of class `Generator` has no attribute `get`
ERROR Object of class `Generator` has no attribute `get`
ERROR Object of class `Generator` has no attribute `get`
-ERROR Argument `Generator[Unknown] | bytes | dict[Unknown, Unknown] | list[Unknown] | Unknown | None` is not assignable to parameter `result_object` with type `dict[Unknown, Unknown]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
- --> core/rag/extractor/watercrawl/provider.py:87:37
+ERROR Argument `Generator[Unknown] | bytes | dict[Unknown, Unknown] | list[Unknown] | Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
+ --> core/rag/extractor/watercrawl/provider.py:110:37
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | Any | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
--> core/rag/extractor/word_extractor.py:371:13
ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | Any | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
@@ -3346,6 +3346,8 @@
--> tests/unit_tests/core/datasource/test_datasource_manager.py:573:34
ERROR Object of class `StreamChunkEvent` has no attribute `node_run_result` [missing-attribute]
--> tests/unit_tests/core/datasource/test_datasource_manager.py:624:12
+ERROR `in` is not supported between `Literal['Single Page']` and `None` [not-iterable]
+ --> tests/unit_tests/core/datasource/test_website_crawl.py:989:16
ERROR Argument `Iterator[DatasourceMessage]` is not assignable to parameter `messages` with type `Generator[DatasourceMessage]` in function `core.datasource.utils.message_transformer.DatasourceFileMessageTransformer.transform_datasource_invoke_messages` [bad-argument-type]
--> tests/unit_tests/core/datasource/utils/test_message_transformer.py:28:26
ERROR Object of class `BlobChunkMessage` has no attribute `text`
|
There was a problem hiding this comment.
Code Review
This pull request is a solid step towards improving type safety by introducing TypedDict for WaterCrawl API responses. The changes are well-implemented across client.py, provider.py, and website_service.py. My review includes a few suggestions to further enhance type specificity by replacing bare dict annotations with dict[str, Any], which aligns with the goals of this refactoring.
Pyrefly Diffbase → PR--- /tmp/pyrefly_base.txt 2026-03-18 20:24:03.306032143 +0000
+++ /tmp/pyrefly_pr.txt 2026-03-18 20:23:54.020880598 +0000
@@ -399,44 +399,44 @@
ERROR Argument `dict[str, bytes | str]` is not assignable to parameter `headers` with type `Headers | Mapping[bytes, bytes] | Mapping[str, str] | Sequence[tuple[bytes, bytes]] | Sequence[tuple[str, str]] | None` in function `httpx._api.request` [bad-argument-type]
--> core/rag/extractor/notion_extractor.py:368:21
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/client.py:188:54
+ --> core/rag/extractor/watercrawl/client.py:210:54
ERROR Object of class `Generator` has no attribute `get`
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/provider.py:47:12
+ --> core/rag/extractor/watercrawl/provider.py:70:12
ERROR Object of class `Generator` has no attribute `get`
ERROR Object of class `Generator` has no attribute `get`
ERROR Object of class `Generator` has no attribute `get`
ERROR Object of class `Generator` has no attribute `get`
-ERROR Argument `Generator[Unknown] | bytes | dict[Unknown, Unknown] | list[Unknown] | Unknown | None` is not assignable to parameter `result_object` with type `dict[Unknown, Unknown]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
- --> core/rag/extractor/watercrawl/provider.py:87:37
+ERROR Argument `Generator[Unknown] | bytes | dict[Unknown, Unknown] | list[Unknown] | Unknown | None` is not assignable to parameter `result_object` with type `dict[str, Any]` in function `WaterCrawlProvider._structure_data` [bad-argument-type]
+ --> core/rag/extractor/watercrawl/provider.py:110:37
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/provider.py:110:20
+ --> core/rag/extractor/watercrawl/provider.py:135:20
ERROR Cannot index into `Generator[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR Cannot index into `bytes` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR Cannot index into `list[Unknown]` [bad-index]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR `None` is not subscriptable [unsupported-operation]
- --> core/rag/extractor/watercrawl/provider.py:116:16
+ --> core/rag/extractor/watercrawl/provider.py:141:16
ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | Any | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
--> core/rag/extractor/word_extractor.py:371:13
ERROR Pyrefly detected conflicting types while breaking a dependency cycle: `str | Any | None` is not assignable to `None`. Adding explicit type annotations might possibly help. [bad-assignment]
@@ -3346,6 +3346,8 @@
--> tests/unit_tests/core/datasource/test_datasource_manager.py:573:34
ERROR Object of class `StreamChunkEvent` has no attribute `node_run_result` [missing-attribute]
--> tests/unit_tests/core/datasource/test_datasource_manager.py:624:12
+ERROR `in` is not supported between `Literal['Single Page']` and `None` [not-iterable]
+ --> tests/unit_tests/core/datasource/test_website_crawl.py:989:16
ERROR Argument `Iterator[DatasourceMessage]` is not assignable to parameter `messages` with type `Generator[DatasourceMessage]` in function `core.datasource.utils.message_transformer.DatasourceFileMessageTransformer.transform_datasource_invoke_messages` [bad-argument-type]
--> tests/unit_tests/core/datasource/utils/test_message_transformer.py:28:26
ERROR Object of class `BlobChunkMessage` has no attribute `text`
|
Summary
Types the WaterCrawl API request and response dicts in
client.pyandprovider.pyusing explicit TypedDicts as part of the broader effort (#32863) to replace baredict/Mappingannotations acrossapi/core/rag.client.pyandprovider.pyhad untypeddictfor both crawl configuration options (spider_options,page_options) and all response shapes, making the WaterCrawl API contract invisible to the type system. Typing these enables IDE autocompletion on response fields and catches field-name mistakes at type-check time rather than at runtime.Changes
client.py: AddSpiderOptionsTypedDict (max_depth,page_limit,allowed_domains,exclude_paths,include_paths) andPageOptionsTypedDict (all page-level crawl settings); updatecreate_crawl_request()andscrape_url()param typesprovider.py: AddWatercrawlDocumentData,CrawlJobResponse,WatercrawlCrawlStatusResponseTypedDicts; updatecrawl_url(),get_crawl_status(),get_crawl_url_data(),scrape_url(),_structure_data(), and_get_results()return types; annotatespider_optionsandpage_optionslocal variableswebsite_service.py: Wrap TypedDict returns withdict()at thedict[str, Any]boundary (basedpyright requirement)Test plan
make lintpassesmake type-checkpasses (basedpyright + pyrefly + mypy — 0 errors)uv run --project api pytest api/tests/unit_tests/core/rag/extractor/watercrawl/— 27 tests passPart of #32863 (
core/rag/extractor/watercrawl/)