Skip to content

refactor: strengthen extension and segment dict typing#35169

Open
bohdansolovie wants to merge 3 commits intolanggenius:mainfrom
bohdansolovie:refactor/typedict-segment-summary
Open

refactor: strengthen extension and segment dict typing#35169
bohdansolovie wants to merge 3 commits intolanggenius:mainfrom
bohdansolovie:refactor/typedict-segment-summary

Conversation

@bohdansolovie
Copy link
Copy Markdown
Contributor

Summary

  • Replace broad dict annotations in segment marshal helpers with a dedicated SegmentWithSummaryDict TypedDict.
  • Remove # type: ignore in extension factories by casting extension classes to concrete types (Moderation and ExternalDataTool).
  • Keep runtime behavior unchanged while improving static type safety and maintainability.

Changes

  • api/controllers/service_api/dataset/segment.py

    • Added SegmentWithSummaryDict (TypedDict) for marshaled segment payloads with optional summary.
    • Updated _marshal_segment_with_summary and _marshal_segments_with_summary return types to use the new typed structure.
    • Replaced generic local dict typing with cast(...) to align with the marshaled output shape.
  • api/core/moderation/factory.py

    • Typed extension_class via cast(type[Moderation], ...).
    • Removed type: ignore from validate_config call.
  • api/core/external_data_tool/factory.py

    • Typed extension_class via cast(type[ExternalDataTool], ...).
    • Removed type: ignore from validate_config call.

##Related Issues

Replace broad dict annotations with TypedDict/cast patterns in segment marshalling and extension factories to remove type ignores and improve static type safety without changing runtime behavior.
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. refactor labels Apr 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

No changes detected.

@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

No changes detected.

Drop redundant cast in query return path to satisfy basedpyright reportUnnecessaryCast without changing behavior.
@github-actions
Copy link
Copy Markdown
Contributor

Pyrefly Diff

No changes detected.

def _marshal_segment_with_summary(segment, dataset_id: str) -> SegmentWithSummaryDict:
"""Marshal a single segment and enrich it with summary content."""
segment_dict: dict[str, Any] = dict(marshal(segment, segment_fields)) # type: ignore[arg-type]
segment_dict = cast(SegmentWithSummaryDict, dict(marshal(segment, segment_fields))) # type: ignore[arg-type]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dosu do we need marshal segment, segment_fields here

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, marshal(segment, segment_fields) is needed here. The DocumentSegment model doesn't have a built-in to_dict() method, so marshal() from Flask-RESTX is the only mechanism to convert the SQLAlchemy ORM object into a serializable dictionary [1]. The segment_fields schema controls which fields appear in the response, handles nested objects like child_chunks and attachments, converts timestamps, and coerces types [2].

This is also the same pattern used in the console API counterpart [3], so it keeps behavior consistent across both API surfaces.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Chat with Dosu Open in Cursor  Share on X

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asukaminato0721

@dosu will reply to this?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already replied above! To summarize: yes, marshal(segment, segment_fields) is needed here. The DocumentSegment model doesn't have a built-in to_dict() method, so marshal() is the mechanism that converts the SQLAlchemy ORM object into a serializable dictionary. The segment_fields schema controls which fields appear in the response, handles nested objects, converts timestamps, and coerces types. This is consistent with how the console API does it as well.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Chat with Dosu Open in Cursor  Share on X

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dosu so it means, if DocumentSegment is TypeBase then this can be removed? Since it will become dataclass

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly — even after migrating to TypeBase (which uses MappedAsDataclass), marshal(segment, segment_fields) would still be needed for API responses. TypeBase models are still SQLAlchemy ORM models mapped to database tables, not pure dataclasses [1].

While dataclass-based models do gain asdict() for basic dict conversion, that doesn't give you the same thing as marshal() — specifically, control over which fields to expose, nested object serialization (child_chunks, attachments), and type coercion (e.g., timestamps). The codebase already has TypeBase models that still use marshal() in their API controllers — for example, WorkflowDraftVariableList is a frozen dataclass but is still served via @marshal_with() [2].

In the codebase, asdict() is only used for internal service purposes like queue payloads [3], not for API responses. So even once DocumentSegment is migrated to TypeBase, the marshal() call here would remain necessary to shape the API response correctly.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Chat with Dosu Open in Cursor  Share on X

Comment thread api/controllers/service_api/dataset/segment.py
Comment thread api/core/external_data_tool/factory.py
@bohdansolovie
Copy link
Copy Markdown
Contributor Author

@dosu
reply to @asukaminato0721

@dosubot
Copy link
Copy Markdown

dosubot bot commented Apr 14, 2026

@asukaminato0721 I've already addressed all three of your review comments in the inline replies above. Here's a quick recap:

  1. marshal(segment, segment_fields) in _marshal_segment_with_summary — Yes, it's needed. DocumentSegment is a SQLAlchemy ORM object and marshal() is the mechanism to convert it into a serializable dict, selecting fields, formatting timestamps, and handling nested structures like child_chunks and attachments [1].

  2. marshal(segment, segment_fields) in _marshal_segments_with_summary — Same reason as above. The summary enrichment that follows also depends on the segment already being a dict [1].

  3. cast(type[ExternalDataTool], ...) in factory.py — Yes, the cast is needed because code_based_extension.extension_class() returns a generic type, and without the cast the type checker can't verify the constructor or validate_config calls. This replaces the old # type: ignore comments with explicit type narrowing [1].

Let me know if you have any further questions!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Chat with Dosu Open in Cursor  Share on X

@bohdansolovie
Copy link
Copy Markdown
Contributor Author

Hi, @asukaminato0721
@dosu said all are needed.
what about your opinion?

@dosubot
Copy link
Copy Markdown

dosubot bot commented Apr 14, 2026

This question is directed at @asukaminato0721 for their reviewer opinion — I'll let them weigh in! My earlier analysis on why marshal() and cast() are needed still stands, but the final call on the review is theirs.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
Chat with Dosu Open in Cursor  Share on X

@bohdansolovie
Copy link
Copy Markdown
Contributor Author

bohdansolovie commented Apr 14, 2026

Hi, @asukaminato0721
Could you review my PR again please?
Thank you.

@bohdansolovie
Copy link
Copy Markdown
Contributor Author

@asukaminato0721
please check my PR again.

@bohdansolovie
Copy link
Copy Markdown
Contributor Author

hi, @asukaminato0721
could you review my PR please?

@bohdansolovie
Copy link
Copy Markdown
Contributor Author

hi, @asukaminato0721
If you have time, could you review my PR please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants