-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] DirectoryAsset BatchDefinition API #9888
Conversation
✅ Deploy Preview for niobium-lead-7998 canceled.
|
@@ -36,6 +39,7 @@ class PartitionerDatetimePart(pydantic.BaseModel): | |||
column_name: str | |||
sort_ascending: bool = True | |||
method_name: Literal["partition_on_date_parts"] = "partition_on_date_parts" | |||
param_names: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these Partitioners are not supported will be removed soon; this attribute is a placeholder for mypy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
found a cleaner solution that doesn't require changing these classes d2cd4ae
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9888 +/- ##
===========================================
+ Coverage 77.77% 77.94% +0.17%
===========================================
Files 494 494
Lines 42389 42440 +51
===========================================
+ Hits 32967 33080 +113
+ Misses 9422 9360 -62
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -78,7 +77,7 @@ def test_batch_request_config_serialization_round_trips( | |||
batch_request_config: dict[str, Any] = { | |||
"datasource_name": datasource_name, | |||
"data_asset_name": data_asset_name, | |||
"partitioner": PartitionerColumnValue(column_name="my_column"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated to this PR, but we don't intend to support this going forward, so i replaced it with a time-based partitioner.
def add_batch_definition_whole_directory(self, name: str) -> BatchDefinition: | ||
"""Add a BatchDefinition which creates a single batch for the entire directory.""" | ||
return self.add_batch_definition(name=name, partitioner=None) | ||
@singledispatchmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL. This seems like it accomplishes the same as using @overload
, but now we get really clean small methods, rather than one big one that handles all the cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i like the syntax, but tend to avoid it because typing always gets wonky. not sure what is going on in the implementation but mypy has a very difficult time reasoning about what type is being returned. Also, it's not possible to use keyword args in the signature, which is unfortunate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I left a couple smaller asks you can ignore if you disagree, but there's at least one method that is missing typing.
|
||
return batch_spec_options | ||
|
||
def _add_partitioner_batch_parameters(self, batch_request, parameters) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add types to the params here? Also non-blocking, but pure functions tend to be simpler to reason about; I'd consider just returning the dict containing partitioner_method
and partitioner_kwargs
and merging in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wild, great catch, i thought types were a hard requirement on new code. And love the suggestion of not mutating the dict in the helper method, updated in c432636.
def _get_sortable_partitioner( | ||
self, partitioner: Optional[PartitionerT] | ||
) -> Optional[PartitionerSortingProtocol]: | ||
# allow subclasses to determine sorting configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an abstractmethod
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, yes (e89cf57) - but that makes me wonder why do we have a handful of other NotImplemented
methods on this class, can we make them abstractmethod
s as well? And the answer is no, we can't, because Pandas dynamic assets are given the type FileAsset
, but their implementations of these methods are dynamically generated, so mypy fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wow good to know, thanks for digging in!
|
||
|
||
@pytest.fixture | ||
def daily_batch_parameters_and_expected_result(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what these fixtures represent from looking at them. I'd probably rename something to the effect of expected_result
-> known_matching_row_count
or something along those lines. These tests that assert against real data are important, but it makes the reader have to guess where the vales came from. Or maybe even stop having them be fixtures, since it looks like only one of them is reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm all for in-lining fixtures, the more context in the test, the better c89c7c6
], | ||
) | ||
def test_get_batch_parameters_keys_with_partitioner( | ||
self, directory_asset, partitioner: Partitioner, expected_keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss these sometimes too, but probably good to get types on the test params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! updated c89c7c6
This PR introduces a fluent-style BatchDefinition API to DirectoryAssets:
Unlike other asset's BatchDefinition APIs, DirectoryAssets do not expose a sort parameter, since they require exact batch parameters and only ever return a single batch.
Followup Work
This PR does not test that the column a user indicates exists; that work is captured in V1-325.
Pre Work
This PR was preceded by #9874, which refactored the inheritance hierarchy of path-based assets to allow adding this feature.