Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] DirectoryAsset BatchDefinition API #9888

Merged
merged 15 commits into from
May 7, 2024

Conversation

joshua-stauffer
Copy link
Member

@joshua-stauffer joshua-stauffer commented May 7, 2024

This PR introduces a fluent-style BatchDefinition API to DirectoryAssets:

class DirectoryAsset:
    def add_batch_definition_daily(self, name: str, column: str) -> BatchDefinition: 
        ...

    def add_batch_definition_monthly(self, name: str, column: str) -> BatchDefinition:
        ...

    def add_batch_definition_yearly(self, name: str, column: str) -> BatchDefinition:
        ...

    def add_batch_definition_whole_directory(self, name: str) -> BatchDefinition:
        ...

Unlike other asset's BatchDefinition APIs, DirectoryAssets do not expose a sort parameter, since they require exact batch parameters and only ever return a single batch.

Followup Work

This PR does not test that the column a user indicates exists; that work is captured in V1-325.

Pre Work

This PR was preceded by #9874, which refactored the inheritance hierarchy of path-based assets to allow adding this feature.

Copy link

netlify bot commented May 7, 2024

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit c89c7c6
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/663a8603184d020008e3a4e1

@@ -36,6 +39,7 @@ class PartitionerDatetimePart(pydantic.BaseModel):
column_name: str
sort_ascending: bool = True
method_name: Literal["partition_on_date_parts"] = "partition_on_date_parts"
param_names: List[str] = []
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these Partitioners are not supported will be removed soon; this attribute is a placeholder for mypy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found a cleaner solution that doesn't require changing these classes d2cd4ae

Copy link

codecov bot commented May 7, 2024

Codecov Report

Attention: Patch coverage is 97.50000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 77.94%. Comparing base (86a5862) to head (c89c7c6).

Files Patch % Lines
...tasource/fluent/data_asset/path/directory_asset.py 98.11% 1 Missing ⚠️
...tasource/fluent/data_asset/path/path_data_asset.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9888      +/-   ##
===========================================
+ Coverage    77.77%   77.94%   +0.17%     
===========================================
  Files          494      494              
  Lines        42389    42440      +51     
===========================================
+ Hits         32967    33080     +113     
+ Misses        9422     9360      -62     
Flag Coverage Δ
3.10 64.33% <68.75%> (+0.10%) ⬆️
3.10 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds ?
3.10 aws_deps ?
3.10 big ?
3.10 databricks ?
3.10 filesystem ?
3.10 mssql ?
3.10 mysql ?
3.10 postgresql ?
3.10 snowflake ?
3.10 spark ?
3.10 trino ?
3.11 64.33% <68.75%> (+0.10%) ⬆️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds 54.10% <53.75%> (+0.10%) ⬆️
3.11 aws_deps 45.01% <63.75%> (+0.12%) ⬆️
3.11 big 55.97% <53.75%> (+0.10%) ⬆️
3.11 databricks 46.18% <53.75%> (+0.11%) ⬆️
3.11 filesystem 60.92% <65.00%> (+0.10%) ⬆️
3.11 mssql 48.98% <53.75%> (+0.11%) ⬆️
3.11 mysql 49.04% <53.75%> (+0.11%) ⬆️
3.11 postgresql 52.88% <53.75%> (+0.10%) ⬆️
3.11 snowflake ?
3.11 spark 57.48% <96.25%> (+0.19%) ⬆️
3.11 trino 50.96% <53.75%> (+0.11%) ⬆️
3.8 64.35% <68.75%> (+0.10%) ⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds 54.11% <53.75%> (+0.10%) ⬆️
3.8 aws_deps 45.02% <63.75%> (+0.12%) ⬆️
3.8 big 55.98% <53.75%> (+0.10%) ⬆️
3.8 databricks 46.20% <53.75%> (+0.11%) ⬆️
3.8 filesystem 60.93% <65.00%> (+0.10%) ⬆️
3.8 mssql 48.97% <53.75%> (+0.11%) ⬆️
3.8 mysql 49.02% <53.75%> (+0.11%) ⬆️
3.8 postgresql 52.86% <53.75%> (+0.10%) ⬆️
3.8 snowflake 46.80% <53.75%> (+0.11%) ⬆️
3.8 spark 57.44% <96.25%> (+0.20%) ⬆️
3.8 trino 50.94% <53.75%> (+0.11%) ⬆️
3.9 64.35% <68.75%> (+0.10%) ⬆️
3.9 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds ?
3.9 aws_deps ?
3.9 big ?
3.9 databricks ?
3.9 filesystem ?
3.9 mssql ?
3.9 mysql ?
3.9 postgresql ?
3.9 snowflake ?
3.9 spark ?
3.9 trino ?
cloud 0.00% <0.00%> (ø)
docs-basic 49.03% <63.75%> (+0.11%) ⬆️
docs-creds-needed 50.46% <65.00%> (+0.11%) ⬆️
docs-spark 48.55% <63.75%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -78,7 +77,7 @@ def test_batch_request_config_serialization_round_trips(
batch_request_config: dict[str, Any] = {
"datasource_name": datasource_name,
"data_asset_name": data_asset_name,
"partitioner": PartitionerColumnValue(column_name="my_column"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated to this PR, but we don't intend to support this going forward, so i replaced it with a time-based partitioner.

def add_batch_definition_whole_directory(self, name: str) -> BatchDefinition:
"""Add a BatchDefinition which creates a single batch for the entire directory."""
return self.add_batch_definition(name=name, partitioner=None)
@singledispatchmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL. This seems like it accomplishes the same as using @overload, but now we get really clean small methods, rather than one big one that handles all the cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like the syntax, but tend to avoid it because typing always gets wonky. not sure what is going on in the implementation but mypy has a very difficult time reasoning about what type is being returned. Also, it's not possible to use keyword args in the signature, which is unfortunate.

Copy link
Contributor

@tyler-hoffman tyler-hoffman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I left a couple smaller asks you can ignore if you disagree, but there's at least one method that is missing typing.


return batch_spec_options

def _add_partitioner_batch_parameters(self, batch_request, parameters) -> dict:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add types to the params here? Also non-blocking, but pure functions tend to be simpler to reason about; I'd consider just returning the dict containing partitioner_method and partitioner_kwargs and merging in the caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wild, great catch, i thought types were a hard requirement on new code. And love the suggestion of not mutating the dict in the helper method, updated in c432636.

def _get_sortable_partitioner(
self, partitioner: Optional[PartitionerT]
) -> Optional[PartitionerSortingProtocol]:
# allow subclasses to determine sorting configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an abstractmethod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, yes (e89cf57) - but that makes me wonder why do we have a handful of other NotImplemented methods on this class, can we make them abstractmethods as well? And the answer is no, we can't, because Pandas dynamic assets are given the type FileAsset, but their implementations of these methods are dynamically generated, so mypy fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wow good to know, thanks for digging in!



@pytest.fixture
def daily_batch_parameters_and_expected_result():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what these fixtures represent from looking at them. I'd probably rename something to the effect of expected_result -> known_matching_row_count or something along those lines. These tests that assert against real data are important, but it makes the reader have to guess where the vales came from. Or maybe even stop having them be fixtures, since it looks like only one of them is reused.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm all for in-lining fixtures, the more context in the test, the better c89c7c6

],
)
def test_get_batch_parameters_keys_with_partitioner(
self, directory_asset, partitioner: Partitioner, expected_keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss these sometimes too, but probably good to get types on the test params.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! updated c89c7c6

@joshua-stauffer joshua-stauffer added this pull request to the merge queue May 7, 2024
Merged via the queue into develop with commit aad13e6 May 7, 2024
68 of 69 checks passed
@joshua-stauffer joshua-stauffer deleted the f/v1-306/directory_asset_fluent_api branch May 7, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants