-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUGFIX] Fluent PandasFilesytemDatasources
data_connector fixes
#7414
Conversation
✅ Deploy Preview for niobium-lead-7998 canceled.
|
@@ -350,6 +351,8 @@ class Datasource( | |||
|
|||
# class attrs | |||
asset_types: ClassVar[Sequence[Type[DataAsset]]] = [] | |||
# Not all Datasources require a DataConnector | |||
data_connector_type: ClassVar[Optional[Type[DataConnector]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 If this addition is toward including DataConnector
in Fluent configuration, then we are not supposed to move in that direction (because in Fluent, DataConnector
is an implementational utility, not a core concept). If this is for some other purpose, then please let me know and I will study more carefully. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's about ensuring we don't need DataConnector
in the configuration.
At least directly. We do need to be able to provide options that the data connector can ingest in config (even if we don't call it a data_connector). But before this change you could not provide any asset level data_connector params in config (glob_directive
, prefix
, delimiter
, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my edification: Sounds like if a property is None
, it will not end up in Configuration? Or if this is incorrect, how does one specify a property "to be part of configuration" and "to not be part of configuration"? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check this class attribute to know whether or not the Datasource and data assets need to utilize a data_connector. If it does, we need to build that specific DataConnector
with this attribute.
@@ -508,7 +517,7 @@ def parse_batching_regex_string( | |||
) -> re.Pattern: | |||
pattern: re.Pattern | |||
if not batching_regex: | |||
pattern = re.compile(".*") | |||
pattern = MATCH_ALL_PATTERN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move MATCH_ALL_PATTERN
to the default argument and remove this if branch? That is def parse_batching_regex_string(batching_regex: Union[re.Pattern, str] = MATCH_ALL_PATTERN)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can! 😄
But some of the other Datasource
types rely on it. Once they've all been updated to follow the pattern in this PR we can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 Thank you for this exploratory work. I identified one action item and raised a number of questions. Happy to discuss. Thanks!
@@ -528,7 +528,6 @@ def test_update_datasource_with_datasource_object( | |||
"csv_asset": { | |||
"batching_regex": "(?P<file_name>.*).csv", | |||
"name": "csv_asset", | |||
"order_by": [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was present because this value was being set in the manually defined add_csv_asset()
method.
Fields that are not set should be excluded from serialization, so this is the desired behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅ following our synchronous review
@@ -40,6 +40,10 @@ class DataConnector(ABC): | |||
data_asset_name: The name of the DataAsset using this DataConnector instance | |||
""" | |||
|
|||
# needed to select the asset level kwargs needed to build the DataConnector | |||
asset_level_option_keys: ClassVar[tuple[str, ...]] = () | |||
asset_options_type: ClassVar[Type] = dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 If possible, could you please provide a rich docstring for asset_options_type
and asset_level_option_keys
here in the top-level DataConnector
module? The reason for this request is that whenever a reader sees a class variable, which does not appear to be used anywhere in the present module, the experience of interpreting the code slows down appreciably. We have an analogous situation in the Metrics component (I have made some effort at documenting the class variables used, but a lot more is in order). On a separate thought, is the class variable the best way of achieving our goal here? Could this be an indicator that a better / cleaner design is lurking somewhere and we just have not yet nailed it? Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree this needs to be well explained.
Unfortunately, we do have to do something like this (even if not exactly this) if we want to preserve the simultaneous goals of...
- No exponential explosion of DataAsset classes.
- Auto-complete for specific DataAsset Dataconnector connect options.
@@ -33,6 +39,9 @@ class FilesystemDataConnector(FilePathDataConnector): | |||
file_path_template_map_fn: Format function mapping path to fully-qualified resource on filesystem (optional) | |||
""" | |||
|
|||
asset_level_option_keys: ClassVar[tuple[str, ...]] = ("glob_directive",) | |||
asset_options_type: ClassVar[Type[FilesystemOptions]] = FilesystemOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 Yes, at least documenting for how asset_level_option_keys
and asset_level_option_keys
will be utilized would be nice (here, rather than have to search for it elsewhere). Thanks.
@@ -354,6 +354,8 @@ class Datasource( | |||
|
|||
# class attrs | |||
asset_types: ClassVar[Sequence[Type[DataAsset]]] = [] | |||
# Not all Datasources require a DataConnector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 Side comment -- I suspect that down the road all Datasource
would do well by utilizing some kind of a DataConnector
in order to achieve simpler, cleaner, and consistent method interfaces. Thanks.
@@ -462,7 +464,9 @@ def get_asset(self, asset_name: str) -> _DataAssetT: | |||
f"'{asset_name}' not found. Available assets are {list(self.assets.keys())}" | |||
) from exc | |||
|
|||
def add_asset(self, asset: _DataAssetT) -> _DataAssetT: | |||
def add_asset( | |||
self, asset: _DataAssetT, connect_options: dict | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 Unless it is available elsewhere, some documentation and/or examples about connect_options
would be great. Thanks.
@@ -27,6 +27,31 @@ fluent_datasources: | |||
order_by: | |||
- year | |||
- -month | |||
sqlite_taxi: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ Nice new use cases in the configuration! Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kilo59 Thank you for this thrust for a massive improvement. I spotted an oddity, requested docstrings, and asked some questions for my own understanding. Happy to discuss! Thanks!
self, data_asset: _FilePathDataAsset, glob_directive: str = "**/*", **kwargs | ||
) -> None: | ||
"""Builds and attaches the `FilesystemDataConnector` to the asset.""" | ||
if kwargs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ I like this technique!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM -- awesome! Thanks! P.S.: Really looking forward to making time to disentangle the DataConnector
now covering both connecting to data and partitioning the assets).
Proof of concept for "fixing" the DataConnector abstraction
Changes proposed in this pull request:
_build_data_connector
connect_options
fieldconnect_options
field to where it's neededadd_<ASSET_TYPE>_asset()
methodsupdate stubsdone in [MAINTENANCE] PandasFilesystemDatasource stubs and Schema corrections #7428Notes
These changes are initially only for
PandasFilesytemDatasource
s. Followup PR's re-implement this pattern for the remaining datasources which use `DataConnectors. This will mostly entail deleting code 😄 .