Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Fluent PandasFilesytemDatasources data_connector fixes #7414

Merged
merged 42 commits into from
Mar 23, 2023

Conversation

Kilo59
Copy link
Member

@Kilo59 Kilo59 commented Mar 21, 2023

Proof of concept for "fixing" the DataConnector abstraction

Changes proposed in this pull request:

Notes

These changes are initially only for PandasFilesytemDatasources. Followup PR's re-implement this pattern for the remaining datasources which use `DataConnectors. This will mostly entail deleting code 😄 .

@netlify
Copy link

netlify bot commented Mar 21, 2023

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit 28ea231
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/641c82381bc54d0007ead4c6

@ghost
Copy link

ghost commented Mar 21, 2023

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@@ -350,6 +351,8 @@ class Datasource(

# class attrs
asset_types: ClassVar[Sequence[Type[DataAsset]]] = []
# Not all Datasources require a DataConnector
data_connector_type: ClassVar[Optional[Type[DataConnector]]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 If this addition is toward including DataConnector in Fluent configuration, then we are not supposed to move in that direction (because in Fluent, DataConnector is an implementational utility, not a core concept). If this is for some other purpose, then please let me know and I will study more carefully. Thanks.

Copy link
Member Author

@Kilo59 Kilo59 Mar 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's about ensuring we don't need DataConnector in the configuration.
At least directly. We do need to be able to provide options that the data connector can ingest in config (even if we don't call it a data_connector). But before this change you could not provide any asset level data_connector params in config (glob_directive, prefix, delimiter, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my edification: Sounds like if a property is None, it will not end up in Configuration? Or if this is incorrect, how does one specify a property "to be part of configuration" and "to not be part of configuration"? Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check this class attribute to know whether or not the Datasource and data assets need to utilize a data_connector. If it does, we need to build that specific DataConnector with this attribute.

@@ -508,7 +517,7 @@ def parse_batching_regex_string(
) -> re.Pattern:
pattern: re.Pattern
if not batching_regex:
pattern = re.compile(".*")
pattern = MATCH_ALL_PATTERN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move MATCH_ALL_PATTERN to the default argument and remove this if branch? That is def parse_batching_regex_string(batching_regex: Union[re.Pattern, str] = MATCH_ALL_PATTERN).

Copy link
Member Author

@Kilo59 Kilo59 Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can! 😄
But some of the other Datasource types rely on it. Once they've all been updated to follow the pattern in this PR we can remove it.

Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 Thank you for this exploratory work. I identified one action item and raised a number of questions. Happy to discuss. Thanks!

@Kilo59 Kilo59 changed the title [BUGFIX/FEATURE??] Fluent Datasources - DataConnector usage update [BUGFIX/FEATURE??] Fluent Datasources - DataConnector abstraction update Mar 22, 2023
@@ -528,7 +528,6 @@ def test_update_datasource_with_datasource_object(
"csv_asset": {
"batching_regex": "(?P<file_name>.*).csv",
"name": "csv_asset",
"order_by": [],
Copy link
Member Author

@Kilo59 Kilo59 Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was present because this value was being set in the manually defined add_csv_asset() method.
Fields that are not set should be excluded from serialization, so this is the desired behavior.

Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ following our synchronous review

@Kilo59 Kilo59 enabled auto-merge (squash) March 23, 2023 15:37
@@ -40,6 +40,10 @@ class DataConnector(ABC):
data_asset_name: The name of the DataAsset using this DataConnector instance
"""

# needed to select the asset level kwargs needed to build the DataConnector
asset_level_option_keys: ClassVar[tuple[str, ...]] = ()
asset_options_type: ClassVar[Type] = dict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 If possible, could you please provide a rich docstring for asset_options_type and asset_level_option_keys here in the top-level DataConnector module? The reason for this request is that whenever a reader sees a class variable, which does not appear to be used anywhere in the present module, the experience of interpreting the code slows down appreciably. We have an analogous situation in the Metrics component (I have made some effort at documenting the class variables used, but a lot more is in order). On a separate thought, is the class variable the best way of achieving our goal here? Could this be an indicator that a better / cleaner design is lurking somewhere and we just have not yet nailed it? Thank you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree this needs to be well explained.

Unfortunately, we do have to do something like this (even if not exactly this) if we want to preserve the simultaneous goals of...

  1. No exponential explosion of DataAsset classes.
  2. Auto-complete for specific DataAsset Dataconnector connect options.

@@ -33,6 +39,9 @@ class FilesystemDataConnector(FilePathDataConnector):
file_path_template_map_fn: Format function mapping path to fully-qualified resource on filesystem (optional)
"""

asset_level_option_keys: ClassVar[tuple[str, ...]] = ("glob_directive",)
asset_options_type: ClassVar[Type[FilesystemOptions]] = FilesystemOptions
Copy link
Contributor

@alexsherstinsky alexsherstinsky Mar 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 Yes, at least documenting for how asset_level_option_keys and asset_level_option_keys will be utilized would be nice (here, rather than have to search for it elsewhere). Thanks.

@@ -354,6 +354,8 @@ class Datasource(

# class attrs
asset_types: ClassVar[Sequence[Type[DataAsset]]] = []
# Not all Datasources require a DataConnector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 Side comment -- I suspect that down the road all Datasource would do well by utilizing some kind of a DataConnector in order to achieve simpler, cleaner, and consistent method interfaces. Thanks.

@@ -462,7 +464,9 @@ def get_asset(self, asset_name: str) -> _DataAssetT:
f"'{asset_name}' not found. Available assets are {list(self.assets.keys())}"
) from exc

def add_asset(self, asset: _DataAssetT) -> _DataAssetT:
def add_asset(
self, asset: _DataAssetT, connect_options: dict | None = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 Unless it is available elsewhere, some documentation and/or examples about connect_options would be great. Thanks.

@@ -27,6 +27,31 @@ fluent_datasources:
order_by:
- year
- -month
sqlite_taxi:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ Nice new use cases in the configuration! Thanks!

Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kilo59 Thank you for this thrust for a massive improvement. I spotted an oddity, requested docstrings, and asked some questions for my own understanding. Happy to discuss! Thanks!

self, data_asset: _FilePathDataAsset, glob_directive: str = "**/*", **kwargs
) -> None:
"""Builds and attaches the `FilesystemDataConnector` to the asset."""
if kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ I like this technique!

Copy link
Contributor

@alexsherstinsky alexsherstinsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- awesome! Thanks! P.S.: Really looking forward to making time to disentangle the DataConnector now covering both connecting to data and partitioning the assets).

@Kilo59 Kilo59 merged commit 492fb1e into develop Mar 23, 2023
@Kilo59 Kilo59 deleted the f/great-1750/dc-design branch March 23, 2023 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants