New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-38953: Dynamic connection support and miscellaneous cleanups #325
Conversation
070126f
to
6e151e6
Compare
Codecov ReportPatch coverage has no change and project coverage change:
Additional details and impacted files@@ Coverage Diff @@
## main #325 +/- ##
=========================================
- Coverage 82.13% 0 -82.14%
=========================================
Files 60 0 -60
Lines 6723 0 -6723
Branches 1373 0 -1373
=========================================
- Hits 5522 0 -5522
+ Misses 923 0 -923
+ Partials 278 0 -278 ☔ View full report in Codecov by Sentry. |
1745152
to
4c54a65
Compare
2736066
to
8e23725
Compare
6d2c714
to
5a19dd5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still like to think/talk about the dispatching of types, but I do think this is an improvement from what is there.
|
||
|
||
@dataclasses.dataclass(frozen=True) | ||
class InitOutput(BaseConnection): | ||
pass | ||
_connection_type_set: ClassVar[str] = "initOutputs" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I like duplicating all the information about the class, just to have a string of it. I am really not a fan of _ attributes declared at class scope, and especially no in dataclasses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problem with private class attributes, but I'll concede having to use getattr
on a string isn't great. Would you prefer a method like
def _get_connections_set(self, connections: PipelineTaskConnectionsMetaclass | PipelineTaskConnections) -> set[str]:
return connections.initOutputs
?
As for the duplication, I think the BaseConnection
hierarchy and PipelineTaskConnections
are already super tightly coupled since one is a descriptor that assumes a lot about the internals of the other, and having one attribute or method on each BaseConnection
is much better than a series of isinstance
checks (or the slightly prettier, but otherwise identical match
equivalent) somewhere in PipelineTaskConnetions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but that _get_connections_set
method idea doesn't work inside PipelineTaskConnectionDict
, where we do need the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nate and I discussed this in person, so for posterity:
Nate's type
-keyed mapping idea doesn't quite work as-is, because there are different sets on PipelineTaskConnectionsMetaclass
and PipelineTaskConnections
. It could be made to work as a mapping with string values, but that brings in some of the disadvantages of the approach that's already on the branch, so we've decided to just stick with what's there now, since it then boils down to just slight philosophical/stylistic differences in preferred programming paradigms.
self.data["inputs"] = set() | ||
self.data["prerequisiteInputs"] = set() | ||
self.data["outputs"] = set() | ||
self.data["initInputs"] = set() | ||
self.data["initOutputs"] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is sort of duplicated info here as well, but a bit differently, and it is hidden as an implementation detail, not a public visible thing on a dataclass.
self.data["inputs"] = set() | |
self.data["prerequisiteInputs"] = set() | |
self.data["outputs"] = set() | |
self.data["initInputs"] = set() | |
self.data["initOutputs"] = set() | |
aliasMap = {} | |
self.data["inputs"] = aliasMap[Input]= set() | |
self.data["prerequisiteInputs"] = aliasMap[PrerequisiteInput] = set() | |
self.data["outputs"] = aliasMap[Output] = set() | |
self.data["initInputs"] = aliasMap[InitInput] = set() | |
self.data["initOutputs"] = aliasMap[InitOutput] = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"public visible on a data class" is why I made it a private attribute.
object.__setattr__(value, "varName", name) | ||
self.data["allConnections"][name] = value | ||
self.data[value._connection_type_set].add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then
self.data[value._connection_type_set].add(name) | |
self.aliasMap[type(value)].add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if using type
is more or less of an antipattern here than using ininstance
/ match
, but I do think they're all antipatterns. Sure, sometimes that kind of inverted dispatch is the best solution, and I used it a ton in daf_relation (for instance). But I think it's what you use only you've determined that the traditional OO "delegate to a possilbly-overridden method or attribute" doesn't work well.
instance.initOutputs = set(cls.initOutputs) | ||
|
||
# Set self.config. It's a bit strange that we claim to accept None but | ||
# really just raise here, but it's not worth changing now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a fragment from wanting kw only config= behavior, but it not letting you not have something after =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, so we could remove it now that we can say __init__(self, *, config)
. I did actually try that last weekend in my "too ambitious" version of these changes, but in the end I don't think I want to do it here without also going and trying to fix downstream methods that copied the signature and also take None
, and I definitely don't want to do that.
if isinstance(value, BaseConnection): | ||
previous = self._allConnections.get(name) | ||
try: | ||
getattr(self, value._connection_type_set).add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could put alasMap
in the class dictionary, possibly as __aliasMap
and use it here as well. I'm not saying you have to do this, just hoping to spark some ideas around this topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see related thread)
Sorting was added here in response to a request from me to make pipeline outputs more deterministic, so refactored pipelines could be compared to the originals by diffing their on-disk forms. But sorting was just a tiny fraction of that, and it turns out it wasn't really possible to make this pipeline form usefully comparable anyway, since that would require expanding out all config overrides. Deterministic comparisons will possible with the expanded output format being added on DM-33027, and in the meantime removing the sorting here makes writes preserve order, which is actually more useful since that's usually a human-friendly order.
This changes the type name of the generated connections config class, not the name of the config field. That type name is usually invisible, and it's not a terrible useful type name as it can't be imported directly anyway. But it does appear in the AttributeError message when using the wrong name for a connection, so having a meaningful type name makes that error message far more useful, as it tells you which task actually had a problem.
This is motivated by the plan to add a PipelineGraph class that will end up taking over some of the responsibilities of PipelineGraph and TaskDef, and while that's not being done on this ticket, I think this refactor is good more generally.
These are sets on instances and are annotated as sets on the class, so it looks like it's this usage of list that's wrong.
This is private, but it's never used anywhere.
This is not the name of the class; it's just PipelineTaskConnections.
There was already nothing stopping subclasses from assigning directly to self.dimensions and shadowing the class dimensions, except mypy and a lack of documentation. But making it an explicit instance `set` object allows for in-place updates in the same manner as the type-of-connection sets (e.g. inputs, outputs), which is nice.
This addresses a bug that left the allConnections attribute on PipelineTaskConnections instances holding the class-level (i.e. unconfigured) BaseConnection objects instead of their configured counterparts. It then goes a step further by using __setattr__ and __delattr__ methods on PipelineTaskConnections and a PipelineTaskConnectionsMetaclass.__call__ method to keep connection attributes, connection-type name sets (e.g. inputs, outputs), and the allConnections mapping in sync throughout and after __init__ as much as possible (they're fully consistent after __init__, but we don't try to watch for updates to the connection-type name sets during __init__). A side effect of this is that the "construct on first use and then cache" behavior of accessing the BaseConnection descriptor interface has been replaced by simpler logic in the metaclass __call__ that just constructs the configured instance connection objects up front, allowing the BaseConnection descriptor interface to delegate to the allConnections instead of adding another one.
This brings this page in line with all other docs. I think it may still play a bit fast and loose with "dataset" vs "dataset type", but I also think it might be a bit of a distraction to be too pedantic about that too early in a tutorial.
3691b3c
to
e47edb9
Compare
e47edb9
to
67b6095
Compare
Checklist
doc/changes