Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-38953: Dynamic connection support and miscellaneous cleanups #325

Merged
merged 20 commits into from May 10, 2023

Conversation

TallJimbo
Copy link
Member

@TallJimbo TallJimbo commented Apr 29, 2023

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Apr 29, 2023

Codecov Report

Patch coverage has no change and project coverage change: -82.14 ⚠️

Comparison is base (3b0cfa4) 82.13% compared to head (e47edb9) 0.00%.

❗ Current head e47edb9 differs from pull request most recent head 67b6095. Consider uploading reports for the commit 67b6095 to get more accurate results

Additional details and impacted files
@@            Coverage Diff            @@
##             main   #325       +/-   ##
=========================================
- Coverage   82.13%      0   -82.14%     
=========================================
  Files          60      0       -60     
  Lines        6723      0     -6723     
  Branches     1373      0     -1373     
=========================================
- Hits         5522      0     -5522     
+ Misses        923      0      -923     
+ Partials      278      0      -278     

see 60 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@TallJimbo TallJimbo force-pushed the tickets/DM-38953 branch 4 times, most recently from 1745152 to 4c54a65 Compare May 2, 2023 14:48
@TallJimbo TallJimbo force-pushed the tickets/DM-38953 branch 7 times, most recently from 2736066 to 8e23725 Compare May 8, 2023 21:04
@TallJimbo TallJimbo changed the title DM-38953: miscellaneous cleanups from pipeline graph work DM-38953: Dynamic connection support and miscellaneous cleanups May 9, 2023
@TallJimbo TallJimbo force-pushed the tickets/DM-38953 branch 2 times, most recently from 6d2c714 to 5a19dd5 Compare May 9, 2023 15:14
Copy link
Contributor

@natelust natelust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still like to think/talk about the dispatching of types, but I do think this is an improvement from what is there.



@dataclasses.dataclass(frozen=True)
class InitOutput(BaseConnection):
pass
_connection_type_set: ClassVar[str] = "initOutputs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I like duplicating all the information about the class, just to have a string of it. I am really not a fan of _ attributes declared at class scope, and especially no in dataclasses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problem with private class attributes, but I'll concede having to use getattr on a string isn't great. Would you prefer a method like

def _get_connections_set(self, connections: PipelineTaskConnectionsMetaclass | PipelineTaskConnections) -> set[str]:
    return connections.initOutputs

?

As for the duplication, I think the BaseConnection hierarchy and PipelineTaskConnections are already super tightly coupled since one is a descriptor that assumes a lot about the internals of the other, and having one attribute or method on each BaseConnection is much better than a series of isinstance checks (or the slightly prettier, but otherwise identical match equivalent) somewhere in PipelineTaskConnetions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but that _get_connections_set method idea doesn't work inside PipelineTaskConnectionDict, where we do need the string.

Copy link
Member Author

@TallJimbo TallJimbo May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nate and I discussed this in person, so for posterity:

Nate's type-keyed mapping idea doesn't quite work as-is, because there are different sets on PipelineTaskConnectionsMetaclass and PipelineTaskConnections. It could be made to work as a mapping with string values, but that brings in some of the disadvantages of the approach that's already on the branch, so we've decided to just stick with what's there now, since it then boils down to just slight philosophical/stylistic differences in preferred programming paradigms.

Comment on lines +81 to +85
self.data["inputs"] = set()
self.data["prerequisiteInputs"] = set()
self.data["outputs"] = set()
self.data["initInputs"] = set()
self.data["initOutputs"] = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it is sort of duplicated info here as well, but a bit differently, and it is hidden as an implementation detail, not a public visible thing on a dataclass.

Suggested change
self.data["inputs"] = set()
self.data["prerequisiteInputs"] = set()
self.data["outputs"] = set()
self.data["initInputs"] = set()
self.data["initOutputs"] = set()
aliasMap = {}
self.data["inputs"] = aliasMap[Input]= set()
self.data["prerequisiteInputs"] = aliasMap[PrerequisiteInput] = set()
self.data["outputs"] = aliasMap[Output] = set()
self.data["initInputs"] = aliasMap[InitInput] = set()
self.data["initOutputs"] = aliasMap[InitOutput] = set()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"public visible on a data class" is why I made it a private attribute.

object.__setattr__(value, "varName", name)
self.data["allConnections"][name] = value
self.data[value._connection_type_set].add(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then

Suggested change
self.data[value._connection_type_set].add(name)
self.aliasMap[type(value)].add(name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if using type is more or less of an antipattern here than using ininstance / match, but I do think they're all antipatterns. Sure, sometimes that kind of inverted dispatch is the best solution, and I used it a ton in daf_relation (for instance). But I think it's what you use only you've determined that the traditional OO "delegate to a possilbly-overridden method or attribute" doesn't work well.

instance.initOutputs = set(cls.initOutputs)

# Set self.config. It's a bit strange that we claim to accept None but
# really just raise here, but it's not worth changing now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a fragment from wanting kw only config= behavior, but it not letting you not have something after =

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, so we could remove it now that we can say __init__(self, *, config). I did actually try that last weekend in my "too ambitious" version of these changes, but in the end I don't think I want to do it here without also going and trying to fix downstream methods that copied the signature and also take None, and I definitely don't want to do that.

if isinstance(value, BaseConnection):
previous = self._allConnections.get(name)
try:
getattr(self, value._connection_type_set).add(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could put alasMap in the class dictionary, possibly as __aliasMap and use it here as well. I'm not saying you have to do this, just hoping to spark some ideas around this topic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting was added here in response to a request from me to make
pipeline outputs more deterministic, so refactored pipelines could be
compared to the originals by diffing their on-disk forms.  But sorting
was just a tiny fraction of that, and it turns out it wasn't really
possible to make this pipeline form usefully comparable anyway, since
that would require expanding out all config overrides.  Deterministic
comparisons will possible with the expanded output format being added
on DM-33027, and in the meantime removing the sorting here makes
writes preserve order, which is actually more useful since that's
usually a human-friendly order.
This changes the type name of the generated connections config class,
not the name of the config field.  That type name is usually
invisible, and it's not a terrible useful type name as it can't be
imported directly anyway.  But it does appear in the AttributeError
message when using the wrong name for a connection, so having a
meaningful type name makes that error message far more useful, as it
tells you which task actually had a problem.
This is motivated by the plan to add a PipelineGraph class that will
end up taking over some of the responsibilities of PipelineGraph and
TaskDef, and while that's not being done on this ticket, I think this
refactor is good more generally.
These are sets on instances and are annotated as sets on the class,
so it looks like it's this usage of list that's wrong.
This is private, but it's never used anywhere.
This is not the name of the class; it's just PipelineTaskConnections.
There was already nothing stopping subclasses from assigning directly
to self.dimensions and shadowing the class dimensions, except mypy and
a lack of documentation.  But making it an explicit instance `set`
object allows for in-place updates in the same manner as the
type-of-connection sets (e.g. inputs, outputs), which is nice.
This addresses a bug that left the allConnections attribute on
PipelineTaskConnections instances holding the class-level (i.e.
unconfigured) BaseConnection objects instead of their configured
counterparts.

It then goes a step further by using __setattr__ and __delattr__
methods on PipelineTaskConnections and a
PipelineTaskConnectionsMetaclass.__call__ method to keep connection
attributes, connection-type name sets (e.g. inputs, outputs), and the
allConnections mapping in sync throughout and after __init__ as much as
possible (they're fully consistent after __init__, but we don't try to
watch for updates to the connection-type name sets during __init__).

A side effect of this is that the "construct on first use and then
cache" behavior of accessing the BaseConnection descriptor interface
has been replaced by simpler logic in the metaclass __call__ that
just constructs the configured instance connection objects up front,
allowing the BaseConnection descriptor interface to delegate to the
allConnections instead of adding another one.
This brings this page in line with all other docs.  I think it may
still play a bit fast and loose with "dataset" vs "dataset type", but
I also think it might be a bit of a distraction to be too pedantic
about that too early in a tutorial.
@TallJimbo TallJimbo force-pushed the tickets/DM-38953 branch 2 times, most recently from 3691b3c to e47edb9 Compare May 10, 2023 20:10
@TallJimbo TallJimbo merged commit 5702c59 into main May 10, 2023
5 checks passed
@TallJimbo TallJimbo deleted the tickets/DM-38953 branch May 10, 2023 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants