Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-27140 Add named subsets to pipelines #154

Merged
merged 2 commits into from Oct 21, 2020
Merged

DM-27140 Add named subsets to pipelines #154

merged 2 commits into from Oct 21, 2020

Conversation

natelust
Copy link
Contributor

Add a directive to pipeline documents that allows a subset of
labels in the pipeline to be declared with a name. This
modification allows portions of a pipeline to be run with using
this named subset instead of specifying a list of labels.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a preliminary perspective; don't have much to say structurally. I haven't done a close read for e.g. docstrings etc. yet.

for label in labelSpecifier.labels:
if label in pipeline._pipelineIR.named_subsets:
toRemove.add(label)
toAdd.update(pipeline._pipelineIR.named_subsets[label].subset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought about whether this should be recursive (i.e. can subsets nest)? I don't see a compelling reason why it should be, but it's probably something that should be a conscious choice rather than an implementation artifact.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but lots of little style nitpicks.

# is needed because a user may only configure the tasks they intend
# to run, which may cause some contracts to fail if they will later
# be dropped
pipeline = copy.copy(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure a one-level deep shallow copy is the right level of copying here?

(I always find copy.copy to be scary, and I might go so far as to call it an antipattern, because whether it is or is not the right amount of copying to do is really an implementation detail in general.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had deep copy then switched to copy because I knew the implementation details (there are only a few top level containers of things that nothing really modifies) but it probably is better to always deep copy to avoid any future issues that may arise, it certainly is worth the small amount of memory overhead.

closed = True
return pipeline
labelSet.add(label)
if labelSpecifier.end and label == labelSpecifier.end:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first test here could be labelSpecifier.end is not None, it probably should be.

"""A set of task labels contained in this subset.
"""
description: Optional[str]
"""A description of what this sub set of tasks is intended to do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sub set" -> "subset"

@staticmethod
def from_yaml(label: str, value: Union[List[str], dict]) -> LabeledSubset:
"""This static method generates `LabeledSubset` objects given a properly
formatted object that as been created by a yaml loader.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----------
label : `str`
The label that will be used to identify this labeled subset.
value : `list` of str or `dict`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add backticks around str.

subset = value
description = None
else:
raise ValueError(f"There was a problem parsing the labeled subset {label}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a very helpful error message. Can you at least report what syntax was expected?


Parameters
----------
loaded_yaml: `Mapping`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I think this needs to be MutableMapping (and you should probably state below that it's modified in-place).

raise ValueError(f"There was a problem parsing the labeled subset {label}")
return LabeledSubset(label, set(subset), description)

def to_primitives(self) -> dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the "to" method to_primitives, but the "from" method from_yaml? It looks like "primitives" and "yaml" mean the same thing, given that from_yaml operates on a dict or list rather than a file or string.

# existing labeled subsets, and with existing task labels.
overlapping_subsets = accumulate_labeled_subsets.keys() & tmp_IR.labeled_subsets.keys()
task_subset_overlap = (accumulate_labeled_subsets.keys() | tmp_IR.labeled_subsets.keys()) \
& accumulate_tasks.keys()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extra parentheses for line continuation is preferred over backslash.


# create a copy of the object to iterate over
labeled_subsets = copy.copy(pipeline.labeled_subsets)
# remove any labeled subsets that no longer have a complete set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like the right behavior, but it's sufficiently non-obvious that it should probably be mentioned in the docstring as well.

Add a directive to pipeline documents that allows a subset of
labels in the pipeline to be declared with a name. This
modification allows portions of a pipeline to be run with using
this named subset instead of specifying a list of labels.
There were a few modules issues with documentation generation,
and some documentation was not being generated at all. This
commit makes sure all the appropriate documentation builds.
@natelust natelust merged commit e1cbb79 into master Oct 21, 2020
@timj timj deleted the tickets/DM-27140 branch April 13, 2022 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants