New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-23024: Allow multiple datasets to be associated with a single file on ingest #218
Conversation
ea12654
to
88e3645
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good aside from some doc/annotation type stuff. And it seems you've duplicated butler.py
in addition to modifying _butler.py
(presumably when resolving a rebase conflict).
My rebase completely failed to notice |
4e8f1c6
to
26f1d7f
Compare
FileDataset now uses a list of DatasetRef and ingest iterates over them. Formatters can now access the associated dataId so that they can extract a subset from a file.
We never really expected `iterable(dict)` to return the keys of the dict and are not relying on this anywhere. Change the behavior so that `iterable(dict)` returns `[dict]`. This allows DataCoordinate to be passed to iterable.
26f1d7f
to
17e54a1
Compare
Previously we were writing a whole new file each time.
b78a7e9
to
0738d34
Compare
Fixes problem in Butler.remove where we were trying to remove everything and catching exceptions. Now we only try to remove things that we know the datastore was given.
0738d34
to
be69ac2
Compare
Adjust tests such that butler.remove remembers a component. This leads to a problem unless subsequent butler.remove on the composite checks for dataset existence first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; only minor comments, though this increases my unease with ChainedDatastore
as a way to associate multiple Datastores
with a butler, and I'd like to address that eventually.
(e.g. ``[i for i in a]``) | ||
- a `str` -> return single element iterable (e.g. ``[a]``) | ||
- a Mapping -> return single element iterable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a little nervous, as I think it's hard to remember what makes a class qualify as a Mapping
. Could we at least make this not the default behavior? Or maybe allow general isinstance
exceptions via an optional kwarg, i.e.:
iterable(x, except=collections.abc.Mapping)
(though that's not terribly readable either).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you ever want iterable to return the keys of a Mapping? The problem here is dataIds and compatbility with old YAML files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point; I think you're right that this is the only behavior we'd ever actually want for mappings - my concern is more that it's not the behavior I would have naively expected. But it's not a big deal either way.
I do think Python mappings would be much cleaner if they weren't iterable at all and you had to explicitly say keys()
, but of course there's nothing we can do about that.
try: | ||
# There is a difference between a concrete composite and virtual | ||
# composite. In a virtual composite the datastore is never | ||
# given the top level DatasetRef. In the concrete composite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean there is a difference in behavior at the butler level when we try to delete different kinds of composites? Or is it just butler's responsibility rather than datastore's to iterate over components in the virtual composite case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In one case butler sends the composite ref to the datastore and datastore stores its own entries because it is writing a single file. In the virtual case (I still get these the wrong way round) the butler keeps the top level composite ref to itself and only sends the components to the datastore. Previously we were doing a "send all refs regardless" approach and mostly things were throwing exceptions because either the ref was never known to butler, or else after the first ref was removed the component refs were already deleted. Now at least the two paths are separate just as in getDirect.
# If a dataset was removed previously but remembered | ||
# in registry, skip the removal in the datastore. | ||
if self.registry.getDatasetLocations(r): | ||
self.datastore.remove(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this attempts to remove the dataset from this particular datastore if it is present in any datastore. I think that's okay for now, but we should open a ticket to address it in the future. The more I think about it, the more I think we'd be better off if butler had a dict of datastores instead of having ChainedDatastore
, so it could refer to them by name in this and other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine so long as the datastores used by the registry haven't changed because of a config change. I can filter the returned list by asking for the equivalent from the datastore itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've fixed it now so that it does use the datastore name before deciding whether to remove.
This is only interesting for ChainedDatastore but it provides a standard way of getting all the names in a list from a datastore.
Prior to this change any dataset stored in a datastore would trigger the remove even if the butler had changed datastores with reconfiguration.
The most work was in changing all the code that used FileDataset.ref to instead use FileDataset.refs.
cc/ @parejkoj
I will now check what other packages need to change because of FileDataset changes.