Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-23024: Allow multiple datasets to be associated with a single file on ingest #218

Merged
merged 13 commits into from Jan 16, 2020

Conversation

timj
Copy link
Member

@timj timj commented Jan 13, 2020

The most work was in changing all the code that used FileDataset.ref to instead use FileDataset.refs.

cc/ @parejkoj

I will now check what other packages need to change because of FileDataset changes.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good aside from some doc/annotation type stuff. And it seems you've duplicated butler.py in addition to modifying _butler.py (presumably when resolving a rebase conflict).

python/lsst/daf/butler/core/formatter.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/formatter.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/core/repoTransfers.py Outdated Show resolved Hide resolved
@timj
Copy link
Member Author

timj commented Jan 13, 2020

My rebase completely failed to notice _butler.py turn up. 👎

@timj timj force-pushed the tickets/DM-23024 branch 2 times, most recently from 4e8f1c6 to 26f1d7f Compare January 14, 2020 02:19
FileDataset now uses a list of DatasetRef and ingest iterates
over them.
Formatters can now access the associated dataId so that they
can extract a subset from a file.
We never really expected `iterable(dict)` to return the keys of
the dict and are not relying on this anywhere. Change the
behavior so that `iterable(dict)` returns `[dict]`.

This allows DataCoordinate to be passed to iterable.
@timj timj force-pushed the tickets/DM-23024 branch 2 times, most recently from b78a7e9 to 0738d34 Compare January 15, 2020 19:15
Fixes problem in Butler.remove where we were trying to remove
everything and catching exceptions. Now we only try to remove
things that we know the datastore was given.
Adjust tests such that butler.remove remembers a component.
This leads to a problem unless subsequent butler.remove
on the composite checks for dataset existence first.
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; only minor comments, though this increases my unease with ChainedDatastore as a way to associate multiple Datastores with a butler, and I'd like to address that eventually.

(e.g. ``[i for i in a]``)
- a `str` -> return single element iterable (e.g. ``[a]``)
- a Mapping -> return single element iterable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me a little nervous, as I think it's hard to remember what makes a class qualify as a Mapping. Could we at least make this not the default behavior? Or maybe allow general isinstance exceptions via an optional kwarg, i.e.:
iterable(x, except=collections.abc.Mapping) (though that's not terribly readable either).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever want iterable to return the keys of a Mapping? The problem here is dataIds and compatbility with old YAML files.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point; I think you're right that this is the only behavior we'd ever actually want for mappings - my concern is more that it's not the behavior I would have naively expected. But it's not a big deal either way.

I do think Python mappings would be much cleaner if they weren't iterable at all and you had to explicitly say keys(), but of course there's nothing we can do about that.

python/lsst/daf/butler/datastores/fileLikeDatastore.py Outdated Show resolved Hide resolved
try:
# There is a difference between a concrete composite and virtual
# composite. In a virtual composite the datastore is never
# given the top level DatasetRef. In the concrete composite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean there is a difference in behavior at the butler level when we try to delete different kinds of composites? Or is it just butler's responsibility rather than datastore's to iterate over components in the virtual composite case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In one case butler sends the composite ref to the datastore and datastore stores its own entries because it is writing a single file. In the virtual case (I still get these the wrong way round) the butler keeps the top level composite ref to itself and only sends the components to the datastore. Previously we were doing a "send all refs regardless" approach and mostly things were throwing exceptions because either the ref was never known to butler, or else after the first ref was removed the component refs were already deleted. Now at least the two paths are separate just as in getDirect.

python/lsst/daf/butler/datastores/fileLikeDatastore.py Outdated Show resolved Hide resolved
# If a dataset was removed previously but remembered
# in registry, skip the removal in the datastore.
if self.registry.getDatasetLocations(r):
self.datastore.remove(r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this attempts to remove the dataset from this particular datastore if it is present in any datastore. I think that's okay for now, but we should open a ticket to address it in the future. The more I think about it, the more I think we'd be better off if butler had a dict of datastores instead of having ChainedDatastore, so it could refer to them by name in this and other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine so long as the datastores used by the registry haven't changed because of a config change. I can filter the returned list by asking for the equivalent from the datastore itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've fixed it now so that it does use the datastore name before deciding whether to remove.

This is only interesting for ChainedDatastore but it provides
a standard way of getting all the names in a list from a
datastore.
Prior to this change any dataset stored in a datastore would
trigger the remove even if the butler had changed datastores
with reconfiguration.
@timj timj merged commit 5e0d73f into master Jan 16, 2020
@timj timj deleted the tickets/DM-23024 branch January 16, 2020 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants