Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: raise if kwds are ignored in require_dataset #897

Closed
wants to merge 1 commit into from

Conversation

tacaswell
Copy link
Member

This raises if user kwds will be silently ignored.

This is prompted by a mailing list question about automatically resizing when I noticed the new data is simply ignored!

A better solution may be to check all of the kwargs?

A less disruptive change would be to put a huge warning in the docstring.

This raises if user kwds will be silently ignored.
@aragilar
Copy link
Member

aragilar commented Jun 20, 2017

I feel this should be a warning rather than an exception, as I'm not sure why someone would use require_dataset with kwargs unless they had some expectation that the dataset existed (otherwise why not just check whether the dataset existed with the correct settings, and create it with create_dataset if necessary).

@vasole
Copy link
Contributor

vasole commented Jun 20, 2017

@aragilar

Perhaps as a way to make sure the existing dataset is compatible with the one intended to be used/extended/overwritten?

I have to acknowledge I did not understand the idea behind the PR until reading your post. If the idea is to make sure the required dataset is compatible with what it was requested and therefore expected by the user, then I think the idea to raise an exception makes perfect sense.

@vasole
Copy link
Contributor

vasole commented Jun 20, 2017

BTW, what is the behavior of HDF5 itself?

@aragilar
Copy link
Member

aragilar commented Jun 20, 2017

@vasole require_dataset is an h5py thing only, so no guidance from HDF5.

If require_dataset had just been introduced, then I could see there being an intent to validate the properties of the dataset, which we could explicitly document (and so avoid confusion). The problem is require_dataset has been part of h5py for at least 6 years, and seems to lack an example in the docs, so it's quite likely this could break code (whether that code is behaving correctly or not is the question). We could do the validation for the kwargs, which would be consistent with the part of the docstring which says Raises TypeError if an incompatible object already exists, but what about data keyword, under the old system it's ignored, but should we override (which would result in data loss), or raise an error always (consistent with the idea of validating the properties of the dataset), or raise an error (or warn) if the dataset exists.

I feel that warning is a change which avoids breaking anything while also highlighting potential problems, while we decide what the correct behaviour (and intent) should be.

@tacaswell tacaswell added this to the 2.8 milestone Aug 20, 2017
@tacaswell tacaswell modified the milestones: 2.8, 2.9 Feb 11, 2018
@tacaswell tacaswell modified the milestones: 2.9, 2.10 Oct 16, 2018
@tacaswell
Copy link
Member Author

Punting to next release.

@takluyver
Copy link
Member

I don't think this makes sense as-is. If you know the dataset shouldn't already exist, you'd use create_dataset. You use require_dataset if the dataset may or may not exist. It would be annoying to have parameters which may cause an error depending on state which the code shouldn't know.

Perhaps instead we could do some combination of:

  • More checks that an existing dataset matches specified parameters (e.g. maxshape, chunk)
  • Deprecate parameters which it doesn't make sense to check, or which can't efficiently be checked (data?)

@takluyver
Copy link
Member

I'm dropping the 2.10 milestone for this, as it hasn't been changed for a long time, and I think there are issues with it as stands.

@takluyver takluyver removed this from the 2.10 milestone Apr 25, 2019
@takluyver
Copy link
Member

Do we want to make any backward-incompatible changes to require_dataset for 3.0? E.g. checking more parameters against the existing dataset?

I'm in favour of leaving it as is. Shape and dtype are the fundamental properties of a dataset. If we start checking more properties of existing datasets, I don't think there's any good reason not to check all specified properties, and that sounds like it would be hard to get right - e.g. if you specify gzip compression with the default level, and the existing dataset has a different compression level, should it throw an error? I'd say leave it up to application code to check the specific things it cares about.

@tacaswell
Copy link
Member Author

Coming back to this, I suspect the right path forward is start checking the extra kwargs and seeing if they match the settings on the dset and raise if strict==True.

That would still leave the case that started me down this path 5 years ago un-resolved (as with a bit more hind-sight being able to provide a fill dataset that is non-constant may be useful).

@takluyver
Copy link
Member

takluyver commented Dec 15, 2021

I'm inclined to say it's not worth it - it's not clear to me precisely what it's useful to check (e.g. issue #2018, and my last comment), and this PR has hung around for years without generating any enthusiasm. None of the links from other issues are 'this PR would solve the problem you describe'. Maybe we just accept that require_dataset is a kind of scruffy, awkward bit of the API, make sure it's clearly documented and let it be. 🤷

If you do still want to check more inputs, a new strict=True keyword argument would be the only way to do it without breaking the API. (Edit: OK, that specific interface is not the only way. But any new checks should be opt in, so we'd still be maintaining the existing code path.)

@tacaswell
Copy link
Member Author

Yeah, I think I'm going to give up on this one 😞

I agree that any changes that would make me happy would be quite backwards incompatible and the work required to make them opt-in is just not worth it.

@tacaswell tacaswell closed this Dec 15, 2021
@tacaswell tacaswell deleted the mnt_raise_on_ignored_kwds branch December 15, 2021 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants