Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset add `sources_policy` is confusing and should be split into two options #451

Closed
Kirill888 opened this issue May 17, 2018 · 5 comments

Comments

@Kirill888
Copy link
Contributor

commented May 17, 2018

I'm talking about this one:

def add(self, dataset, sources_policy='verify', **kwargs):
"""
Add ``dataset`` to the index. No-op if it is already present.
:param Dataset dataset: dataset to add
:param str sources_policy: how should source datasets included in this dataset be handled:
``verify``
Verify that each source exists in the index, and that they are identical.
``ensure``
Add source datasets to the index if they doesn't exist.
``skip``
don't attempt to index source datasets (use when sources are already indexed)

First off, verify and ensure are way too similar in normal language use. Also verify implies a kind of "read-only" action: check everything is fine, but don't change a thing. Even documentation linked above implies that much. But that's not what's happening.

Both ensure and verify insert new datasets into the database from the lineage sources. The real difference between them is the level of verification two of them perform. One of them just checks presence of uuids in the database (if missing it will add it to the DB), the other one checks uuids as well as metadata documents. Which does what? I can never remember, but here is code for you to read:

if sources_policy == 'ensure':
for source in dataset.sources.values():
if not self.has(source.id):
self.add(source, sources_policy=sources_policy)
elif sources_policy == 'verify':
for source in dataset.sources.values():
self.add(source, sources_policy=sources_policy)

And then skip, is the only one that doesn't insert new datasets from the lineage, but otherwise performs a most complete check of consistency of lineage datasets.

Proposed interface

There are really two axis here

  1. What to do when lineage datasets are missing from DB
    • Add them to the database (current default action, verify and ensure)
    • Abort processing of the input dataset (skip)
  2. What level of verification to apply to lineage datasets that are already in the DB
    • None (I think that's what ensure does)
    • Full check that lineage dataset is exactly the same as in DB (skip and verify do that)

So this all quite confusing, but worst of all we bubble up that confusion all the way to the end-user, have a look at datacube dataset add --help output, I won't be pasting it here.

@uchchwhash

This comment has been minimized.

Copy link
Contributor

commented May 17, 2018

While you are at it, seems like @harshurampur needs #398 too.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2018

@uchchwhash not sure I would do it at this level, wholesale lineage rejection is a useful "app-level" feature we should have as part of datacube dataset add, and it's trivial to add at the app level, just remove lineage subtree from the datasets as they are being parsed. In fact I would say that it should probably be a feature at the dataset generation level as well: "don't record lineage for these datasets, I'm still experimenting and this not a final run of the algorithm, I don't won't to deal and pay in time and effort for all this lineage management, just yet"

This is index API level and at this point having an option like "just ignore missing lineage datasets and proceed like nothing happened" is not ideal.

But in general I agree that we need to have a serious discussion about lineage management options for summary datasets, as it gets quite unwieldy in that case. Also cross datacube data movement is made practically impossible because of the way we deal with lineage.

@uchchwhash

This comment has been minimized.

Copy link
Contributor

commented May 18, 2018

I understand your concerns, but from the users' perspective the option to opt out of the lineage record keeping (and thus indexing the parents if they are not there) could be (and probably is) valuable on its own.

@Kirill888

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2018

Users should not be calling datacube.index._dataset.add directly, and we do not expose indexing apis to "users" other than via command line/ui tools, as far as I know off, there is only talk about "datacube save" which would be a high-level user-facing api for indexing.

datacube dataset add will get --ignore-lineage option that will prompt the user if they are sure or will abort with error message if not running interactively, and --confirm-ignore-lineage for scripts.

@uchchwhash

This comment has been minimized.

Copy link
Contributor

commented May 18, 2018

👍 thanks.

Kirill888 added a commit that referenced this issue Jun 20, 2018

Large change to datasets.add
- Removed verification step, not part of add, #489
- Deprecated `sources_policy` instead just `with_lineage=True|False` which is
  equivalent to `ensure|skip` of the old, `verify` is gone and will be done as a
  separate higher level step #451
- Addition of lineage sources is now done in one transaction #475
- Fixed warnings for duplicate lineage datasets #477
- Duplicate lineage datasets are processed once now #478
- Low-level `insert_dataset` doesn't raise DuplicateError, instead returns False
- Added method to check presence of several datasets in the DB
- Don't attempt insert for datasets present in DB

Kirill888 added a commit that referenced this issue Jun 21, 2018

Remove sources_policy option from `dataset add` #451
- Verification is not part of dataset add anymore
- The default behaviour is what `sources_policy=ensure` used to do
  - add lineage datasets if they are missing,
    but don't check for consistency with the DB for those that are present
- `sources_policy=skip` is achieved with `--no-auto-add-lineage` now

We might re-add verification of lineage option in the future, but it would have
to allow for greater flexibility than previous implementation, at the very least
it should support the same mechanism currently used by "safe to update", except
it will be "divergence from DB for these fields is ok", but this probably needs
to be extended too. What is an ok divergence for product "A" inside the lineage,
might not be ok for product "B".

Kirill888 added a commit that referenced this issue Jun 25, 2018

Large change to datasets.add
- Removed verification step, not part of add, #489
- Deprecated `sources_policy` instead just `with_lineage=True|False` which is
  equivalent to `ensure|skip` of the old, `verify` is gone and will be done as a
  separate higher level step #451
- Addition of lineage sources is now done in one transaction #475
- Fixed warnings for duplicate lineage datasets #477
- Duplicate lineage datasets are processed once now #478
- Low-level `insert_dataset` doesn't raise DuplicateError, instead returns False
- Added method to check presence of several datasets in the DB
- Don't attempt insert for datasets present in DB

Kirill888 added a commit that referenced this issue Jun 25, 2018

Remove sources_policy option from `dataset add` #451
- Verification is not part of dataset add anymore
- The default behaviour is what `sources_policy=ensure` used to do
  - add lineage datasets if they are missing,
    but don't check for consistency with the DB for those that are present
- `sources_policy=skip` is achieved with `--no-auto-add-lineage` now

We might re-add verification of lineage option in the future, but it would have
to allow for greater flexibility than previous implementation, at the very least
it should support the same mechanism currently used by "safe to update", except
it will be "divergence from DB for these fields is ok", but this probably needs
to be extended too. What is an ok divergence for product "A" inside the lineage,
might not be ok for product "B".

Kirill888 added a commit that referenced this issue Jul 4, 2018

Large change to datasets.add
- Removed verification step, not part of add, #489
- Deprecated `sources_policy` instead just `with_lineage=True|False` which is
  equivalent to `ensure|skip` of the old, `verify` is gone and will be done as a
  separate higher level step #451
- Addition of lineage sources is now done in one transaction #475
- Fixed warnings for duplicate lineage datasets #477
- Duplicate lineage datasets are processed once now #478
- Low-level `insert_dataset` doesn't raise DuplicateError, instead returns False
- Added method to check presence of several datasets in the DB
- Don't attempt insert for datasets present in DB

Kirill888 added a commit that referenced this issue Jul 4, 2018

Remove sources_policy option from `dataset add` #451
- Verification is not part of dataset add anymore
- The default behaviour is what `sources_policy=ensure` used to do
  - add lineage datasets if they are missing,
    but don't check for consistency with the DB for those that are present
- `sources_policy=skip` is achieved with `--no-auto-add-lineage` now

We might re-add verification of lineage option in the future, but it would have
to allow for greater flexibility than previous implementation, at the very least
it should support the same mechanism currently used by "safe to update", except
it will be "divergence from DB for these fields is ok", but this probably needs
to be extended too. What is an ok divergence for product "A" inside the lineage,
might not be ok for product "B".

Kirill888 added a commit that referenced this issue Jul 4, 2018

Large change to datasets.add
- Removed verification step, not part of add, #489
- Deprecated `sources_policy` instead just `with_lineage=True|False` which is
  equivalent to `ensure|skip` of the old, `verify` is gone and will be done as a
  separate higher level step #451
- Addition of lineage sources is now done in one transaction #475
- Fixed warnings for duplicate lineage datasets #477
- Duplicate lineage datasets are processed once now #478
- Low-level `insert_dataset` doesn't raise DuplicateError, instead returns False
- Added method to check presence of several datasets in the DB
- Don't attempt insert for datasets present in DB

Kirill888 added a commit that referenced this issue Jul 4, 2018

Remove sources_policy option from `dataset add` #451
- Verification is not part of dataset add anymore
- The default behaviour is what `sources_policy=ensure` used to do
  - add lineage datasets if they are missing,
    but don't check for consistency with the DB for those that are present
- `sources_policy=skip` is achieved with `--no-auto-add-lineage` now

We might re-add verification of lineage option in the future, but it would have
to allow for greater flexibility than previous implementation, at the very least
it should support the same mechanism currently used by "safe to update", except
it will be "divergence from DB for these fields is ok", but this probably needs
to be extended too. What is an ok divergence for product "A" inside the lineage,
might not be ok for product "B".
@Kirill888 Kirill888 referenced this issue Jul 5, 2018
9 of 9 tasks complete

@omad omad closed this in #485 Jul 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.