Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow indexing without provenance tracking #398

Closed
uchchwhash opened this issue Mar 21, 2018 · 5 comments

Comments

@uchchwhash
Copy link
Contributor

commented Mar 21, 2018

Rationale

For testing and debugging, it would be helpful if datacube allowed indexing datasets
with lineage information stripped off. This would help avoid filling test datacubes to hold,
for example, simple statistics on a product with irrelevant lineage data tracing back to
acquisition.

Expected behaviour

Something along the lines of
$ datacube dataset add --no-lineage DATASET
should index dataset with lineage.source_datasets set to {}.

Environment information

  • Which datacube --version are you using?
    Open Data Cube core, version 1.5.1+516.g7046071
@omad

This comment has been minimized.

Copy link
Member

commented Mar 21, 2018

We already have a command line option of --sources-policy. How about adding another policy named drop. eg:

$ datacube dataset add --sources-policy drop <datasets>
@Kirill888

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

But what does --sources-policy=X actually do? I remember expecting something else from it than what it actually does.

@Kirill888

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

It already has "skip" which is very close to "drop", but what it skips is the dataset being added, not the sources dataset.

@Kirill888

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

Also there is both "verify" and "ensure", but which one ensures more?

@jeremyh

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2018

There's a docstring the explains it, but perhaps it belongs in one of the guides too?

ensure: ensure that it exists in the index (adding it if needed)

verify (default): ensure that it exists in the index (adding it if needed), and throw an error if any metadata differs from what we already have.

skip: Confusingly named: it skips adding the sources, but will still link them to this dataset.

  • From a user perspective, skip really means "throw an error if we don't already have all the sources".
  • Which is slightly more useful than it sounds: most of our processing is on datasets that are already in the cube, so if it's quietly adding new sources we'd want to know about it.
    • Example previous bug: an old scene processor that inappropriately scanned the filesystem for datasets to process was picking up datasets that were partially-written/corrupt.
  • I think it was actually added before ensure existed, as brute-force way to "skip" the metadata verification. ensure is a more forgiving alternative.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.