Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add modification metadata (user/time) to datasets #35

Closed
hschellman opened this issue Aug 26, 2023 · 16 comments
Closed

Add modification metadata (user/time) to datasets #35

hschellman opened this issue Aug 26, 2023 · 16 comments

Comments

@hschellman
Copy link

metacat files have modification md as well as creation. Can that feature be added for datasets, especially as we will be modifying them during production.

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 1, 2023

How exactly are you planning to modify datasets ?

@hschellman
Copy link
Author

hschellman commented Sep 4, 2023 via email

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

What do you mean by "as more files ... come in, the dataset needs to be updated" ?
I am trying to understand what you mean.
Is a single file addition becomes an "update", or you mean some separate action to be a dataset update ?
Do you mean update of the dataset metadata ?

So if we record a dataset update user and timestamp - what exactly those are the initiator and the timestamp of ?

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#file-dataset-attributes

@hschellman
Copy link
Author

hschellman commented Sep 4, 2023 via email

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

because that is not a dataset attribute

@hschellman
Copy link
Author

hschellman commented Sep 4, 2023 via email

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

to get file count, use API function:

   get_dataset(did=None, namespace=None, name=None, exact_file_count=False)

Keep in mind though that getting file count will take time proportional to the number of files in the dataset

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals.
And maintaining such history would be very expensive and I would like to see a good supporting use case for that.

@hschellman
Copy link
Author

hschellman commented Sep 4, 2023 via email

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

Here are changes which can be done to a dataset (not necessarily all of them are implemented as of now):

  • metadata updated
  • file added
  • file removed
  • subset added
  • subset removed
  • description changed
  • metadata requirements changed

so which of those do you think should be reflected in the update time/author ?

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

I am trying to make a point that if you are lucky, the change you are interested in will be most recent. But there are good chances that the change you are looking is no longer the most recent one and therefore the information is gone.

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

If we want this feature to be useful without maintaining the whole modification history, we need to narrow the qualified events list as much as possible and include only rare and significant event or events. I would suggest change of metadata because that is supposed to be very rare action and yet it can be significant.

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 4, 2023

FYI: for files, qualified updates are:

  • metadata
  • file size
  • checksums
  • parentage

as you can see, file parentage, size and checksums are not supposed to ever change. Metadata can be changed, but only on special events like changes of the metadata namespace or correcting errors

@ivmfnal
Copy link
Owner

ivmfnal commented Sep 9, 2023

Done. Please upgrade your client to 3.40.0
Dataset now has new attributes: updated_by and updated_timestamp attributes accessible via API and UI.
The following events will trigger the changes in these attributes:

  • Dataset metadata is changed
  • A subset is added or removed from the dataset
  • Dataset flags (frozen or monotonic) change
  • Dataset description changes

@ivmfnal ivmfnal closed this as completed Sep 9, 2023
@hschellman
Copy link
Author

hschellman commented Sep 9, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants