Add modification metadata (user/time) to datasets #35

hschellman · 2023-08-26T14:31:38Z

metacat files have modification md as well as creation. Can that feature be added for datasets, especially as we will be modifying them during production.

ivmfnal · 2023-09-01T11:45:14Z

How exactly are you planning to modify datasets ?

hschellman · 2023-09-04T16:44:42Z

Mainly by adding files. One sets up a dataset early in production, as more files with the same characteristics come in, the dataset needs to be updated, if one automates this process, knowing that it has been done and by whom is very useful. To some extent I’m requesting that datasets share many of the global attributes of files (status, updates, owners …) One could of course implement this using the metadata fields but using the metacat structure for files makes it more consistent. On Sep 1, 2023, at 4:45 AM, Igor Mandrichenko ***@***.***> wrote: [This email originated from outside of OSU. Use caution with links and attachments.] How exactly are you planning to modify datasets ? — Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DNAYWRERRY7NOSA3PTXYHDFLANCNFSM6AAAAAA37UKK3A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ivmfnal · 2023-09-04T17:10:28Z

What do you mean by "as more files ... come in, the dataset needs to be updated" ?
I am trying to understand what you mean.
Is a single file addition becomes an "update", or you mean some separate action to be a dataset update ?
Do you mean update of the dataset metadata ?

So if we record a dataset update user and timestamp - what exactly those are the initiator and the timestamp of ?

ivmfnal · 2023-09-04T17:12:23Z

Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#file-dataset-attributes

hschellman · 2023-09-04T17:48:03Z

Yes, I do those and it works and I can in fact make a url that does the query. But it does not provide the information like # of files that the dataset list page does. On Sep 4, 2023, at 10:12 AM, Igor Mandrichenko ***@***.***> wrote: [This email originated from outside of OSU. Use caution with links and attachments.] Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#dataset-queries — Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DODXCSPGVUD6QJQA4TXYYDYFANCNFSM6AAAAAA37UKK3A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ivmfnal · 2023-09-04T17:49:39Z

because that is not a dataset attribute

hschellman · 2023-09-04T17:50:30Z

Here I mean adding files to the dataset. since datasets are static something has to do that addition and it would be good to know what. On Sep 4, 2023, at 10:10 AM, Igor Mandrichenko ***@***.***> wrote: [This email originated from outside of OSU. Use caution with links and attachments.] What do you mean by "as more files ... come in, the dataset needs to be updated" ? I am trying to understand what you mean. Is a single file addition becomes an "update", or you mean some separate action to be a dataset update ? Do you mean update of the dataset metadata ? — Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DLIAD6WMXVTQGNEQJ3XYYDQ5ANCNFSM6AAAAAA37UKK3A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ivmfnal · 2023-09-04T17:52:52Z

to get file count, use API function:

   get_dataset(did=None, namespace=None, name=None, exact_file_count=False)

Keep in mind though that getting file count will take time proportional to the number of files in the dataset

ivmfnal · 2023-09-04T17:54:42Z

I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals.
And maintaining such history would be very expensive and I would like to see a good supporting use case for that.

hschellman · 2023-09-04T17:56:18Z

Most recent change is useful info. I agree that full history would not be useful. On Sep 4, 2023, at 10:54 AM, Igor Mandrichenko ***@***.***> wrote: [This email originated from outside of OSU. Use caution with links and attachments.] I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals. And maintaining such history would be very expensive and I would like to see a good supporting use case for that. — Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DKQAUI7N6AXYXJ7WMTXYYIWZANCNFSM6AAAAAA37UKK3A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ivmfnal · 2023-09-04T17:58:27Z

Here are changes which can be done to a dataset (not necessarily all of them are implemented as of now):

metadata updated
file added
file removed
subset added
subset removed
description changed
metadata requirements changed

so which of those do you think should be reflected in the update time/author ?

ivmfnal · 2023-09-04T18:02:58Z

I am trying to make a point that if you are lucky, the change you are interested in will be most recent. But there are good chances that the change you are looking is no longer the most recent one and therefore the information is gone.

ivmfnal · 2023-09-04T18:24:03Z

If we want this feature to be useful without maintaining the whole modification history, we need to narrow the qualified events list as much as possible and include only rare and significant event or events. I would suggest change of metadata because that is supposed to be very rare action and yet it can be significant.

ivmfnal · 2023-09-04T18:40:08Z

FYI: for files, qualified updates are:

metadata
file size
checksums
parentage

as you can see, file parentage, size and checksums are not supposed to ever change. Metadata can be changed, but only on special events like changes of the metadata namespace or correcting errors

ivmfnal · 2023-09-09T21:59:42Z

Done. Please upgrade your client to 3.40.0
Dataset now has new attributes: updated_by and updated_timestamp attributes accessible via API and UI.
The following events will trigger the changes in these attributes:

Dataset metadata is changed
A subset is added or removed from the dataset
Dataset flags (frozen or monotonic) change
Dataset description changes

hschellman · 2023-09-09T23:17:56Z

Thanks, this is good. Hopefully we don’t change those items frequently. On Sep 9, 2023, at 2:59 PM, Igor Mandrichenko ***@***.******@***.***>> wrote: [This email originated from outside of OSU. Use caution with links and attachments.] Done. Please upgrade your client to 3.40.0 Dataset now has new attributes: updated_by and updated_timestamp attributes accessible via API and UI. The following events will trigger the changes in these attributes: * Dataset metadata is changed * A subset is added or removed from the dataset * Dataset flags (frozen or monotonic) change * Dataset description changes — Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DJ4QLVDUYM4RPFLHQLXZTRFRANCNFSM6AAAAAA37UKK3A>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ivmfnal closed this as completed Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add modification metadata (user/time) to datasets #35

Add modification metadata (user/time) to datasets #35

hschellman commented Aug 26, 2023

ivmfnal commented Sep 1, 2023

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023 •

edited

Loading

ivmfnal commented Sep 4, 2023 •

edited

Loading

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023 •

edited

Loading

ivmfnal commented Sep 9, 2023

hschellman commented Sep 9, 2023 via email

Add modification metadata (user/time) to datasets #35

Add modification metadata (user/time) to datasets #35

Comments

hschellman commented Aug 26, 2023

ivmfnal commented Sep 1, 2023

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023 • edited Loading

ivmfnal commented Sep 4, 2023 • edited Loading

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023

hschellman commented Sep 4, 2023 via email

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023

ivmfnal commented Sep 4, 2023 • edited Loading

ivmfnal commented Sep 9, 2023

hschellman commented Sep 9, 2023 via email

ivmfnal commented Sep 4, 2023 •

edited

Loading

ivmfnal commented Sep 4, 2023 •

edited

Loading

ivmfnal commented Sep 4, 2023 •

edited

Loading