-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add modification metadata (user/time) to datasets #35
Comments
How exactly are you planning to modify datasets ? |
Mainly by adding files.
One sets up a dataset early in production, as more files with the same characteristics come in, the dataset needs to be updated, if one automates this process, knowing that it has been done and by whom is very useful.
To some extent I’m requesting that datasets share many of the global attributes of files (status, updates, owners …)
One could of course implement this using the metadata fields but using the metacat structure for files makes it more consistent.
On Sep 1, 2023, at 4:45 AM, Igor Mandrichenko ***@***.***> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
How exactly are you planning to modify datasets ?
—
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DNAYWRERRY7NOSA3PTXYHDFLANCNFSM6AAAAAA37UKK3A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
What do you mean by "as more files ... come in, the dataset needs to be updated" ? So if we record a dataset update user and timestamp - what exactly those are the initiator and the timestamp of ? |
Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#file-dataset-attributes |
Yes, I do those and it works and I can in fact make a url that does the query. But it does not provide the information like # of files that the dataset list page does.
On Sep 4, 2023, at 10:12 AM, Igor Mandrichenko ***@***.***> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
Also, just to remind you, the following dataset attributes are available for queries: https://metacat.readthedocs.io/en/latest/mql.html#dataset-queries
—
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DODXCSPGVUD6QJQA4TXYYDYFANCNFSM6AAAAAA37UKK3A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
because that is not a dataset attribute |
Here I mean adding files to the dataset. since datasets are static something has to do that addition and it would be good to know what.
On Sep 4, 2023, at 10:10 AM, Igor Mandrichenko ***@***.***> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
What do you mean by "as more files ... come in, the dataset needs to be updated" ?
I am trying to understand what you mean.
Is a single file addition becomes an "update", or you mean some separate action to be a dataset update ?
Do you mean update of the dataset metadata ?
—
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DLIAD6WMXVTQGNEQJ3XYYDQ5ANCNFSM6AAAAAA37UKK3A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
to get file count, use API function:
Keep in mind though that getting file count will take time proportional to the number of files in the dataset |
I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals. |
Most recent change is useful info. I agree that full history would not be useful.
On Sep 4, 2023, at 10:54 AM, Igor Mandrichenko ***@***.***> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
I guess I do not see much of a point in knowing what and when added the last file to a dataset without knowing complete history of additions/removals.
And maintaining such history would be very expensive and I would like to see a good supporting use case for that.
—
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DKQAUI7N6AXYXJ7WMTXYYIWZANCNFSM6AAAAAA37UKK3A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Here are changes which can be done to a dataset (not necessarily all of them are implemented as of now):
so which of those do you think should be reflected in the update time/author ? |
I am trying to make a point that if you are lucky, the change you are interested in will be most recent. But there are good chances that the change you are looking is no longer the most recent one and therefore the information is gone. |
If we want this feature to be useful without maintaining the whole modification history, we need to narrow the qualified events list as much as possible and include only rare and significant event or events. I would suggest change of metadata because that is supposed to be very rare action and yet it can be significant. |
FYI: for files, qualified updates are:
as you can see, file parentage, size and checksums are not supposed to ever change. Metadata can be changed, but only on special events like changes of the metadata namespace or correcting errors |
Done. Please upgrade your client to 3.40.0
|
Thanks, this is good. Hopefully we don’t change those items frequently.
On Sep 9, 2023, at 2:59 PM, Igor Mandrichenko ***@***.******@***.***>> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
Done. Please upgrade your client to 3.40.0
Dataset now has new attributes: updated_by and updated_timestamp attributes accessible via API and UI.
The following events will trigger the changes in these attributes:
* Dataset metadata is changed
* A subset is added or removed from the dataset
* Dataset flags (frozen or monotonic) change
* Dataset description changes
—
Reply to this email directly, view it on GitHub<#35 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AIA37DJ4QLVDUYM4RPFLHQLXZTRFRANCNFSM6AAAAAA37UKK3A>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
metacat files have modification md as well as creation. Can that feature be added for datasets, especially as we will be modifying them during production.
The text was updated successfully, but these errors were encountered: