-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection Quotas (8549) #10144
Collection Quotas (8549) #10144
Conversation
@@ -139,39 +143,6 @@ public class DataFileServiceBean implements java.io.Serializable { | |||
*/ | |||
public static final String MIME_TYPE_PACKAGE_FILE = "application/vnd.dataverse.file-package"; | |||
|
|||
public class UserStorageQuota { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class was used during the phase one/proof of concept development.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass review. Overall, looks good. I hope performance is good enough! It'll be a great feature, one that's often asked for!
@@ -0,0 +1,3 @@ | |||
This release adds support for defining storage size quotas for collections. Please see the API guide for details. This is an experimental feature that has not yet been used in production on any real life Dataverse instance, but we are planning to try it out at Harvard/IQSS. | |||
Please note that this release includes a database update (via a Flyway script) that will calculate the storage sizes of all the existing datasets and collections on the first deployment. On a large production database with tens of thousands of datasets this may add a couple of extra minutes to the deployments. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have a API changelog we could say something like, "See the API changelog for details: http://preview.guides.gdcc.io/en/develop/api/changelog.html "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
|
||
curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota" | ||
|
||
Will output the storage quota allocated (in bytes), or a message indicating that the quota is not defined for the collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me wonder if a quota can be defined globally or not. That is, a default quota for any random collection.
(Time passes.)
I see from some Javadoc below there is some inheritance going on. This should be added to the guides somewhere.
* Checks if the supplied DvObjectContainer...
* has a quota configured, and if not, keeps checking if any of the direct
* ancestor Collections further up have a configured quota. If it finds one,
* it will retrieve the current total content size for that specific ancestor
* dvObjectContainer and use it to define the quota limit for the upload
* session in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's definitely going to be more documentation, and more tests. I'm still working on these parts, and this is the main reason this is still a draft PR. But I really wanted other developers to look at the mechanics of the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (correct me if I'm wrong - that that comment is less about a default global quota, and more that since collections can contain collection you have to go up the tree.
src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java
Outdated
Show resolved
Hide resolved
} else { | ||
// Direct upload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't be too hard to add tests for direct upload and quotas once we merge this PR:
/* | ||
* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license | ||
* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* | |
* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license | |
* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template | |
*/ |
|
||
// Upload a small file: | ||
|
||
// [To be continued/work in progress] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, great to continue this testing.
// @todo: Do we want to do this after after *each* file is saved? - there may be | ||
// quite a few files being saved here all at once. We could alternatively | ||
// perform this update only once, after this loop is completed (are there any | ||
// risks/accuracy loss?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good that we're thinking about the "many files" case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is still one of the work-in-progress/still experimenting with parts. I may have a better solution in the works).
savedSuccess = true; | ||
logger.fine("Success: permanently saved file " + dataFile.getFileMetadata().getLabel()); | ||
|
||
// TODO: reformat this file to remove the many tabs added in cc08330 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good, tabs are being removed, I think, I hope.
Collection Storage Quotas | ||
~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the remaining quota info will be shown to the user on the file upload page, above. The code for that is already in the develop branch, it was added in the proof of concept/phase one pr (#9409).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…nd some related documentation (#8549)
This comment has been minimized.
This comment has been minimized.
My one big concern about this scheme is its reliance on this new StorageUse table that needs to be modified every time a file (or a batch of files) are saved successfully. Specifically, that the one entry in it, that for the root collection, will need to be updated any time a file is uploaded anywhere else in the dvobject tree. The same applies to any large collection, or any collection where multiple active uploads are happening in parallel in different parts of the tree. The update itself is quite simple and fast, and the order in which the actual increments are applied is irrelevant, but I'm still wondering if there is some potential race condition that can lock things up (?). |
This comment has been minimized.
This comment has been minimized.
…ng the em.merge()/em.persist() to the djb. #8549
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I'm doing the QA now. What if anything should happen if the quota is set to a value less than the current size of the existing data? |
This comment has been minimized.
This comment has been minimized.
The collection will become over the storage quota limit right away, thus disabling further file uploads. |
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
(working on adding the "how to test" section now)
What this PR does / why we need it:
Which issue(s) this PR closes:
Closes #8549
Special notes for your reviewer:
I'd like to double-check/have a second and third opinions etc. on the viability of this approach as described below.
[another edit: In line with the scope of the issue, the quotas mechanism added is for collections only. But nothing in the implementation prevents from extending it to datasets as well.]
The main idea behind the implementation is to maintain a real-time record of storage use for all the DvObjectContainers (Datasets and Collections), since we cannot possibly afford to calculate it on the fly for every file upload. We record storage use for the entire hierarchy of nested objects. I.e., adding a file to the dataset increments the storage use for the parent dataset, and all the parent collections up to the root.
This information is stored in its own dedicated table, StorageUse. As opposed to storing it as an extra column in the DvObject table, since that would require updating all the parent DvObjects every time we upload a file, likely causing merge conflicts on a busy instance/large collections.
These updates are performed via native recursive queries in new transactions, avoiding any cascading updates.
These recursive updates need to be performed not just every time a new file is uploaded, but every time a file is successfully ingested (this adds the size of the archival tab-delimited file to the total), when an unpublished file is deleted and when a tabular file is un-ingested.
I'm only counting the sizes of the main datafiles, and the produced tab-delimited versions for the ingested tabular files for these purposes. I.e., I consider these to be the archival "payload" of datasets and collections, as opposed to all the file- and dataset-level auxiliary files, such as the resized image thumbnails and metadata exports, under the assumption that those are transitive, i.e., can be deleted and automatically re-generated. If necessary, we can extend the scheme to also count the sizes of (some, select?) auxiliaries going forward. [edit: In the context of the IQSS prod. instance, this feature is specifically needed for the larger collections and datasets, on the order of TBs or similar, where the sizes of these aux. files should not amount to any statistically significant portion of the overall storage]
I chose to have this system of keeping the real-time record of storage use automatically enabled, regardless of whether quota enforcement is configured on the instance. With the rationale being that this information can be very useful regardless. (The existing commands for calculating storage sizes perform this in real time and are thus less efficient). Also, the idea is that these updates via direct native queries should be fairly efficient and not result in any serious overhead. A somewhat bulky flyway script is provided to populate this map for the existing objects. I tested it on a copy of the prod. db and the performance of the current version is adequate. But, like with everything else in the PR, I would appreciate another developer taking a quick look at it.
Suggestions on how to test this:
The functionality is mostly for the admins (both instance admins, aka "superusers", who can set, change and delete these quotas for specific collection, and collection admins as in, users who own and manage collections, who will have read access and be able to see the quotas defined on their collections and the storage use already accumulated on them). It is for the most part API-only. I.e., at this point we don't want to invest much work into adding anything to the existing UI.
The only UI impact is that the end user of a collection that has a quota configured will be warned about it on the file upload page, as in
... and of course they will be getting error messages on that page attempting to upload anything that goes over the quota.
So the way to test the functionality would be to
curl -X POST -H "X-Dataverse-key: xxx" http://localhost:8080/api/dataverses/<COLLECTIONALIAS>/storage/quota/1048576
(probably makes sense to use something stingy like that for testing...)
curl -X PUT -d true http://localhost:8080/api/admin/settings/:UseStorageQuotas
So the above is the simplest functionality test. How much effort to invest into trying to find a way to break the feature is a judgement call/matter of balance between thoroughness and efficiency.
Reviewing the documentation introduced in the pr is an integral part of QA. The goal should be to try and read it from the point of view of an admin managing a Dataverse instance somewhere else - the intended audience of the guide - and see whether it's sufficient for them to be able to use the feature, if it's clear enough etc.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: