Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection Quotas (8549) #10144

Merged
merged 36 commits into from
Dec 5, 2023
Merged

Collection Quotas (8549) #10144

merged 36 commits into from
Dec 5, 2023

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Nov 27, 2023

(working on adding the "how to test" section now)

What this PR does / why we need it:

Which issue(s) this PR closes:

Closes #8549

Special notes for your reviewer:
I'd like to double-check/have a second and third opinions etc. on the viability of this approach as described below.

[another edit: In line with the scope of the issue, the quotas mechanism added is for collections only. But nothing in the implementation prevents from extending it to datasets as well.]

The main idea behind the implementation is to maintain a real-time record of storage use for all the DvObjectContainers (Datasets and Collections), since we cannot possibly afford to calculate it on the fly for every file upload. We record storage use for the entire hierarchy of nested objects. I.e., adding a file to the dataset increments the storage use for the parent dataset, and all the parent collections up to the root.

This information is stored in its own dedicated table, StorageUse. As opposed to storing it as an extra column in the DvObject table, since that would require updating all the parent DvObjects every time we upload a file, likely causing merge conflicts on a busy instance/large collections.

These updates are performed via native recursive queries in new transactions, avoiding any cascading updates.

These recursive updates need to be performed not just every time a new file is uploaded, but every time a file is successfully ingested (this adds the size of the archival tab-delimited file to the total), when an unpublished file is deleted and when a tabular file is un-ingested.

I'm only counting the sizes of the main datafiles, and the produced tab-delimited versions for the ingested tabular files for these purposes. I.e., I consider these to be the archival "payload" of datasets and collections, as opposed to all the file- and dataset-level auxiliary files, such as the resized image thumbnails and metadata exports, under the assumption that those are transitive, i.e., can be deleted and automatically re-generated. If necessary, we can extend the scheme to also count the sizes of (some, select?) auxiliaries going forward. [edit: In the context of the IQSS prod. instance, this feature is specifically needed for the larger collections and datasets, on the order of TBs or similar, where the sizes of these aux. files should not amount to any statistically significant portion of the overall storage]

I chose to have this system of keeping the real-time record of storage use automatically enabled, regardless of whether quota enforcement is configured on the instance. With the rationale being that this information can be very useful regardless. (The existing commands for calculating storage sizes perform this in real time and are thus less efficient). Also, the idea is that these updates via direct native queries should be fairly efficient and not result in any serious overhead. A somewhat bulky flyway script is provided to populate this map for the existing objects. I tested it on a copy of the prod. db and the performance of the current version is adequate. But, like with everything else in the PR, I would appreciate another developer taking a quick look at it.

Suggestions on how to test this:

The functionality is mostly for the admins (both instance admins, aka "superusers", who can set, change and delete these quotas for specific collection, and collection admins as in, users who own and manage collections, who will have read access and be able to see the quotas defined on their collections and the storage use already accumulated on them). It is for the most part API-only. I.e., at this point we don't want to invest much work into adding anything to the existing UI.

The only UI impact is that the end user of a collection that has a quota configured will be warned about it on the file upload page, as in
Screen Shot 2023-11-28 at 7 15 51 PM
... and of course they will be getting error messages on that page attempting to upload anything that goes over the quota.

So the way to test the functionality would be to

  • create a collection
  • use a superuser api token to configure a quota, along the lines of
    curl -X POST -H "X-Dataverse-key: xxx" http://localhost:8080/api/dataverses/<COLLECTIONALIAS>/storage/quota/1048576
    (probably makes sense to use something stingy like that for testing...)
  • use the settings api to enable quota enforcement on the instance: (configuring a quota on a specific collection does not by itself enable the enforcement!)
    curl -X PUT -d true http://localhost:8080/api/admin/settings/:UseStorageQuotas
  • proceed to upload files, make sure files can be uploaded until the specified byte size is reached, and that Dataverse starts rejecting further uploads once that happens...

So the above is the simplest functionality test. How much effort to invest into trying to find a way to break the feature is a judgement call/matter of balance between thoroughness and efficiency.

Reviewing the documentation introduced in the pr is an integral part of QA. The goal should be to try and read it from the point of view of an admin managing a Dataverse instance somewhere else - the intended audience of the guide - and see whether it's sufficient for them to be able to use the feature, if it's clear enough etc.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Nov 27, 2023

Coverage Status

coverage: 19.983% (-0.02%) from 20.006%
when pulling cda06a3 on 8549-collection-quotas
into 1ea692b on develop.

@@ -139,39 +143,6 @@ public class DataFileServiceBean implements java.io.Serializable {
*/
public static final String MIME_TYPE_PACKAGE_FILE = "application/vnd.dataverse.file-package";

public class UserStorageQuota {
Copy link
Contributor Author

@landreev landreev Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class was used during the phase one/proof of concept development.

This comment has been minimized.

@landreev landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Nov 27, 2023
@pdurbin pdurbin added this to the 6.1 milestone Nov 28, 2023
@pdurbin pdurbin moved this from Ready for Review ⏩ to In Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 28, 2023
@pdurbin pdurbin self-assigned this Nov 28, 2023
Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass review. Overall, looks good. I hope performance is good enough! It'll be a great feature, one that's often asked for!

@@ -0,0 +1,3 @@
This release adds support for defining storage size quotas for collections. Please see the API guide for details. This is an experimental feature that has not yet been used in production on any real life Dataverse instance, but we are planning to try it out at Harvard/IQSS.
Please note that this release includes a database update (via a Flyway script) that will calculate the storage sizes of all the existing datasets and collections on the first deployment. On a large production database with tens of thousands of datasets this may add a couple of extra minutes to the deployments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have a API changelog we could say something like, "See the API changelog for details: http://preview.guides.gdcc.io/en/develop/api/changelog.html "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

doc/release-notes/8549-collection-quotas.md Outdated Show resolved Hide resolved

curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota"

Will output the storage quota allocated (in bytes), or a message indicating that the quota is not defined for the collection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me wonder if a quota can be defined globally or not. That is, a default quota for any random collection.

(Time passes.)

I see from some Javadoc below there is some inheritance going on. This should be added to the guides somewhere.

     * Checks if the supplied DvObjectContainer...
     * has a quota configured, and if not, keeps checking if any of the direct
     * ancestor Collections further up have a configured quota. If it finds one, 
     * it will retrieve the current total content size for that specific ancestor 
     * dvObjectContainer and use it to define the quota limit for the upload
     * session in progress. 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there's definitely going to be more documentation, and more tests. I'm still working on these parts, and this is the main reason this is still a draft PR. But I really wanted other developers to look at the mechanics of the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (correct me if I'm wrong - that that comment is less about a default global quota, and more that since collections can contain collection you have to go up the tree.

} else {
// Direct upload.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be too hard to add tests for direct upload and quotas once we merge this PR:

Comment on lines 1 to 4
/*
* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license
* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*
* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license
* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template
*/


// Upload a small file:

// [To be continued/work in progress]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, great to continue this testing.

Comment on lines 385 to 388
// @todo: Do we want to do this after after *each* file is saved? - there may be
// quite a few files being saved here all at once. We could alternatively
// perform this update only once, after this loop is completed (are there any
// risks/accuracy loss?)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that we're thinking about the "many files" case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is still one of the work-in-progress/still experimenting with parts. I may have a better solution in the works).

savedSuccess = true;
logger.fine("Success: permanently saved file " + dataFile.getFileMetadata().getLabel());

// TODO: reformat this file to remove the many tabs added in cc08330
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good, tabs are being removed, I think, I hope.

Comment on lines +759 to +760
Collection Storage Quotas
~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any UI impact? Should we add to the long, dynamic list of rules? Or defer until we're off JSF? Screenshot below:

Screenshot 2023-11-28 at 12 02 31 PM

Copy link
Contributor Author

@landreev landreev Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the remaining quota info will be shown to the user on the file upload page, above. The code for that is already in the develop branch, it was added in the proof of concept/phase one pr (#9409).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like this:
Screen Shot 2023-11-28 at 7 15 51 PM

@pdurbin pdurbin removed their assignment Nov 28, 2023
@pdurbin pdurbin moved this from In Review 🔎 to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 28, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@landreev
Copy link
Contributor Author

landreev commented Dec 4, 2023

My one big concern about this scheme is its reliance on this new StorageUse table that needs to be modified every time a file (or a batch of files) are saved successfully. Specifically, that the one entry in it, that for the root collection, will need to be updated any time a file is uploaded anywhere else in the dvobject tree. The same applies to any large collection, or any collection where multiple active uploads are happening in parallel in different parts of the tree. The update itself is quite simple and fast, and the order in which the actual increments are applied is irrelevant, but I'm still wondering if there is some potential race condition that can lock things up (?).
In practical terms my response is that we'll never know for sure until we deploy the feature on a busy prod. instance like ours; and then, in case this ends up causing any database conflicts, I added a kill switch, a JVM/mpconfig option that stops the updates from happening in real time.

This comment has been minimized.

@cmbz cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Dec 4, 2023

This comment has been minimized.

This comment has been minimized.

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from In Review 🔎 to Ready for QA ⏩ Dec 5, 2023
@scolapasta scolapasta removed their assignment Dec 5, 2023
@stevenwinship
Copy link
Contributor

I'm doing the QA now. What if anything should happen if the quota is set to a value less than the current size of the existing data?

This comment has been minimized.

@landreev
Copy link
Contributor Author

landreev commented Dec 5, 2023

I'm doing the QA now. What if anything should happen if the quota is set to a value less than the current size of the existing data?

The collection will become over the storage quota limit right away, thus disabling further file uploads.
So, this can be considered a way for an admin to make a specific collection read-only.

Copy link

github-actions bot commented Dec 5, 2023

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8549-collection-quotas
ghcr.io/gdcc/configbaker:8549-collection-quotas

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@stevenwinship stevenwinship merged commit b1c22d8 into develop Dec 5, 2023
20 checks passed
IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Dec 5, 2023
Recherche Data Gouv (formerly Data INRAE) automation moved this from 🔍 Interest to Done Dec 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Add mechanism for collection-wise storage size quotas
6 participants