Collection Quotas (8549) #10144

landreev · 2023-11-27T15:43:22Z

(working on adding the "how to test" section now)

What this PR does / why we need it:

Which issue(s) this PR closes:

Special notes for your reviewer:
I'd like to double-check/have a second and third opinions etc. on the viability of this approach as described below.

[another edit: In line with the scope of the issue, the quotas mechanism added is for collections only. But nothing in the implementation prevents from extending it to datasets as well.]

The main idea behind the implementation is to maintain a real-time record of storage use for all the DvObjectContainers (Datasets and Collections), since we cannot possibly afford to calculate it on the fly for every file upload. We record storage use for the entire hierarchy of nested objects. I.e., adding a file to the dataset increments the storage use for the parent dataset, and all the parent collections up to the root.

This information is stored in its own dedicated table, StorageUse. As opposed to storing it as an extra column in the DvObject table, since that would require updating all the parent DvObjects every time we upload a file, likely causing merge conflicts on a busy instance/large collections.

These updates are performed via native recursive queries in new transactions, avoiding any cascading updates.

These recursive updates need to be performed not just every time a new file is uploaded, but every time a file is successfully ingested (this adds the size of the archival tab-delimited file to the total), when an unpublished file is deleted and when a tabular file is un-ingested.

I'm only counting the sizes of the main datafiles, and the produced tab-delimited versions for the ingested tabular files for these purposes. I.e., I consider these to be the archival "payload" of datasets and collections, as opposed to all the file- and dataset-level auxiliary files, such as the resized image thumbnails and metadata exports, under the assumption that those are transitive, i.e., can be deleted and automatically re-generated. If necessary, we can extend the scheme to also count the sizes of (some, select?) auxiliaries going forward. [edit: In the context of the IQSS prod. instance, this feature is specifically needed for the larger collections and datasets, on the order of TBs or similar, where the sizes of these aux. files should not amount to any statistically significant portion of the overall storage]

I chose to have this system of keeping the real-time record of storage use automatically enabled, regardless of whether quota enforcement is configured on the instance. With the rationale being that this information can be very useful regardless. (The existing commands for calculating storage sizes perform this in real time and are thus less efficient). Also, the idea is that these updates via direct native queries should be fairly efficient and not result in any serious overhead. A somewhat bulky flyway script is provided to populate this map for the existing objects. I tested it on a copy of the prod. db and the performance of the current version is adequate. But, like with everything else in the PR, I would appreciate another developer taking a quick look at it.

Suggestions on how to test this:

The functionality is mostly for the admins (both instance admins, aka "superusers", who can set, change and delete these quotas for specific collection, and collection admins as in, users who own and manage collections, who will have read access and be able to see the quotas defined on their collections and the storage use already accumulated on them). It is for the most part API-only. I.e., at this point we don't want to invest much work into adding anything to the existing UI.

The only UI impact is that the end user of a collection that has a quota configured will be warned about it on the file upload page, as in

... and of course they will be getting error messages on that page attempting to upload anything that goes over the quota.

So the way to test the functionality would be to

create a collection
use a superuser api token to configure a quota, along the lines of
curl -X POST -H "X-Dataverse-key: xxx" http://localhost:8080/api/dataverses/<COLLECTIONALIAS>/storage/quota/1048576
(probably makes sense to use something stingy like that for testing...)
use the settings api to enable quota enforcement on the instance: (configuring a quota on a specific collection does not by itself enable the enforcement!)
curl -X PUT -d true http://localhost:8080/api/admin/settings/:UseStorageQuotas
proceed to upload files, make sure files can be uploaded until the specified byte size is reached, and that Dataverse starts rejecting further uploads once that happens...

So the above is the simplest functionality test. How much effort to invest into trying to find a way to break the feature is a judgement call/matter of balance between thoroughness and efficiency.

Reviewing the documentation introduced in the pr is an integral part of QA. The goal should be to try and read it from the point of view of an admin managing a Dataverse instance somewhere else - the intended audience of the guide - and see whether it's sufficient for them to be able to use the feature, if it's clear enough etc.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

…#8549

…zes of all dvobjects. (#8549)

…n, replacing the test class used during the proof of concept stage. (#8549)

…ranch already. #8549

coveralls · 2023-11-27T15:50:47Z

coverage: 19.983% (-0.02%) from 20.006%
when pulling cda06a3 on 8549-collection-quotas
into 1ea692b on develop.

landreev · 2023-11-27T15:51:23Z

src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java

@@ -139,39 +143,6 @@ public class DataFileServiceBean implements java.io.Serializable {
     */
    public static final String MIME_TYPE_PACKAGE_FILE = "application/vnd.dataverse.file-package";

-    public class UserStorageQuota {


This class was used during the phase one/proof of concept development.

pdurbin

First pass review. Overall, looks good. I hope performance is good enough! It'll be a great feature, one that's often asked for!

pdurbin · 2023-11-28T16:25:05Z

doc/release-notes/8549-collection-quotas.md

@@ -0,0 +1,3 @@
+This release adds support for defining storage size quotas for collections. Please see the API guide for details. This is an experimental feature that has not yet been used in production on any real life Dataverse instance, but we are planning to try it out at Harvard/IQSS.
+Please note that this release includes a database update (via a Flyway script) that will calculate the storage sizes of all the existing datasets and collections on the first deployment. On a large production database with tens of thousands of datasets this may add a couple of extra minutes to the deployments. 
+


Now that we have a API changelog we could say something like, "See the API changelog for details: http://preview.guides.gdcc.io/en/develop/api/changelog.html "

doc/release-notes/8549-collection-quotas.md

pdurbin · 2023-11-28T16:28:04Z

doc/sphinx-guides/source/api/native-api.rst

+
+  curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota"
+
+Will output the storage quota allocated (in bytes), or a message indicating that the quota is not defined for the collection.


This makes me wonder if a quota can be defined globally or not. That is, a default quota for any random collection.

(Time passes.)

I see from some Javadoc below there is some inheritance going on. This should be added to the guides somewhere.

* Checks if the supplied DvObjectContainer... * has a quota configured, and if not, keeps checking if any of the direct * ancestor Collections further up have a configured quota. If it finds one, * it will retrieve the current total content size for that specific ancestor * dvObjectContainer and use it to define the quota limit for the upload * session in progress.

Yes, there's definitely going to be more documentation, and more tests. I'm still working on these parts, and this is the main reason this is still a draft PR. But I really wanted other developers to look at the mechanics of the implementation.

I think (correct me if I'm wrong - that that comment is less about a default global quota, and more that since collections can contain collection you have to go up the tree.

src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java

pdurbin · 2023-11-28T16:41:40Z

src/main/java/edu/harvard/iq/dataverse/engine/command/impl/CreateNewDataFilesCommand.java

        } else {
+            // Direct upload.


It shouldn't be too hard to add tests for direct upload and quotas once we merge this PR:

add S3 tests, LocalStack, MinIO #10044

pdurbin · 2023-11-28T16:55:19Z

src/main/java/edu/harvard/iq/dataverse/storageuse/StorageQuota.java

+/*
+ * Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license
+ * Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template
+ */


Suggested change

/*

* Click nbfs://nbhost/SystemFileSystem/Templates/Licenses/license-default.txt to change this license

* Click nbfs://nbhost/SystemFileSystem/Templates/Classes/Class.java to edit this template

*/

pdurbin · 2023-11-28T16:58:01Z

src/test/java/edu/harvard/iq/dataverse/api/FilesIT.java

+
+        // Upload a small file: 
+
+        // [To be continued/work in progress]


Yes, great to continue this testing.

pdurbin · 2023-11-28T17:00:15Z

src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java

+                        // @todo: Do we want to do this after after *each* file is saved? - there may be 
+                        // quite a few files being saved here all at once. We could alternatively
+                        // perform this update only once, after this loop is completed (are there any
+                        // risks/accuracy loss?)


Good that we're thinking about the "many files" case.

Yes, this is still one of the work-in-progress/still experimenting with parts. I may have a better solution in the works).

pdurbin · 2023-11-28T17:01:03Z

src/main/java/edu/harvard/iq/dataverse/ingest/IngestServiceBean.java

-						savedSuccess = true;
-						logger.fine("Success: permanently saved file " + dataFile.getFileMetadata().getLabel());
-
-                                            // TODO: reformat this file to remove the many tabs added in cc08330


Oh good, tabs are being removed, I think, I hope.

pdurbin · 2023-11-28T17:03:59Z

doc/sphinx-guides/source/api/native-api.rst

+Collection Storage Quotas
+~~~~~~~~~~~~~~~~~~~~~~~~~


Any UI impact? Should we add to the long, dynamic list of rules? Or defer until we're off JSF? Screenshot below:

Yes, the remaining quota info will be shown to the user on the file upload page, above. The code for that is already in the develop branch, it was added in the proof of concept/phase one pr (#9409).

…nd some related documentation (#8549)

landreev · 2023-12-04T14:06:15Z

My one big concern about this scheme is its reliance on this new StorageUse table that needs to be modified every time a file (or a batch of files) are saved successfully. Specifically, that the one entry in it, that for the root collection, will need to be updated any time a file is uploaded anywhere else in the dvobject tree. The same applies to any large collection, or any collection where multiple active uploads are happening in parallel in different parts of the tree. The update itself is quite simple and fast, and the order in which the actual increments are applied is irrelevant, but I'm still wondering if there is some potential race condition that can lock things up (?).
In practical terms my response is that we'll never know for sure until we deploy the feature on a busy prod. instance like ours; and then, in case this ends up causing any database conflicts, I added a kill switch, a JVM/mpconfig option that stops the updates from happening in real time.

…ng the em.merge()/em.persist() to the djb. #8549

stevenwinship · 2023-12-05T15:54:33Z

I'm doing the QA now. What if anything should happen if the quota is set to a value less than the current size of the existing data?

…te. #8549

landreev · 2023-12-05T17:08:24Z

I'm doing the QA now. What if anything should happen if the quota is set to a value less than the current size of the existing data?

The collection will become over the storage quota limit right away, thus disabling further file uploads.
So, this can be considered a way for an admin to make a specific collection read-only.

github-actions · 2023-12-05T17:21:53Z

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8549-collection-quotas

ghcr.io/gdcc/configbaker:8549-collection-quotas

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

landreev added 16 commits November 1, 2023 14:45

some refactoring changes (#8549)

bf5e28f

checking in some intermediate changes/work in progress (#8549)

9954c5e

current state of the flyway script (work in progress/likely to change) …

75789e0

…#8549

more intermediate changes to the entity classes #8549

06f6222

get quota command #8549

552e735

new classes and instances #8549

2b87779

set/delete apis for quotas (#8549)

3e004c2

more commands (#8549)

fb952a2

another iteration of the flyway script for calculating the initial si…

9e80815

…zes of all dvobjects. (#8549)

new object that defines the size limits for the current upload sessio…

1b95665

…n, replacing the test class used during the proof of concept stage. (#8549)

a stub for the release note (#8549)

627a7b1

All the bits and pieces that haven't been committed yet. (#8549)

a5196df

renaming the flyway script, since there is a 6.0.0.3 in the develop b…

03e1860

…ranch already. #8549

Merge branch 'develop' into 8549-collection-quotas

01e8f94

a quick stab at a restassured test for the collection quotas #8549

8d49fc4

Merge branch 'develop' into 8549-collection-quotas

7754c36

landreev commented Nov 27, 2023

View reviewed changes

This comment has been minimized.

Sign in to view

landreev added this to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Nov 27, 2023

restassured test/work in progress #8549

c9dc155

pdurbin added this to the 6.1 milestone Nov 28, 2023

pdurbin moved this from Ready for Review ⏩ to In Review 🔎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 28, 2023

pdurbin self-assigned this Nov 28, 2023

pdurbin reviewed Nov 28, 2023

View reviewed changes

pdurbin removed their assignment Nov 28, 2023

pdurbin moved this from In Review 🔎 to Ready for Review ⏩ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Nov 28, 2023

Merge branch 'develop' into 8549-collection-quotas

9840f60

This comment has been minimized.

Sign in to view

basic restassured test added. #8549

28cb866

This comment has been minimized.

Sign in to view

the kill switch for the real-time storageuse updates (just in case) a…

1a96c56

…nd some related documentation (#8549)

This comment has been minimized.

Sign in to view

a missing ref in the doc. #8549

0a536da

This comment has been minimized.

Sign in to view

cmbz added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Dec 4, 2023

scolapasta assigned stevenwinship and scolapasta Dec 4, 2023

moving the StorageUse member to DvObjectContainer from DvObject; movi…

3babc5a

…ng the em.merge()/em.persist() to the djb. #8549

This comment has been minimized.

Sign in to view

one more refinement for the flyway script. #8549

c78613e

This comment has been minimized.

Sign in to view

scolapasta approved these changes Dec 5, 2023

View reviewed changes

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from In Review 🔎 to Ready for QA ⏩ Dec 5, 2023

scolapasta removed their assignment Dec 5, 2023

scolapasta moved this from Ready for QA ⏩ to QA ✅ in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Dec 5, 2023

Clarified the sentence about the initial deployment in the release no…

c194d74

…te. #8549

This comment has been minimized.

Sign in to view

moved the entitymanager calls from a command to the service #8549

cf7e664

stevenwinship merged commit b1c22d8 into develop Dec 5, 2023
20 checks passed

IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) automation moved this from QA ✅ to Done 🚀 Dec 5, 2023

Recherche Data Gouv (formerly Data INRAE) automation moved this from 🔍 Interest to Done Dec 5, 2023

This was referenced Dec 8, 2023

Storage allocation quota #4339

Open

File and dataset limits: Add a programmatic way to limit file size and dataset size #3939

Open

landreev mentioned this pull request Dec 11, 2023

8549 collection quotas #10174

Merged

DS-INRA removed this from Done in Recherche Data Gouv (formerly Data INRAE) Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection Quotas (8549) #10144

Collection Quotas (8549) #10144

landreev commented Nov 27, 2023 •

edited

coveralls commented Nov 27, 2023 •

edited

landreev Nov 27, 2023 •

edited

This comment has been minimized.

pdurbin left a comment

pdurbin Nov 28, 2023

landreev Nov 29, 2023

pdurbin Nov 28, 2023

landreev Nov 29, 2023

scolapasta Nov 29, 2023

pdurbin Nov 28, 2023

pdurbin Nov 28, 2023

pdurbin Nov 28, 2023

pdurbin Nov 28, 2023

landreev Nov 29, 2023

pdurbin Nov 28, 2023

pdurbin Nov 28, 2023

landreev Nov 29, 2023 •

edited

landreev Nov 29, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Dec 4, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

stevenwinship commented Dec 5, 2023

This comment has been minimized.

landreev commented Dec 5, 2023

github-actions bot commented Dec 5, 2023

		@@ -0,0 +1,3 @@
		This release adds support for defining storage size quotas for collections. Please see the API guide for details. This is an experimental feature that has not yet been used in production on any real life Dataverse instance, but we are planning to try it out at Harvard/IQSS.
		Please note that this release includes a database update (via a Flyway script) that will calculate the storage sizes of all the existing datasets and collections on the first deployment. On a large production database with tens of thousands of datasets this may add a couple of extra minutes to the deployments.


		curl -H "X-Dataverse-key:$API_TOKEN" "$SERVER_URL/api/dataverses/$ID/storage/quota"

		Will output the storage quota allocated (in bytes), or a message indicating that the quota is not defined for the collection.


		// Upload a small file:

		// [To be continued/work in progress]

Collection Quotas (8549) #10144

Collection Quotas (8549) #10144

Conversation

landreev commented Nov 27, 2023 • edited

coveralls commented Nov 27, 2023 • edited

landreev Nov 27, 2023 • edited

Choose a reason for hiding this comment

This comment has been minimized.

pdurbin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

landreev Nov 29, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

landreev commented Dec 4, 2023

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

stevenwinship commented Dec 5, 2023

This comment has been minimized.

landreev commented Dec 5, 2023

github-actions bot commented Dec 5, 2023

landreev commented Nov 27, 2023 •

edited

coveralls commented Nov 27, 2023 •

edited

landreev Nov 27, 2023 •

edited

landreev Nov 29, 2023 •

edited