Performance: Slow response for the versions API call with large number of files or versions #9763

tainguyenbui · 2020-04-10T08:52:30Z

Hi Dataverse Team,

I just wanted to share with you something that affects datasets with high number of versions or files.

There have been some previously related issues that were already solved/closed. However, I strongly believe that the problem may still happen. I will backup the issue with some interesting data.

The below problem is happening when retrieving dataset versions information through the native API, hitting the endpoint: http://demo.dataverse.org/api/datasets/<dataset-version>/versions

Given that I am a user with large many files and versions in a dataset
When I retrieve all the dataset versions
Then I would like to receive a fairly "fast" response
So the user experience is smooth

Current behavior

When the dataset has a large number of files, and also a large number of versions, the response time increments dramatically. This can be seen in the below table

dataset id	# of versions	# of files	response size (MB)	response time	# lines in response
767863	4	1349	2.92	11.42s	120k
396086	82	36	1.66	10.83s	80k
774618	14	69	0.51	3.57s	23k
770972	2	60	0.08	1.14s	3.3k

One of our concerns here is the fact that the dataset with id 767863 already takes a long time only having 4 versions, which means that once it reaches for instance 10 versions, it may be easily taking more than 20 seconds to respond, and could potentially cause a timeout in Dataverse.

Additionally, Dataverse currently returns as part of the response to all the files and their metadata for each of the versions available. That causes a very large response payload that may be unnecessary.

Note: The number of files also seem to affect the speed at which a version is published

Expected behavior

To have the ability to retrieve dataset versions information in an efficient way that does not impact massively the response time.

Possible solution

To return basic metadata about each of the dataset versions available without file information that could potentially be the real problem.
To review whether there are possible parallelization improvements

It is likely that the user will not necessarily need all the information for each of the datasets, normally, they would click on the version they are interested in, where we could actually perform another request such as http://demo.dataverse.org/api/datasets/<dataset-id>/versions/<version>

Thanks a lot!

The text was updated successfully, but these errors were encountered:

djbrooke · 2020-04-10T17:57:29Z

Thanks @tainguyenbui for the detailed writeup (with data!)!

djbrooke · 2020-04-15T16:03:17Z

Moving to project board so @scolapasta and dev team can discuss and bring into a future sprint

Gerafp · 2022-11-03T17:26:28Z

Unfortunally, We have a similar problem in our Dataverse. We have a dataset with 34,618 files, all stored in S3 storage. When someone makes a request to this dataset dataverse is slow in the response and the users are reporting the problem.
We aumented the timeout response in Apache server for prevent the error 500, but we don't have any idea that occurs because in the logs don't appear any registry error.

Do they have any other suggestion for apply in dataverse?

Regards

KMIT - CIMMYT

sbarbosadataverse · 2022-11-03T17:40:49Z

For datasets with this amount of files in Harvard Dataverse, we have included instructions that access should be done via api instead of direct access via the UI. We discourage the use of "download" all for a dataset this size.

…

On Thu, Nov 3, 2022 at 1:26 PM Gerardo Flores-Petlacalco < ***@***.***> wrote: Unfortunally, We have a similar problem in our Dataverse. We have a dataset with 34,618 files, all stored in S3 storage. When someone makes a request to this dataset dataverse is slow in the response and the users are reporting the problem. We aumented the timeout response in Apache server for prevent the error 500, but we don't have any idea that occurs because in the logs don't appear any registry error. Do they have any other suggestion for apply in dataverse? Regards KMIT - CIMMYT — Reply to this email directly, view it on GitHub <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_IQSS_dataverse_issues_6808-23issuecomment-2D1302442129&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=JigSWEy0N-vcev0ncda1XgQBkEr_XAiLW1KorZQIwfuQMlnDUdvVMTPFu6EG2N9a&s=rjjhEm_8koSBicpKpgPryDpyj7M9Y1OAipvta3xInJM&e=>, or unsubscribe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AB7P2KT3ZZBT7ACGUYTSJITWGPYVDANCNFSM4MFKEVQA&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=8R6PzVqt1PEocauQgZMGXsGz29-nb19M7eqlo1d8EVs&m=JigSWEy0N-vcev0ncda1XgQBkEr_XAiLW1KorZQIwfuQMlnDUdvVMTPFu6EG2N9a&s=dNPliDtTQQBvShJns-bNlMSa1Sy9VyuJG9P_rtBSre4&e=> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Sonia Barbosa Manager of Data Curation, The Harvard Dataverse Repository Manager of the Murray Research Archive <http://Murray.harvard.edu>, IQSS The Dataverse Project <http://dataverse.org> Data Science Harvard University Visit our Harvard Dataverse support website: https://support.dataverse.harvard.edu/ Need to deposit data? Visit http://dataverse.harvard.edu Harvard Library RDM services: <http://goog_1421170368> https://hlrdm.library.harvard.edu/network All Harvard Dataverse Repository inquiries should be sent to: ***@***.*** All software inquiries should be sent to: ***@***.*** Interested in sharing sensitive data? Coming soon to Harvard Dataverse: http://datatags.org/ All test Dataverse Collections should be created in our demo environment: https://demo.dataverse.org/ Join our Dataverse Community! https://groups.google.com/forum/#!forum/dataverse-communit <https://groups.google.com/forum/#!forum/dataverse-community>y

pdurbin · 2022-11-03T18:00:33Z

@Gerafp when you say "make a request", do you mean download? If so, you could try wget with the ne-ish "dirindex" view of files: https://guides.dataverse.org/en/5.12/api/native-api.html#view-dataset-files-and-folders-as-a-directory-index

Gerafp · 2022-11-08T16:27:42Z

Hi @pdurbin

Thanks for your response. I refer when a user checks the dataset's page or when a curator needs to edit the metadata.

pdurbin · 2023-03-09T16:50:34Z

Last time (3 weeks ago), we took some notes here: #8928 (comment)

Huge issue. Not clear. Not ready to be estimated. Our current strategy is a workaround, to not have 30,000 files in a single dataset.

We'd like to use this issue as an umbrella issue for all problems regarding datasets with too many files. We should split off separate, smaller chunks.

mreekie · 2023-03-14T15:00:13Z

Last time (3 weeks ago), we took some notes here: #8928 (comment)

Huge issue. Not clear. Not ready to be estimated. Our current strategy is a workaround, to not have 30,000 files in a single dataset.

We'd like to use this issue as an umbrella issue for all problems regarding datasets with too many files. We should split off separate, smaller chunks.

Sizing

Let's break off the first chunk in our next sizing meeting

mreekie · 2023-03-22T15:48:15Z

Sizing:

This is no longer the umbrella issue. The umbrella issue is: Deliverable: Slow response for datasets with high number of files dataverse-pm#29
Changed title to reflect what the issue is: Performance: Slow response for the versions API call with large number of files or versions
A suggested possible solution suggested is a basic API call that does less work so that you get less information quicker.
We are sizing this for this suggestion.
Estimated size: 33

Next steps:

Add the API change.
Check in with the customer.

jggautier · 2023-03-22T16:09:58Z

In case this is helpful:

When I use the APIs to get the metadata of all versions of a dataset, and the API call is slow because the dataset has many versions and might also have many files (and maybe a lot of metadata), I've looked for a way to break the task into multiple API calls, one for each version of the dataset.

In one API call, we can get the metadata of a particular version of a dataset. But as far as I know, the only way to use the APIs to get a list of version numbers of a dataset is to use the endpoint that returns information about all versions of the dataset, which is the very API call I would like to avoid using.

So I'm wondering if another solution might be better support for getting information about all dataset versions, by making it easier to break the task into multiple API calls.

For example, if there was an endpoint that returns the list of version numbers of a given dataset, I could then use that list to make multiple API calls, one for each version.

tainguyenbui · 2023-03-24T06:38:13Z

Not expanding all the information would definitely help in the payload size. I've not looked at the data structure of versions, but if we could do some lazy search it could also speeds up the query.

Your solution would have been good enough for us, despite of needing metadata and files once the version is selected. So, as a result, two endpoints:

GET /datasets/<dataset-version>/versions - returns a paginable list of versions (just in case there are thousands).
GET /dataset/<dataset-version>/versions/<version-id> - returns all the information about the given version.

However, take into consideration that it could mean breaking changes for the existing applications using the /versions endpoint and the metadata and file information. So it would require a new version of the endpoint or some parameter that would define whether you want a minified version.

… the (default) output of the api. (#9763)

landreev · 2023-08-21T14:37:00Z

@GPortas I just want to confirm quickly that you are ok with what I'm doing in this branch wrt the datasets/.../versions api:

Dropping the files listing from the (default) output of both the /api/datasets/{id}/versions and /api/datasets/{id}/versions/{vid}.
Adding the optional includeFiles=true flag to the apis above, in case anyone needs backward compatibility.
Adding optional pagination to /api/datasets/{id}/versions via offset= and limit=.

I'm very open to any input/if any changes are needed.

GPortas · 2023-08-21T16:11:30Z

@landreev

Looks good to me! Just a couple of considerations:

Making the includeFiles option true by default, even if it is not specified by query param. For backwards compatibility. I guess you've considered it but just in case.
Extend the documentation to warn of the performance issues that we can find if we do not use these new parameters.
I would also reference the /api/datasets/{id}/versions/{versionId}/files endpoint (The one we are extending) in the documentation as the preferred option in case the user is working closer to the files and looking for more filtering and sorting options.

…Deep()" logic. #9763

landreev · 2023-08-30T23:21:58Z

To summarize, in this branch I'm addressing the performance issues in the versions api via a combination of the following approaches:

(optional) pagination is being added to /api/datasets/{id}/versions
a new flag includeFiles is added to both /api/datasets/{id}/versions and /api/datasets/{id}/versions/{vid} (true by default), providing an option to drop the file information from the output
when files are requested to be included, the filemetadatas are looked up using the "left joint fetch" optimizations that retrieve the information from the extended table tree in a single query, instead of allowing EJB to perform 1 SELECT query on each 1:1 relation for each FileMetadata entry. On a version with N filemetadatas this saves 3*N individual queries. This is not free resources-wise by any means (costs more memory primarily), but still appears to show measurable improvement on datasets with large numbers of versions and/or files.

The following real life datasets from IQSS prod. service are used in the sample tests below:

5255036: 237 versions (!), ~2,400 files. Files are relatively sparsely spread between versions.
4554342: 2 versions, 1 pub. 1 draft. ~25,000 files (all appear in both versions).
2710927: a control "reasonable" dataset, 100 files, 3 published versions.

/api/datasets/{id}/versions:

dataset id	# of versions	# of files	response size	v5.14 response time	new response time	new size w/out files	new time w/out files
5255036	237	2400	38MB	1m30s	48s	2MB	12s
4554342	2	25000	17MB	2m16s	12s	2.5K	1s
2710927	3	100	.2MB	1s	<1s	26K	<1s

/api/datasets/{id}/versions/{vid}:

dataset id	version	# of files in version	response size	v5.14 response time	new response time	new size w/out files	new time w/out files
5255036	87.1	12	24K	1s	<1s	12K	<1s
4554342	draft	25000	17MB	2m17s	12s	4K	<1s
2710927	3.0	100	78K	1s	<1s	9K	<1s

Note: /api/datasets/4554342/versions and /api/datasets/4554342/versions/draft produce the same amount of output because the former api was called without authentication, so only one, published version was included.

Note: tests above were run on the dedicated IQSS test system (not in prod., which is a beefier and faster system).

I will add more info, and will include this in the pr as well.

landreev · 2023-08-31T16:36:22Z

(to clarify, the results in the last update are with the extra "citation date" logic commented out from the code; I am working on addressing that)

ErykKul · 2023-09-01T15:17:41Z

I did something similar some time ago with the left-join hints in this PR: #9684
The issue I had opened #9683 looks redundant to this one, sorry that I had missed it.

…" aggregate. #9763

…ults to "true". (#9763)

…i. (#9763)

… api (also being added in 6.1). #9763

… filemetadatas retrieval method, not directly used in the PR). (#9763)

…he pr. (#9763)

avoid conflict with V6.0.0.1__9599-guestbook-at-request.sql

)

tainguyenbui changed the title ~~Performance: Slow response for datasets with high number of files~~ Performance: Slow response for datasets with high number of files or versions Apr 10, 2020

kcondon added the Feature: Performance & Stability label Sep 15, 2021

pdurbin mentioned this issue Aug 23, 2022

Dataset with large number of files #8928

Closed

mreekie added the D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 label Feb 13, 2023

mreekie added the zbklog: Deliverable This is an item synched from the product planning process label Mar 14, 2023

mreekie changed the title ~~Performance: Slow response for datasets with high number of files or versions~~ bklog: Deliverable - Performance: Slow response for datasets with high number of files or versions Mar 15, 2023

mreekie transferred this issue from IQSS/dataverse Mar 15, 2023

mreekie mentioned this issue Mar 22, 2023

Deliverable: Slow response for datasets with high number of files IQSS/dataverse-pm#29

Closed

mreekie changed the title ~~bklog: Deliverable - Performance: Slow response for datasets with high number of files or versions~~ Performance: Slow response for datasets with high number of files or versions Mar 22, 2023

mreekie changed the title ~~Performance: Slow response for datasets with high number of files or versions~~ Performance: Slow response for the versions API call with large number of files or versions Mar 22, 2023

mreekie added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) and removed zbklog: Deliverable This is an item synched from the product planning process labels Mar 22, 2023

landreev mentioned this issue Jul 25, 2023

Performance issues as a side effect of the file lookup optimizations #9725

Closed

scolapasta assigned landreev Aug 7, 2023

scolapasta transferred this issue from IQSS/dataverse-pm Aug 7, 2023

landreev added a commit that referenced this issue Aug 21, 2023

added pagination to the /versions api. dropped the files section from…

a835f5d

… the (default) output of the api. (#9763)

landreev added a commit that referenced this issue Aug 21, 2023

added left join hints to the full filemetadatas lookup. #9763

de35ae7

landreev added the Size: 10 A percentage of a sprint. 7 hours. label Aug 30, 2023

landreev added a commit that referenced this issue Aug 30, 2023

made the "includeFiles" option true by default, cleaned up the ".find…

ccd6b7d

…Deep()" logic. #9763

landreev added a commit that referenced this issue Aug 31, 2023

intermediate changes for the adjusted citation date. #9763

2d27c03

jggautier mentioned this issue Sep 1, 2023

Large amount of queries when getting a dataset with API #9683

Open

landreev added a commit that referenced this issue Sep 6, 2023

Additional changes needed for the optimized "embargo publication date…

7b1e799

…" aggregate. #9763

landreev added a commit that referenced this issue Sep 6, 2023

removing a comment (#9763)

fd30fd5

landreev mentioned this issue Sep 6, 2023

Versions API improvements (9763) #9883

Merged

landreev removed this from IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Sep 6, 2023

landreev added a commit that referenced this issue Sep 6, 2023

a short release note (#9763)

b74affc

landreev added a commit that referenced this issue Sep 6, 2023

changed the guide to reflect the fact that the includeFiles flag defa…

2324fe1

…ults to "true". (#9763)

landreev added a commit that referenced this issue Sep 6, 2023

extended the release note. (#9763)

35835e4

landreev added a commit that referenced this issue Sep 6, 2023

cosmetic change in the release note (#9763)

9a9d7d6

landreev added a commit that referenced this issue Sep 6, 2023

cosmetic change, comment text (#9763)

d465b20

landreev added a commit that referenced this issue Sep 13, 2023

RestAssured tests for the new functionality added to the /versions ap…

bfe7f9c

…i. (#9763)

landreev added a commit that referenced this issue Sep 13, 2023

added another test, for the pagination functionality in the /versions…

8e894c3

… api (also being added in 6.1). #9763

landreev added a commit that referenced this issue Sep 13, 2023

typo in a comment. #9763

b9e99f3

landreev added a commit that referenced this issue Sep 13, 2023

more typos in comments. (#9763)

f164a68

pdurbin added Type: Bug a defect User Role: API User Makes use of APIs labels Oct 9, 2023

landreev added a commit that referenced this issue Oct 11, 2023

stripping more dead code in the version service bean (my experimental…

18cdf13

… filemetadatas retrieval method, not directly used in the PR). (#9763)

landreev added a commit that referenced this issue Oct 11, 2023

more commented-out code that needed to be removed before finalizing t…

381ddf5

…he pr. (#9763)

pdurbin added a commit that referenced this issue Oct 13, 2023

rename sql script #9763

4b5ad8f

avoid conflict with V6.0.0.1__9599-guestbook-at-request.sql

pdurbin added a commit that referenced this issue Oct 13, 2023

Merge branch 'develop' into 9763-lookup-optimizations #9763

402ccfb

landreev added a commit that referenced this issue Oct 15, 2023

renaming the flyway script since 6.0.0.1 has already been merged. (#9763

f47867e

)

kcondon closed this as completed in #9883 Oct 18, 2023

pdurbin added this to the 6.1 milestone Oct 18, 2023

ErykKul mentioned this issue Oct 20, 2023

Faster combined query for retrieving datasets via API #9684

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Slow response for the versions API call with large number of files or versions #9763

Performance: Slow response for the versions API call with large number of files or versions #9763

tainguyenbui commented Apr 10, 2020 •

edited by mreekie

djbrooke commented Apr 10, 2020

djbrooke commented Apr 15, 2020

Gerafp commented Nov 3, 2022

sbarbosadataverse commented Nov 3, 2022 via email

pdurbin commented Nov 3, 2022 •

edited by jggautier

Gerafp commented Nov 8, 2022

pdurbin commented Mar 9, 2023

mreekie commented Mar 14, 2023

mreekie commented Mar 22, 2023 •

edited

jggautier commented Mar 22, 2023

tainguyenbui commented Mar 24, 2023

landreev commented Aug 21, 2023

GPortas commented Aug 21, 2023

landreev commented Aug 30, 2023

landreev commented Aug 31, 2023

ErykKul commented Sep 1, 2023

Performance: Slow response for the versions API call with large number of files or versions #9763

Performance: Slow response for the versions API call with large number of files or versions #9763

Comments

tainguyenbui commented Apr 10, 2020 • edited by mreekie

Current behavior

Expected behavior

Possible solution

djbrooke commented Apr 10, 2020

djbrooke commented Apr 15, 2020

Gerafp commented Nov 3, 2022

sbarbosadataverse commented Nov 3, 2022 via email

pdurbin commented Nov 3, 2022 • edited by jggautier

Gerafp commented Nov 8, 2022

pdurbin commented Mar 9, 2023

mreekie commented Mar 14, 2023

mreekie commented Mar 22, 2023 • edited

jggautier commented Mar 22, 2023

tainguyenbui commented Mar 24, 2023

landreev commented Aug 21, 2023

GPortas commented Aug 21, 2023

landreev commented Aug 30, 2023

landreev commented Aug 31, 2023

ErykKul commented Sep 1, 2023

tainguyenbui commented Apr 10, 2020 •

edited by mreekie

pdurbin commented Nov 3, 2022 •

edited by jggautier

mreekie commented Mar 22, 2023 •

edited