Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Slow response for the versions API call with large number of files or versions #9763

Closed
tainguyenbui opened this issue Apr 10, 2020 · 24 comments · Fixed by #9883
Assignees
Labels
D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 Feature: Performance & Stability Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect User Role: API User Makes use of APIs
Milestone

Comments

@tainguyenbui
Copy link
Contributor

tainguyenbui commented Apr 10, 2020

Hi Dataverse Team,

I just wanted to share with you something that affects datasets with high number of versions or files.

There have been some previously related issues that were already solved/closed. However, I strongly believe that the problem may still happen. I will backup the issue with some interesting data.

The below problem is happening when retrieving dataset versions information through the native API, hitting the endpoint: http://demo.dataverse.org/api/datasets/<dataset-version>/versions

Given that I am a user with large many files and versions in a dataset
When I retrieve all the dataset versions
Then I would like to receive a fairly "fast" response
So the user experience is smooth

Current behavior

When the dataset has a large number of files, and also a large number of versions, the response time increments dramatically. This can be seen in the below table

dataset id # of versions # of files response size (MB) response time # lines in response
767863 4 1349 2.92 11.42s 120k
396086 82 36 1.66 10.83s 80k
774618 14 69 0.51 3.57s 23k
770972 2 60 0.08 1.14s 3.3k

One of our concerns here is the fact that the dataset with id 767863 already takes a long time only having 4 versions, which means that once it reaches for instance 10 versions, it may be easily taking more than 20 seconds to respond, and could potentially cause a timeout in Dataverse.

Additionally, Dataverse currently returns as part of the response to all the files and their metadata for each of the versions available. That causes a very large response payload that may be unnecessary.

Note: The number of files also seem to affect the speed at which a version is published

Expected behavior

To have the ability to retrieve dataset versions information in an efficient way that does not impact massively the response time.

Possible solution

  • To return basic metadata about each of the dataset versions available without file information that could potentially be the real problem.

  • To review whether there are possible parallelization improvements

It is likely that the user will not necessarily need all the information for each of the datasets, normally, they would click on the version they are interested in, where we could actually perform another request such as http://demo.dataverse.org/api/datasets/<dataset-id>/versions/<version>

Thanks a lot!

@tainguyenbui tainguyenbui changed the title Performance: Slow response for datasets with high number of files Performance: Slow response for datasets with high number of files or versions Apr 10, 2020
@djbrooke
Copy link
Contributor

Thanks @tainguyenbui for the detailed writeup (with data!)!

@djbrooke
Copy link
Contributor

Moving to project board so @scolapasta and dev team can discuss and bring into a future sprint

@Gerafp
Copy link
Contributor

Gerafp commented Nov 3, 2022

Unfortunally, We have a similar problem in our Dataverse. We have a dataset with 34,618 files, all stored in S3 storage. When someone makes a request to this dataset dataverse is slow in the response and the users are reporting the problem.
We aumented the timeout response in Apache server for prevent the error 500, but we don't have any idea that occurs because in the logs don't appear any registry error.

Do they have any other suggestion for apply in dataverse?

Regards

KMIT - CIMMYT

@sbarbosadataverse
Copy link

sbarbosadataverse commented Nov 3, 2022 via email

@pdurbin
Copy link
Member

pdurbin commented Nov 3, 2022

@Gerafp when you say "make a request", do you mean download? If so, you could try wget with the ne-ish "dirindex" view of files: https://guides.dataverse.org/en/5.12/api/native-api.html#view-dataset-files-and-folders-as-a-directory-index

@Gerafp
Copy link
Contributor

Gerafp commented Nov 8, 2022

Hi @pdurbin

Thanks for your response. I refer when a user checks the dataset's page or when a curator needs to edit the metadata.

@mreekie mreekie added the D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 label Feb 13, 2023
@pdurbin
Copy link
Member

pdurbin commented Mar 9, 2023

Last time (3 weeks ago), we took some notes here: #8928 (comment)

Huge issue. Not clear. Not ready to be estimated. Our current strategy is a workaround, to not have 30,000 files in a single dataset.

We'd like to use this issue as an umbrella issue for all problems regarding datasets with too many files. We should split off separate, smaller chunks.

@mreekie
Copy link

mreekie commented Mar 14, 2023

Last time (3 weeks ago), we took some notes here: #8928 (comment)

Huge issue. Not clear. Not ready to be estimated. Our current strategy is a workaround, to not have 30,000 files in a single dataset.

We'd like to use this issue as an umbrella issue for all problems regarding datasets with too many files. We should split off separate, smaller chunks.

Sizing

  • Let's break off the first chunk in our next sizing meeting

@mreekie mreekie added the zbklog: Deliverable This is an item synched from the product planning process label Mar 14, 2023
@mreekie mreekie changed the title Performance: Slow response for datasets with high number of files or versions bklog: Deliverable - Performance: Slow response for datasets with high number of files or versions Mar 15, 2023
@mreekie mreekie transferred this issue from IQSS/dataverse Mar 15, 2023
@mreekie mreekie changed the title bklog: Deliverable - Performance: Slow response for datasets with high number of files or versions Performance: Slow response for datasets with high number of files or versions Mar 22, 2023
@mreekie mreekie changed the title Performance: Slow response for datasets with high number of files or versions Performance: Slow response for the versions API call with large number of files or versions Mar 22, 2023
@mreekie
Copy link

mreekie commented Mar 22, 2023

Sizing:

  • This is no longer the umbrella issue. The umbrella issue is: Deliverable: Slow response for datasets with high number of files dataverse-pm#29
  • Changed title to reflect what the issue is: Performance: Slow response for the versions API call with large number of files or versions
  • A suggested possible solution suggested is a basic API call that does less work so that you get less information quicker.
  • We are sizing this for this suggestion.
  • Estimated size: 33

Next steps:

  • Add the API change.
  • Check in with the customer.

@mreekie mreekie added Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) and removed zbklog: Deliverable This is an item synched from the product planning process labels Mar 22, 2023
@jggautier
Copy link
Contributor

In case this is helpful:

When I use the APIs to get the metadata of all versions of a dataset, and the API call is slow because the dataset has many versions and might also have many files (and maybe a lot of metadata), I've looked for a way to break the task into multiple API calls, one for each version of the dataset.

In one API call, we can get the metadata of a particular version of a dataset. But as far as I know, the only way to use the APIs to get a list of version numbers of a dataset is to use the endpoint that returns information about all versions of the dataset, which is the very API call I would like to avoid using.

So I'm wondering if another solution might be better support for getting information about all dataset versions, by making it easier to break the task into multiple API calls.

For example, if there was an endpoint that returns the list of version numbers of a given dataset, I could then use that list to make multiple API calls, one for each version.

@tainguyenbui
Copy link
Contributor Author

Not expanding all the information would definitely help in the payload size. I've not looked at the data structure of versions, but if we could do some lazy search it could also speeds up the query.

Your solution would have been good enough for us, despite of needing metadata and files once the version is selected. So, as a result, two endpoints:

GET /datasets/<dataset-version>/versions - returns a paginable list of versions (just in case there are thousands).
GET /dataset/<dataset-version>/versions/<version-id> - returns all the information about the given version.

However, take into consideration that it could mean breaking changes for the existing applications using the /versions endpoint and the metadata and file information. So it would require a new version of the endpoint or some parameter that would define whether you want a minified version.

@landreev
Copy link
Contributor

@GPortas I just want to confirm quickly that you are ok with what I'm doing in this branch wrt the datasets/.../versions api:

  • Dropping the files listing from the (default) output of both the /api/datasets/{id}/versions and /api/datasets/{id}/versions/{vid}.
  • Adding the optional includeFiles=true flag to the apis above, in case anyone needs backward compatibility.
  • Adding optional pagination to /api/datasets/{id}/versions via offset= and limit=.

I'm very open to any input/if any changes are needed.

@GPortas
Copy link
Contributor

GPortas commented Aug 21, 2023

@landreev

Looks good to me! Just a couple of considerations:

  • Making the includeFiles option true by default, even if it is not specified by query param. For backwards compatibility. I guess you've considered it but just in case.

  • Extend the documentation to warn of the performance issues that we can find if we do not use these new parameters.

  • I would also reference the /api/datasets/{id}/versions/{versionId}/files endpoint (The one we are extending) in the documentation as the preferred option in case the user is working closer to the files and looking for more filtering and sorting options.

@landreev landreev added the Size: 10 A percentage of a sprint. 7 hours. label Aug 30, 2023
@landreev
Copy link
Contributor

To summarize, in this branch I'm addressing the performance issues in the versions api via a combination of the following approaches:

  • (optional) pagination is being added to /api/datasets/{id}/versions
  • a new flag includeFiles is added to both /api/datasets/{id}/versions and /api/datasets/{id}/versions/{vid} (true by default), providing an option to drop the file information from the output
  • when files are requested to be included, the filemetadatas are looked up using the "left joint fetch" optimizations that retrieve the information from the extended table tree in a single query, instead of allowing EJB to perform 1 SELECT query on each 1:1 relation for each FileMetadata entry. On a version with N filemetadatas this saves 3*N individual queries. This is not free resources-wise by any means (costs more memory primarily), but still appears to show measurable improvement on datasets with large numbers of versions and/or files.

The following real life datasets from IQSS prod. service are used in the sample tests below:

  • 5255036: 237 versions (!), ~2,400 files. Files are relatively sparsely spread between versions.
  • 4554342: 2 versions, 1 pub. 1 draft. ~25,000 files (all appear in both versions).
  • 2710927: a control "reasonable" dataset, 100 files, 3 published versions.

/api/datasets/{id}/versions:

dataset id # of versions # of files response size v5.14 response time new response time new size w/out files new time w/out files
5255036 237 2400 38MB 1m30s 48s 2MB 12s
4554342 2 25000 17MB 2m16s 12s 2.5K 1s
2710927 3 100 .2MB 1s <1s 26K <1s

/api/datasets/{id}/versions/{vid}:

dataset id version # of files in version response size v5.14 response time new response time new size w/out files new time w/out files
5255036 87.1 12 24K 1s <1s 12K <1s
4554342 draft 25000 17MB 2m17s 12s 4K <1s
2710927 3.0 100 78K 1s <1s 9K <1s

Note: /api/datasets/4554342/versions and /api/datasets/4554342/versions/draft produce the same amount of output because the former api was called without authentication, so only one, published version was included.

Note: tests above were run on the dedicated IQSS test system (not in prod., which is a beefier and faster system).

I will add more info, and will include this in the pr as well.

@landreev
Copy link
Contributor

(to clarify, the results in the last update are with the extra "citation date" logic commented out from the code; I am working on addressing that)

@ErykKul
Copy link
Contributor

ErykKul commented Sep 1, 2023

I did something similar some time ago with the left-join hints in this PR: #9684
The issue I had opened #9683 looks redundant to this one, sorry that I had missed it.

landreev added a commit that referenced this issue Sep 6, 2023
landreev added a commit that referenced this issue Sep 6, 2023
landreev added a commit that referenced this issue Sep 6, 2023
landreev added a commit that referenced this issue Sep 6, 2023
landreev added a commit that referenced this issue Sep 13, 2023
landreev added a commit that referenced this issue Sep 13, 2023
landreev added a commit that referenced this issue Sep 13, 2023
@pdurbin pdurbin added Type: Bug a defect User Role: API User Makes use of APIs labels Oct 9, 2023
landreev added a commit that referenced this issue Oct 11, 2023
… filemetadatas retrieval method,

not directly used in the PR). (#9763)
pdurbin added a commit that referenced this issue Oct 13, 2023
avoid conflict with V6.0.0.1__9599-guestbook-at-request.sql
@pdurbin pdurbin added this to the 6.1 milestone Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D: Dataset: large number of files https://github.com/IQSS/dataverse-pm/issues/27 Feature: Performance & Stability Size: 10 A percentage of a sprint. 7 hours. Type: Bug a defect User Role: API User Makes use of APIs
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.