Performance issues as a side effect of the file lookup optimizations #9725

landreev · 2023-07-20T14:57:10Z

Adding @ErykKul, since there may be some kinks that need to be ironed out after the optimization PR #9558 was merged.

The PR above has significantly improved the performance in many scenarios, but there may be some unintended consequences/side effects. As part of releasing v5.14, performance tests were run on some of the IQSS real life prod. datasets and we noticed that performance of the dataset page dropped for some datasets, for some of them catastrophically so. It did not appear to be the direct function of the number of files in the dataset, but appears to be related to the number of cascade-related objects. Specifically, the first thing I noticed was that there was now 1 extra query for every entry in the guestbookresponse table for every file in the dataset. i.e.:

SELECT DOWNLOADTIMESTAMP, DOWNLOADTYPE, GUESTBOOKRESPONSE_ID, SESSIONID FROM FILEDOWNLOAD WHERE (GUESTBOOKRESPONSE_ID = ...)

for one specific dataset that became particularly unusable, that was 68K extra queries - but we definitely have datasets with many more downloads. This is the result of the 1:1 relationship between GuestbookResponse and FileDownload. It's not just the GuestBookResponse though. It appears that the optimization underneath relies on multiple LEFT OUTER JOIN in the query that's run underneath to get all the files in one pass. Here's the monstrous query issued:

SELECT t2.ID, t2.DTYPE, t2.AUTHORITY, t2.CREATEDATE, t2.GLOBALIDCREATETIME, t2.IDENTIFIER, t2.IDENTIFIERREGISTERED, t2.INDEXTIME, t2.MODIFICATIONTIME, t2.PERMISSIONINDEXTIME, t2.PERMISSIONMODIFICATIONTIME, t2.PREVIEWIMAGEAVAILABLE, t2.PROTOCOL, t2.PUBLICATIONDATE, t2.STORAGEIDENTIFIER, t2.CREATOR_ID, t2.OWNER_ID, t2.RELEASEUSER_ID, t3.ID, t3.EXTERNALLABELSETNAME, t3.FILEACCESSREQUEST, t3.HARVESTIDENTIFIER, t3.LASTEXPORTTIME, t3.METADATALANGUAGE, t3.STORAGEDRIVER, t3.USEGENERICTHUMBNAIL, t3.citationDateDatasetFieldType_id, t3.harvestingClient_id, t3.template_id, t3.guestbook_id, t3.thumbnailfile_id, t0.ID, t0.DTYPE, t0.AUTHORITY, t0.CREATEDATE, t0.GLOBALIDCREATETIME, t0.IDENTIFIER, t0.IDENTIFIERREGISTERED, t0.INDEXTIME, t0.MODIFICATIONTIME, t0.PERMISSIONINDEXTIME, t0.PERMISSIONMODIFICATIONTIME, t0.PREVIEWIMAGEAVAILABLE, t0.PROTOCOL, t0.PUBLICATIONDATE, t0.STORAGEIDENTIFIER, t0.CREATOR_ID, t0.OWNER_ID, t0.RELEASEUSER_ID, t1.ID, t1.CHECKSUMTYPE, t1.CHECKSUMVALUE, t1.CONTENTTYPE, t1.FILESIZE, t1.INGESTSTATUS, t1.PREVIOUSDATAFILEID, t1.prov_entityname, t1.RESTRICTED, t1.ROOTDATAFILEID, t1.embargo_id, t4.ID, t4.CONTROLCARD, t4.FORCETYPECHECK, t4.LABELSFILE, t4.TEXTENCODING, t4.datafile_id, t5.ID, t5.DTYPE, t5.AUTHORITY, t5.CREATEDATE, t5.GLOBALIDCREATETIME, t5.IDENTIFIER, t5.IDENTIFIERREGISTERED, t5.INDEXTIME, t5.MODIFICATIONTIME, t5.PERMISSIONINDEXTIME, t5.PERMISSIONMODIFICATIONTIME, t5.PREVIEWIMAGEAVAILABLE, t5.PROTOCOL, t5.PUBLICATIONDATE, t5.STORAGEIDENTIFIER, t5.CREATOR_ID, t5.OWNER_ID, t5.RELEASEUSER_ID, t6.ID, t6.EXTERNALLABELSETNAME, t6.FILEACCESSREQUEST, t6.HARVESTIDENTIFIER, t6.LASTEXPORTTIME, t6.METADATALANGUAGE, t6.STORAGEDRIVER, t6.USEGENERICTHUMBNAIL, t6.citationDateDatasetFieldType_id, t6.harvestingClient_id, t6.template_id, t6.guestbook_id, t6.thumbnailfile_id, t7.ID, t7.CASEQUANTITY, t7.ORIGINALFILEFORMAT, t7.ORIGINALFILENAME, t7.ORIGINALFILESIZE, t7.ORIGINALFORMATVERSION, t7.RECORDSPERCASE, t7.UNF, t7.VARQUANTITY, t7.DATAFILE_ID, t8.ID, t8.CHECKSUM, t8.CONTENTTYPE, t8.FILESIZE, t8.FORMATTAG, t8.FORMATVERSION, t8.ISPUBLIC, t8.ORIGIN, t8.TYPE, t8.DATAFILE_ID, t9.ID, t9.ENDTIME, t9.REPORT, t9.STARTTIME, t9.STATUS, t9.TYPE, t9.DATAFILE_ID, t10.ID, t10.TYPE, t10.DATAFILE_ID, t11.ID, t11.DESCRIPTION, t11.DIRECTORYLABEL, t11.LABEL, t11.prov_freeform, t11.RESTRICTED, t11.VERSION, t11.DATAFILE_ID, t11.DATASETVERSION_ID, t12.ID, t12.NAME, t12.DATASET_ID, t13.ID, t13.EMAIL, t13.INSTITUTION, t13.NAME, t13.POSITION, t13.RESPONSETIME, t13.AUTHENTICATEDUSER_ID, t13.DATAFILE_ID, t13.DATASET_ID, t13.DATASETVERSION_ID, t13.GUESTBOOK_ID, t14.ID, t14.DATEAVAILABLE, t14.REASON, t15.creation_time, t15.authenticated_user_id, t15.datafile_id, t16.ID, t16.DTYPE, t16.AUTHORITY, t16.CREATEDATE, t16.GLOBALIDCREATETIME, t16.IDENTIFIER, t16.IDENTIFIERREGISTERED, t16.INDEXTIME, t16.MODIFICATIONTIME, t16.PERMISSIONINDEXTIME, t16.PERMISSIONMODIFICATIONTIME, t16.PREVIEWIMAGEAVAILABLE, t16.PROTOCOL, t16.PUBLICATIONDATE, t16.STORAGEIDENTIFIER, t16.CREATOR_ID, t16.OWNER_ID, t16.RELEASEUSER_ID, t17.ID, t17.CHECKSUMTYPE, t17.CHECKSUMVALUE, t17.CONTENTTYPE, t17.FILESIZE, t17.INGESTSTATUS, t17.PREVIOUSDATAFILEID, t17.prov_entityname, t17.RESTRICTED, t17.ROOTDATAFILEID, t17.embargo_id, t18.ID, t18.AFFILIATION, t18.ALIAS, t18.DATAVERSETYPE, t18.description, t18.EXTERNALLABELSETNAME, t18.FACETROOT, t18.FILEPIDSENABLED, t18.GUESTBOOKROOT, t18.METADATABLOCKFACETROOT, t18.METADATABLOCKROOT, t18.METADATALANGUAGE, t18.NAME, t18.PERMISSIONROOT, t18.STORAGEDRIVER, t18.TEMPLATEROOT, t18.THEMEROOT, t18.DEFAULTCONTRIBUTORROLE_ID, t18.DEFAULTTEMPLATE_ID, t19.ID, t19.EXTERNALLABELSETNAME, t19.FILEACCESSREQUEST, t19.HARVESTIDENTIFIER, t19.LASTEXPORTTIME, t19.METADATALANGUAGE, t19.STORAGEDRIVER, t19.USEGENERICTHUMBNAIL, t19.citationDateDatasetFieldType_id, t19.harvestingClient_id, t19.template_id, t19.guestbook_id, t19.thumbnailfile_id, t20.ID, t20.AFFILIATION, t20.CREATEDTIME, t20.DEACTIVATED, t20.DEACTIVATEDTIME, t20.EMAIL, t20.EMAILCONFIRMED, t20.FIRSTNAME, t20.LASTAPIUSETIME, t20.LASTLOGINTIME, t20.LASTNAME, t20.MUTEDEMAILS, t20.MUTEDNOTIFICATIONS, t20.POSITION, t20.SUPERUSER, t20.USERIDENTIFIER, t21.ID, t21.AFFILIATION, t21.CREATEDTIME, t21.DEACTIVATED, t21.DEACTIVATEDTIME, t21.EMAIL, t21.EMAILCONFIRMED, t21.FIRSTNAME, t21.LASTAPIUSETIME, t21.LASTLOGINTIME, t21.LASTNAME, t21.MUTEDEMAILS, t21.MUTEDNOTIFICATIONS, t21.POSITION, t21.SUPERUSER, t21.USERIDENTIFIER, t22.ID, t22.AUTHORITY, t22.GLOBALIDCREATETIME, t22.IDENTIFIER, t22.IDENTIFIERREGISTERED, t22.PROTOCOL, t22.STORAGELOCATIONDESIGNATOR, t22.DVOBJECT_ID, t23.ID, t23.ASSIGNEEIDENTIFIER, t23.PRIVATEURLANONYMIZEDACCESS, t23.PRIVATEURLTOKEN, t23.DEFINITIONPOINT_ID, t23.ROLE_ID FROM DVOBJECT t2 LEFT OUTER JOIN (DVOBJECT t0 JOIN DATAFILE t1 ON (t1.ID = t0.ID)) ON (t0.OWNER_ID = t2.ID) LEFT OUTER JOIN INGESTREQUEST t4 ON (t4.datafile_id = t0.ID) LEFT OUTER JOIN (DVOBJECT t5 JOIN DATASET t6 ON (t6.ID = t5.ID)) ON (t6.thumbnailfile_id = t0.ID) LEFT OUTER JOIN DATATABLE t7 ON (t7.DATAFILE_ID = t0.ID) LEFT OUTER JOIN AUXILIARYFILE t8 ON (t8.DATAFILE_ID = t0.ID) LEFT OUTER JOIN INGESTREPORT t9 ON (t9.DATAFILE_ID = t0.ID) LEFT OUTER JOIN DATAFILETAG t10 ON (t10.DATAFILE_ID = t0.ID) LEFT OUTER JOIN FILEMETADATA t11 ON (t11.DATAFILE_ID = t0.ID) LEFT OUTER JOIN (FILEMETADATA_DATAFILECATEGORY t24 JOIN DATAFILECATEGORY t12 ON (t12.ID = t24.fileCategories_ID)) ON (t24.fileMetadatas_ID = t11.ID) LEFT OUTER JOIN GUESTBOOKRESPONSE t13 ON (t13.DATAFILE_ID = t0.ID) LEFT OUTER JOIN EMBARGO t14 ON (t14.ID = t1.embargo_id) LEFT OUTER JOIN fileaccessrequests t15 ON (t15.datafile_id = t0.ID) LEFT OUTER JOIN (DVOBJECT t16 LEFT OUTER JOIN DATAFILE t17 ON (t17.ID = t16.ID) LEFT OUTER JOIN DATAVERSE t18 ON (t18.ID = t16.ID) LEFT OUTER JOIN DATASET t19 ON (t19.ID = t16.ID)) ON (t16.ID = t0.OWNER_ID) LEFT OUTER JOIN AUTHENTICATEDUSER t20 ON (t20.ID = t0.RELEASEUSER_ID) LEFT OUTER JOIN AUTHENTICATEDUSER t21 ON (t21.ID = t0.CREATOR_ID) LEFT OUTER JOIN ALTERNATIVEPERSISTENTIDENTIFIER t22 ON (t22.DVOBJECT_ID = t0.ID) LEFT OUTER JOIN ROLEASSIGNMENT t23 ON (t23.DEFINITIONPOINT_ID = t0.ID), DATASET t3 WHERE ((t2.ID = '57776') AND ((t3.ID = t2.ID) AND (t2.DTYPE = 'Dataset')))

because of all these JOINs for this specific dataset (2,200 files) the query, if run on the command line with psql takes 3.5 minutes and produces 2M hits and 9.5GB of output (that's text output of course; it should take less as memory objects, but is most likely still prohibitive for the application to handle). Dropping the guestbookresponse from the query (i.e., removing all the t13 entries from it) makes it run in 1+ minute and produce 46K hits and 200MB of output. (this is probably the equivalent of dropping setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses") from the code? - not sure). Haven't tested yet if this make the page work for the dataset.

There are definitely many more repeated queries on many other tables; it was just the filedownload ones that were the most dramatic and jumped at me right away.

The text was updated successfully, but these errors were encountered:

landreev · 2023-07-20T18:32:51Z

To clarify, we have no reason to assume that the code underneath tries to keep the entire output of the lookup query in memory all at once. (although the way that particular dataset crashed the application on our performance test system at least once was suspicious). I would assume that the code goes through the list of the hits sequentially and builds the entities out of it as needed, which does not require caching it all. Still, going through 2 million hits and Gigabytes of data in order to build a list of 2,200 files - feels a bit excessive (?), even if it does not make the application run out of memory.

What makes the specific dataset even trickier is that it has 220 versions, with the 2,200 files spread between them. There are only 12 files in the last published version. And that's all the files that need to be looked up, for the purposes of the dataset page load - but the new code goes to all the trouble above to look up all of them... This is an exotic case, but we definitely have more like it. AND we are seeing performance degradation on datasets with less exotic structure too. The number of guestbookresponse entries appear to be the factor in the cases observed so far; but I don't think it is safe to assume that there are no other cases where some other type of object that cascades off of DataFile would not cause a similar issue.

landreev · 2023-07-21T22:29:45Z

(documenting what was discussed locally on slack)
Removing setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses") from the findDeep() method does eliminate the extra FileDownload lookups making it possible to load the dataset page for most of our prod. datasets. (We do have datasets with millions of downloads - with the hint above still in the code, findDeep() just does not work on such datasets).

(while it is of course possible to add a hint in that method for the FileDownload entities as well (setHint("eclipselink.left-join-fetch", "o.files.guestbookResponses.fileDownload")?), that would make the resulting query produce even more output, for popular datasets - and I don't think we are going to be using findDeep() in any situations where we will actually need the guestbook responses for the files underneath, so better just drop it, I think?)

The performance is still not very good on datasets with more exotic versions structure, like the one described in the previous comment (2,200 files in 220 versions): while the page loads in 3-4 seconds under 5.13 (there are only 12 files in the latest published versions), it takes more than a minute under the current release candidate. We don't have that many datasets like this one, but we do have them, so we need to figure out a way to address this. As an experiment, I'm trying to put together a similar findDeep() method, but for an individual version, with just the list of filemetadatas that belong to it, and modify the page to use that instead... which may address this?

I am still not 100% sure that there are no other special/exotic cases where this single query approach may become problematic ("too many" of something else in the tree structure of the dataset/version... too many fileAccessRequests? too many... categories?? - idk...) so we need to think about this.

Also, we will retest more carefully, on more datasets, how the performance of the dataset page in the current state (with the GuestBookResponse lookup removed) compares to that of 5.13.

… "findDeep" on the requested version, instead of the full dataset. (Needs more testing). #9725

landreev · 2023-07-24T22:08:50Z

The approach I described above - make the dataset page use an optimization similar to the findDeep() method from #9558, but only for the version needed, instead of the whole dataset - appears to be working, knock on wood. The performance on the “site killer” dataset with 220 versions is back to roughly what it is in 5.13 - some seconds instead of a minute+. We'll test more tomorrow, but cautiously optimistic.

landreev · 2023-07-25T20:17:08Z

In our, IQSS production instance, this solution - retrieving all the files + nested objects with one query with multiple left joins - appears to become somewhat counter-productive with datasets with roughly 10,000+ files; at which point the time it takes to execute that single query exceeds the time it takes to load the page in 5.13. We only have ~10 such datasets; and the degradation is < the factor of 2 in the worst case. Considering that for most of our datasets the performance is better than under 5.13, we can probably live with that in a release.

As a longer term solution, we will need to modify the page so that it does not need to initialize ALL the files in the version, as opposed to a page-worth of files that we are actually displaying. We will also want to create an API for this paging-style retrieval of files. We have an issue already scheduled for this, so we can start working on it immediately after 5.14. (#9763)

landreev · 2023-07-25T20:46:01Z

(the good news is that this back end pagination is already being addressed in #9693).

…prod. datasets. #9725

landreev · 2023-07-27T18:35:39Z

I said earlier:

In our, IQSS production instance, this solution - retrieving all the files + nested objects with one query with multiple left joins - appears to become somewhat counter-productive with datasets with roughly 10,000+ files; at which point the time it takes to execute that single query exceeds the time it takes to load the page in 5.13

I may have finally gotten it to the point where the above is no longer true; mostly by further trimming the lookup query. The monstrous dataset we have in prod. that has 42K files now loads much faster than under 5.13.

scolapasta assigned landreev Jul 20, 2023

scolapasta added this to the 5.14 milestone Jul 20, 2023

scolapasta added this to ▶ SPRINT READY in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) via automation Jul 20, 2023

scolapasta moved this from ▶ SPRINT READY to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 20, 2023

landreev added a commit that referenced this issue Jul 24, 2023

work in progress, ongoing performance experiments. #9725

5c0acbc

landreev added a commit that referenced this issue Jul 24, 2023

A potential solution? Makes the dataset page use a single query-based…

3672517

… "findDeep" on the requested version, instead of the full dataset. (Needs more testing). #9725

landreev added a commit that referenced this issue Jul 26, 2023

some refinements (#9725)

db23a03

landreev mentioned this issue Jul 26, 2023

9725 files lookup performance #9736

Merged

landreev removed this from IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jul 26, 2023

landreev added a commit that referenced this issue Jul 26, 2023

typo. #9725

1adc403

landreev added a commit that referenced this issue Jul 26, 2023

A quick experiment per review suggestion. #9725

a061198

landreev added a commit that referenced this issue Jul 26, 2023

reverting the deaccessioned version experiment. #9725

57ea454

landreev added a commit that referenced this issue Jul 26, 2023

reverting an earlier experiment. #9725

dcfe17e

landreev added a commit that referenced this issue Jul 27, 2023

Some progress addressing the performance on the very high file count …

bcb93ba

…prod. datasets. #9725

kcondon closed this as completed in #9736 Jul 31, 2023

landreev mentioned this issue Aug 7, 2023

DatasetPage optimization cleanup, another pass #9762

Open

pdurbin added the Feature: Performance & Stability label Sep 25, 2023

qqmyers mentioned this issue Sep 26, 2023

Dataset status change isn't visible after publish on auto refreshed dataset page (v5.14) #9954

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues as a side effect of the file lookup optimizations #9725

Performance issues as a side effect of the file lookup optimizations #9725

landreev commented Jul 20, 2023 •

edited

landreev commented Jul 20, 2023

landreev commented Jul 21, 2023

landreev commented Jul 24, 2023

landreev commented Jul 25, 2023 •

edited

landreev commented Jul 25, 2023

landreev commented Jul 27, 2023

Performance issues as a side effect of the file lookup optimizations #9725

Performance issues as a side effect of the file lookup optimizations #9725

Comments

landreev commented Jul 20, 2023 • edited

landreev commented Jul 20, 2023

landreev commented Jul 21, 2023

landreev commented Jul 24, 2023

landreev commented Jul 25, 2023 • edited

landreev commented Jul 25, 2023

landreev commented Jul 27, 2023

landreev commented Jul 20, 2023 •

edited

landreev commented Jul 25, 2023 •

edited