Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semaphore for async indexing and sync index in transaction after publish #10388

Merged
merged 14 commits into from Mar 27, 2024

Conversation

ErykKul
Copy link
Collaborator

@ErykKul ErykKul commented Mar 19, 2024

What this PR does / why we need it:
Hopefully this fixes indexing issues after publishing of datasets and lowers the risk of solr getting overwhelmed.

Which issue(s) this PR closes:

Closes #10381

Special notes for your reviewer:
I do not have much experience with Java semaphores yet, I used them extensively in other languages. We need to be sure it works as intended before merging. I tried to balance this solution and not to move the problem to having too many threads sleeping and waiting on semaphore i.s.o. relying on solr to cope with the load. The new parameter for max running async indexes should be then tuned for Harvard Dataverse, I think (guesstimate) 4 is OK for most instances, but Harvard might want a bigger number.

Suggestions on how to test this:
Ideally, under realistic load in an environment similar to production: publish and edit many datasets concurrently. Also, edit some collections/dataverses, as that appears to be the worst case scenario: after editing a large collection, all datasets get reindexed asynchronously causing large load on the server and solr. This might be the root cause of the problems on Harvard Dataverse.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.

Is there a release notes update needed for this change?:
Yes, it includes new property.

Additional documentation:
Needs testing. I hope it will pass unit and integration tests without problems. I will also try testing it on my local installation. Any help in testing it in realistic setting is welcome.

@coveralls
Copy link

coveralls commented Mar 19, 2024

Coverage Status

coverage: 20.736% (+0.08%) from 20.659%
when pulling 1479403 on ErykKul:10381_index_after_publish
into 30666f9 on IQSS:develop.

Copy link
Member

@pdurbin pdurbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick doc review.

doc/sphinx-guides/source/installation/config.rst Outdated Show resolved Hide resolved
@scolapasta scolapasta added this to the 6.2 milestone Mar 20, 2024
Copy link
Member

@qqmyers qqmyers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I guess the proof is in the QA to see how well this handles the issue.

@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 20, 2024

thought: Do we need a counter and expose it as a metric to see if there are threads pilling up, waiting for their turn? It's hard to get a feeling for the right number here. Ideally you want to see in your monitoring system a correlation between the setting, the number of threads waiting and the loadavg on Solr...

Microprofile Metrics to the rescue here, if we want such a metric done easily.

@landreev
Copy link
Contributor

Sorry, what is the status of this PR? - I see that it's been approved; but then it is still sitting in the "Ready for Review" column. And it's showing the red "Merging is blocked" and "Changes Requested" banners - but the latter appears to be pointing to the change @poikilotherm requested the other day, that appears to have been addressed ... Oliver, is it just waiting for you to approve the change? (or do you actually want more to be done, in response to the question about threads?)
This is somehow more confusing than it should be. 🙂 ... please drag into "ready for QA" one way or another.

@pdurbin
Copy link
Member

pdurbin commented Mar 21, 2024

Huh. Jim did approve the PR but maybe it didn't automatically move to "ready for QA" because Oliver requested changes?

It looks like we can dismiss Oliver's request for changes, if necessary:

Screenshot 2024-03-20 at 9 08 58 PM

Maybe that would unlock the merge button? Not sure.

@landreev landreev self-assigned this Mar 21, 2024
@landreev
Copy link
Contributor

Thanks all, into QA it goes.

ctxt.index().indexDataset(readyDataset, true);
} catch (SolrServerException | IOException e) {
throw new CommandException("Indexing failed: " + e.getMessage(), this);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand what's going on here. I understand why we may want to change indexing from asynchronous to synchronous. But why are we moving the indexing from onSuccess() and back into the execute() method?
Is this just to address the original issue with timestamps (#10381)?
Are we positive we want to do this? - If I understand correctly, this would bring us back to a failure to index resulting in a failure to publish. Which I thought we specifically wanted to avoid.
I may easily be missing something of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev @scolapasta I do not fully understand the issue myself. Moving the indexing back to execute makes the indexing to happen earlier, not later. Also making it synchronous instead of async makes it also happen earlier, not later. If the issue is that the index did happen, but its timestamp is before the publish moment, this PR would make the problem worse, not better. I just do not see yet why the publish date would be greater than the index moment, unless the index was interrupted (e.g., by a circuit breaker), or a wrong version of dataset was indexed (e.g., the draft and not the published version). I will go again through the code to see if I am missing something, and maybe wrong version gets indexed, etc. As its turns out, we just migrated to the version 6.1 (from 5.14) and recently published dataset has the same problem. I think that the core cause is still not addressed. I will keep digging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev @scolapasta I did not see any errors on solr, no failure log in payara, etc. Also, the index looks fine, with the right version etc. I assume that indexing went fine after publish, only the "indextime" was not updated on the DB. I reverted to async update in finalize after publish, and added em.flush() after the indextime update, which should fix the problem.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if em.flush changes anything, from what I understand the change should have been written anyway. I think I am still missing something. In the example on our server, the publishing took 20+ hours due to the checksum checks on 20+ TiB. I got e-mail that the dataset was published the next day after publishing started. From the code, the index should happen directly after sending the notifications. I would assume that it either succeeded, but the indextime did not update, or it failed, but then I should see it in the process-failures log, which is empty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think that index was done, the correct indextime was written, but then it got overwritten by the merge in

try {
            ExportService instance = ExportService.getInstance();
            instance.exportAllFormats(dataset);
            dataset = ctxt.datasets().merge(dataset); 
        }

I moved the indexing to the bottom, I am more confident now that it is fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that it is safe to merge the current code of this PR and leave it like this until we see how it will behave in production. I saved the link to this issue as it contains some interesting discussion, especially on running batches and metrics. We can revisit these ideas at some point. As for adding the sleep calls, I am against that since this would makes the queue holding indexing jobs longer than necessary. I like the idea of adding the index logic to the timer for fixing exports, but I agree it is out of scope for this PR. We will pick it up later, if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just asked a question in dv-tech slack about simplifying the whole scheme somewhat radically. (but probably not in the context of this PR either).
I'm generally in favor of merging this PR as is (just want to run a couple of more experiments with multiple indexing attempts in parallel).
I would consider however including one small extra change - adding a few extra hours to the index timestamp in the test in the isIndexedVersion() method in DatasetPage.java:

return (workingVersion.getDataset().getIndexTime() != null) && 
workingVersion.getDataset().getIndexTime().after(workingVersion.getReleaseTime());

i.e., something like

(dataset.getIndexTime().getTime() + (a few hours worth of milliseconds)) > version.getReleaseTime().getTime()

instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ik have looked at the differences, but nothing stroke me as suspicious.

In solrconfig.xml some requesthandlers are disabled, but I do not think that these were used anyway. Those are the /browse, /spell, /update/extract, /tvhr and /elevate endpoints.

In schema.xml I did find it strange that the schema version in 6.1 is lower than that of 5.14. But I do not know what the impact of that version number is. Could be just informative. The other differences are replacing the now obsolete solr.TrieXXXField with solr.XXXPointField as one would expect for Solr 9. There are also some changes to the definitions of dynamic field definitions. AFAIK Dataverse does not use dynamic field types in Solr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be the move from Solr 8 to 9 itself that changed something dramatically?
From a brief look, I'm not seeing any information online suggesting 9 is dramatically slower than 8 in general. ... But could be something Dataverse specifically relies on when indexing that is behaving differently.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the indexed time in the isIndexedVersion test by 3 hours as suggested. This should fix the problem for already published datasets with indexedtime before releasetime.

- Measure wait times for a permit, a measurement how many indexing requests are currently stuck because of high demand (so you can tune the amount of available parallel indexing threads)

- Measure how long the actual indexing task runs, which might be an indication of lots of very large datasets or not enough resources for your index
@poikilotherm
Copy link
Contributor

poikilotherm commented Mar 25, 2024

May I suggest taking a look at ErykKul#10 (a PR to this PR) which adds timers? It's a very simple change but will provide a better way to understand if there actually are any load issues. These might be used in production as well as benchmarking, as it is a common "standard" for metrics exposure.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Mar 25, 2024

@poikilotherm Thanks! I have just merged your code, it looks great! I used prometheus with Grafana in the past, It would be cool to have something like that. It would help finding other issues as well.

@poikilotherm
Copy link
Contributor

@poikilotherm Thanks! I have just merged your code, it looks great! I used prometheus with Grafana in the past, It would be cool to have something like that. It would help finding other issues as well.

Splendid! Thanks for merging! Two more questions:

  1. Should we add some more details to the docs (guides/release notes) about the availability of these metrics?
  2. Are there other metrics we should add now, as a part of this PR? Like histograms of sizes? XY data of size vs time, etc?

@ErykKul
Copy link
Collaborator Author

ErykKul commented Mar 25, 2024

@poikilotherm histogram would be more useful than raw data. Also, documentation would be great! Do we have any on how access metrics? If it is a new thing, it might be out of scope of this PR. I think it deserves a new section in the documentation specifically on metrics. We should also add more metrics at other places.

@poikilotherm
Copy link
Contributor

There is doc/sphinx-guides/source/admin/monitoring. We could add a few words there that can be reformatted / moved / extended later on.

@landreev
Copy link
Contributor

I started testing the effects of the semaphores on our test instance running off a copy of the prod. database clone; with somewhat inconclusive results just yet. But I'm only doing it in parallel with having been gang-pressed into working on something else prod. and admin here.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Mar 27, 2024

I added a mention in the doc/sphinx-guides/source/admin/monitoring.rst about the newly added metrics. I am not sure if it is sufficient and I think it will need some cleaning up later on.

@landreev
Copy link
Contributor

The guide entry for the metrics is perfect.
I'm going to merge the PR now. Will share some observations from running tests on a large database when I have a moment. But definitely looking forward to seeing the effects, and the metrics in real production use.
Thank you again, @ErykKul and @poikilotherm for this effort!

@landreev landreev merged commit d5bb0c2 into IQSS:develop Mar 27, 2024
11 of 12 checks passed
@landreev landreev removed their assignment Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done 🧹
Development

Successfully merging this pull request may close these issues.

Index after publish
8 participants