New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HSEARCH-2764 Improve orchestration of Elasticsearch works #1500

Closed
wants to merge 20 commits into
base: master
from

Conversation

Projects
None yet
2 participants
@yrodiere
Member

yrodiere commented Jul 28, 2017

This PR is based on #1491 (HSEARCH-2827 and HSEARCH-2828), which should be merged first.
We should also probably merge #1492 (HSEARCH-2830) before this one.
=> Done.

Relevant JIRA ticket: https://hibernate.atlassian.net//browse/HSEARCH-2764

So, that's a big one... Sorry about that, but once again most of the changes are linked. I tried to organize the commits in such a way that they are easier to understand.

Purpose

Performance improvements

First, if you're dreaming about dramatic performance improvements, forget about it. Currently, Hibernate Search spends most of its time waiting for Elasticsearch to process works. And I mean this literally: threads are usually parked or blocked on a monitor, waiting for an Elasticsearch response, more than 90% of the time.
So improving raw throughput in Hibernate Search, while possible, likely won't have much of an effect on overall throughput (when we take Elasticsearch into account).
In order to improve performance, I tried to be a bit smarter about what we ask Elasticsearch to do, and how we ask Elasticsearch to do it:

  • Avoid performing unnecessary works (like unnecessary refreshes)
  • Minimize the impact of network latency (minimize the number of round-trips between Hibernate Search and the Elasticsearch cluster)
  • Leverage the distributed nature of an Elasticsearch cluster (send more unrelated works in parallel)

The last one, in particular, is made much easier by embracing the reactive nature of the underlying HTTP client: by submitting a response handler along with each request, instead of submitting a request and waiting for it to finish, we can achieve a high level of parallelism with a small number of threads. That's what the commits introducing CompletableFuture everywhere are about: making Elasticsearch work processing reactive.

Technical improvements

Along the way, I had to tackle some technical issues, mainly because if I didn't, the performance tests would be skewed, or would simply crash. Both of the changes are about AsyncRequestProcessor (now BatchingSharedElasticsearchWorkOrchestrator).

  1. I discovered a race condition in which a thread waiting for async works to be finished would be released too early. See commit "HSEARCH-2764 Avoid a very rare race condition in async work processing ...".
  2. I switched the MultiWriteDrainableLinkedList for a safer, bounded ArrayBlockingQueue. MultiWriteDrainableLinkedList being unbounded means we get very quickly an OutOfMemoryException when works are submitted faster than they are consumed. And this is, in fact, an issue that's been reported before: https://forum.hibernate.org/viewtopic.php?f=9&t=1044534 . Note we don't have this issue in the Lucene backend because in that case we only use it for synchronous execution, while with Elasticsearch we use it for asynchronous execution, meaning the submitting threads are released after having submitted a work and can submit more work before the first even finished executing. See "HSEARCH-2764 Limit the size of Elasticsearch work queues"

Functional improvements

A side effect of one of this PR is that synchronously-executed works from different threads, but relating to the same documents, are no longer executed out of order. That's because we now submit synchronous works for each index to a single queue, which is then processed serially.

Obviously it's not something that will improve performance per se, since we previously executed these works in parallel.
But on the other hand, this change allows us to skip the refresh=true parameter in bulk requests when refreshAfterWrite is disabled.
So, all in all... I figured we may as well give this change a try.

See commit "HSEARCH-2764 Make Elasticsearch non-stream orchestrators index-specific".

Results

Complexity

The overall architecture of Elasticsearch work processing is arguably more complex after this patch. I think that's counterbalanced by more thorough testing and more clearly separated responsibilities, but I'll let you be the judge of that.

Performance tests results

By running the performance tests we merged previously (with a few improvements, see #1492) before and after this patch, I got some numbers: https://docs.google.com/spreadsheets/d/14pLNWIewrCNr_pj_rzMyq6UTMf7lf9VBHMjY8b_gapU/edit#gid=2063776958

The interesting sheet is "nonbuggy-2017-07-28-2". Here is how to read it:

  • column A is the client mode:
    • blackhole-5.4.0 means a stub Elasticsearch 5.4 server processing requests in almost-constant time
    • default means an actual Elasticsearch 5.3 server hosted on AWS (a c4-large instance)
  • column B is the benchmark:
    • Stream.write is about IndexManager.performStreamOperation
    • NonStream.concurrentReadWriteTest:readWriteTestWriter is about IndexManager.performOperation with concurrent search queries
    • NonStream.write is about IndexManager.performOperation without concurrent search queries
    • NonStream.queryBooksByBestRating is about querying only
  • column C is the worker execution mode (only relevant for non-stream writes): sync or async
  • column D is the value of refreshAfterWrite
  • columns E and F show the scores and errors for the original code (without this patch)
  • columns G and H show the scores and errors for the "reactive" code (with this patch)
  • columns I to J show the relative improvements with this patch. Color indicates improvements/decreases from the original build to the "reactive" build:
    • values in green are statistically significant improvements
    • values in red are statistically significant decreases
    • other values do not reveal any statistically significant difference
  • columns K to N show the scores, errors, and improvements when applying this patch and, on top of that, using parallel processing everywhere (even for non-stream work). This means we run the risk of conflicts between add/updates and deletes on the same document, so it's not an option, but it shows us the cost of serial processing.
  • columns O to R show the scores, errors, and improvements when applying this patch and, on top of that, using serial processing everywhere (even for stream work). It shows the benefit of parallel processing in that case.

Note that due to what's probably a bug in JHM, the auxiliary counters are skewed. So please don't use the :changeset and :add counters when analyzing the results.

Analysis of performance tests results

Inconclusive results

"query" method in concurrent tests

For concurrent benchmarks, the "query" threads and the "write" threads are competing for one resource: the Elasticsearch server in "default" mode, and the CPU in "black hole" mode.

Therefore, looking at the performance of the "query" benchmark is not very conclusive, because this performance will automatically degrade when write performance improves, and will automatically improve when write performance decreases.

Blackhole mode

In blackhole mode, we either have seemingly huge improvements or inconclusive results. I'd argue that the huge improvements are not conclusive either, because they mostly result from the last commit in this PR, where we remove the artificial 100ms delay between two async processing batches.

I think the only conclusion to draw from blackhole tests is that, be it before or after this PR, the internal processing in Hibernate Search is faster than the actual indexing by Elasticsearch by several orders of magnitude. Thus we shouldn't bother ourselves too much about raw code performance for now, at least not if our goal is to increase throughput.

Actual Elasticsearch server

In real-world conditions (with an actual Elasticsearch server), the numbers are more conclusive, and quite encouraging.

My comments below only look at the "worst case" improvements, i.e. the difference between (score before + error) and (score after - error); thus the improvements are probably better than what I mention here.

  • query throughput seems to have decreased. I cannot explain why exactly, since I barely even touched query code. Note that the numbers are quite different depending on the worker execution mode and the value of refreshAfterWrite, which is a bit strange since we don't perform any write during these tests. I would be inclined to attribute the changes to error margins, but we should probably work a bit on query performance in another ticket.
  • non-stream throughput with refreshAfterWrite=false, be it sync or async, improved massively (+20% to +240%), probably because we no longer do unnecessary refreshes in bulk API calls.
  • non-stream async throughput with refreshAfterWrite=true decreased, though only in a non-concurrent setup. This is likely a result of more frequent refreshes: while we used to do a refresh for each batch of works, we now do a refresh for every bulk API call (in this setup, 1 batch = at most 2500 changesets = at most 30000 works = approximately 250 times less bulk API calls, i.e. at most 120). We could optimize it, I would say "we don't care", since this setup is particularly pointless.
  • non-stream sync throughput with refreshAfterWrite=true increased in the non-concurrent setup, but decreased in the concurrent setup. I don't think it's very important, because refreshAfterWrite=true will ruin performance anyway, but if I had to explain why, I would say:
    • the increase is caused by the fact we now bulk together requests from different threads, leading to lower average latency when the Elasticsearch server is the bottleneck
    • the decrease is probably caused by the fact we now (correctly) execute sync works serially, which must reduce the throughput when write threads compete with read threads, somehow.
  • streamed write throughput improved about 20%, probably because of parallel work processing. This is confirmed by the fact the Elasticsearch server did show significantly higher CPU usage during those tests, as shown on this CPU graph taken during three run (the "original" run is on the left, the "patched" run on the right and slightly shorter).

Conclusion

Performance improvements seem to be here, at least in the important cases (stream, and non-stream with refreshAfterWrite=false).
Let's merge it, and work on the sore points in the scope of another ticket if you think it's necessary?

yrodiere added some commits Jun 30, 2017

HSEARCH-2764 Only include request/response in ElasticsearchWork wrapp…
…ing exceptions

This will make it much easier to handle exceptions in CompletableFutures
in the next commits.

Granted, this makes the exception traces longer, but to be fair it only
*adds* to the traces, so users won't have to scroll more unless they
want to know more.
HSEARCH-2764 Allow null roots when simply getting a value from Json
We already return an Optional, so callers already have to deal with
missing values. Better return a missing value than throw an
AssertionFailure...

This will make error handling for bulk works easier, since we won't have
to care about null responses anymore (it may happen in unit tests when
mocks are incorrectly configured in particular).
HSEARCH-2764 Use CompletableFuture in ElasticsearchWorks
This is the first step toward making asynchronous requests more
"reactive".
HSEARCH-2764 Use CompletableFuture for orchestration in Elasticsearch…
…WorkProcessor

But still execute works sequentially for now.
HSEARCH-2764 Introduce ElasticsearchWorkBulker and ElasticsearchWorkS…
…equenceBuilder

Those two classes encapsulate the logic of bulking and building a
sequence of works, making it easier to orchestrate works in many
different ways (see the following commits).

Also, compared the the previous way of executing works, this fixes the
following issues:

 1. In async mode, a failure will now only affect the changeset of the
    failing work, subsequent changesets will execute normally.
    And (that's the hard part) bulks can still span multiple changesets:
    each changeset will only be affected by failures from its own
    bulked works.
 2. The stack traces of failures in bulked works are now much more
    similar to failures in non-bulked works.
 3. That's just a side-effect, but bulked works can now return a result,
    though for now the result is ignored. This mainly means that
    if one day we need to inspect the result of bulked works
    (for statistics, in particular), it will be that much easier.
 4. We now have thorough unit tests for work bulking and sequencing.
HSEARCH-2764 Introduce ElasticsearchWorkOrchestrator to better manage…
… inter-work dependency when running asynchronously

This will allow more flexibility in work orchestration in the following
commits.
HSEARCH-2764 Avoid a very rare race condition in async work processing
If, between the end of the processing loop and the call to
processingScheduled.set( false ) at the end of processing, another
thread somehow managed to submit a changeset and call awaitCompletion(),
then this thread ended up not waiting for its changeset, but only for
the previous ones.

This commit fixes the issue by avoiding the use of multiple instances of
CountDownLatch, and instead relying on Phaser so that we can safely
change what waiting threads are waiting for (i.e. we can just say
"oh sorry, you were waiting for the previous runnable, but another one
needs to be ran before I let you go"). Also part of the solution is
systematically checking whether a new processing runnable must be
scheduled before arriving at the phaser.
HSEARCH-2764 Add a parallel orchestrator for streamed work
This could lead to better performance with large Elasticsearch
connection pools when works affect multiple indexes.
HSEARCH-2764 Don't refresh indexes for streamed Elasticsearch works
Those works are executed out of order anyway, and the only way for
the client to be sure they've been executed is to perform a flush (which
is followed by a refresh), so there's no point trying to refresh for
every single work.
HSEARCH-2764 Set the minim bulk size to 1 for stream work orchestration
For stream works, we only submit single-work changesets, which means
the decision on whether to bulk the work or not will always happen
immediately after each work, when we only have one work to bulk.
Thus if we set the minimum to a value higher than 1, we would always
decide not to start a bulk (because there would always be only one
work to bulk), which would result in terrible performance.
HSEARCH-2764 Allow synchronous works from different threads to be bul…
…ked together

We still don't fix the issue of works being executed out of order,
because that's not our concern in this commit.
Ultimately we may want to have one shared, serial orchestrator per index
manager.
HSEARCH-2764 Make Elasticsearch non-stream orchestrators index-specific
The downside is we may not be able to bulk as much as we used to, but
there are a few advantages too:

1. We're finally able to force executing synchronous works in order
(by using one serial orchestrator per index). Note that this may impact
performance negatively, but at least we'll avoid some errors.
2. We can finally disable the 'refresh' in bulk API calls when
'refreshAfterWrite' is disabled for the index. Previously we couldn't,
because this parameter can take a different value for each index
manager.
HSEARCH-2764 Limit the size of Elasticsearch work queues
If we don't, we run the risk of OutOfMemoryErrors when a huge stream
of works is pushed continuously to the index manager (for instance,
when mass indexing).
HSEARCH-2764 Remove the delay in BatchingSharedElasticsearchWorkOrche…
…strator

There's no need for such a delay:

 * if works are submitted more slowly than they are processed, then
there's no need to try doing more bulking (especially if it means adding
an artificial delay)
 * if works are submitted faster than they are processed, then the queue
should progressively fill up, we'll start doing bulking, and we'll end
up ignoring the delay anyway.
@Sanne

More comments (possibly) coming. Just flusing some ;)

@Sanne

This comment has been minimized.

Show comment
Hide comment
@Sanne

Sanne Aug 3, 2017

Member

First, if you're dreaming about dramatic performance improvements, forget about it. Currently, Hibernate Search spends most of its time waiting for Elasticsearch to process works.

I see you might be a bit disappointed because of the figures you got, but this work might actually provide some dramatic improvements for people who have an higher-end configured ES server, or just a better tuned local network link.

In order to improve performance, I tried to be a bit smarter about what we ask Elasticsearch to do

See, that's great and exactly what people hope to get as "free lunch" when using a library like this.

The figures we get here are just not representative of what someone else will get from it - unless they use the same cheap AWS configuration. Still that doesn't make our figures irrelevant, they are an essential factor of this whole exercise which helps us to understand and acquire the knowledge to figure out what "smarter about what we ask Elasticsearch to do" requires... today and in the future.

Member

Sanne commented Aug 3, 2017

First, if you're dreaming about dramatic performance improvements, forget about it. Currently, Hibernate Search spends most of its time waiting for Elasticsearch to process works.

I see you might be a bit disappointed because of the figures you got, but this work might actually provide some dramatic improvements for people who have an higher-end configured ES server, or just a better tuned local network link.

In order to improve performance, I tried to be a bit smarter about what we ask Elasticsearch to do

See, that's great and exactly what people hope to get as "free lunch" when using a library like this.

The figures we get here are just not representative of what someone else will get from it - unless they use the same cheap AWS configuration. Still that doesn't make our figures irrelevant, they are an essential factor of this whole exercise which helps us to understand and acquire the knowledge to figure out what "smarter about what we ask Elasticsearch to do" requires... today and in the future.

@Sanne

This comment has been minimized.

Show comment
Hide comment
@Sanne

Sanne Aug 9, 2017

Member

merged!

Member

Sanne commented Aug 9, 2017

merged!

@Sanne Sanne closed this Aug 9, 2017

@yrodiere yrodiere deleted the yrodiere:HSEARCH-2764 branch Jan 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment