Allow usage of publishOn scheduler for reactive signal emission #905

mp911de · 2018-10-26T12:46:41Z

Motivated by https://stackoverflow.com/questions/53001283/reactive-redis-lettuce-always-publishing-to-single-thread, we should investigate (benchmark) thread switching for response processing.

In a reactive/async scenario, where responses are processed on an I/O thread, there may occur congestion and we should create various scenarios to measure impact and how to address this.

jmnght · 2018-10-31T22:28:19Z

@mp911de I'm seeing similar issue with Reactor Netty + Webflux + Reactor + Lettuce. I was doing stress test on my application and if i send a 100 requests to my webserver, I'm seeing the 4 netty threads are busy processing (essentially calling subscribe) to all the requests. Now the single Lettuce thread responsible for fetching data from Redis is not really "getting its turn" until the 4 netty threads call subscribe on almost all of the 100 requests. Lettuce thread gets to process few times during this time, but once the flood of requests are all subscribed, lettuce picks up speed. It kind of seems unfair in the sense that if requests keep coming, there is a significant impact on request processing times. Also in a real prod environment, this is gonna create a huge delay in processing..

mp911de · 2018-11-07T09:32:01Z

First test results:

Synchronous/Spring Boot/Tomcat

$ wrk -c 4 -d 10 -t 4 http://localhost:8080  
Running 10s test @ http://localhost:8080
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   706.04us  410.71us  10.01ms   96.65%
    Req/Sec     1.47k   221.33     1.61k    94.06%
  59167 requests in 10.10s, 10.05MB read
Requests/sec:   5856.68
Transfer/sec:      1.00MB

$ wrk -c 400 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    42.26ms   12.08ms  68.05ms   71.66%
    Req/Sec     2.01k   219.85     2.69k    90.00%
  80186 requests in 10.02s, 13.62MB read
  Socket errors: connect 0, read 371, write 0, timeout 0
Requests/sec:   8005.37
Transfer/sec:      1.36MB

Reactive/Spring Boot/WebFlux

$ wrk -c 4 -d 10 -t 4 http://localhost:8080  
Running 10s test @ http://localhost:8080
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   772.73us  408.56us  12.88ms   95.66%
    Req/Sec     1.34k   225.01     1.60k    86.88%
  53713 requests in 10.10s, 6.35MB read
Requests/sec:   5317.69
Transfer/sec:    643.96KB

$ wrk -c 400 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
  4 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    53.91ms   15.57ms 125.04ms   69.35%
    Req/Sec     1.59k   300.71     2.02k    63.00%
  63762 requests in 10.07s, 7.54MB read
  Socket errors: connect 0, read 384, write 0, timeout 0
Requests/sec:   6332.87
Transfer/sec:    766.87KB

Application under test: https://gist.github.com/mp911de/ba9c46fe8e8bbe4558ca445e996f108a

Conclusion: Cannot reproduce significant difference between reactive and synchronous processing times. Higher load (400 concurrent requests) results in a difference of 10ms.

mp911de · 2018-11-07T09:32:28Z

Please provide a reproducer, otherwise we cannot trace the issue.

jmnght · 2018-11-16T14:38:41Z

As you asked, I have created a sample project to show the lettuce performance issue, based on your demo application.
https://github.com/jmnght/lettucetest
It uses embedded redis and exposes 2 endpoints
The first get just returns a list of items, the second one gets list of items from redis and returns
when i run jmeter, i see the application handles around 4000 reqests/sec when not using /getFromRedis endpoint
when getting from redis, it drops down to ~100 req/sec
@mp911de Can you take a look and provide some feedback?

jmnght · 2018-11-16T15:35:33Z

$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 20.11ms 84.63ms 615.90ms 95.13%
Req/Sec 271.16 247.91 0.90k 56.36%
1952 requests in 10.10s, 352.75KB read
Socket errors: connect 0, read 0, write 0, timeout 12
Requests/sec: 193.21
Transfer/sec: 34.92KB

$ wrk -c 4 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
4 threads and 4 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 529.69us 1.03ms 32.24ms 96.46%
Req/Sec 2.44k 382.23 3.19k 75.00%
98242 requests in 10.10s, 17.33MB read
Requests/sec: 9723.83
Transfer/sec: 1.72MB

mp911de · 2018-11-19T19:23:23Z

Here are my test results with the code repo you provided:

$ wrk -c 4 -d 10 -t 4 http://localhost:8080
Running 10s test @ http://localhost:8080
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.11ms    8.28ms 130.24ms   98.83%
    Req/Sec     3.51k   527.91     3.89k    95.00%
  139963 requests in 10.10s, 33.77MB read
Requests/sec:  13857.60
Transfer/sec:      3.34MB

$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.39ms    2.18ms  44.80ms   98.14%
    Req/Sec   848.70    131.12     1.13k    71.50%
  33810 requests in 10.01s, 8.16MB read
Requests/sec:   3377.87
Transfer/sec:    834.60KB

After removing system.out calls:

$ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.47ms    8.46ms 131.72ms   98.59%
    Req/Sec     1.78k   317.14     2.05k    92.68%
  70206 requests in 10.01s, 16.94MB read
Requests/sec:   7016.71
Transfer/sec:      1.69MB

These numbers seem pretty reasonable to me, 7000 operations per second on a single machine. Not sure why you see a slowdown to 190 operations per second. After I removed the system.out calls, throughput almost doubled. It's not healthy to use sysout as all output is via PrintStream is synchronized.

jmnght · 2018-11-19T23:44:19Z

@mp911de What machine are you running it? The most I'm getting is

➜  bin wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    11.09ms   17.88ms 214.67ms   97.28%
    Req/Sec   115.18     30.45   170.00     63.38%
  4584 requests in 10.06s, 1.11MB read
Requests/sec:    455.53
Transfer/sec:    112.60KB
➜  bin wrk -c 4 -d 10 -t 4 http://localhost:8080/
Running 10s test @ http://localhost:8080/
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.03ms    2.82ms  56.02ms   95.63%
    Req/Sec     1.93k   474.56     2.73k    76.75%
  76931 requests in 10.01s, 18.56MB read
Requests/sec:   7686.60
Transfer/sec:      1.85MB

mp911de · 2018-11-20T07:27:23Z

Just a regular MacBook (2,8 GHz Intel Core i7), nothing crazy.
Do you have a profiler to measure what's causing the slowdown?

jmnght · 2018-11-26T15:45:31Z

@mp911de I'm running on a 3.1 GHz i7 MacBook, dual core. Not sure what was causing the slowdown but redoing the test after restarting laptop gave me ~4900 reques/sec

➜  ~ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   816.76us  376.35us  15.16ms   97.18%
    Req/Sec     1.25k   106.36     1.51k    79.70%
  50095 requests in 10.10s, 12.09MB read
Requests/sec:   4959.46
Transfer/sec:      1.20MB

But to follow up on the actual issue, i updated the response size to ~300KB and the req/sec dropped to ~100 which is what I'm seeing in our app servers. Is that expected? I have updated the sample code here: https://github.com/jmnght/lettucetest

Results:

➜  ~ wrk -c 4 -d 10 -t 4 http://localhost:8080/fromRedis
Running 10s test @ http://localhost:8080/fromRedis
  4 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    35.57ms   19.86ms 139.73ms   92.87%
    Req/Sec    30.30      7.86    50.00     57.88%
  1214 requests in 10.07s, 448.51MB read
Requests/sec:    120.53
Transfer/sec:     44.53MB

Thanks!

mp911de · 2018-12-03T11:35:34Z

Thanks a lot for the additional details, @jmnght. Increasing the payload size really makes a difference and helps to reveal code that does not scale well.

I spotted various things here and tried different things:

Jackson requires a lot of time_ to decode binary data into JSON objects.

I tried to simplify things by replacing Jackson with plain String. Things improve but by doing so I found a performance issue in Spring WebFlux. The throughput isn't perfect but it improved.

Another simplification is passing-through objects as ByteBuffer/byte[]. This is by far the fastest throughput.

I attached my wrk results below.

Okay, so what's the conclusion? I think the combination of an expensive serialization mechanism parired with the single-threaded nature of a single connection have a measurable impact. I will do further experiments with multi-threaded scenarios to see how we can improve here.

Using Jackson (original code at jmnght/lettucetest@6f2c221)

 ~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    21.12ms    1.14ms  28.53ms   93.99%
    Req/Sec   189.62      6.54   202.00     72.00%
  948 requests in 5.02s, 350.21MB read
Requests/sec:    188.94
Transfer/sec:     69.80MB

Write HGETALL content as String:

~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.24ms    0.88ms  13.20ms   85.06%
    Req/Sec   553.86     33.64   610.00     74.51%
  2812 requests in 5.10s, 1.01GB read
Requests/sec:    551.37
Transfer/sec:    203.75MB

Write HGETALL content as byte[]:

~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.19ms  225.60us   4.82ms   83.02%
    Req/Sec     3.25k   190.54     3.44k    88.24%
  16511 requests in 5.10s, 5.96GB read
Requests/sec:   3237.87
Transfer/sec:      1.17GB

mp911de · 2018-12-03T12:51:47Z

I introduced a Scheduler option to Lettuce so data signals are dispatched on ClientResources#eventExecutorGroup() for now. I think I will make this option configurable via ClientOptions whether a specific client instance should emit items directly (default) or using the a different scheduler. So basically, we're re-introducing multi-threading with this change whereas a single connection effectively leads to single-threaded-like behavior.

I deployed a snapshot build that emits data signals on a multi-threaded scheduler.
@jmnght Can you give version 5.1.4.gh905-SNAPSHOT a try and report back whether you see a similar positive effect?

<dependency>
  <groupId>io.lettuce</groupId>
  <artifactId>lettuce-core</artifactId>
  <version>5.1.4.gh905-SNAPSHOT</version>
</dependency>

<repositories>
  <repository>
    <id>sonatype-snapshots</id>
    <name>Sonatype Snapshot Repository</name>
    <url>https://oss.sonatype.org/content/repositories/snapshots/</url>
    <snapshots>
      <enabled>true</enabled>
    </snapshots>
  </repository>
</repositories>

Emitting data signals directly:

 ~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    21.80ms    2.06ms  32.30ms   93.14%
    Req/Sec   183.64     14.14   202.00     76.00%
  918 requests in 5.02s, 339.13MB read
Requests/sec:    183.00
Transfer/sec:     67.60MB

Emitting data signals on a scheduler:

 ~  wrk -c 4 -d 5 -t 1 http://localhost:8080/fromRedis
Running 5s test @ http://localhost:8080/fromRedis
  1 threads and 4 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.09ms    3.42ms  22.06ms   64.09%
    Req/Sec   659.58    124.00     0.96k    72.00%
  3287 requests in 5.01s, 0.87GB read
Requests/sec:    656.40
Transfer/sec:    177.22MB

jmnght · 2018-12-03T18:08:20Z

@mp911de Thank you for analyzing it. I will try the SNAPSHOT version and report back.
I have a few questions though:

What profiler are you using? Looks like visual vm but the flame icons looks neat.
You mentioned passing through bytebuffer/byte[] had much better performance. Whats your thought on lettuce thread just returning byte[] without really deserializing, and then doing the deserialization on a separate pool by doing a publishOn() ops in controller so that lettuce thread is not impacted?
I'm using a dual core Macbook Pro with 3.1GHz i7, but my requests/sec are most of the time 50% of what you are seeing. Are you using quad core? Any thoughts on how to debug that?

mp911de · 2018-12-03T19:24:51Z

Regarding your questions:

It's YourKit
That's what is/will happen in Spring Data Redis. In Spring Data Redis, Lettuce is used with a binary codec and all serialization happens in Spring Data Redis itself. From all the reports I see is addressing the issue in Lettuce the best options. This allows users to decide whether they want to offload deserialization/data signals to a separate thread or whether they like to keep things as-is.
See Allow usage of publishOn scheduler for reactive signal emission #905 (comment). If you have two cores, then you can effectively run only two threads. Any number higher than 2 increases overhead and decreases throughput.

jmnght · 2018-12-05T08:58:19Z

@mp911de I tried the snapshot and the performance improvements I'm seeing are similar to what you are seeing. Number of reqests/second handled is little more than double of what it is currently.
Until this fix gets released, I think i will have to have the deserializer wrap the byte[] in a ByteBuffer and return (without actually deseriallizing) so that Lettuce thread is not blocked, unless you have another suggestion. :)

mp911de · 2018-12-05T09:13:44Z

Thanks for reporting back. There's no sensible workaround beyond raw consumption of bytes and handling deserialization in your own code and applying publishOn(Scheduler.parallel()) before deserialization.

jmnght · 2018-12-05T14:06:04Z

@mp911de Thank you for adding this to 5.1.4 milestone. For now, I've changed usage to something like:
redisTemplate.execute(conn -> conn.hashCommands().hGetAll(ByteBuffer.wrap("master-key-proto".getBytes()))) .publishOn(Schedulers.parallel()) .map(Map.Entry::getValue) .map(val -> mapper.readValue(val, Parent.class));

Throughput is close to twice (~400 req/sec) in version 5.1.2, but nowhere close to 3237 req/sec that you were seeing with HGETALL content as byte[]

jmnght · 2018-12-07T11:33:09Z

@mp911de , I integrated this change in the actual application and the improvement was much lower than what I saw in the test application. Could you share the code snippet that gave you 3237 req/sec with byte[] ? (or feel free to commit to jmnght/lettucetest repo)
I will also be trying this in the actual production environment and will report back on what I find.

Emission of data and completion signals using the reactive API can now use Lettuce's EventExecutorGroup instead of the I/O thread. Scheduler usage allows multiplexing of threads and early I/O thread release. Using a single Redis connection in a reactive context can cause effectively single-threaded behavior as processing of downstream operators happens typically on the I/O thread. Compute-intensive processing consumes I/O thread capacity and results in performance decrease.

mp911de · 2018-12-10T19:49:09Z

@jmnght I pushed the change to master and 5.1.x and you can consume the new artifacts using versions 5.2.0.BUILD-SNAPSHOT respective 5.1.4.BUILD-SNAPSHOT.

When using Spring Boot, then you can customize Lettuce via:

@Configuration
class LettuceCustomizer implements LettuceClientConfigurationBuilderCustomizer {

  @Override
  public void customize(LettuceClientConfigurationBuilder clientConfigurationBuilder) {
    clientConfigurationBuilder.clientOptions(ClientOptions.builder().publishOnScheduler(true).build())
  }
}

jmnght · 2018-12-13T18:11:21Z

@mp911de Any metrics on performance with the above changes?

adrianotr · 2018-12-13T18:16:21Z

I will not be able to share my code here, but I can confirm that my use case is running 4x faster with the above change. I'm heavily using spring data redis in my project.

mp911de · 2018-12-14T09:15:49Z

We've seen already metrics measured from an external system perspective. Throughput scales with the number of utilized cores instead of capping throughput at single-thread (single-core) capacity.

From a JMH perspective, there's a notable penalty, however users have the choice to enable or to disable publish on scheduler:

Direct Publish:
Benchmark                               Mode  Cnt      Score       Error  Units
RedisClientBenchmark.reactiveSetBatch  thrpt    5  72636,061 ± 14977,276  ops/s

Use Publish Scheduler:
Benchmark                               Mode  Cnt      Score       Error  Units
RedisClientBenchmark.reactiveSetBatch  thrpt    5  56656,378 ± 18703,055  ops/s

I'm closing this ticket as it is resolved now.

trueinsider · 2019-02-27T10:30:33Z

@mp911de
This change introduces some kind of bug where result is lost. I was able to reproduce this with script execution:

RedisClient client1 = RedisClient.create("redis://localhost");
RedisClient client2 = RedisClient.create("redis://localhost");
client1.setOptions(ClientOptions.builder().publishOnScheduler(false).build());
client2.setOptions(ClientOptions.builder().publishOnScheduler(true).build());
RedisReactiveCommands<String, String> redis1 = client1.connect().reactive();
RedisReactiveCommands<String, String> redis2 = client2.connect().reactive();
int counter1 = 0;
int counter2 = 0;
for (int i = 0; i < 1000; i++) {
    if (redis1.eval("return 1", INTEGER).next().block() == null) counter1++;
    if (redis2.eval("return 1", INTEGER).next().block() == null) counter2++;
}
out.println(counter1);
out.println(counter2);

Basically, counter1 == counter2 == 0 should be true, but counter2 has value 1-100 instead of 0.

Also, it would be great to be able to specify our own executor, instead of Lettuce's one.

mp911de · 2019-02-27T12:44:32Z

@trueinsider Can you please file a ticket for the bug report.

Re Executor: You can do so by providing an EventExecutorGroup object.

mp911de added for: team-attention An issue we need to discuss as a team to make progress type: task A general task labels Oct 26, 2018

mp911de added status: waiting-for-feedback We need additional information before we can continue status: blocked An issue that is blocked on an external project change labels Nov 7, 2018

mp911de added type: enhancement A general enhancement and removed status: blocked An issue that is blocked on an external project change status: waiting-for-feedback We need additional information before we can continue type: task A general task labels Dec 3, 2018

mp911de added this to the 5.1.4 milestone Dec 5, 2018

mp911de removed the for: team-attention An issue we need to discuss as a team to make progress label Dec 10, 2018

mp911de closed this as completed Dec 14, 2018

mp911de changed the title ~~Investigate on response thread switching~~ Allow usage of publishOn scheduler for reactive signal emission Feb 12, 2019

trueinsider mentioned this issue Feb 27, 2019

Result is lost when published on another executor #986

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow usage of publishOn scheduler for reactive signal emission #905

Allow usage of publishOn scheduler for reactive signal emission #905

mp911de commented Oct 26, 2018

jmnght commented Oct 31, 2018

mp911de commented Nov 7, 2018

mp911de commented Nov 7, 2018

jmnght commented Nov 16, 2018

jmnght commented Nov 16, 2018

mp911de commented Nov 19, 2018

jmnght commented Nov 19, 2018

mp911de commented Nov 20, 2018

jmnght commented Nov 26, 2018 •

edited

mp911de commented Dec 3, 2018

mp911de commented Dec 3, 2018

jmnght commented Dec 3, 2018

mp911de commented Dec 3, 2018

jmnght commented Dec 5, 2018 •

edited

mp911de commented Dec 5, 2018

jmnght commented Dec 5, 2018

jmnght commented Dec 7, 2018 •

edited

mp911de commented Dec 10, 2018

jmnght commented Dec 13, 2018

adrianotr commented Dec 13, 2018

mp911de commented Dec 14, 2018

trueinsider commented Feb 27, 2019 •

edited

mp911de commented Feb 27, 2019

Allow usage of publishOn scheduler for reactive signal emission #905

Allow usage of publishOn scheduler for reactive signal emission #905

Comments

mp911de commented Oct 26, 2018

jmnght commented Oct 31, 2018

mp911de commented Nov 7, 2018

mp911de commented Nov 7, 2018

jmnght commented Nov 16, 2018

jmnght commented Nov 16, 2018

mp911de commented Nov 19, 2018

jmnght commented Nov 19, 2018

mp911de commented Nov 20, 2018

jmnght commented Nov 26, 2018 • edited

mp911de commented Dec 3, 2018

mp911de commented Dec 3, 2018

jmnght commented Dec 3, 2018

mp911de commented Dec 3, 2018

jmnght commented Dec 5, 2018 • edited

mp911de commented Dec 5, 2018

jmnght commented Dec 5, 2018

jmnght commented Dec 7, 2018 • edited

mp911de commented Dec 10, 2018

jmnght commented Dec 13, 2018

adrianotr commented Dec 13, 2018

mp911de commented Dec 14, 2018

trueinsider commented Feb 27, 2019 • edited

mp911de commented Feb 27, 2019

jmnght commented Nov 26, 2018 •

edited

jmnght commented Dec 5, 2018 •

edited

jmnght commented Dec 7, 2018 •

edited

trueinsider commented Feb 27, 2019 •

edited