Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios #282

asasas234 · 2022-09-22T03:00:57Z

We found that the difference in latency between using and not using the easeagent was relatively large, as observed by the flame chart, and finally located that the reason was because com.megaease.easeagent.report.async.zipkin.AgentByteBoundedQueue# offer method, it will call the ReentrantLock#lock method at the entry point, and it will get stuck here in a high concurrency scenario.
I've seen others on github who have encountered the same problem: openzipkin/zipkin-reporter-java#101

asasas234 · 2022-09-22T03:15:11Z

My test method:
I have started 2 local test projects：

SpringCloudGateway performed a simple downstream forwarding
A SpringMvc project, the entry point uses JdbcTemplate to execute 1 SQL: select * from test (only 3 rows in test table)

Here are the compression test data before optimization：

 default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 105 MB 401 kB/s
     data_sent......................: 56 MB  216 kB/s
     http_req_blocked...............: avg=11.16µs  min=0s     med=2µs   max=17.26ms  p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=8.83µs   min=0s     med=0s    max=17.23ms  p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=825.88ms min=1.26ms med=1.04s max=2.44s    p(90)=1.28s p(95)=1.37s
       { expected_response:true }...: avg=826.33ms min=1.26ms med=1.04s max=2.44s    p(90)=1.28s p(95)=1.38s
     http_req_failed................: 0.06%  ✓ 392         ✗ 626137
     http_req_receiving.............: avg=73.3µs   min=5µs    med=38µs  max=115.79ms p(90)=130µs p(95)=217µs
     http_req_sending...............: avg=7.71µs   min=2µs    med=6µs   max=2.24ms   p(90)=11µs  p(95)=19µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s    max=0s       p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=825.8ms  min=1.23ms med=1.04s max=2.44s    p(90)=1.28s p(95)=1.37s
     http_reqs......................: 626529 2401.813723/s
     iteration_duration.............: avg=1.82s    min=1s     med=2.04s max=3.44s    p(90)=2.28s p(95)=2.38s
     iterations.....................: 626529 2401.813723/s
     vus............................: 48     min=48        max=6000
     vus_max........................: 6000   min=6000      max=6000

     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 104 MB 398 kB/s
     data_sent......................: 56 MB  215 kB/s
     http_req_blocked...............: avg=13.68µs  min=0s     med=2µs   max=108.47ms p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=11.38µs  min=0s     med=0s    max=108.44ms p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=838.14ms min=1.28ms med=1.07s max=2.38s    p(90)=1.3s  p(95)=1.38s
       { expected_response:true }...: avg=838.73ms min=1.28ms med=1.07s max=2.38s    p(90)=1.3s  p(95)=1.38s
     http_req_failed................: 0.07%  ✓ 479        ✗ 621846
     http_req_receiving.............: avg=79.95µs  min=5µs    med=39µs  max=154.13ms p(90)=137µs p(95)=242µs
     http_req_sending...............: avg=7.86µs   min=2µs    med=6µs   max=4.51ms   p(90)=12µs  p(95)=19µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s    max=0s       p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=838.05ms min=1.26ms med=1.07s max=2.38s    p(90)=1.3s  p(95)=1.38s
     http_reqs......................: 622325 2384.77698/s
     iteration_duration.............: avg=1.83s    min=1s     med=2.07s max=3.38s    p(90)=2.3s  p(95)=2.38s
     iterations.....................: 622325 2384.77698/s
     vus............................: 43     min=43       max=6000
     vus_max........................: 6000   min=6000     max=6000

     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 104 MB 399 kB/s
     data_sent......................: 56 MB  215 kB/s
     http_req_blocked...............: avg=18.83µs  min=0s     med=2µs   max=369.86ms p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=16.54µs  min=0s     med=0s    max=369.83ms p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=837.22ms min=1.27ms med=1.07s max=2.41s    p(90)=1.29s p(95)=1.37s
       { expected_response:true }...: avg=838.21ms min=1.27ms med=1.07s max=2.41s    p(90)=1.29s p(95)=1.37s
     http_req_failed................: 0.13%  ✓ 822         ✗ 621808
     http_req_receiving.............: avg=75.86µs  min=5µs    med=38µs  max=104.57ms p(90)=129µs p(95)=215µs
     http_req_sending...............: avg=7.74µs   min=2µs    med=6µs   max=2.07ms   p(90)=11µs  p(95)=19µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s    max=0s       p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=837.14ms min=1.24ms med=1.07s max=2.41s    p(90)=1.29s p(95)=1.37s
     http_reqs......................: 622630 2387.268865/s
     iteration_duration.............: avg=1.83s    min=1s     med=2.07s max=3.41s    p(90)=2.29s p(95)=2.37s
     iterations.....................: 622630 2387.268865/s
     vus............................: 51     min=51        max=6000
     vus_max........................: 6000   min=6000      max=6000

Here are the optimized compression test data

default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 113 MB 433 kB/s
     data_sent......................: 61 MB  233 kB/s
     http_req_blocked...............: avg=10.75µs  min=0s     med=2µs      max=26.93ms p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=8.48µs   min=0s     med=0s       max=26.89ms p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=691.29ms min=1.29ms med=874.78ms max=2.07s   p(90)=1.11s p(95)=1.19s
       { expected_response:true }...: avg=691.82ms min=1.29ms med=874.99ms max=2.07s   p(90)=1.11s p(95)=1.19s
     http_req_failed................: 0.08%  ✓ 602         ✗ 675619
     http_req_receiving.............: avg=54.04µs  min=4µs    med=34µs     max=76.11ms p(90)=96µs  p(95)=131µs
     http_req_sending...............: avg=7.67µs   min=2µs    med=6µs      max=6.06ms  p(90)=11µs  p(95)=18µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s       max=0s      p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=691.23ms min=1.25ms med=874.72ms max=2.07s   p(90)=1.11s p(95)=1.19s
     http_reqs......................: 676221 2593.195764/s
     iteration_duration.............: avg=1.69s    min=1s     med=1.87s    max=3.07s   p(90)=2.11s p(95)=2.19s
     iterations.....................: 676221 2593.195764/s
     vus............................: 51     min=51        max=6000
     vus_max........................: 6000   min=6000      max=6000


     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 117 MB 448 kB/s
     data_sent......................: 63 MB  241 kB/s
     http_req_blocked...............: avg=10.89µs  min=0s     med=2µs      max=13.9ms  p(90)=3µs      p(95)=4µs  
     http_req_connecting............: avg=8.56µs   min=0s     med=0s       max=13.82ms p(90)=0s       p(95)=0s   
     http_req_duration..............: avg=635.62ms min=1.27ms med=817.71ms max=1.94s   p(90)=984.92ms p(95)=1.05s
       { expected_response:true }...: avg=636.05ms min=1.27ms med=817.88ms max=1.94s   p(90)=985.01ms p(95)=1.05s
     http_req_failed................: 0.07%  ✓ 553         ✗ 698610
     http_req_receiving.............: avg=52.35µs  min=4µs    med=33µs     max=54.57ms p(90)=95µs     p(95)=128µs
     http_req_sending...............: avg=7.82µs   min=2µs    med=6µs      max=6.2ms   p(90)=11µs     p(95)=18µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s       max=0s      p(90)=0s       p(95)=0s   
     http_req_waiting...............: avg=635.56ms min=1.23ms med=817.66ms max=1.94s   p(90)=984.86ms p(95)=1.05s
     http_reqs......................: 699163 2681.117921/s
     iteration_duration.............: avg=1.63s    min=1s     med=1.81s    max=2.94s   p(90)=1.98s    p(95)=2.05s
     iterations.....................: 699163 2681.117921/s
     vus............................: 52     min=52        max=6000
     vus_max........................: 6000   min=6000      max=6000

     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 117 MB 448 kB/s
     data_sent......................: 63 MB  242 kB/s
     http_req_blocked...............: avg=10.17µs  min=0s     med=2µs      max=60.28ms p(90)=3µs      p(95)=4µs  
     http_req_connecting............: avg=8.03µs   min=0s     med=0s       max=60.24ms p(90)=0s       p(95)=0s   
     http_req_duration..............: avg=631.68ms min=1.25ms med=822.73ms max=1.79s   p(90)=988.91ms p(95)=1.04s
       { expected_response:true }...: avg=632.09ms min=1.25ms med=822.88ms max=1.79s   p(90)=988.97ms p(95)=1.04s
     http_req_failed................: 0.07%  ✓ 536         ✗ 700327
     http_req_receiving.............: avg=55.3µs   min=4µs    med=34µs     max=67.8ms  p(90)=95µs     p(95)=129µs
     http_req_sending...............: avg=7.29µs   min=2µs    med=5µs      max=2.31ms  p(90)=10µs     p(95)=17µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s       max=0s      p(90)=0s       p(95)=0s   
     http_req_waiting...............: avg=631.62ms min=1.22ms med=822.67ms max=1.79s   p(90)=988.84ms p(95)=1.04s
     http_reqs......................: 700863 2685.511346/s
     iteration_duration.............: avg=1.63s    min=1s     med=1.82s    max=2.79s   p(90)=1.98s    p(95)=2.04s
     iterations.....................: 700863 2685.511346/s
     vus............................: 44     min=44        max=6000
     vus_max........................: 6000   min=6000      max=6000

K6 is used for the pressure test tool, and the following is the pressure test script

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '20s', target: 1500 }, // below normal load
    { duration: '20s', target: 3000 },
    { duration: '20s', target: 4500 }, // normal load
    { duration: '20s', target: 6000 },
    { duration: '2m', target: 6000 }, // around the breaking point
    { duration: '1m', target: 0 }, // scale down. Recovery stage.
  ],
};

export default function () {
  const BASE_URL = 'http://localhost:8134'; // make sure this is not production

  const responses = http.batch([
    ['GET', `${BASE_URL}/api/getALL`, null, { tags: { name: 'owner' } }]
  ]);

  sleep(1);
}

zhao-kun · 2022-09-22T04:04:47Z

Thanks for your contribution. Nice job.
But I have several concerns about the test case, would you like to give us more details of the background about:

What's the machine spec you used to test.?
What's the machine's resource usage when you are testing?
Have any tracing packages been dropped due to the full queue size?

asasas234 · 2022-09-22T04:17:26Z

Thanks for your contribution. Nice job. But I have several concerns about the test case, would you like to give us more details of the background about:

What's the machine spec you used to test.?

What's the machine's resource usage when you are testing?

Have any tracing packages been dropped due to the full queue size?

What's the machine spec you used to test.?
I performed the pressure test on my local computer, K6 and test items are on the local computer, the local computer configuration is apple M1 Max 64G RAM, 1T SSD, CPU core count is 10
What's the machine's resource usage when you are testing?
At that time, the load of the machine was between 18 and 20 most of the time, the highest value was 21.42, and the utilization rate of CPU was about 50%.
Have any tracing packages been dropped due to the full queue size?
I understand that this is not the case since the requests and test items I initiate stress tests on are local.

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java

observeralone · 2022-09-22T06:08:52Z

To make sure the functionality of this class is fully migrated, I hope you can add unit tests

asasas234 · 2022-09-22T07:15:32Z

To make sure the functionality of this class is fully migrated, I hope you can add unit tests

OK，I'll deal with the atomicity problem, and then add a single test before submitting.

asasas234 · 2022-09-22T08:07:19Z

@observeralone It has been processed, please look at the

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java

report/src/test/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueueTest.java

…e of multi-state by the way of atomic variables.

…the data and subtract according to the size of the data.

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java

landyking · 2022-09-24T12:25:42Z

Overall, this is a good optimization.
The biggest difference is that in the case of concurrency the order in which consumers consume data may not match the order in which data is written. But this does not seem to be a problem in the tracing scenario (eventually we'll organize the data by timestamp and trace id).

2. Optimize the logic of doDrain to avoid calling size methods

…turn mode

… acquisition to write to death directly.

landyking

LGTM

zhao-kun

LGTM, thanks for your contribution.

优化有界阻塞队列，避免在高并发场景下锁等待时间太久

79a59cf

asasas234 changed the title ~~优化有界阻塞队列，避免在高并发场景下锁等待时间太久~~ Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios Sep 22, 2022

observeralone reviewed Sep 22, 2022

View reviewed changes

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java Outdated Show resolved Hide resolved

Handle clear method atomicity vulnerability and add unit tests

4329192

observeralone reviewed Sep 22, 2022

View reviewed changes

asasas234 added 5 commits September 22, 2022 17:24

Deal with BUG and improve single test

91bc2ea

Encapsulate the state into State, and then realize the atomized updat…

a51ebdf

…e of multi-state by the way of atomic variables.

fixbug

e0bb1f8

Modify the clear implementation strategy, cycle constantly poll, get …

d6140dd

…the data and subtract according to the size of the data.

fixbug

2a66c6f

observeralone reviewed Sep 22, 2022

View reviewed changes

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java Outdated Show resolved Hide resolved

asasas234 added 4 commits September 23, 2022 09:31

Optimization logic

e8b8fa3

Solve Bug and modify single test

4e38a4a

Add multithreaded unit testing

342ee16

Add assertion validation

402ae95

landyking reviewed Sep 24, 2022

View reviewed changes

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java Show resolved Hide resolved

landyking reviewed Sep 24, 2022

View reviewed changes

report/src/main/java/com/megaease/easeagent/report/async/zipkin/AgentByteBoundedQueue.java Outdated Show resolved Hide resolved

asasas234 added 3 commits September 25, 2022 15:48

1. Deal with data loss when drainTo fails

1c3b89c

2. Optimize the logic of doDrain to avoid calling size methods

Change it to poll first, and if the consumption fails in the offer re…

92a3692

…turn mode

Modify the ThreadCount value of a single test, change it from dynamic…

1106083

… acquisition to write to death directly.

landyking approved these changes Sep 25, 2022

View reviewed changes

zhao-kun approved these changes Sep 26, 2022

View reviewed changes

landyking merged commit 6336f5d into megaease:master Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios #282

Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios #282

asasas234 commented Sep 22, 2022 •

edited

asasas234 commented Sep 22, 2022 •

edited

zhao-kun commented Sep 22, 2022

asasas234 commented Sep 22, 2022

observeralone commented Sep 22, 2022

asasas234 commented Sep 22, 2022

asasas234 commented Sep 22, 2022

landyking commented Sep 24, 2022

landyking left a comment

zhao-kun left a comment

Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios #282

Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios #282

Conversation

asasas234 commented Sep 22, 2022 • edited

asasas234 commented Sep 22, 2022 • edited

zhao-kun commented Sep 22, 2022

asasas234 commented Sep 22, 2022

observeralone commented Sep 22, 2022

asasas234 commented Sep 22, 2022

asasas234 commented Sep 22, 2022

landyking commented Sep 24, 2022

landyking left a comment

Choose a reason for hiding this comment

zhao-kun left a comment

Choose a reason for hiding this comment

asasas234 commented Sep 22, 2022 •

edited

asasas234 commented Sep 22, 2022 •

edited