Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios #282

Merged
merged 14 commits into from Sep 26, 2022

Conversation

asasas234
Copy link
Contributor

@asasas234 asasas234 commented Sep 22, 2022

We found that the difference in latency between using and not using the easeagent was relatively large, as observed by the flame chart, and finally located that the reason was because com.megaease.easeagent.report.async.zipkin.AgentByteBoundedQueue# offer method, it will call the ReentrantLock#lock method at the entry point, and it will get stuck here in a high concurrency scenario.
I've seen others on github who have encountered the same problem: openzipkin/zipkin-reporter-java#101

@asasas234
Copy link
Contributor Author

asasas234 commented Sep 22, 2022

My test method:
I have started 2 local test projects:

  1. SpringCloudGateway performed a simple downstream forwarding
  2. A SpringMvc project, the entry point uses JdbcTemplate to execute 1 SQL: select * from test (only 3 rows in test table)

Here are the compression test data before optimization:

 default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 105 MB 401 kB/s
     data_sent......................: 56 MB  216 kB/s
     http_req_blocked...............: avg=11.16µs  min=0s     med=2µs   max=17.26ms  p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=8.83µs   min=0s     med=0s    max=17.23ms  p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=825.88ms min=1.26ms med=1.04s max=2.44s    p(90)=1.28s p(95)=1.37s
       { expected_response:true }...: avg=826.33ms min=1.26ms med=1.04s max=2.44s    p(90)=1.28s p(95)=1.38s
     http_req_failed................: 0.06%  ✓ 392         ✗ 626137
     http_req_receiving.............: avg=73.3µs   min=5µs    med=38µs  max=115.79ms p(90)=130µs p(95)=217µs
     http_req_sending...............: avg=7.71µs   min=2µs    med=6µs   max=2.24ms   p(90)=11µs  p(95)=19µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s    max=0s       p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=825.8ms  min=1.23ms med=1.04s max=2.44s    p(90)=1.28s p(95)=1.37s
     http_reqs......................: 626529 2401.813723/s
     iteration_duration.............: avg=1.82s    min=1s     med=2.04s max=3.44s    p(90)=2.28s p(95)=2.38s
     iterations.....................: 626529 2401.813723/s
     vus............................: 48     min=48        max=6000
     vus_max........................: 6000   min=6000      max=6000

     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 104 MB 398 kB/s
     data_sent......................: 56 MB  215 kB/s
     http_req_blocked...............: avg=13.68µs  min=0s     med=2µs   max=108.47ms p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=11.38µs  min=0s     med=0s    max=108.44ms p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=838.14ms min=1.28ms med=1.07s max=2.38s    p(90)=1.3s  p(95)=1.38s
       { expected_response:true }...: avg=838.73ms min=1.28ms med=1.07s max=2.38s    p(90)=1.3s  p(95)=1.38s
     http_req_failed................: 0.07%  ✓ 479        ✗ 621846
     http_req_receiving.............: avg=79.95µs  min=5µs    med=39µs  max=154.13ms p(90)=137µs p(95)=242µs
     http_req_sending...............: avg=7.86µs   min=2µs    med=6µs   max=4.51ms   p(90)=12µs  p(95)=19µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s    max=0s       p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=838.05ms min=1.26ms med=1.07s max=2.38s    p(90)=1.3s  p(95)=1.38s
     http_reqs......................: 622325 2384.77698/s
     iteration_duration.............: avg=1.83s    min=1s     med=2.07s max=3.38s    p(90)=2.3s  p(95)=2.38s
     iterations.....................: 622325 2384.77698/s
     vus............................: 43     min=43       max=6000
     vus_max........................: 6000   min=6000     max=6000

     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 104 MB 399 kB/s
     data_sent......................: 56 MB  215 kB/s
     http_req_blocked...............: avg=18.83µs  min=0s     med=2µs   max=369.86ms p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=16.54µs  min=0s     med=0s    max=369.83ms p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=837.22ms min=1.27ms med=1.07s max=2.41s    p(90)=1.29s p(95)=1.37s
       { expected_response:true }...: avg=838.21ms min=1.27ms med=1.07s max=2.41s    p(90)=1.29s p(95)=1.37s
     http_req_failed................: 0.13%  ✓ 822         ✗ 621808
     http_req_receiving.............: avg=75.86µs  min=5µs    med=38µs  max=104.57ms p(90)=129µs p(95)=215µs
     http_req_sending...............: avg=7.74µs   min=2µs    med=6µs   max=2.07ms   p(90)=11µs  p(95)=19µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s    max=0s       p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=837.14ms min=1.24ms med=1.07s max=2.41s    p(90)=1.29s p(95)=1.37s
     http_reqs......................: 622630 2387.268865/s
     iteration_duration.............: avg=1.83s    min=1s     med=2.07s max=3.41s    p(90)=2.29s p(95)=2.37s
     iterations.....................: 622630 2387.268865/s
     vus............................: 51     min=51        max=6000
     vus_max........................: 6000   min=6000      max=6000

Here are the optimized compression test data

default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 113 MB 433 kB/s
     data_sent......................: 61 MB  233 kB/s
     http_req_blocked...............: avg=10.75µs  min=0s     med=2µs      max=26.93ms p(90)=3µs   p(95)=4µs  
     http_req_connecting............: avg=8.48µs   min=0s     med=0s       max=26.89ms p(90)=0s    p(95)=0s   
     http_req_duration..............: avg=691.29ms min=1.29ms med=874.78ms max=2.07s   p(90)=1.11s p(95)=1.19s
       { expected_response:true }...: avg=691.82ms min=1.29ms med=874.99ms max=2.07s   p(90)=1.11s p(95)=1.19s
     http_req_failed................: 0.08%  ✓ 602         ✗ 675619
     http_req_receiving.............: avg=54.04µs  min=4µs    med=34µs     max=76.11ms p(90)=96µs  p(95)=131µs
     http_req_sending...............: avg=7.67µs   min=2µs    med=6µs      max=6.06ms  p(90)=11µs  p(95)=18µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s       max=0s      p(90)=0s    p(95)=0s   
     http_req_waiting...............: avg=691.23ms min=1.25ms med=874.72ms max=2.07s   p(90)=1.11s p(95)=1.19s
     http_reqs......................: 676221 2593.195764/s
     iteration_duration.............: avg=1.69s    min=1s     med=1.87s    max=3.07s   p(90)=2.11s p(95)=2.19s
     iterations.....................: 676221 2593.195764/s
     vus............................: 51     min=51        max=6000
     vus_max........................: 6000   min=6000      max=6000


     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 117 MB 448 kB/s
     data_sent......................: 63 MB  241 kB/s
     http_req_blocked...............: avg=10.89µs  min=0s     med=2µs      max=13.9ms  p(90)=3µs      p(95)=4µs  
     http_req_connecting............: avg=8.56µs   min=0s     med=0s       max=13.82ms p(90)=0s       p(95)=0s   
     http_req_duration..............: avg=635.62ms min=1.27ms med=817.71ms max=1.94s   p(90)=984.92ms p(95)=1.05s
       { expected_response:true }...: avg=636.05ms min=1.27ms med=817.88ms max=1.94s   p(90)=985.01ms p(95)=1.05s
     http_req_failed................: 0.07%  ✓ 553         ✗ 698610
     http_req_receiving.............: avg=52.35µs  min=4µs    med=33µs     max=54.57ms p(90)=95µs     p(95)=128µs
     http_req_sending...............: avg=7.82µs   min=2µs    med=6µs      max=6.2ms   p(90)=11µs     p(95)=18µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s       max=0s      p(90)=0s       p(95)=0s   
     http_req_waiting...............: avg=635.56ms min=1.23ms med=817.66ms max=1.94s   p(90)=984.86ms p(95)=1.05s
     http_reqs......................: 699163 2681.117921/s
     iteration_duration.............: avg=1.63s    min=1s     med=1.81s    max=2.94s   p(90)=1.98s    p(95)=2.05s
     iterations.....................: 699163 2681.117921/s
     vus............................: 52     min=52        max=6000
     vus_max........................: 6000   min=6000      max=6000

     default ✓ [======================================] 0000/6000 VUs  4m20s

     data_received..................: 117 MB 448 kB/s
     data_sent......................: 63 MB  242 kB/s
     http_req_blocked...............: avg=10.17µs  min=0s     med=2µs      max=60.28ms p(90)=3µs      p(95)=4µs  
     http_req_connecting............: avg=8.03µs   min=0s     med=0s       max=60.24ms p(90)=0s       p(95)=0s   
     http_req_duration..............: avg=631.68ms min=1.25ms med=822.73ms max=1.79s   p(90)=988.91ms p(95)=1.04s
       { expected_response:true }...: avg=632.09ms min=1.25ms med=822.88ms max=1.79s   p(90)=988.97ms p(95)=1.04s
     http_req_failed................: 0.07%  ✓ 536         ✗ 700327
     http_req_receiving.............: avg=55.3µs   min=4µs    med=34µs     max=67.8ms  p(90)=95µs     p(95)=129µs
     http_req_sending...............: avg=7.29µs   min=2µs    med=5µs      max=2.31ms  p(90)=10µs     p(95)=17µs 
     http_req_tls_handshaking.......: avg=0s       min=0s     med=0s       max=0s      p(90)=0s       p(95)=0s   
     http_req_waiting...............: avg=631.62ms min=1.22ms med=822.67ms max=1.79s   p(90)=988.84ms p(95)=1.04s
     http_reqs......................: 700863 2685.511346/s
     iteration_duration.............: avg=1.63s    min=1s     med=1.82s    max=2.79s   p(90)=1.98s    p(95)=2.04s
     iterations.....................: 700863 2685.511346/s
     vus............................: 44     min=44        max=6000
     vus_max........................: 6000   min=6000      max=6000

K6 is used for the pressure test tool, and the following is the pressure test script

import http from 'k6/http';
import { sleep } from 'k6';

export const options = {
  stages: [
    { duration: '20s', target: 1500 }, // below normal load
    { duration: '20s', target: 3000 },
    { duration: '20s', target: 4500 }, // normal load
    { duration: '20s', target: 6000 },
    { duration: '2m', target: 6000 }, // around the breaking point
    { duration: '1m', target: 0 }, // scale down. Recovery stage.
  ],
};

export default function () {
  const BASE_URL = 'http://localhost:8134'; // make sure this is not production

  const responses = http.batch([
    ['GET', `${BASE_URL}/api/getALL`, null, { tags: { name: 'owner' } }]
  ]);

  sleep(1);
}

@asasas234 asasas234 changed the title 优化有界阻塞队列,避免在高并发场景下锁等待时间太久 Optimize AgentByteBoundedQueue to avoid excessive latency in concurrent scenarios Sep 22, 2022
@zhao-kun
Copy link
Contributor

Thanks for your contribution. Nice job.
But I have several concerns about the test case, would you like to give us more details of the background about:

  1. What's the machine spec you used to test.?
  2. What's the machine's resource usage when you are testing?
  3. Have any tracing packages been dropped due to the full queue size?

@asasas234
Copy link
Contributor Author

Thanks for your contribution. Nice job. But I have several concerns about the test case, would you like to give us more details of the background about:

  1. What's the machine spec you used to test.?
  2. What's the machine's resource usage when you are testing?
  3. Have any tracing packages been dropped due to the full queue size?
  1. What's the machine spec you used to test.?
    I performed the pressure test on my local computer, K6 and test items are on the local computer, the local computer configuration is apple M1 Max 64G RAM, 1T SSD, CPU core count is 10
  2. What's the machine's resource usage when you are testing?
    At that time, the load of the machine was between 18 and 20 most of the time, the highest value was 21.42, and the utilization rate of CPU was about 50%.
  3. Have any tracing packages been dropped due to the full queue size?
    I understand that this is not the case since the requests and test items I initiate stress tests on are local.

@observeralone
Copy link
Contributor

To make sure the functionality of this class is fully migrated, I hope you can add unit tests

@asasas234
Copy link
Contributor Author

To make sure the functionality of this class is fully migrated, I hope you can add unit tests

OK,I'll deal with the atomicity problem, and then add a single test before submitting.

@asasas234
Copy link
Contributor Author

@observeralone It has been processed, please look at the

@landyking
Copy link
Contributor

Overall, this is a good optimization.
The biggest difference is that in the case of concurrency the order in which consumers consume data may not match the order in which data is written. But this does not seem to be a problem in the tracing scenario (eventually we'll organize the data by timestamp and trace id).

Copy link
Contributor

@landyking landyking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhao-kun zhao-kun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your contribution.

@landyking landyking merged commit 6336f5d into megaease:master Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants