Improved ACLK memory management and shutdown sequence #8611

stelfrag · 2020-04-06T07:38:00Z

Fixes #8483

Summary

Improve the ACLK main thread shutdown sequence.
- The thread's cancelability is disabled and the shutdown is detected by the netdata_exit variable.
  - The netdata core would immediately cancel/kill the ACLK main thread during agent shutdown so the code outside the while loop would never be executed.
The ACLK query thread empties the query queue and frees resources on shutdown
Fixed memory leak (payload type parameter)
Improved error / info message on disconnection from cloud
- When the connection is dropped due to the agent shutting down, it logs an information message
- When the connection actually fails, it logs an error message
Trying to queue identical update commands to the query engine (i.e. an identical update is already in the queue) is now considered a success

Component Name

ACLK

Test Plan

Compile agent with ACLK enabled (must be claimed)
To test fixed memory leaks
- After starting the agent as described below, switch to the cloud and observe charts to make sure the ACLK is in use (i.e. the agent will accept and process cloud queries)
- Use valgrind e.g:
  - valgrind --undef-value-errors=no --leak-check=full -s <netdata executable>
  - Allow it to run for sometime, stop the server and observe the valgrind report
To test that the error ACLK failed to queue chart_update command is gone
- Start the agent with as many plugins enabled as possible
  - The rapid initialization of the plugins, creating charts and adding dimensions within the same second would have triggered the ACLK failed to queue chart_update command in the error.log

squash-labs · 2020-04-06T07:38:08Z

Manage this branch in Squash

Test this branch here: https://stelfragaclk-cleanup-8483-0qu6e.squash.io

underhood · 2020-04-09T12:00:01Z

@stelfrag can you rebase on master as there seems to be conflict

* Return success where an identical update command is found in the queue * Disabled cancelability for main ACLK thread to handle graceful shutdown * Fixed memory leak on incoming payload 'type_id' * Fixed memory leak on ACLK query thread shutdown

amoss · 2020-04-09T13:42:55Z

In aclk_main should we push the cleanup if !netdata_cloud_setting ?

amoss · 2020-04-09T13:50:49Z

I see this in valgrind, can we determine if it comes from the ACLK code or somewhere else in Netdata?

==24264== Thread 1:
==24264== 1,584 (176 direct, 1,408 indirect) bytes in 1 blocks are definitely lost in loss record 11,258 of 11,440
==24264==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==24264==    by 0x4D90558: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DBF4FE: RSA_new_method (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DBDB64: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB53CC: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB2D6A: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB394C: ASN1_item_ex_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB39CA: ASN1_item_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DBC26D: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CAEFEA: d2i_PrivateKey (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DA4A3A: PEM_read_bio_PrivateKey (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DA0DCA: PEM_read_bio_RSAPrivateKey (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)

amoss · 2020-04-09T13:54:06Z

~~There seems to be a mixture of:~~

RSA_free(aclk_private_key);

and

freez(private_key);

~~I think the freez call is wrong (L172 in agent_cloud_link.c) and should be RSA_free.~~

Ignore this, it is completely wrong.

stelfrag · 2020-04-09T13:54:25Z

In aclk_main should we push the cleanup if !netdata_cloud_setting ?

Good point -- no need to push cleanup before that check

stelfrag · 2020-04-09T13:58:57Z

I think the freez call is wrong (L172 in agent_cloud_link.c) and should be RSA_free.

No that looks good. It comes from the read_by_filename (uses callocz on that)

underhood · 2020-04-09T14:09:37Z

OK, tested this with cloud (after reinstalling json-c) and reviewed. Appart from Valgrind thing (which is probably the freeing of the key) that Andrew mention already I do not notice any obvious issues

underhood · 2020-04-09T14:13:10Z

after adding --num-callers=50 to valgrind you can see longer backtrace

==71908== 1,584 (176 direct, 1,408 indirect) bytes in 1 blocks are definitely lost in loss record 11,357 of 11,570
==71908==    at 0x483980B: malloc (vg_replace_malloc.c:309)
==71908==    by 0x4B545CD: CRYPTO_zalloc (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B86533: RSA_new_method (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B83BD4: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6D984: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6B30A: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6BEB1: ASN1_item_ex_d2i (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6BF3E: ASN1_item_d2i (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B82241: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6749B: d2i_PrivateKey (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B6948A: PEM_read_bio_PrivateKey (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B6541E: PEM_read_bio_RSAPrivateKey (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4BC067: create_private_key (agent_cloud_link.c:168)
==71908==    by 0x4BF4DC: aclk_main (agent_cloud_link.c:1313)
==71908==    by 0x43C9B1: thread_start (threads.c:170)
==71908==    by 0x4F054E1: start_thread (in /usr/lib64/libpthread-2.30.so)
==71908==    by 0x501F6A2: clone (in /usr/lib64/libc-2.30.so)

stelfrag · 2020-04-09T14:16:32Z

OK, tested this with cloud (after reinstalling json-c) and reviewed. Appart from Valgrind thing (which is probably the freeing of the key) that Andrew mention already I do not notice any obvious issues

That issue is caused by create_private_key() called twice at about L1318, L1327 and allocating aclk_private_key twice

amoss · 2020-04-09T14:22:29Z

The call after the loop (that does not check the return value) can be removed. Perhaps we should check if aclk_private_key!=NULL inside create_private_key and just return success in that case?

All the tests look good though, I'm about to approve this.

…ed and the ACLK main thread terminates immediately * Remove reduntant aclk_private_key generation

stelfrag · 2020-04-09T14:26:48Z

The call after the loop (that does not check the return value) can be removed. Perhaps we should check if aclk_private_key!=NULL inside create_private_key and just return success in that case?

All the tests look good though, I'm about to approve this.

Yes I removed that call. Valgrind error went away

amoss · 2020-04-09T14:30:52Z

Question:

We disable thread cancelability before the main loop, but we have a cleanup handler registered and it is removed as the last step in the thread main.

Why do we have a cleanup if it is not used?
Is there anything in that cleanup that we should be doing?

underhood · 2020-04-09T14:31:26Z

Valgrind looks good now

==89546== LEAK SUMMARY:
==89546==    definitely lost: 0 bytes in 0 blocks
==89546==    indirectly lost: 0 bytes in 0 blocks
==89546==      possibly lost: 0 bytes in 0 blocks
==89546==    still reachable: 4,918,562 bytes in 42,172 blocks
==89546==         suppressed: 104 bytes in 1 blocks

stelfrag · 2020-04-09T14:34:49Z

Question:

We disable thread cancelability before the main loop, but we have a cleanup handler registered and it is removed as the last step in the thread main.

Why do we have a cleanup if it is not used?

Is there anything in that cleanup that we should be doing?

The netdata_thread_cleanup_pop will execute the cleanup

amoss · 2020-04-09T14:36:42Z

The netdata_thread_cleanup_pop will execute the cleanup

Yes :) But why.... If it is only being called in that way why are we using a cleanup handler? i.e. I know it is called in the normal flow, but it is never called in an abnormal flow.

mfundul · 2020-04-09T14:37:11Z

Guys ideally we shouldn't add anything posix thread spefic to facilitate the move to libuv threads. Thread cancelability is not supported in libuv.

amoss · 2020-04-09T14:40:48Z

True, but without disabling cancel-ability the code after the main loop does not get called. The threads are killed before the netdata_exit flag is set.

amoss · 2020-04-09T14:44:26Z

@stelfrag spotted this when he was cleaning up. I was going to ask you about 28d5018f3

…y instead

amoss

Tests all work, random poking around seems stable. Changes to the code are good.

underhood

LGTM

Improved ACLK memory management and shutdown sequence

stelfrag requested review from amoss, underhood and mfundul April 6, 2020 07:38

stelfrag requested review from cosmix and ktsaou as code owners April 6, 2020 07:38

stelfrag removed request for cosmix and ktsaou April 6, 2020 07:38

github-actions bot added the area/aclk label Apr 6, 2020

stelfrag added 3 commits April 9, 2020 16:11

- Improved message logged on disconnect callback

2fd96bc

* Remove unused functions

c8d2c86

* Return success where an identical update command is found in the queue * Disabled cancelability for main ACLK thread to handle graceful shutdown * Fixed memory leak on incoming payload 'type_id' * Fixed memory leak on ACLK query thread shutdown

Use ACLK_METADATA_REQUIRED to reset the aclk metadata submission status

475db2f

stelfrag force-pushed the aclk_cleanup_8483 branch from 3027299 to 475db2f Compare April 9, 2020 13:23

* Skip thread cleanup scheduling if the cloud functionality is disabl…

750b724

…ed and the ACLK main thread terminates immediately * Remove reduntant aclk_private_key generation

underhood previously approved these changes Apr 9, 2020

View reviewed changes

* Removed the thread cleanup function callback and calling it directl…

96916b1

…y instead

stelfrag dismissed underhood’s stale review via 96916b1 April 9, 2020 14:52

stelfrag requested a review from underhood April 9, 2020 14:55

amoss approved these changes Apr 9, 2020

View reviewed changes

underhood approved these changes Apr 9, 2020

View reviewed changes

stelfrag merged commit 764a067 into netdata:master Apr 9, 2020

stelfrag deleted the aclk_cleanup_8483 branch April 9, 2020 16:06

Saruspete pushed a commit to Saruspete/netdata that referenced this pull request May 21, 2020

Improved ACLK memory management and shutdown sequence (netdata#8611)

c37550d

Improved ACLK memory management and shutdown sequence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved ACLK memory management and shutdown sequence #8611

Improved ACLK memory management and shutdown sequence #8611

stelfrag commented Apr 6, 2020

squash-labs bot commented Apr 6, 2020

underhood commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss commented Apr 9, 2020 •

edited

Loading

stelfrag commented Apr 9, 2020

stelfrag commented Apr 9, 2020 •

edited

Loading

underhood commented Apr 9, 2020

underhood commented Apr 9, 2020 •

edited

Loading

stelfrag commented Apr 9, 2020 •

edited

Loading

amoss commented Apr 9, 2020

stelfrag commented Apr 9, 2020

amoss commented Apr 9, 2020

underhood commented Apr 9, 2020

stelfrag commented Apr 9, 2020

amoss commented Apr 9, 2020

mfundul commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss left a comment

underhood left a comment

Improved ACLK memory management and shutdown sequence #8611

Improved ACLK memory management and shutdown sequence #8611

Conversation

stelfrag commented Apr 6, 2020

Summary

Component Name

Test Plan

squash-labs bot commented Apr 6, 2020

Manage this branch in Squash

underhood commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss commented Apr 9, 2020 • edited Loading

stelfrag commented Apr 9, 2020

stelfrag commented Apr 9, 2020 • edited Loading

underhood commented Apr 9, 2020

underhood commented Apr 9, 2020 • edited Loading

stelfrag commented Apr 9, 2020 • edited Loading

amoss commented Apr 9, 2020

stelfrag commented Apr 9, 2020

amoss commented Apr 9, 2020

underhood commented Apr 9, 2020

stelfrag commented Apr 9, 2020

amoss commented Apr 9, 2020

mfundul commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss commented Apr 9, 2020

amoss left a comment

Choose a reason for hiding this comment

underhood left a comment

Choose a reason for hiding this comment

amoss commented Apr 9, 2020 •

edited

Loading

stelfrag commented Apr 9, 2020 •

edited

Loading

underhood commented Apr 9, 2020 •

edited

Loading

stelfrag commented Apr 9, 2020 •

edited

Loading