Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved ACLK memory management and shutdown sequence #8611

Merged
merged 5 commits into from
Apr 9, 2020

Conversation

stelfrag
Copy link
Collaborator

@stelfrag stelfrag commented Apr 6, 2020

Fixes #8483

Summary
  • Improve the ACLK main thread shutdown sequence.
    • The thread's cancelability is disabled and the shutdown is detected by the netdata_exit variable.
      • The netdata core would immediately cancel/kill the ACLK main thread during agent shutdown so the code outside the while loop would never be executed.
  • The ACLK query thread empties the query queue and frees resources on shutdown
  • Fixed memory leak (payload type parameter)
  • Improved error / info message on disconnection from cloud
    • When the connection is dropped due to the agent shutting down, it logs an information message
    • When the connection actually fails, it logs an error message
  • Trying to queue identical update commands to the query engine (i.e. an identical update is already in the queue) is now considered a success
Component Name

ACLK

Test Plan
  • Compile agent with ACLK enabled (must be claimed)
  • To test fixed memory leaks
    • After starting the agent as described below, switch to the cloud and observe charts to make sure the ACLK is in use (i.e. the agent will accept and process cloud queries)
    • Use valgrind e.g:
      • valgrind --undef-value-errors=no --leak-check=full -s <netdata executable>
      • Allow it to run for sometime, stop the server and observe the valgrind report
  • To test that the error ACLK failed to queue chart_update command is gone
    • Start the agent with as many plugins enabled as possible
      • The rapid initialization of the plugins, creating charts and adding dimensions within the same second would have triggered the ACLK failed to queue chart_update command in the error.log

@squash-labs
Copy link

squash-labs bot commented Apr 6, 2020

Manage this branch in Squash

Test this branch here: https://stelfragaclk-cleanup-8483-0qu6e.squash.io

@underhood
Copy link
Contributor

@stelfrag can you rebase on master as there seems to be conflict

* Return success where an identical update command is found in the queue
* Disabled cancelability for main ACLK thread to handle graceful shutdown
* Fixed memory leak on incoming payload 'type_id'
* Fixed memory leak on ACLK query thread shutdown
@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

In aclk_main should we push the cleanup if !netdata_cloud_setting ?

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

I see this in valgrind, can we determine if it comes from the ACLK code or somewhere else in Netdata?

==24264== Thread 1:
==24264== 1,584 (176 direct, 1,408 indirect) bytes in 1 blocks are definitely lost in loss record 11,258 of 11,440
==24264==    at 0x483577F: malloc (vg_replace_malloc.c:299)
==24264==    by 0x4D90558: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DBF4FE: RSA_new_method (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DBDB64: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB53CC: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB2D6A: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB394C: ASN1_item_ex_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CB39CA: ASN1_item_d2i (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DBC26D: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4CAEFEA: d2i_PrivateKey (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DA4A3A: PEM_read_bio_PrivateKey (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==24264==    by 0x4DA0DCA: PEM_read_bio_RSAPrivateKey (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

There seems to be a mixture of:

RSA_free(aclk_private_key);

and

freez(private_key);

I think the freez call is wrong (L172 in agent_cloud_link.c) and should be RSA_free.

Ignore this, it is completely wrong.

@stelfrag
Copy link
Collaborator Author

stelfrag commented Apr 9, 2020

In aclk_main should we push the cleanup if !netdata_cloud_setting ?

Good point -- no need to push cleanup before that check

@stelfrag
Copy link
Collaborator Author

stelfrag commented Apr 9, 2020

I think the freez call is wrong (L172 in agent_cloud_link.c) and should be RSA_free.

No that looks good. It comes from the read_by_filename (uses callocz on that)

@underhood
Copy link
Contributor

OK, tested this with cloud (after reinstalling json-c) and reviewed. Appart from Valgrind thing (which is probably the freeing of the key) that Andrew mention already I do not notice any obvious issues

@underhood
Copy link
Contributor

underhood commented Apr 9, 2020

after adding --num-callers=50 to valgrind you can see longer backtrace

==71908== 1,584 (176 direct, 1,408 indirect) bytes in 1 blocks are definitely lost in loss record 11,357 of 11,570
==71908==    at 0x483980B: malloc (vg_replace_malloc.c:309)
==71908==    by 0x4B545CD: CRYPTO_zalloc (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B86533: RSA_new_method (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B83BD4: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6D984: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6B30A: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6BEB1: ASN1_item_ex_d2i (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6BF3E: ASN1_item_d2i (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B82241: ??? (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4A6749B: d2i_PrivateKey (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B6948A: PEM_read_bio_PrivateKey (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4B6541E: PEM_read_bio_RSAPrivateKey (in /usr/lib64/libcrypto.so.1.1.1d)
==71908==    by 0x4BC067: create_private_key (agent_cloud_link.c:168)
==71908==    by 0x4BF4DC: aclk_main (agent_cloud_link.c:1313)
==71908==    by 0x43C9B1: thread_start (threads.c:170)
==71908==    by 0x4F054E1: start_thread (in /usr/lib64/libpthread-2.30.so)
==71908==    by 0x501F6A2: clone (in /usr/lib64/libc-2.30.so)

@stelfrag
Copy link
Collaborator Author

stelfrag commented Apr 9, 2020

OK, tested this with cloud (after reinstalling json-c) and reviewed. Appart from Valgrind thing (which is probably the freeing of the key) that Andrew mention already I do not notice any obvious issues

That issue is caused by create_private_key() called twice at about L1318, L1327 and allocating aclk_private_key twice

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

The call after the loop (that does not check the return value) can be removed. Perhaps we should check if aclk_private_key!=NULL inside create_private_key and just return success in that case?

All the tests look good though, I'm about to approve this.

…ed and the ACLK main thread terminates immediately

* Remove reduntant aclk_private_key generation
@stelfrag
Copy link
Collaborator Author

stelfrag commented Apr 9, 2020

The call after the loop (that does not check the return value) can be removed. Perhaps we should check if aclk_private_key!=NULL inside create_private_key and just return success in that case?

All the tests look good though, I'm about to approve this.

Yes I removed that call. Valgrind error went away

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

Question:

We disable thread cancelability before the main loop, but we have a cleanup handler registered and it is removed as the last step in the thread main.

  • Why do we have a cleanup if it is not used?
  • Is there anything in that cleanup that we should be doing?

@underhood
Copy link
Contributor

Valgrind looks good now

==89546== LEAK SUMMARY:
==89546==    definitely lost: 0 bytes in 0 blocks
==89546==    indirectly lost: 0 bytes in 0 blocks
==89546==      possibly lost: 0 bytes in 0 blocks
==89546==    still reachable: 4,918,562 bytes in 42,172 blocks
==89546==         suppressed: 104 bytes in 1 blocks

@stelfrag
Copy link
Collaborator Author

stelfrag commented Apr 9, 2020

Question:

We disable thread cancelability before the main loop, but we have a cleanup handler registered and it is removed as the last step in the thread main.

  • Why do we have a cleanup if it is not used?
  • Is there anything in that cleanup that we should be doing?

The netdata_thread_cleanup_pop will execute the cleanup

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

The netdata_thread_cleanup_pop will execute the cleanup

Yes :) But why.... If it is only being called in that way why are we using a cleanup handler? i.e. I know it is called in the normal flow, but it is never called in an abnormal flow.

@mfundul
Copy link
Contributor

mfundul commented Apr 9, 2020

Guys ideally we shouldn't add anything posix thread spefic to facilitate the move to libuv threads. Thread cancelability is not supported in libuv.

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

True, but without disabling cancel-ability the code after the main loop does not get called. The threads are killed before the netdata_exit flag is set.

@amoss
Copy link
Contributor

amoss commented Apr 9, 2020

@stelfrag spotted this when he was cleaning up. I was going to ask you about 28d5018f3

underhood
underhood previously approved these changes Apr 9, 2020
Copy link
Contributor

@amoss amoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests all work, random poking around seems stable. Changes to the code are good.

Copy link
Contributor

@underhood underhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stelfrag stelfrag merged commit 764a067 into netdata:master Apr 9, 2020
@stelfrag stelfrag deleted the aclk_cleanup_8483 branch April 9, 2020 16:06
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request May 21, 2020
Improved ACLK memory management and shutdown sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ACLK code cleanup
4 participants