Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACLK: Implemented Last Will and Testament #8410

Merged
merged 14 commits into from Mar 18, 2020

Conversation

stelfrag
Copy link
Collaborator

Fixes #8369

Summary

Defines a Last Will and Testament message that the broker will send to the outbound/meta
topic if an unexpected (agent or link) failure occurs.

When the agent performs a normal shutdown an MQTT disconnect packet notifies the broker
that the LWT message should not be sent. In that case a different disconnect message is transmitted and indicates a graceful agent shutdown.

Component Name

ACLK

Test Plan
Graceful agent shutdown
  • Start a claimed agent
  • Subscribe to the broker on the outbound/meta topic
  • Verify ACLK is up and running
  • Notice the on connect messages
  • Press control - C or send a kill signal to the agent
    • Notice a LWT message on outbound/meta similar to
      • { "type" : "disconnect", "version" : 1, "msg-id" : "e7558504-eb6f-4bdf-9700-df972e0940e6", "timestamp" : 1584372301, "payload" : "graceful" }

Abnormal agent shutdown
  • Start a claimed agent
  • Subscribe to the broker on the outbound/meta topic
  • Verify ACLK is up and running
  • Notice the on connect messages
  • Send kill -9 signal to the agent
    • Notice a LWT message on outbound/meta similar to
      • { "type" : "disconnect", "version" : 1, "msg-id" : "8ecde43f-5e97-4443-b314-5abab793a8d7", "timestamp" : 1584374934, "payload" : "unexpected" }
Additional Information

@squash-labs
Copy link

squash-labs bot commented Mar 16, 2020

Manage this branch in Squash

Test this branch here: https://stelfraglast-will-testament-83-fy4pe.squash.io

@stelfrag stelfrag requested review from amoss and underhood and removed request for cosmix and ktsaou March 16, 2020 16:16
@stelfrag stelfrag changed the title ACLK: Implemented Last Will Testament ACLK: Implemented Last Will and Testament Mar 16, 2020
@underhood
Copy link
Contributor

underhood commented Mar 16, 2020

When i CTRL+C i get SEGV

==253313== Memcheck, a memory error detector
==253313== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==253313== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==253313== Command: ./netdata -D
==253313== 
2020-03-16 18:14:56: netdata INFO  : MAIN : SIGNAL: Not enabling reaper
^C==253313== Thread 9 ACLK_Main:
==253313== Invalid read of size 8
==253313==    at 0x50F40A: mosquitto_want_write (in /home/timo/projects/netdata/netdata/netdata)
==253313==    by 0x4A5F0C: _link_mosquitto_write (mqtt.c:183)
==253313==    by 0x4A60D8: _link_event_loop (mqtt.c:227)
==253313==    by 0x4A300A: aclk_main_cleanup (agent_cloud_link.c:955)
==253313==    by 0x4A4263: aclk_main (agent_cloud_link.c:1314)
==253313==    by 0x437120: thread_start (threads.c:170)
==253313==    by 0x4E154E1: start_thread (in /usr/lib64/libpthread-2.30.so)
==253313==    by 0x4F316D2: clone (in /usr/lib64/libc-2.30.so)
==253313==  Address 0x80 is not stack'd, malloc'd or (recently) free'd
==253313== 
==253313== 
==253313== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==253313==  Access not within mapped region at address 0x80
==253313==    at 0x50F40A: mosquitto_want_write (in /home/timo/projects/netdata/netdata/netdata)
==253313==    by 0x4A5F0C: _link_mosquitto_write (mqtt.c:183)
==253313==    by 0x4A60D8: _link_event_loop (mqtt.c:227)
==253313==    by 0x4A300A: aclk_main_cleanup (agent_cloud_link.c:955)
==253313==    by 0x4A4263: aclk_main (agent_cloud_link.c:1314)
==253313==    by 0x437120: thread_start (threads.c:170)
==253313==    by 0x4E154E1: start_thread (in /usr/lib64/libpthread-2.30.so)
==253313==    by 0x4F316D2: clone (in /usr/lib64/libc-2.30.so)
==253313==  If you believe this happened as a result of a stack
==253313==  overflow in your program's main thread (unlikely but
==253313==  possible), you can try to increase the size of the
==253313==  main thread stack using the --main-stacksize= flag.
==253313==  The main thread stack size used in this run was 8388608.
==253313== 
==253313== HEAP SUMMARY:
==253313==     in use at exit: 5,336,840 bytes in 37,997 blocks
==253313==   total heap usage: 63,693 allocs, 25,696 frees, 12,616,620 bytes allocated
==253313== 
==253313== LEAK SUMMARY:
==253313==    definitely lost: 114 bytes in 1 blocks
==253313==    indirectly lost: 0 bytes in 0 blocks
==253313==      possibly lost: 8,016 bytes in 152 blocks
==253313==    still reachable: 5,328,606 bytes in 37,843 blocks
==253313==         suppressed: 104 bytes in 1 blocks
==253313== Rerun with --leak-check=full to see details of leaked memory
==253313== 
==253313== For lists of detected and suppressed errors, rerun with: -s
==253313== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Seems to be related to CTRL+C when node not claimed.

2020-03-16 18:14:35: netdata INFO  : MAIN : EXIT: Stopping master thread: ACLK_Main
2020-03-16 18:14:35: netdata INFO  : ACLK_Main : cleaning up...
2020-03-16 18:14:39: netdata INFO  : ACLK_Main : thread created with task id 252934
2020-03-16 18:14:39: netdata INFO  : ACLK_Main : set name of thread 252934 to ACLK_Main
2020-03-16 18:14:39: netdata INFO  : ACLK_Main : Waiting for netdata to be ready
2020-03-16 18:14:39: netdata INFO  : ACLK_Main : Setting ACLK target host=netdata.cloud port=443 from https://netdata.cloud
2020-03-16 18:14:39: netdata INFO  : ACLK_Main : Waiting for netdata to be claimed
2020-03-16 18:14:47: netdata INFO  : MAIN : EXIT: Stopping master thread: ACLK_Main
2020-03-16 18:14:47: netdata INFO  : ACLK_Main : cleaning up...
2020-03-16 18:14:59: netdata INFO  : ACLK_Main : thread created with task id 253493
2020-03-16 18:15:00: netdata INFO  : ACLK_Main : set name of thread 253493 to ACLK_Main
2020-03-16 18:15:00: netdata INFO  : ACLK_Main : Waiting for netdata to be ready
2020-03-16 18:15:00: netdata INFO  : MAIN : EXIT: Stopping master thread: ACLK_Main
2020-03-16 18:15:00: netdata INFO  : ACLK_Main : Setting ACLK target host=netdata.cloud port=443 from https://netdata.cloud
2020-03-16 18:15:00: netdata INFO  : ACLK_Main : Waiting for netdata to be claimed
2020-03-16 18:15:00: netdata INFO  : ACLK_Main : cleaning up...

When we are stuck waiting for claiming and you call _link_event_loop from aclk_main_cleanup the mosquitto is not initialized yet. Causing SEGV

Copy link
Contributor

@underhood underhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SEGV.
When we are stuck waiting for claiming (unclaimed agent) and you call _link_event_loop from aclk_main_cleanup (by hitting CTRL+C) the mosquitto is not initialized yet. Causing SEGV

@stelfrag
Copy link
Collaborator Author

_link_event_loop

Thanks Timo, will fix !!

@stelfrag stelfrag requested a review from underhood March 16, 2020 18:37
@stelfrag stelfrag dismissed underhood’s stale review March 16, 2020 18:41

Added check to make sure agent is claimed

amoss
amoss previously requested changes Mar 16, 2020
aclk/agent_cloud_link.c Outdated Show resolved Hide resolved
aclk/agent_cloud_link.h Outdated Show resolved Hide resolved
aclk/mqtt.c Outdated Show resolved Hide resolved
Copy link
Contributor

@underhood underhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some warrnings:

 CC       aclk/agent_cloud_link.o
aclk/agent_cloud_link.c: In function ‘aclk_main_cleanup’:
aclk/agent_cloud_link.c:962:9: warning: implicit declaration of function ‘aclk_lws_wss_mqtt_layer_disconect_notif’ [-Wimplicit-function-declaration]
  962 |         aclk_lws_wss_mqtt_layer_disconect_notif();
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  CC       aclk/mqtt.o
aclk/mqtt.c: In function ‘_link_set_lwt’:
aclk/mqtt.c:271:19: warning: implicit declaration of function ‘get_topic’ [-Wimplicit-function-declaration]
  271 |     final_topic = get_topic(sub_topic, topic, ACLK_MAX_TOPIC);
      |                   ^~~~~~~~~
aclk/mqtt.c:271:17: warning: assignment to ‘char *’ from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
  271 |     final_topic = get_topic(sub_topic, topic, ACLK_MAX_TOPIC);
      |                 ^
  CC       aclk/aclk_lws_wss_client.o


aclk/mqtt.c:271:19: warning: type of ‘get_topic’ does not match original declaration [-Wlto-type-mismatch]
  271 |     final_topic = get_topic(sub_topic, topic, ACLK_MAX_TOPIC);
      |                   ^
aclk/agent_cloud_link.c:469:7: note: return value type mismatch
  469 | char *get_topic(char *sub_topic, char *final_topic, int max_size)
      |       ^
aclk/agent_cloud_link.c:469:7: note: ‘get_topic’ was previously declared here
aclk/agent_cloud_link.c:469:7: note: code may be misoptimized unless ‘-fno-strict-aliasing’ is used

@underhood
Copy link
Contributor

The unclaimed agent SEGV appears to be fixed and both test cases with claimed agent seem to work. I will approve when compiler warnings and related problems are fixed.

@stelfrag
Copy link
Collaborator Author

The unclaimed agent SEGV appears to be fixed and both test cases with claimed agent seem to work. I will approve when compiler warnings and related problems are fixed.

Added extern for:

  • `aclk_lws_wss_mqtt_layer_disconect_notif
  • get_topic

functions to clear the compilter warnings.

aclk/agent_cloud_link.h Outdated Show resolved Hide resolved
@stelfrag stelfrag dismissed stale reviews from amoss and underhood March 17, 2020 11:19

Now probing the lws layer to call the event loop enough times

@stelfrag stelfrag requested a review from underhood March 17, 2020 11:21
@stelfrag stelfrag requested review from amoss and removed request for Ferroin, joelhans, knatsakis, ncmans, netdatabot, prologic, a user, thiagoftsm and vlvkobal March 17, 2020 14:28
@github-actions github-actions bot added area/exporting area/build Build system (autotools and cmake). area/ci area/docs area/packaging Packaging and operating systems support labels Mar 17, 2020
amoss
amoss previously requested changes Mar 17, 2020
Copy link
Contributor

@amoss amoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It segfaults when I hit ctrl-c at random points during the startup-up / pop-corning phase:

==12081== Process terminating with default action of signal 11 (SIGSEGV)
==12081==  Access not within mapped region at address 0x80
==12081==    at 0x203B97: mosquitto_want_write (in /home/amoss/netdata/dev-install3/netdata/usr/sbin/netdata)
==12081==    by 0x1704FB: UnknownInlinedFun (mqtt.c:183)
==12081==    by 0x1704FB: UnknownInlinedFun (mqtt.c:227)
==12081==    by 0x1704FB: aclk_main_cleanup (agent_cloud_link.c:982)
==12081==    by 0x1728B7: aclk_main (agent_cloud_link.c:1293)
==12081==    by 0x1E3B34: thread_start (threads.c:170)
==12081==    by 0x523FFA2: start_thread (pthread_create.c:486)
==12081==    by 0x53544CE: clone (clone.S:95)

@stelfrag
Copy link
Collaborator Author

It segfaults when I hit ctrl-c at random points during the startup-up / pop-corning phase:

==12081== Process terminating with default action of signal 11 (SIGSEGV)
==12081==  Access not within mapped region at address 0x80
==12081==    at 0x203B97: mosquitto_want_write (in /home/amoss/netdata/dev-install3/netdata/usr/sbin/netdata)
==12081==    by 0x1704FB: UnknownInlinedFun (mqtt.c:183)
==12081==    by 0x1704FB: UnknownInlinedFun (mqtt.c:227)
==12081==    by 0x1704FB: aclk_main_cleanup (agent_cloud_link.c:982)
==12081==    by 0x1728B7: aclk_main (agent_cloud_link.c:1293)
==12081==    by 0x1E3B34: thread_start (threads.c:170)
==12081==    by 0x523FFA2: start_thread (pthread_create.c:486)
==12081==    by 0x53544CE: clone (clone.S:95)

Ok will take a look asap

@amoss
Copy link
Contributor

amoss commented Mar 17, 2020

Here is another one (might not be due to this PR) :

==17942== Process terminating with default action of signal 11 (SIGSEGV)
==17942==  Access not within mapped region at address 0x168B39D0
==17942==    at 0x52476C9: pthread_cancel (pthread_cancel.c:33)
==17942==    by 0x1DFED0: netdata_thread_cancel (threads.c:203)
==17942==    by 0x1732D0: socket_listen_main_static_threaded_cleanup (static-threaded.c:428)
==17942==    by 0x175E30: socket_listen_main_static_threaded (static-threaded.c:458)
==17942==    by 0x1E3B34: thread_start (threads.c:170)
==17942==    by 0x523FFA2: start_thread (pthread_create.c:486)
==17942==    by 0x53544CE: clone (clone.S:95)

@stelfrag stelfrag dismissed amoss’s stale review March 17, 2020 18:47

Fixed random segfaults

@stelfrag stelfrag requested a review from amoss March 17, 2020 18:47
@stelfrag
Copy link
Collaborator Author

It segfaults when I hit ctrl-c at random points during the startup-up / pop-corning phase:

==12081== Process terminating with default action of signal 11 (SIGSEGV)
==12081==  Access not within mapped region at address 0x80
==12081==    at 0x203B97: mosquitto_want_write (in /home/amoss/netdata/dev-install3/netdata/usr/sbin/netdata)
==12081==    by 0x1704FB: UnknownInlinedFun (mqtt.c:183)
==12081==    by 0x1704FB: UnknownInlinedFun (mqtt.c:227)
==12081==    by 0x1704FB: aclk_main_cleanup (agent_cloud_link.c:982)
==12081==    by 0x1728B7: aclk_main (agent_cloud_link.c:1293)
==12081==    by 0x1E3B34: thread_start (threads.c:170)
==12081==    by 0x523FFA2: start_thread (pthread_create.c:486)
==12081==    by 0x53544CE: clone (clone.S:95)

Ok will take a look asap

Cause by the internal mosq structure being destroyed and set to null.

When the cleanup code is called it is not certain that netdata_exit variable is set to 1 (yet) which the code that tears down the link assumed.

Copy link
Contributor

@amoss amoss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-tested. Both cases work. LGTM.

@stelfrag stelfrag merged commit 2c716dc into netdata:master Mar 18, 2020
stelfrag added a commit to stelfrag/netdata that referenced this pull request Mar 19, 2020
* Added support for Last Will and Testament to the ACLK
* On normal agent shutdown an alternate "graceful shutdown" message is published
@stelfrag stelfrag deleted the last_will_testament_8369 branch April 4, 2020 10:32
Saruspete pushed a commit to Saruspete/netdata that referenced this pull request May 21, 2020
* Added support for Last Will and Testament to the ACLK
* On normal agent shutdown an alternate "graceful shutdown" message is published
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build Build system (autotools and cmake). area/ci area/docs area/exporting area/packaging Packaging and operating systems support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ACLK Last Will Testament implementation
3 participants