-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Parent agent ACLK is connected, but cloud says is offline #293
Comments
Tried to reinstall from source (this is a gentoo system): # netdata -v
netdata v1.33.1-30-g0a51695ef # netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy Can't understand why this is not New: This happened once:
Still not online at the cloud: |
The connection is being dropped:
|
This seems to be affected by https://github.com/netdata/product/issues/2803. As far as the cloud is concerned, your agent is connected but missing relevant entries in other tables. Since those entries are missing, your connection status is not represented correctly and the agent is kicked out by the sanitization script every 15 minutes. As a temporary workaround re-claiming should fix the problem while we investigate and fix the root cause of it. Could you also clarify a few things? This might help the investigation.
|
Yes, very old indeed
Yes, it was working up to October or November. It is a gentoo box and it a pain to update, so it was online up to point the certificates were valid.
Yes, yesterday. |
Even agent supporting new protocol still has to get green light from cloud to use new protocol. If new protocol is not allowed by cloud we will fall back and use old one.
So the connection is established successfully and even connect payload is sent and confirmed to be received by VerneMQ. We know that by
This suggests connection was closed on purpose by cloud by means of closing websocket link with EC 1000. Now I have few comments about this:
I know everyone has lot of work, but we should definitely rethink/improve this. We have multiple occasions and clean ways to close agent connection cleanly and with "reason message". |
We've just had a look with @papazach and in regards to this:
It's also caused by the missing entries in the database. In essence, the missing entries cause this whitelisted space to be seen as not whitelisted as the cloud is unable to match this claim ID to a space. If we fix the underlying issue, the env service will correctly respond with proto as the encoding. |
@underhood @vkuznecovas thank you both for the info. All ACLK problems are very important and we should deal with them first priority. Another important issue is that the agent believes is connected: # netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy But it is not really! It is only connected at the transport level. At the application level, our backend is not acknowledging the agent. We should probably introduce another message from the cloud node service to the agent, to let the agent know it is really connected. So, while in this state, the agent should say:
|
I've deployed a fix for this recently, could you please check if everything looks ok on your side @ktsaou? |
Still offline on the UI. # netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy restarting netdata to verify # netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: New ok, New protocol now... And netdata crashed after a minute... |
# gdb $(which netdata) /tmp/core-AS_pi2-11-0-0-3212-1645527438
GNU gdb (Gentoo 11.2 vanilla) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/netdata...
[New LWP 3769]
[New LWP 3212]
[New LWP 3365]
[New LWP 3363]
[New LWP 3378]
[New LWP 3364]
[New LWP 3358]
[New LWP 3379]
[New LWP 3214]
[New LWP 3384]
[New LWP 3362]
[New LWP 3369]
[New LWP 3391]
[New LWP 3360]
[New LWP 3375]
[New LWP 3370]
[New LWP 3746]
[New LWP 3361]
[New LWP 3770]
[New LWP 3371]
[New LWP 3374]
[New LWP 3751]
[New LWP 3785]
[New LWP 3372]
[New LWP 3376]
[New LWP 3766]
[New LWP 3380]
[New LWP 3812]
[New LWP 3382]
[New LWP 3767]
[New LWP 3396]
[New LWP 3813]
[New LWP 3385]
[New LWP 3784]
[New LWP 3397]
[New LWP 3786]
[New LWP 3386]
[New LWP 3844]
[New LWP 3718]
[New LWP 3787]
[New LWP 3412]
[New LWP 3421]
[New LWP 3703]
[New LWP 3708]
[New LWP 3761]
[New LWP 3768]
[New LWP 3399]
[New LWP 3845]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/sbin/netdata -P /run/netdata/netdata.pid'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055e0c6f991fa in sql_build_node_info ()
[Current thread is 1 (Thread 0x7fa90e61b640 (LWP 3769))]
(gdb)
(gdb)
(gdb)
(gdb) bt
#0 0x000055e0c6f991fa in sql_build_node_info ()
#1 0x000055e0c6f95c52 in aclk_database_worker ()
#2 0x00007fa92565efef in start_thread (arg=0x7fa90e61b640) at pthread_create.c:463
#3 0x00007fa92559600f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) |
Rebuilding netdata with debug symbols without optimization flags, to make sure the trace is right. Let's hope it will crash again... |
The crash is probably related to netdata/netdata#12210 |
It doesn't crash now. I see these in the logs:
Still the agent is not online. In the agent access.log, I see only 2 children are being queried:
|
So I've checked the SoT and materialized views, what you should be seeing on the cloud is 4 reachable nodes and 2 unreachable, only one of them seems to be connected directly to the cloud which is the If you're not seeing this, could you clear the index db in the browser and refresh the page? I know that there were some issues reported in regards to that recently. |
@vkuznecovas I deleted indexdb and refreshed. Still no luck...
@underhood how can find the claim ids of children at the parent? where are they stored?
|
I think I might have found the cause for this. Let me verify the findings and I'll update this ticket. There might be a desync issue between the nodes and the spaceroom service. |
@underhood @stelfrag I have these logs every 2 seconds on my parent (
|
will check ! |
I've done a detailed write up of what we've discovered on the cloud side of things in the internal ticket here. A short summary: Nodes that have been inactive for more than 60 days are deleted from the cloud. However, the deletion is not propagated properly, resulting in you seeing nodes that are no longer there. In reality, the nodes were re-created once you've turned back on your machine, but they are not assigned to any room. What you're seeing in the rooms are skeletons. |
@vlvkobal please take a look (rrdset_set_name) |
Nothing has changed except variable names in this part of the code. We should keep the message at the |
Crashed again: # gdb $(which netdata) core-AS_pi3-11-0-0-11483-1645556781
GNU gdb (Gentoo 11.2 vanilla) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/netdata...
[New LWP 12101]
[New LWP 11483]
[New LWP 11635]
[New LWP 11636]
[New LWP 11628]
[New LWP 11640]
[New LWP 11485]
[New LWP 11647]
[New LWP 11643]
[New LWP 11630]
[New LWP 11629]
[New LWP 11654]
[New LWP 11631]
[New LWP 11632]
[New LWP 11644]
[New LWP 11656]
[New LWP 11637]
[New LWP 11646]
[New LWP 11633]
[New LWP 11658]
[New LWP 11638]
[New LWP 11649]
[New LWP 11663]
[New LWP 11639]
[New LWP 11657]
[New LWP 11662]
[New LWP 11692]
[New LWP 11645]
[New LWP 11660]
[New LWP 11933]
[New LWP 11710]
[New LWP 11648]
[New LWP 11676]
[New LWP 12013]
[New LWP 11650]
[New LWP 11970]
[New LWP 12012]
[New LWP 11653]
[New LWP 12057]
[New LWP 12037]
[New LWP 12059]
[New LWP 12038]
[New LWP 11664]
[New LWP 12023]
[New LWP 12054]
[New LWP 12056]
[New LWP 12058]
[New LWP 12091]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/sbin/netdata -P /run/netdata/netdata.pid'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 str2uint32_t (s=0x0) at ./libnetdata/inlined.h:91
91 for(c = *s; c >= '0' && c <= '9' ; c = *(++s)) {
[Current thread is 1 (Thread 0x7f34b6e0e640 (LWP 12101))]
(gdb) bt
#0 str2uint32_t (s=0x0) at ./libnetdata/inlined.h:91
#1 sql_build_node_info (wc=wc@entry=0x5589886e1780, cmd=...) at database/sqlite/sqlite_aclk_node.c:38
#2 0x000055897f9f6a01 in aclk_database_worker (arg=0x5589886e1780) at database/sqlite/sqlite_aclk.c:499
#3 0x00007f34d3e8ffef in start_thread (arg=0x7f34b6e0e640) at pthread_create.c:463
#4 0x00007f34d3dc700f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) |
@stelfrag found the reason for these crashes. A few child netdata are very old and some host info structures are null, but the parent does not check before using them, so it crashes. |
@vkuznecovas check this: # curl 'http://localhost:19999/api/v1/info'
{
"version": "v1.33.1-45-g4d0750620",
"uid": "c7a9cf8c-1882-11e6-944b-74d435e7ace6",
"mirrored_hosts": [
"box",
"rpi2b-1",
"pi2",
"bedtv",
"pi1",
"pi3"
],
"mirrored_hosts_status": [
{ "guid": "c7a9cf8c-1882-11e6-944b-74d435e7ace6", "reachable": true, "hops": 0, "claim_id": "c7a9cf8c-1882-11e6-944b-74d435e7ace6" },
{ "guid": "44bbfb16-827f-11ea-bc9f-b827eb91870b", "reachable": true, "hops": 1, "claim_id": "44bbfb16-827f-11ea-bc9f-b827eb91870b" },
{ "guid": "d5874cc6-1afb-11e6-859b-b827ebd15026", "reachable": true, "hops": 1, "claim_id": null },
{ "guid": "eef2e7b4-1976-11e6-ae19-7cdd9077342a", "reachable": true, "hops": 1, "claim_id": "eef2e7b4-1976-11e6-ae19-7cdd9077342a" },
{ "guid": "a3bb4986-197a-11e6-9324-b827ebe850c4", "reachable": true, "hops": 1, "claim_id": null },
{ "guid": "ac79784a-1afb-11e6-b71c-b827eb19c746", "reachable": true, "hops": 1, "claim_id": null }
], |
@stelfrag @underhood the nodes above that have null claim id ( |
It seems that the agent didn't claim the # grep 'Queuing registration for ' /var/log/netdata/error.log
2022-02-22 12:55:55: netdata INFO : ACLK_Main : Queuing registration for host=c7a9cf8c-1882-11e6-944b-74d435e7ace6, hops=0
2022-02-22 12:55:55: netdata INFO : ACLK_Main : Queuing registration for host=44bbfb16-827f-11ea-bc9f-b827eb91870b, hops=1
2022-02-22 12:55:55: netdata INFO : ACLK_Main : Queuing registration for host=9a212702-827f-11ea-abb0-b827ebd78c48, hops=1
2022-02-22 12:55:55: netdata INFO : ACLK_Main : Queuing registration for host=5e58f59c-31f5-11e8-bcf2-408d5c6fcbd0, hops=1
2022-02-22 12:55:55: netdata INFO : ACLK_Main : Queuing registration for host=eef2e7b4-1976-11e6-ae19-7cdd9077342a, hops=1 The machine guids of the 3 nodes, are not in the log. |
The |
@vlvkobal We can add the info (move it a bit) if there is actually a clash |
Update. I installed netdata/netdata#12223 I see at the cloud all the nodes now, but:
Generally, no |
Identified the problem with this. Children running version < 1.31 will not correctly register to the cloud (via the parent) Reason: |
I reinstall netdata with the latest merges. No change. The status is exactly the same as above #293 (comment) |
yes the issue I mentioned in #293 (comment) is not fixed (Pr coming up), but the "crash fix" that was linked auto-closed the issue |
which is this PR? |
|
I have an agent running:
# netdata -v netdata v1.33.1-30-nightly
ACLK log:
Claim Id:
# cat /var/lib/netdata/cloud.d/claimed_id c7a9cf8c-1882-11e6-944b-74d435e7ace6
# netdatacli aclk-state ACLK Available: Yes ACLK Implementation: Next Generation New Cloud Protocol Support: Yes Claimed: Yes Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6 Online: Yes Used Cloud Protocol: Legacy
The node is called
box
and is a parent node.The text was updated successfully, but these errors were encountered: