Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parent agent ACLK is connected, but cloud says is offline #293

Closed
ktsaou opened this issue Feb 20, 2022 · 37 comments · Fixed by netdata/netdata#12223
Closed

[BUG] Parent agent ACLK is connected, but cloud says is offline #293

ktsaou opened this issue Feb 20, 2022 · 37 comments · Fixed by netdata/netdata#12223
Labels
bug Something isn't working mgmt-navigation-team

Comments

@ktsaou
Copy link
Member

ktsaou commented Feb 20, 2022

I have an agent running:

# netdata -v
netdata v1.33.1-30-nightly

ACLK log:

2022-02-20 13:06:42: netdata ERROR : ACLK_Main : ACLK localhost popcorn timer finished
2022-02-20 13:06:42: netdata INFO  : ACLK_Main : Starting 2 query threads.
2022-02-20 13:06:42: netdata INFO  : ACLK_Query_0 : thread created with task id 5540
2022-02-20 13:06:42: netdata INFO  : ACLK_Query_0 : set name of thread 5540 to ACLK_Query_0
2022-02-20 13:06:42: netdata INFO  : ACLK_Query_1 : thread created with task id 5541
2022-02-20 13:06:42: netdata INFO  : ACLK_Query_1 : set name of thread 5541 to ACLK_Query_1
2022-02-20 13:15:11: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2022-02-20 13:15:11: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-02-20 13:15:11: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-02-20 13:15:11: netdata INFO  : ACLK_Main : Attempting connection now
2022-02-20 13:15:11: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 13:15:11: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-20 13:15:12: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 13:15:12: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-02-20 13:15:12: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-02-20 13:15:12: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-02-20 13:15:13: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-02-20 13:15:13: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-20 13:15:13: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already.
2022-02-20 13:30:10: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2022-02-20 13:30:10: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-02-20 13:30:10: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-02-20 13:30:10: netdata INFO  : ACLK_Main : Attempting connection now
2022-02-20 13:30:11: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 13:30:11: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-20 13:30:11: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 13:30:11: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-02-20 13:30:12: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-02-20 13:30:12: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-02-20 13:30:12: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-02-20 13:30:13: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-20 13:30:13: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already.
2022-02-20 13:45:12: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2022-02-20 13:45:12: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-02-20 13:45:12: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds
2022-02-20 13:45:12: netdata INFO  : ACLK_Main : Attempting connection now
2022-02-20 13:45:13: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 13:45:13: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-20 13:45:13: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 13:45:13: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-02-20 13:45:14: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-02-20 13:45:14: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-02-20 13:45:14: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-02-20 13:45:14: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-20 13:45:14: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already.

Claim Id:

# cat /var/lib/netdata/cloud.d/claimed_id 
c7a9cf8c-1882-11e6-944b-74d435e7ace6
# netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy

The node is called box and is a parent node.

@ktsaou ktsaou added the bug Something isn't working label Feb 20, 2022
@ktsaou
Copy link
Member Author

ktsaou commented Feb 20, 2022

Tried to reinstall from source (this is a gentoo system):

# netdata -v
netdata v1.33.1-30-g0a51695ef
# netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy

Can't understand why this is not New: Used Cloud Protocol: Legacy.

This happened once:

2022-02-20 15:15:07: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2022-02-20 15:15:07: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-02-20 15:15:07: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds

2022-02-20 15:15:07: netdata INFO  : ACLK_Main : Attempting connection now
2022-02-20 15:15:07: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 15:15:07: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-20 15:15:08: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 409
2022-02-20 15:15:08: netdata ERROR : ACLK_Main : ACLK_OTP Challenge HTTP code not 200 OK (got 409)
2022-02-20 15:15:08: netdata ERROR : ACLK_Main : Cloud returned EC="TODO trace-id", Msg-Key:"ErrDuplicatedChallenge", Msg:"delay retry 1m0s: duplicated challenge", BlockRetry:false, Backoff:60s (-1 unset by cloud)
2022-02-20 15:15:08: netdata ERROR : ACLK_Main : Error passing Challenge/Response to get OTP
2022-02-20 15:15:08: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 60.000 seconds

Still not online at the cloud:

image

@ktsaou
Copy link
Member Author

ktsaou commented Feb 20, 2022

The connection is being dropped:

2022-02-20 15:30:18: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2022-02-20 15:30:18: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-02-20 15:30:18: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds

2022-02-20 15:30:18: netdata INFO  : ACLK_Main : Attempting connection now
2022-02-20 15:30:19: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 15:30:19: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-20 15:30:19: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-20 15:30:19: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-02-20 15:30:20: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-02-20 15:30:20: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-02-20 15:30:20: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-02-20 15:30:21: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-20 15:30:21: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already.

@vkuznecovas
Copy link

This seems to be affected by https://github.com/netdata/product/issues/2803.

As far as the cloud is concerned, your agent is connected but missing relevant entries in other tables. Since those entries are missing, your connection status is not represented correctly and the agent is kicked out by the sanitization script every 15 minutes.

As a temporary workaround re-claiming should fix the problem while we investigate and fix the root cause of it.

Could you also clarify a few things? This might help the investigation.

  1. Is this an old agent?
  2. Has it ever worked with the cloud?
  3. Did you "resurrect" it recently?

@ktsaou
Copy link
Member Author

ktsaou commented Feb 21, 2022

  • Is this an old agent?

Yes, very old indeed

  • Has it ever worked with the cloud?

Yes, it was working up to October or November. It is a gentoo box and it a pain to update, so it was online up to point the certificates were valid.

3. Did you "resurrect" it recently?

Yes, yesterday.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 21, 2022

@underhood
Copy link

underhood commented Feb 21, 2022

Can't understand why this is not New: Used Cloud Protocol: Legacy.

Even agent supporting new protocol still has to get green light from cloud to use new protocol. If new protocol is not allowed by cloud we will fall back and use old one.

2022-02-20 13:30:13: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-20 13:30:13: netdata ERROR : ACLK_Main : Sending `connect` payload immediately as popcorning was finished already.
2022-02-20 13:45:12: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.
2022-02-20 13:45:12: netdata ERROR : ACLK_Main : Connection Error or Dropped
2022-02-20 13:45:12: netdata INFO  : ACLK_Main : Wait before attempting to reconnect in 0.000 seconds

So the connection is established successfully and even connect payload is sent and confirmed to be received by VerneMQ. We know that by Wait before attempting to reconnect in 0.000 seconds on next connection. This timer is reset (to 0) only if we get at least 3 PUBACKs on previous connection (proving VerneMQ confirmed delivery of at least 3 messages during lifetime of that connection).

2022-02-20 13:45:12: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: WebSocket server closed the connection with EC=1000. Without message.

This suggests connection was closed on purpose by cloud by means of closing websocket link with EC 1000.

Now I have few comments about this:

  • why not close MQTT connection first?
  • why closing it with websockets EC = 1000. Websockets RFC says about this 1000 indicates a normal closure, meaning that the purpose for which the connection was established has been fulfilled. - The same EC was seen as reason for cloud initiated closure in another occasions as well. This is IMHO wrong -> purpose of connection is permanent monitoring so I do not think "its purpose has been fulfiled" is correct. Additionally we have also possibility to include error message that will be logged for user
  • additionally we have application layer message message DisconnectReq which again can provide description and reason for disconnect to be logged in agent log.

I know everyone has lot of work, but we should definitely rethink/improve this.

We have multiple occasions and clean ways to close agent connection cleanly and with "reason message".
Yet we keep just closing the connection on websocket layer.... why not just then close TCP socket? And even on that opportunity we are not using the option we have to provide error code and reason.
Providing reason and proper EC (+even better using nicer methods of disconnects) will in the end help to investigate things much faster....

@vkuznecovas
Copy link

vkuznecovas commented Feb 21, 2022

We've just had a look with @papazach and in regards to this:

Used Cloud Protocol: Legacy

It's also caused by the missing entries in the database. In essence, the missing entries cause this whitelisted space to be seen as not whitelisted as the cloud is unable to match this claim ID to a space. If we fix the underlying issue, the env service will correctly respond with proto as the encoding.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 21, 2022

@underhood @vkuznecovas thank you both for the info.

All ACLK problems are very important and we should deal with them first priority.

Another important issue is that the agent believes is connected:

# netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy

But it is not really! It is only connected at the transport level. At the application level, our backend is not acknowledging the agent. We should probably introduce another message from the cloud node service to the agent, to let the agent know it is really connected.

So, while in this state, the agent should say:

Online: transport only

@vkuznecovas
Copy link

I've deployed a fix for this recently, could you please check if everything looks ok on your side @ktsaou?

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

Still offline on the UI.

# netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: Legacy

restarting netdata to verify

# netdatacli aclk-state
ACLK Available: Yes
ACLK Implementation: Next Generation
New Cloud Protocol Support: Yes
Claimed: Yes
Claimed Id: c7a9cf8c-1882-11e6-944b-74d435e7ace6
Online: Yes
Used Cloud Protocol: New

ok, New protocol now...

And netdata crashed after a minute...

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

# gdb $(which netdata) /tmp/core-AS_pi2-11-0-0-3212-1645527438
GNU gdb (Gentoo 11.2 vanilla) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/netdata...
[New LWP 3769]
[New LWP 3212]
[New LWP 3365]
[New LWP 3363]
[New LWP 3378]
[New LWP 3364]
[New LWP 3358]
[New LWP 3379]
[New LWP 3214]
[New LWP 3384]
[New LWP 3362]
[New LWP 3369]
[New LWP 3391]
[New LWP 3360]
[New LWP 3375]
[New LWP 3370]
[New LWP 3746]
[New LWP 3361]
[New LWP 3770]
[New LWP 3371]
[New LWP 3374]
[New LWP 3751]
[New LWP 3785]
[New LWP 3372]
[New LWP 3376]
[New LWP 3766]
[New LWP 3380]
[New LWP 3812]
[New LWP 3382]
[New LWP 3767]
[New LWP 3396]
[New LWP 3813]
[New LWP 3385]
[New LWP 3784]
[New LWP 3397]
[New LWP 3786]
[New LWP 3386]
[New LWP 3844]
[New LWP 3718]
[New LWP 3787]
[New LWP 3412]
[New LWP 3421]
[New LWP 3703]
[New LWP 3708]
[New LWP 3761]
[New LWP 3768]
[New LWP 3399]
[New LWP 3845]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/sbin/netdata -P /run/netdata/netdata.pid'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055e0c6f991fa in sql_build_node_info ()
[Current thread is 1 (Thread 0x7fa90e61b640 (LWP 3769))]
(gdb) 
(gdb) 
(gdb) 
(gdb) bt
#0  0x000055e0c6f991fa in sql_build_node_info ()
#1  0x000055e0c6f95c52 in aclk_database_worker ()
#2  0x00007fa92565efef in start_thread (arg=0x7fa90e61b640) at pthread_create.c:463
#3  0x00007fa92559600f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) 

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

Rebuilding netdata with debug symbols without optimization flags, to make sure the trace is right. Let's hope it will crash again...

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

The crash is probably related to netdata/netdata#12210

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

It doesn't crash now.

I see these in the logs:

2022-02-22 13:16:20: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-22 13:16:20: netdata INFO  : ACLK_Main : Getting Cloud /env successful
2022-02-22 13:16:20: netdata INFO  : ACLK_Main : Switching ACLK to new protobuf protocol. Due to /env response.
2022-02-22 13:16:21: netdata INFO  : ACLK_Main : HTTPS "GET" request to "app.netdata.cloud" finished with HTTP code: 200
2022-02-22 13:16:21: netdata INFO  : ACLK_Main : ACLK_OTP Got Challenge from Cloud
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : HTTPS "POST" request to "app.netdata.cloud" finished with HTTP code: 201
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : ACLK_OTP Got Password from Cloud
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : [mqtt_wss] I: ws_client: Websocket Connection Accepted By Server
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : ACLK connection successfully established
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Starting 2 query threads.
2022-02-22 13:16:22: netdata INFO  : ACLK_Query_0 : thread created with task id 28092
2022-02-22 13:16:22: netdata INFO  : ACLK_Query_1 : thread created with task id 28093
2022-02-22 13:16:22: netdata INFO  : ACLK_Query_0 : set name of thread 28092 to ACLK_Query_0
2022-02-22 13:16:22: netdata INFO  : ACLK_Query_1 : set name of thread 28093 to ACLK_Query_1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=01702efd-af72-45b0-8e2f-33d806e5699f, live=1, hops=0
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=2626d41a-39f1-407f-8965-c08d146e8c4d, live=0, hops=1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=465400f5-41b1-4bd3-80d7-30762cc9cfaf, live=0, hops=1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=0a841ab0-bc12-41a0-894a-e41fd5440eb8, live=1, hops=1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11, live=0, hops=1

Still the agent is not online.
@vkuznecovas out of its children, only a couple are online in the UI (probably because they are already online by themselves). The rest (including the agent itself) are offline.

In the agent access.log, I see only 2 children are being queried:

2022-02-22 13:27:37: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5610, last sequence 5625
2022-02-22 13:27:40: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5626, last sequence 5641
2022-02-22 13:27:47: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5642, last sequence 5657
2022-02-22 13:27:48: OG [0a841ab0-bc12-41a0-894a-e41fd5440eb8 (costa-pc)]: Sending charts and dimensions update, batch_id 2, first sequence 8129, last sequence 8144
2022-02-22 13:27:50: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5658, last sequence 5673
2022-02-22 13:27:51: OG [0a841ab0-bc12-41a0-894a-e41fd5440eb8 (costa-pc)]: Sending charts and dimensions update, batch_id 2, first sequence 8145, last sequence 8160
2022-02-22 13:27:57: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5674, last sequence 5689
2022-02-22 13:27:58: OG [0a841ab0-bc12-41a0-894a-e41fd5440eb8 (costa-pc)]: Sending charts and dimensions update, batch_id 2, first sequence 8161, last sequence 8176
2022-02-22 13:28:00: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5690, last sequence 5705
2022-02-22 13:28:01: OG [0a841ab0-bc12-41a0-894a-e41fd5440eb8 (costa-pc)]: Sending charts and dimensions update, batch_id 2, first sequence 8177, last sequence 8192
2022-02-22 13:28:07: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5706, last sequence 5721
2022-02-22 13:28:10: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5722, last sequence 5737
2022-02-22 13:28:17: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5738, last sequence 5753
2022-02-22 13:28:20: OG [9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11 (bedtv)]: Sending charts and dimensions update, batch_id 2, first sequence 5754, last sequence 5769

@vkuznecovas
Copy link

So I've checked the SoT and materialized views, what you should be seeing on the cloud is 4 reachable nodes and 2 unreachable, only one of them seems to be connected directly to the cloud which is the box.

If you're not seeing this, could you clear the index db in the browser and refresh the page? I know that there were some issues reported in regards to that recently.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

@vkuznecovas I deleted indexdb and refreshed. Still no luck...

  • box (the parent agent, c7a9cf8c-1882-11e6-944b-74d435e7ace6) is unreachable.
  • costa-pc (0a841ab0-bc12-41a0-894a-e41fd5440eb8) is live and responding via box (but this is also claimed directly and is live too). I see the parent sending chart and dimension updates.
  • rpi2b-1 (2626d41a-39f1-407f-8965-c08d146e8c4d), is live and is answering queries by itself, not through this parent.
  • bedtv (9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11), is not reachable in the cloud, although I see the parent sending charts and dimensions updates.
  • pi1 does not even appear in the cloud
  • pi2 does not even appear in the cloud
  • pi3 does not even appear in the cloud

@underhood how can find the claim ids of children at the parent? where are they stored?
Please include the hostname in these logs:

netdata INFO  : ACLK_Main : Queuing status update for node=01702efd-af72-45b0-8e2f-33d806e5699f, live=1, hops=0
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=2626d41a-39f1-407f-8965-c08d146e8c4d, live=0, hops=1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=465400f5-41b1-4bd3-80d7-30762cc9cfaf, live=0, hops=1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=0a841ab0-bc12-41a0-894a-e41fd5440eb8, live=1, hops=1
2022-02-22 13:16:22: netdata INFO  : ACLK_Main : Queuing status update for node=9b4c8fff-bc1f-4aee-9d8f-a8ae93e8bd11, live=0, hops=1

@vkuznecovas
Copy link

vkuznecovas commented Feb 22, 2022

I think I might have found the cause for this. Let me verify the findings and I'll update this ticket. There might be a desync issue between the nodes and the spaceroom service.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

@underhood @stelfrag I have these logs every 2 seconds on my parent (NETDATA_INTERNAL_CHECKS is NOT set):

2022-02-22 15:18:33: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:18:35: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:18:36: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:18:39: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:18:40: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:18:43: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:18:44: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:13: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:14: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:22: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:23: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:25: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:26: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:30: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:31: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:35: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:36: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:41: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:42: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:43: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.
2022-02-22 15:19:45: netdata INFO  : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.

@stelfrag
Copy link

2022-02-22 15:19:45: netdata INFO : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.

will check !

@vkuznecovas
Copy link

I've done a detailed write up of what we've discovered on the cloud side of things in the internal ticket here. A short summary:

Nodes that have been inactive for more than 60 days are deleted from the cloud. However, the deletion is not propagated properly, resulting in you seeing nodes that are no longer there. In reality, the nodes were re-created once you've turned back on your machine, but they are not assigned to any room. What you're seeing in the rooms are skeletons.

@stelfrag
Copy link

2022-02-22 15:19:45: netdata INFO : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.

@vlvkobal please take a look (rrdset_set_name)

@vlvkobal
Copy link

vlvkobal commented Feb 22, 2022

2022-02-22 15:19:45: netdata INFO : STREAM_RECEIVER[rpi2b-1,[192.168.1.141]:37928] : RRDSET: chart name 'apps.files' on host 'rpi2b-1' already exists.

@vlvkobal please take a look (rrdset_set_name)

Nothing has changed except variable names in this part of the code. We should keep the message at the INFO level because it's very difficult to debug chart name clashes on the user side without these messages.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

Crashed again:

# gdb $(which netdata) core-AS_pi3-11-0-0-11483-1645556781
GNU gdb (Gentoo 11.2 vanilla) 11.2
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://bugs.gentoo.org/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/sbin/netdata...
[New LWP 12101]
[New LWP 11483]
[New LWP 11635]
[New LWP 11636]
[New LWP 11628]
[New LWP 11640]
[New LWP 11485]
[New LWP 11647]
[New LWP 11643]
[New LWP 11630]
[New LWP 11629]
[New LWP 11654]
[New LWP 11631]
[New LWP 11632]
[New LWP 11644]
[New LWP 11656]
[New LWP 11637]
[New LWP 11646]
[New LWP 11633]
[New LWP 11658]
[New LWP 11638]
[New LWP 11649]
[New LWP 11663]
[New LWP 11639]
[New LWP 11657]
[New LWP 11662]
[New LWP 11692]
[New LWP 11645]
[New LWP 11660]
[New LWP 11933]
[New LWP 11710]
[New LWP 11648]
[New LWP 11676]
[New LWP 12013]
[New LWP 11650]
[New LWP 11970]
[New LWP 12012]
[New LWP 11653]
[New LWP 12057]
[New LWP 12037]
[New LWP 12059]
[New LWP 12038]
[New LWP 11664]
[New LWP 12023]
[New LWP 12054]
[New LWP 12056]
[New LWP 12058]
[New LWP 12091]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `/usr/sbin/netdata -P /run/netdata/netdata.pid'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  str2uint32_t (s=0x0) at ./libnetdata/inlined.h:91
91	    for(c = *s; c >= '0' && c <= '9' ; c = *(++s)) {
[Current thread is 1 (Thread 0x7f34b6e0e640 (LWP 12101))]
(gdb) bt
#0  str2uint32_t (s=0x0) at ./libnetdata/inlined.h:91
#1  sql_build_node_info (wc=wc@entry=0x5589886e1780, cmd=...) at database/sqlite/sqlite_aclk_node.c:38
#2  0x000055897f9f6a01 in aclk_database_worker (arg=0x5589886e1780) at database/sqlite/sqlite_aclk.c:499
#3  0x00007f34d3e8ffef in start_thread (arg=0x7f34b6e0e640) at pthread_create.c:463
#4  0x00007f34d3dc700f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) 

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

@stelfrag found the reason for these crashes. A few child netdata are very old and some host info structures are null, but the parent does not check before using them, so it crashes.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

@vkuznecovas check this:

# curl 'http://localhost:19999/api/v1/info'
{
	"version": "v1.33.1-45-g4d0750620",
	"uid": "c7a9cf8c-1882-11e6-944b-74d435e7ace6",
	"mirrored_hosts": [
		"box",
		"rpi2b-1",
		"pi2",
		"bedtv",
		"pi1",
		"pi3"
	],
	"mirrored_hosts_status": [
		{ "guid": "c7a9cf8c-1882-11e6-944b-74d435e7ace6", "reachable": true, "hops": 0, "claim_id": "c7a9cf8c-1882-11e6-944b-74d435e7ace6" },
		{ "guid": "44bbfb16-827f-11ea-bc9f-b827eb91870b", "reachable": true, "hops": 1, "claim_id": "44bbfb16-827f-11ea-bc9f-b827eb91870b" },
		{ "guid": "d5874cc6-1afb-11e6-859b-b827ebd15026", "reachable": true, "hops": 1, "claim_id": null },
		{ "guid": "eef2e7b4-1976-11e6-ae19-7cdd9077342a", "reachable": true, "hops": 1, "claim_id": "eef2e7b4-1976-11e6-ae19-7cdd9077342a" },
		{ "guid": "a3bb4986-197a-11e6-9324-b827ebe850c4", "reachable": true, "hops": 1, "claim_id": null },
		{ "guid": "ac79784a-1afb-11e6-b71c-b827eb19c746", "reachable": true, "hops": 1, "claim_id": null }
	],

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

@stelfrag @underhood the nodes above that have null claim id (pi1, pi2, pi3) is it possible that the agent does not claim them at all? These nodes are old (running netdata 1.19), but the parent is recent (1.33.1-45). So, the parent should claim them, but it seems it does not.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 22, 2022

It seems that the agent didn't claim the pi1, pi2 and pi3:

# grep 'Queuing registration for ' /var/log/netdata/error.log
2022-02-22 12:55:55: netdata INFO  : ACLK_Main : Queuing registration for host=c7a9cf8c-1882-11e6-944b-74d435e7ace6, hops=0
2022-02-22 12:55:55: netdata INFO  : ACLK_Main : Queuing registration for host=44bbfb16-827f-11ea-bc9f-b827eb91870b, hops=1
2022-02-22 12:55:55: netdata INFO  : ACLK_Main : Queuing registration for host=9a212702-827f-11ea-abb0-b827ebd78c48, hops=1
2022-02-22 12:55:55: netdata INFO  : ACLK_Main : Queuing registration for host=5e58f59c-31f5-11e8-bcf2-408d5c6fcbd0, hops=1
2022-02-22 12:55:55: netdata INFO  : ACLK_Main : Queuing registration for host=eef2e7b4-1976-11e6-ae19-7cdd9077342a, hops=1

The machine guids of the 3 nodes, are not in the log.

@stelfrag
Copy link

@stelfrag @underhood the nodes above that have null claim id (pi1, pi2, pi3) is it possible that the agent does not claim them at all? These nodes are old (running netdata 1.19), but the parent is recent (1.33.1-45). So, the parent should claim them, but it seems it does not.

The claim id indicates if that node is claimed by itself. It is send from the child to the parent via streaming

@stelfrag
Copy link

Nothing has changed except variable names in this part of the code. We should keep the message at the INFO level because it's very difficult to debug chart name clashes on the user side without these messages.

@vlvkobal We can add the info (move it a bit) if there is actually a clash

@ktsaou
Copy link
Member Author

ktsaou commented Feb 23, 2022

Update. I installed netdata/netdata#12223
Now the crashes are gone.

I see at the cloud all the nodes now, but:

  1. box, the agent node, works perfectly
  2. bedtv, works, but not through this parent. It works directly.
  3. rpi2b-1, works, but not through this parent. It works directly.
  4. pi1, is shown as online, but no data requests are sent to the parent.
  5. pi2, is shown as online, but no data requests are sent to the parent.
  6. pi3, is shown as offline.

Generally, no /data requests are received from the cloud, at this node for any of its children.

@stelfrag
Copy link

  • bedtv, works, but not through this parent. It works directly.

Identified the problem with this. Children running version < 1.31 will not correctly register to the cloud (via the parent)
Fix is being prepared.

Reason:
The node_instance registration on the parent happens with a new CLAIM streaming command which is not available to older versions. The fix will make sure the parent takes care of that.

@ktsaou
Copy link
Member Author

ktsaou commented Feb 24, 2022

I reinstall netdata with the latest merges. No change. The status is exactly the same as above #293 (comment)

@ktsaou ktsaou reopened this Feb 24, 2022
@stelfrag
Copy link

I reinstall netdata with the latest merges. No change. The status is exactly the same as above #293 (comment)

yes the issue I mentioned in #293 (comment) is not fixed (Pr coming up), but the "crash fix" that was linked auto-closed the issue

@ktsaou
Copy link
Member Author

ktsaou commented Feb 24, 2022

yes the issue I mentioned in #293 (comment) is not fixed (Pr coming up), but the "crash fix" that was linked auto-closed the issue

which is this PR?

@stelfrag
Copy link

yes the issue I mentioned in #293 (comment) is not fixed (Pr coming up), but the "crash fix" that was linked auto-closed the issue

which is this PR?

netdata/netdata#12241

@ktsaou
Copy link
Member Author

ktsaou commented Mar 7, 2022

This problem is fixed.

The only issue remaining is cleaning up the cloud, because some nodes appear twice. I have removed the duplicate - non-working ones from the room, still they appear in node management, like this:

image

I am closing this issue.

@ktsaou ktsaou closed this as completed Mar 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mgmt-navigation-team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants