Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACLK-available-is-false #8977

Closed
evilalmus opened this issue May 12, 2020 · 11 comments
Closed

ACLK-available-is-false #8977

evilalmus opened this issue May 12, 2020 · 11 comments
Assignees
Labels
bug needs triage Issues which need to be manually labelled

Comments

@evilalmus
Copy link

evilalmus commented May 12, 2020

Bug report summary

I was able to "claim" this done, but it is showing as unavailable in the netdata.cloud console.

OS / Environment
# uname -a; grep -Hv "^#" /etc/*release
Linux live 5.4.0-26-generic #30-Ubuntu SMP Mon Apr 20 16:58:30 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
/etc/lsb-release:DISTRIB_ID=Ubuntu
/etc/lsb-release:DISTRIB_RELEASE=20.04
/etc/lsb-release:DISTRIB_CODENAME=focal
/etc/lsb-release:DISTRIB_DESCRIPTION="Ubuntu 20.04 LTS"
/etc/os-release:NAME="Ubuntu"
/etc/os-release:VERSION="20.04 LTS (Focal Fossa)"
/etc/os-release:ID=ubuntu
/etc/os-release:ID_LIKE=debian
/etc/os-release:PRETTY_NAME="Ubuntu 20.04 LTS"
/etc/os-release:VERSION_ID="20.04"
/etc/os-release:HOME_URL="https://www.ubuntu.com/"
/etc/os-release:SUPPORT_URL="https://help.ubuntu.com/"
/etc/os-release:BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
/etc/os-release:PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
/etc/os-release:VERSION_CODENAME=focal
/etc/os-release:UBUNTU_CODENAME=focal
Netdata version
# netdata -V
netdata v1.22.0-12-nightly

If Netdata is running, execute: $(ps aux | grep -E -o "[a-zA-Z/]+netdata ") -V

# $(ps aux | grep -E -o "[a-zA-Z/]+netdata ") -V
-V: command not found
# $(ps aux | grep -E -o "[a-zA-Z/]+netdata ")
# ps aux | grep netdata
netdata   157856  1.0  0.6 255840 50560 ?        Sl   05:47   0:13 /usr/sbin/netdata
netdata   157976  0.2  0.6  90492 54028 ?        Sl   05:47   0:03 /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
netdata   157979  0.0  0.2 125952 17964 ?        Sl   05:47   0:00 /usr/libexec/netdata/plugins.d/go.d.plugin 1
netdata   157983  0.0  0.0   4036  3028 ?        S    05:47   0:01 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
netdata   157989  1.1  0.0  53264  3480 ?        S    05:47   0:15 /usr/libexec/netdata/plugins.d/apps.plugin 1
root      158373  0.0  0.0   8160   736 pts/1    S+   06:09   0:00 grep --color=auto netdata
Component Name

ACLK

Steps To Reproduce

Installed netdata
ran claim script from netdata.cloud
that's it...
Info from /var/log/netdata/error.log

2020-05-12 05:30:12: netdata INFO  : ACLK_Main : thread created with task id 156551
2020-05-12 05:30:12: netdata INFO  : ACLK_Main : set name of thread 156551 to ACLK_Main
2020-05-12 05:30:12: netdata INFO  : ACLK_Main : Waiting for netdata to be ready
2020-05-12 05:30:12: netdata INFO  : ACLK_Main : Waiting for Cloud to be enabled
2020-05-12 05:30:12: netdata INFO  : ACLK_Main : Waiting for netdata to be claimed
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : Setting ACLK target host=app.netdata.cloud port=443 from https://app.netdata.cloud
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : Attempting to establish the agent cloud link
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/ab6a7318-9352-11ea-a52e-7bd303518e53/challenge
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : aclk_send_https_request GET
2020-05-12 05:33:04: netdata ERROR : ACLK_Main : Decryption of the challenge failed: error:04099079:rsa routines:RSA_padding_check_PKCS1_OAEP_mgf1:oaep decoding error
2020-05-12 05:33:04: netdata ERROR : ACLK_Main : Output buffer for encoding size=512 is not large enough for 18446744073709551615-bytes input
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : aclk_send_https_request POST
2020-05-12 05:33:04: netdata ERROR : ACLK_Main : Challenge-response failed: {"errorCode":"TODO trace-id","errorMsgKey":"ErrIncorrectResponse","errorMessage":"incorrect challenge response"}
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : Retrying to establish the ACLK connection in 0.000 seconds
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : Attempting to establish the agent cloud link
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : Retrieving challenge from cloud: app.netdata.cloud 443 /api/v1/auth/node/ab6a7318-9352-11ea-a52e-7bd303518e53/challenge
2020-05-12 05:33:04: netdata INFO  : ACLK_Main : aclk_send_https_request GET
2020-05-12 05:33:05: netdata ERROR : ACLK_Main : Decryption of the challenge failed: error:04099079:rsa routines:RSA_padding_check_PKCS1_OAEP_mgf1:oaep decoding error
2020-05-12 05:33:05: netdata ERROR : ACLK_Main : Output buffer for encoding size=512 is not large enough for 18446744073709551615-bytes input
2020-05-12 05:33:05: netdata INFO  : ACLK_Main : aclk_send_https_request POST
2020-05-12 05:33:05: netdata ERROR : ACLK_Main : Challenge-response failed: {"errorCode":"TODO trace-id","errorMsgKey":"ErrIncorrectResponse","errorMessage":"incorrect challenge response"}
2020-05-12 05:33:05: netdata INFO  : ACLK_Main : Retrying to establish the ACLK connection in 1.351 seconds
Expected behavior

node appears in cloud console.

@underhood
Copy link
Contributor

I will try to reproduce with Ubuntu LTS in VM. I also contacted the cloud team to see if they can get any helpful info from the other (cloud) side.

@underhood
Copy link
Contributor

@evilalmus Does the machine where you are running affected agent have connection to https://app.netdata.cloud?

@underhood
Copy link
Contributor

underhood commented May 12, 2020

@evilalmus Seems like the claiming got messed up somehow. Can you please delete following files:

  • /var/lib/netdata/cloud.d/claimed_id
  • /var/lib/netdata/cloud.d/private.pem
  • /var/lib/netdata/cloud.d/public.pem
  • /var/lib/netdata/registry/netdata.public.unique.id

Then remove the node from the cloud.

After that please restart Netdata and new /var/lib/netdata/registry/netdata.public.unique.id will be generated after which you can try and claim the node with the cloud again.

@evilalmus
Copy link
Author

Thank you. How do you remove the node from the cloud? The docs say to use the "Node View" (https://learn.netdata.cloud/docs/agent/netdata-cloud/nodes-view) which doesn't seem to exist anymore. I can remove a node from a War Room using the "Manage War Room" screen, but that does not seem to remove it from the cloud registry.

@underhood
Copy link
Contributor

@evilalmus Yes the removing with "Manage War Room/Nodes" then tree dot menu and remove will do. Should have been more clear.

After you do that claiming it under new id will work. There will be "unclaim" node feature in future but for now claiming under new unique id will do.

@evilalmus
Copy link
Author

After following these steps it seems that the node did not generate the new ID:

root@live:/var/lib/netdata# rm /var/lib/netdata/cloud.d/claimed_id
root@live:/var/lib/netdata# rm /var/lib/netdata/cloud.d/private.pem
root@live:/var/lib/netdata# rm /var/lib/netdata/cloud.d/public.pem
root@live:/var/lib/netdata# rm /var/lib/netdata/registry/netdata.public.unique.id
root@live:/var/lib/netdata# systemctl restart netdata
root@live:/var/lib/netdata# sudo netdata-claim.sh -token=r9vlAtuDrjyDBneFyIz81qLQ7ESpfE0FkvfBDa0RFRkOKFfiyj9AjDiYbGN2ok9l2NSZViurbgvMAnKen2-fEzk20vLuzX7K3dRELUXSHlbUTRuyxQ2bCmB4jOrdOC2oYV3GoWA -rooms=1bd9aabc-cfe3-4166-90d0-414b8956a00d -url=https://app.netdata.cloud
Token: ****************
Base URL: https://app.netdata.cloud
Id: unknown
Rooms: 1bd9aabc-cfe3-4166-90d0-414b8956a00d
Hostname: live
Proxy:
Netdata user: netdata
Generating private/public key for the first time.
Generating RSA private key, 2048 bit long modulus (2 primes)
...............+++++
..................+++++
e is 65537 (0x010001)
Extracting public key from private key.
writing RSA key
Failed to claim node with the following error message:"invalid node id"

@mfundul
Copy link
Contributor

mfundul commented May 12, 2020

Looks like /var/lib/netdata/registry/netdata.public.unique.id did not get regenerated by the agent. Take a look at var/log/netdata/error.log for any suspicious error messages.

Is it possible you deleted some of the var/lib/netdata subdirectories and recreated them as the root user?

Take a look at the permissions for the full path of /var/lib/netdata/registry/netdata.public.unique.id. Typically, those directories are owned by netdata or the configured user in netdata.conf as defined by the option run as user =. Typically, nothing should be owned by root under those paths.

@evilalmus
Copy link
Author

I dont see anything that looks relevant in the error log (in fact I don't see anything after an exit from one of the restarts, so maybe its not saving logs correctly now?)
LAst message in error log:

2020-05-12 06:37:49: netdata INFO  : MAIN : EXIT: all done - netdata is now exiting - bye bye...

When following the instructions in the Wiki I did delete all of /var/log/netdata/clud.d/
but it was recreated when I started it back up again and is owned by netdata:netdata
I did not ever create any files manually as the root user. The system did create public and private keys in the cloud.d directory that are owned by root. I assume this is correct.

"/var/lib/netdata/registry" is owned by netdata:netdata but there are no files in that directory.

Maybe I should just uninstall and start over with a fresh install? If that is the case should I use the instructions here: https://learn.netdata.cloud/docs/agent/packaging/installer/uninstall/

@mfundul
Copy link
Contributor

mfundul commented May 12, 2020

This all sounds broken to me. I don't see how the netdata binary of the agent failed to recreate /var/lib/netdata/registry/netdata.public.unique.id when the path is owned by netdata:netdata and how at the same time the netdata-claim.sh command created files owned by root.

Maybe an uninstall fixes the issues for you, it's a good idea to try.

@evilalmus
Copy link
Author

okay, I uninstalled and reinstalled and the problem went away. Thank you.

@Joshua2504
Copy link

I run into the same issue using docker.

I've started netdata using the given docker-compose.yml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issues which need to be manually labelled
Projects
None yet
Development

No branches or pull requests

5 participants