diff --git a/src/content/docs/new-relic-control/agent-control/troubleshooting.mdx b/src/content/docs/new-relic-control/agent-control/troubleshooting.mdx index 91e6b09245b..32e1de4581f 100644 --- a/src/content/docs/new-relic-control/agent-control/troubleshooting.mdx +++ b/src/content/docs/new-relic-control/agent-control/troubleshooting.mdx @@ -12,172 +12,220 @@ This document covers the steps to troubleshoot common issues when installing or ## Kubernetes troubleshooting -### Enable debug logging -To diagnose errors during the installation process, you can increase the log level for Agent Control by adding the following setting in your `values-newrelic.yaml` file: - -```yaml -agent-control-deployment: - config: - agentControl: - content: - log: - level: trace -``` - -- **Default log level:** `info`. -- **Other supported log levels:** `debug` and `trace`. -- **OTel collector logs:** To enable debug logs in the OpenTelemetry collector, add `verboseLog: true`. + + + To diagnose errors during the installation process, you can increase the log level for Agent Control by adding the following setting in your `values-newrelic.yaml` file: -To inspect the Agent Control logs, run the following command, replacing `agent-control-***` with the name of your Agent Control pod: + ```yaml + agent-control-deployment: + config: + agentControl: + content: + log: + level: trace + ``` -```shell -# Find the Agent Control pod name -kubectl get pods -n newrelic-agent-control + - **Default log level:** `info`. + - **Other supported log levels:** `debug` and `trace`. + - **OTel collector logs:** To enable debug logs in the OpenTelemetry collector, add `verboseLog: true`. -# Inspect the logs, replacing `agent-control-***` with your pod's name -kubectl logs agent-control-*** -n newrelic-agent-control -``` + To inspect the Agent Control logs, run the following command, replacing `agent-control-***` with the name of your Agent Control pod: -### Status endpoint -Agent Control exposes a local status endpoint you can use to check the health of Agent Control and its managed agents. This endpoint is enabled by default on port `51200`. Follow these steps to query the cluster status: + ```shell + # Find the Agent Control pod name + kubectl get pods -n newrelic-agent-control -Forward a local port to the main `agent-control` pod: -```shell -kubectl port-forward 51200:51200 -``` -Request the agent status: -```shell -curl localhost:51200/status -``` -### Helm release failure -Agent Control requires a valid authentication credential to securely connect to Fleet Control. Initially, this credential is automatically generated through the Agent Control installation UI and is represented by the `identityClientId` and `identityClientSecret` fields in the values file. For security reasons, the credential necessary for installing Agent Control expires after 12 hours. + # Inspect the logs, replacing `agent-control-***` with your pod's name + kubectl logs agent-control-*** -n newrelic-agent-control + ``` + -If the installation fails with a BackoffLimitExceeded error, it often indicates an expired or invalid credential. -```shell -[output] Error: UPGRADE FAILED: pre-upgrade hooks failed: job failed: BackoffLimitExceeded -``` + + Agent Control exposes a local status endpoint you can use to check the health of Agent Control and its managed agents. This endpoint is enabled by default on port `51200`. Follow these steps to query the cluster status: -Check the logs of the Kubernetes job responsible for setting up the Agent Control system identity. + Forward a local port to the main `agent-control` pod: + ```shell + kubectl port-forward 51200:51200 + ``` + Request the agent status: + ```shell + curl localhost:51200/status + ``` + -First, identify the job’s pods: -```shell -kubectl describe job agent-control-generate-system-identity -n -``` + + When agent-control-bootstrap chart is installed, a job is launched installing all the resources and charts, and the installation may fail with a BackoffLimitExceeded error: + ```shell + [output] Error: UPGRADE FAILED: pre-upgrade hooks failed: job failed: BackoffLimitExceeded + ``` -In the `Events` section, look for entries for the specific pods, as follows: + You can debug the installation errors looking at the installation-job logs: + ```shell + kubectl logs agent-control-bootstrap-install-job-**** -n newrelic-agent-control + ``` -```shell -[output] Events: -[output] Type Reason Age From Message -[output] ---- ------ ---- ---- ------- -[output] Normal SuccessfulCreate 88s job-controller Created pod: agent-control-generate-system-identity-jr6cg -[output] Normal SuccessfulCreate 73s job-controller Created pod: agent-control-generate-system-identity-wnx2v -[output] Normal SuccessfulCreate 50s job-controller Created pod: agent-control-generate-system-identity-8zsqd -[output] Normal SuccessfulCreate 7s job-controller Created pod: agent-control-generate-system-identity-btqh7 -[output] Warning BackoffLimitExceeded 1s job-controller Job has reached the specified backoff limit -``` + Agent Control requires a valid authentication credential to securely connect to Fleet Control. Initially, this credential is automatically generated through the Agent Control installation UI and is represented by the `identityClientId` and `identityClientSecret` fields in the values file. For security reasons, the credential necessary for installing Agent Control expires after 12 hours. -View the logs of the failing pods: + If the installation fails with a BackoffLimitExceeded error, it often indicates an expired or invalid credential. -```shell -kubectl logs -n -``` + Check the logs of the Kubernetes job responsible for setting up the Agent Control system identity. -Example: + First, identify the job’s pods: + ```shell + kubectl describe job agent-control-generate-system-identity -n + ``` -```shell -kubectl logs agent-control-generate-system-identity-btqh7 -n newrelic-agent-control -``` + In the `Events` section, look for entries for the specific pods, as follows: -After reviewing the logs, retry the installation using Helm while watching for specific error messages and checking the logs for potential problems. Below are some known issues and how to interpret them: + ```shell + [output] Events: + [output] Type Reason Age From Message + [output] ---- ------ ---- ---- ------- + [output] Normal SuccessfulCreate 88s job-controller Created pod: agent-control-generate-system-identity-jr6cg + [output] Normal SuccessfulCreate 73s job-controller Created pod: agent-control-generate-system-identity-wnx2v + [output] Normal SuccessfulCreate 50s job-controller Created pod: agent-control-generate-system-identity-8zsqd + [output] Normal SuccessfulCreate 7s job-controller Created pod: agent-control-generate-system-identity-btqh7 + [output] Warning BackoffLimitExceeded 1s job-controller Job has reached the specified backoff limit + ``` -- **Invalid identityClientId:** - `Error getting system identity auth token. The API endpoint returned 404: Failed to find Identity: ` -- **Invalid identityClientSecret:** - `Error getting system identity auth token. The API endpoint returned 400: Bad client secret.` -- **Identity expired:** - `Error getting system identity auth token. The API endpoint returned 400: Expired client secret.` -- **Missing required permissions:** - `Failed to create a New Relic System Identity for Fleet Control communication authentication. Please verify that your User Key is valid and that your Account Organization has the necessary permissions to create a System Identity: Exception while fetching data (/create) : Not authorized to perform this action or the entity is not found.` + View the logs of the failing pods: -### Invalid New Relic license -If you see an error message like the one below in the logs of the OpenTelemetry collector deployment pod, it may indicate an invalid New Relic license key. This prevents the collector from being able to export telemetry data to New Relic: + ```shell + kubectl logs -n + ``` -```shell -[output] 2024-06-13T13:46:05.898Z error exporterhelper/retry_sender.go:126 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "Permanent error: error exporting items, request to https://otlp.nr-dat ││ go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send -``` + Example: -#### Solution -Confirm that you're using a valid New Relic license key in your configuration. + ```shell + kubectl logs agent-control-generate-system-identity-btqh7 -n newrelic-agent-control + ``` -### HelmRelease Failure for Managed Agents + After reviewing the logs, retry the installation using Helm while watching for specific error messages and checking the logs for potential problems. Below are some known issues and how to interpret them: -If a managed agent's pods are not being created, there may be an issue with its HelmRelease. + - **Invalid identityClientId:** + `Error getting system identity auth token. The API endpoint returned 404: Failed to find Identity: ` + - **Invalid identityClientSecret:** + `Error getting system identity auth token. The API endpoint returned 400: Bad client secret.` + - **Identity expired:** + `Error getting system identity auth token. The API endpoint returned 400: Expired client secret.` + - **Missing required permissions:** + `Failed to create a New Relic System Identity for Fleet Control communication authentication. Please verify that your User Key is valid and that your Account Organization has the necessary permissions to create a System Identity: Exception while fetching data (/create) : Not authorized to perform this action or the entity is not found.` + -Check the status of the Helm release: + + If you see an error message like the one below in the logs of the OpenTelemetry collector deployment pod, it may indicate an invalid New Relic license key. This prevents the collector from being able to export telemetry data to New Relic: -```shell -kubectl get helmrelease open-telemetry -n newrelic -``` + ```shell + [output] 2024-06-13T13:46:05.898Z error exporterhelper/retry_sender.go:126 Exporting failed. The error is not retryable. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "otlphttp/newrelic", "error": "Permanent error: error exporting items, request to https://otlp.nr-dat ││ go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send + ``` -A successful and healthy release should show `READY: True` and `STATUS: InstallSucceeded`. + **Solution** -If the release failed, the `STATUS` and `READY` fields will indicate the problem. Depending on the type of error, the root problem might not be fully reflected in the status field. To get more details, use `kubectl` to describe the HelmRelease resource: + Confirm that you're using a valid New Relic license key in your configuration. + -```shell -kubectl describe helmrelease open-telemetry -n newrelic -``` + + If a managed agent's pods are not being created, there may be an issue with its HelmRelease. -### Troubleshoot with NRDiag [#nrdiag] + Check the status of the Helm release: -New Relic diagnostics tool `NRDiag` is a utility that gathers resources and logs related to agent-control in your cluster for debugging. -Follow these steps to gather all the data: + ```shell + kubectl get helmrelease open-telemetry -n newrelic + ``` -1. On your host, install the `NRDiag` tool using the [getting started guide](/docs/new-relic-solutions/solve-common-issues/diagnostics-cli-nrdiag/diagnostics-cli-nrdiag/#get-started). + A successful and healthy release should show `READY: True` and `STATUS: InstallSucceeded`. -2. Run the K8s agent control suite: - - - Ensure that `kubectl` and `helm` are installed. - + If the release failed, the `STATUS` and `READY` fields will indicate the problem. Depending on the type of error, the root problem might not be fully reflected in the status field. To get more details, use `kubectl` to describe the HelmRelease resource: - - Run the command in the namespace set in kubeconfig's context: - ```bash - ./nrdiag -suites K8s-agent-control + ```shell + kubectl describe helmrelease open-telemetry -n newrelic ``` + - - Specify a different namespace for Agent Control using the `--k8s-namespace` flag: - ```bash - ./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic - ``` + + When deleting agent-control-bootstrap, a job is launched deleting all the created resources and charts. - - Specify a different namespace for sub-agents using the `ac-agents-namespace` flag: - ```bash - ./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic-agent-control --ac-agents-namespace=newrelic + If the uninstallation shows an error like: + `* job agent-control-bootstrap-uninstall-job failed: BackoffLimitExceeded` + + You can view the job logs to debug the error. + ```shell + kubectl logs agent-control-bootstrap-uninstall-job-*** -n newrelic-agent-control ``` + -3. The expected output should look like the following report: + + If the helm delete command is canceled while executing, the job uninstaller will continue working, deleting the charts and resources, but the agent-control-bootstrap helm secret may still exist. + In that case you won't be able to upgrade or install the chart, getting the error: - ```bash - [output] Check Results - [output] ------------------------------------------------- - [output] Info K8s/Flux/Charts [Successfully collected Flux Helm Charts] - [output] Info K8s/Resources/Config [Successfully collected K8s configMaps ] - [output] Info K8s/AgentControl/agent-control-status-server [Successfully collected K8s agent-control status se...] - [output] Info K8s/Resources/Daemonset [Successfully collected K8s newrelic-infrastructure...] - [output] Info K8s/Resources/Pods [Successfully collected K8s newrelic-infrastructure...] - [output] Info K8s/Flux/Repositories [Successfully collected Flux Helm Repositories] - [output] Info K8s/AgentControl/helm-controller-logs [Successfully collected K8s agent-control helm-cont...] - [output] Info K8s/Env/Version [kubectl version output successfully collected] - [output] Info K8s/Resources/Deploy [Successfully collected K8s newrelic-infrastructure...] - [output] Info K8s/Helm/Releases [Successfully collected the list of helm releases] - [output] Info K8s/AgentControl/agent-control-logs [Successfully collected K8s agent-control agent-con...] - [output] Info K8s/Flux/Releases [Successfully collected Flux Helm Releases] - [output] Info K8s/AgentControl/source-controller-logs [Successfully collected K8s agent-control source-co...] - [output] See nrdiag-output.json for full results. + `Error: UPGRADE FAILED: "agent-control-bootstrap" has no deployed releases` + + Running the uninstallation again won't work, the logs from the uninstallation job will show an error like: + + `Error: uninstall: Release not loaded: agent-control-cd: release: not found` + + **Solution** + + Delete all helm secrets from your release (change agent-control-bootstrap for the name of your release if it was changed): + ```shell + kubectl delete secrets -l "name=agent-control-bootstrap" ``` -4. All the logs and resources related to the agent-control are saved in the `nrdiag_output.zip` file in the current directory. You can analyze the contents of the zip file or open a support ticket with [New Relic support](https://support.newrelic.com) for further assistance. + Then you can do the installation again. + + + + New Relic diagnostics tool `NRDiag` is a utility that gathers resources and logs related to agent-control in your cluster for debugging. + Follow these steps to gather all the data: + + 1. On your host, install the `NRDiag` tool using the [getting started guide](/docs/new-relic-solutions/solve-common-issues/diagnostics-cli-nrdiag/diagnostics-cli-nrdiag/#get-started). + + 2. Run the K8s agent control suite: + + + Ensure that `kubectl` and `helm` are installed. + + + - Run the command in the namespace set in kubeconfig's context: + ```bash + ./nrdiag -suites K8s-agent-control + ``` + + - Specify a different namespace for Agent Control using the `--k8s-namespace` flag: + ```bash + ./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic + ``` + + - Specify a different namespace for sub-agents using the `ac-agents-namespace` flag: + ```bash + ./nrdiag -suites K8s-agent-control --k8s-namespace=newrelic-agent-control --ac-agents-namespace=newrelic + ``` + + 3. The expected output should look like the following report: + + ```bash + [output] Check Results + [output] ------------------------------------------------- + [output] Info K8s/Flux/Charts [Successfully collected Flux Helm Charts] + [output] Info K8s/Resources/Config [Successfully collected K8s configMaps ] + [output] Info K8s/AgentControl/agent-control-status-server [Successfully collected K8s agent-control status se...] + [output] Info K8s/Resources/Daemonset [Successfully collected K8s newrelic-infrastructure...] + [output] Info K8s/Resources/Pods [Successfully collected K8s newrelic-infrastructure...] + [output] Info K8s/Flux/Repositories [Successfully collected Flux Helm Repositories] + [output] Info K8s/AgentControl/helm-controller-logs [Successfully collected K8s agent-control helm-cont...] + [output] Info K8s/Env/Version [kubectl version output successfully collected] + [output] Info K8s/Resources/Deploy [Successfully collected K8s newrelic-infrastructure...] + [output] Info K8s/Helm/Releases [Successfully collected the list of helm releases] + [output] Info K8s/AgentControl/agent-control-logs [Successfully collected K8s agent-control agent-con...] + [output] Info K8s/Flux/Releases [Successfully collected Flux Helm Releases] + [output] Info K8s/AgentControl/source-controller-logs [Successfully collected K8s agent-control source-co...] + [output] See nrdiag-output.json for full results. + ``` + + 4. All the logs and resources related to the agent-control are saved in the `nrdiag_output.zip` file in the current directory. You can analyze the contents of the zip file or open a support ticket with [New Relic support](https://support.newrelic.com) for further assistance. + + + ## Linux hosts troubleshooting @@ -191,8 +239,8 @@ Follow these steps to gather all the data: * Check the logs provided with the installation script: - If you see `Error creating an identity`, please ensure your user key belongs to a platform user with the [All product admin](https://docs.newrelic.com/docs/accounts/accounts-billing/new-relic-one-user-management/user-management-concepts/#standard-roles) role. - * Check the status of the `newrelic-agent-control` service: - + * Check the status of the `newrelic-agent-control` service: + ```bash sudo systemctl status newrelic-agent-control ``` @@ -200,18 +248,18 @@ Follow these steps to gather all the data: If the service appears in `failed` or `stopped` state, this means the agent got installed but there's an issue preventing its normal operation. Check the agent services logs using `journaltctl` (or any similar Linux tool): - ```bash + ```bash journalctl -u newrelic-agent-control ``` - - If no insights are available, check how to [run the agent in debug mode](/docs/new-relic-agent-control#debug) to access detailed logs explaining why the service cannot be started. + + If no insights are available, check how to [run the agent in debug mode](/docs/new-relic-agent-control#debug) to access detailed logs explaining why the service cannot be started. * If the service is not installed, try appending `--debug` at the end of the [install command](#cli) and run it again. This will enable verbose logging for the installation script. See if the verbose output has additional context explaining the error. * Optionally, answer `yes` when asked to send logs to New Relic to help troubleshooting the installation. Once submitted, logs can be accessed with the following NRQL query: - + ```sql SELECT * FROM Log WHERE hostname = `your-host-name` ``` - + To access logs, you'll need to first enable agent logging by following these steps: @@ -305,7 +353,7 @@ Follow these steps to gather all the data: Agent Control performs certain validations before receiving and applying remote configuration from Fleet Control. - Additionally, configurations might have a valid format (for example, valid .yaml structure) but include unexpected values for certain settings (in example, string when integer is expected). + Additionally, configurations might have a valid format (for example, valid .yaml structure) but include unexpected values for certain settings (for example, string when integer is expected). The following table shows common errors for the different supported agents: