From aa02c94555fce2d86a4c895261da2f295345416c Mon Sep 17 00:00:00 2001 From: Stanimir Ivanov Date: Sat, 7 Jun 2025 17:26:38 +0300 Subject: [PATCH 1/3] Added runbook for existing alerts --- .gitignore | 1 + .../databasecomponentunreadyreplicas.md | 200 ++++++++++++++++++ .../troubleshoot/databasereleaseoutofsync.md | 187 ++++++++++++++++ .../troubleshoot/databaseunavailable.md | 38 ++++ .../domaincomponentunreadyreplicas.md | 134 ++++++++++++ .../troubleshoot/domainreleaseoutofsync.md | 113 ++++++++++ .../troubleshoot/domainunavailable.md | 40 ++++ 7 files changed, 713 insertions(+) create mode 100644 content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md create mode 100644 content/docs/administration/troubleshoot/databasereleaseoutofsync.md create mode 100644 content/docs/administration/troubleshoot/databaseunavailable.md create mode 100644 content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md create mode 100644 content/docs/administration/troubleshoot/domainreleaseoutofsync.md create mode 100644 content/docs/administration/troubleshoot/domainunavailable.md diff --git a/.gitignore b/.gitignore index ca421e6..636681b 100644 --- a/.gitignore +++ b/.gitignore @@ -5,3 +5,4 @@ node_modules public resources/_gen hugo_stats.json +.DS_Store diff --git a/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md b/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md new file mode 100644 index 0000000..faddcee --- /dev/null +++ b/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md @@ -0,0 +1,200 @@ +--- +title: "DatabaseComponentUnreadyReplicas" +description: "" +summary: "" +date: 2025-06-05T13:52:09+03:00 +lastmod: 2025-06-05T13:52:09+03:00 +draft: false +weight: 100 +toc: true +seo: + title: "" # custom title (optional) + description: "" # custom description (recommended) + canonical: "" # custom canonical URL (optional) + noindex: false # false (default) or true +--- + +## Meaning + +Database component has unready replicas. + +{{< details "Full context" open >}} +Database resource has a component with replicas which were declared to be unready. +Database components impacted by this alert are Transaction Engines (TEs) and Storage Managers (SMs) +For example, it is expected for a database to have 2 TE replicas, but it has less than that for a noticeable period of time. + +On rare occasions, there may be more replicas than it should and system did not clean it up. +{{< /details >}} + +## Impact + +Service degradation or unavailability. + +NuoDB database is fault-tolerant and remains available even if a certain number of database processes are down. +Depending on the database configuration, however, this might have an impact on the database availability of certain data partitions (storage groups) or client applications using custom load-balancing rules. + +## Diagnosis + +- Check the database state using `kubectl describe database `. +- Check the database component state and message. +- Check how many replicas are declared for this component. +- List and check the status of all pods associated with the database's Helm release. +- Check if there are issues with provisioning or attaching disks to pods +- Check if the cluster-autoscaler is able to create new nodes. +- Check pod logs and identify issues during database process startup +- Check the NuoDB process state. +Kubernetes readiness probes require that the database processes are in `MONITORED:RUNNING` state. + +### Scenarios + +{{< details "Symptom 1: Pod in `Pending` status for a long time" >}} + +Possible causes for a Pod not being scheduled: + +- A container on the Pod requests a resource not available in the cluster +- The Pod has affinity rules that do not match any available worker node +- One of the containers mounts a volume provisioned in the availability zone (AZ) where no Kubernetes worker is available + +{{< /details >}} + +{{< details "Symptom 2: Pod in `CreateContainerConfigError` status for a long time" >}} + +Possible causes for a container not being created: + +- The container depends on a resource that does not exist yet (e.g. ConfigMap or Secret) +- NuoDB Control Plane external operator did not populate the database connection details yet + +{{< /details >}} + +{{< details "Symptom 3: Database process fails to join the domain" >}} + +Upon startup, the main _engine_ container process communicates with the NuoDB Admin to register the database process with the domain and start it using the NuoDB binary. + +Possible causes for unsuccessful startup during this phase are: + +- Network issues prevent communication between the container entrypoint client scripts and NuoDB Admin REST API +- The NuoDB Admin layer is not available or has no Raft leader +- No Raft quorum in the NuoDB Admin prevents committing new Raft commands +- AP with ordinal 0 formed a separate domain. In case of catastrophic loss of the `admin-0` container (i.e. its durable domain state `raftlog` file is lost), it might form a new domain causing a split-brain scenario. For more information, see [Setting _bootstrapServers_ Helm value](https://github.com/nuodb/nuodb-helm-charts/blob/v3.10.0/stable/admin/values.yaml#L106). + +{{< /details >}} + +{{< details "Symptom 4: Database process fails to join the database" >}} + +Once started, a database process communicates with the rest of the database and executes an entry protocol. + +Possible causes for unsuccessful startup during this phase are: + +- Network issues prevent communication between NuoDB database processes +- No suitable entry node is available +- The database process binary version is too old + +{{< /details >}} + +{{< details "Symptom 5: An SM in `TRACKED` state for a long time" >}} + +The database state might be `AWAITING_ARCHIVE_HISTORIES_MSG` indicating that the database leader assignment is in progress. +NuoDB Admin must collect archive history information from all provisioned archives on database cold start. +This requires all SM processes to start and connect to the NuoDB Admin within the configured timeout period. + +Possible causes for unsuccessful leader assignment: + +- Not all SMs have been scheduled by Kubernes or not all SM processes have started +- Some of the SM pods are in `CrashLoopBackOff` state with long back-off +- There is a _ghost_ archive metadata provisioned in the domain which is not served by an actual SM + +{{< /details >}} + +{{< details "Symptom 6: An TE in `TRACKED` state for a long time" >}} + +A TE process joins the database via an entry node which is normally the first SM that goes to `RUNNING` state. +NuoDB Admin performs synchronization tasks so that TEs are started after the entry node is available. + +Possible causes for missing entry node: + +- Database leader assignment is not performed after cold start. See _Symptom 5_ +- The `UNPARTITIONED` storage group is not in `RUNNING` state + +{{< /details >}} + +{{< details "Symptom 7: SM in `CONFIGURED:RECOVERING_JOURNAL` state for a long time" >}} + +Upon startup, SM processes perform a journal recovery procedure by applying any transaction messages to the atoms. +This involves extensive disk IO and may continue for a while depending on the backlog of messages and the number of atoms to which they are applied. +The SM process reports the progress of the journal recovery which is displayed in `nuocmd show domain` output. + +Possible causes for slow journal recovery: + +- High latency of the archive disk caused by reaching the IOPS limit + +{{< /details >}} + +### Example + +Get the database name and its namespace from the alert's labels. +Inspect the database state in the Kubernetes cluster. + +```sh +kubectl get database acme-messaging-demo -n nuodb-cp-system +``` + +Notice that the `READY` status condition is `False` which means that the database is in a degraded state. + +```text +NAME TIER VERSION READY SYNCED DISABLED AGE +acme-messaging-demo n0.small 6.0.2 False True False 46h +``` + +Inspect the database components state. + +```sh +kubectl get database acme-messaging-demo -o jsonpath='{.status.components}' | jq +``` + +The output below indicates issues with scheduling `te-acme-messaging-demo-zfb77wc-5cd8b5f7c4-qnplm` Pod because of insufficient memory on the cluster. +The mismatch between `replicas` and `readyReplicas` for this component triggers this alert. + +```json +{ + "lastUpdateTime": "2025-06-06T13:08:19Z", + "storageManagers": [ + { + "kind": "StatefulSet", + "name": "sm-acme-messaging-demo-zfb77wc", + "readyReplicas": 2, + "replicas": 2, + "state": "Ready", + "version": "v1" + } + ], + "transactionEngines": [ + { + "kind": "Deployment", + "message": "there is an active rollout for deployment/te-acme-messaging-demo-zfb77wc; pod/te-acme-messaging-demo-zfb77wc-5cd8b5f7c4-qnplm: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.", + "name": "te-acme-messaging-demo-zfb77wc", + "readyReplicas": 5, + "replicas": 6, + "state": "Updating", + "version": "v1" + } + ] +} +``` + +If needed, drill down to the Pod resources associalted with the database by using the below command. + +```sh +RELEASE_NAME=$(kubectl get database acme-messaging-demo -o jsonpath='{.spec.template.releaseName}') +kubectl get pods -l release=$RELEASE_NAME +``` + +Obtain NuoDB domain state by running [nuocmd show domain](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) and [nuocmd show database](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) inside any NuoDB pod that has `Running` status. + +```sh +SM_POD=$(kubectl get pod \ + -l release=${RELEASE_NAME},component=sm \ + --field-selector=status.phase==Running \ + -o jsonpath='{.items[0].metadata.name}') + +kubectl exec -ti $SM_POD -- nuocmd show domain +``` diff --git a/content/docs/administration/troubleshoot/databasereleaseoutofsync.md b/content/docs/administration/troubleshoot/databasereleaseoutofsync.md new file mode 100644 index 0000000..ce093a9 --- /dev/null +++ b/content/docs/administration/troubleshoot/databasereleaseoutofsync.md @@ -0,0 +1,187 @@ +--- +title: "DatabaseReleaseOutOfSync" +description: "" +summary: "" +date: 2025-06-05T13:52:09+03:00 +lastmod: 2025-06-05T13:52:09+03:00 +draft: false +weight: 110 +toc: true +seo: + title: "" # custom title (optional) + description: "" # custom description (recommended) + canonical: "" # custom canonical URL (optional) + noindex: false # false (default) or true +--- + +## Meaning + +Database release is out of sync. + +{{< details "Full context" open >}} +Database resource desired state is out of sync. +The corresponding database Helm release install/upgrade operation failed to apply the latest Helm values. +{{< /details >}} + +## Impact + +Latest database configuration is not enforced. + +A new database won't become available. +Connectivity to already available databases is not impacted by this issue, however, some features that require applying database configuration changes might be unavailable (e.g. start/stop database, TLS, and DBA password rotation, etc.). + +## Diagnosis + +- Check the database state using `kubectl describe database `. +- Check the database `Released` condition's state and message. +- Check the `Released` condition's state and message for the corresponding _HelmApp_ resource. +- List the Helm revisions for the Helm release associated with the database. +- Check the latest Helm values for the failed Helm release. +- Check that Helm chart repository services are available. +By default, the public NuoDB Helm charts [repository](https://nuodb.github.io/nuodb-helm-charts) in GitHub is used, however, this can be overridden. + +### Scenarios + +{{< details "Symptom 1: Helm charts repository not available" >}} + +The NuoDB operator fails to reach the Helm chart repository and reports the following error: + +```text +unable to fetch chart: + unable to resolve chart: + looks like \"http://nuodb-helm-repo\" is not a valid chart repository or cannot be reached +``` + +Possible causes for Helm repository unreachable: + +- Public Helm repository has an outage +- Private Helm in-cluster repository is not running +- Helm repository URL is incorrect +- Authentication is required to allow access to Helm repository + +{{< /details >}} + +{{< details "Symptom 2: Helm chart name or version are not found" >}} + +The NuoDB operator fails to download the Helm chart and reports the following error: + +```text +unable to fetch chart: + unable to resolve chart: + chart \"database\" version \"3.99.0\" not found in http://nuodb-helm-repo repository +``` + +Possible causes for Helm chart not found: + +- Helm chart name or version is incorrect +- Helm repository URL is incorrect + +{{< /details >}} + +{{< details "Symptom 3: Helm chart resource create/update failure" >}} + +Helm operations are targeting the Kubernetes API server directly. +Kubernetes API server and admission controllers are validating incoming resources and any errors will result in failure of the entire Helm operation. +NuoDB Control Plane (CP) enforces extensive validation on NuoDB resources to prevent invalid configuration, however, there are other factors that result in Helm operation errors. + +Possible causes for resource creation/update failure: + +- Kubernetes API server is not available +- Configured admission controller is unavailable and validation/mutation webhooks can't be executed +- The NuoDB operator doesn't have required RBAC permissions to create resources of a specific group-kind +- There is a _ResourceQuota_ which limits a specific resource +- A resource immutable field is updated during `helm upgrade` operation + +{{< /details >}} + +{{< callout context="caution" title="Exhausting resource sync attemts" icon="outline/alert-triangle" >}} + +NuoDB operator will retry failed Helm operations with the configured retry count (`20` by default) and increasing backoff (starting at `60s` by default). +Once the retries are exhausted, the Helm operation reconciliation will be suspended. +To re-activate Helm release reconciliation for such resources after the root cause is fixed, see [Reset Helm operation attempts](#reset-helm-operation-attempts). + +{{< /callout >}} + +### Example + +Get the database name and its namespace from the alert's labels. +Inspect the database state in the Kubernetes cluster. + +```sh +kubectl get database acme-messaging-demo -n nuodb-cp-system +``` + +Notice that the `SYNCED` value is `False` which means that the database desired state is not enforced. + +```text +NAME TIER VERSION READY SYNCED DISABLED AGE +acme-messaging-demo n0.small 6.0.2 False False False 62h +``` + +Inspect the database `Released` condition. + +```sh +kubectl get database acme-messaging-demo -o jsonpath='{.status.conditions[?(@.type=="Released")]}' | jq +``` + +The output below indicates issues with the corresponding `acme-messaging-demo-zfb77wc` release because an existing _ResourseQuota_ limits creation of the Helm values Secret resource. +This failure happens even before invoking the Helm operation. + +```json +{ + "lastTransitionTime": "2025-06-10T09:32:08Z", + "message": "failed to reconcile database release acme-messaging-demo-zfb77wc: unable to process Secret default/acme-messaging-demo-zfb77wc-values: secrets \"acme-messaging-demo-zfb77wc-values\" is forbidden: exceeded quota: quota-account, requested: count/secrets=1, used: count/secrets=782, limited: count/secrets=782", + "observedGeneration": 1, + "reason": "ReconciliationFailed", + "status": "False", + "type": "Released" +} +``` + +If needed, drill down to the _HelmApp_ resources and Helm revisions associated with the database by using the below commands. + +```sh +RELEASE_NAME=$(kubectl get database acme-messaging-demo -o jsonpath='{.spec.template.releaseName}') +kubectl describe helmapp $RELEASE_NAME +helm history $RELEASE_NAME +``` + +To inspect Helm supplied during Helm operation, execute: + +```sh +kubectl get secret "${RELEASE_NAME}-values" -o jsonpath='{.data.values}' | base64 -d +``` + +To inspect Helm values associated with a particular Helm release, execute: + +```sh +helm get values $RELEASE_NAME +``` + +### Reset Helm operation attempts + +Validate that the _HelmApp_ retries are exhausted. + +```sh +RELEASE_NAME=$(kubectl get database acme-messaging-demo -o jsonpath='{.spec.template.releaseName}') +kubectl get helmapp $RELEASE_NAME -o jsonpath='{.status.conditions[?(@.type=="Released")]}' | jq +``` + +Notice that the `Released` condition has `RetriesExhausted` reason meaning that Helm release reconciliation won't be retried anymore. + +```json +{ + "reason": "RetriesExhausted", + "status": "False", + "type": "Released" +} +``` + +Reset the failure count by patching the status sub-resource. + +```sh +kubectl patch helmapp $RELEASE_NAME \ + --type merge \ + --subresource status \ + -p '{"status": {"release": {"failures": 0}}}' +``` diff --git a/content/docs/administration/troubleshoot/databaseunavailable.md b/content/docs/administration/troubleshoot/databaseunavailable.md new file mode 100644 index 0000000..5e113af --- /dev/null +++ b/content/docs/administration/troubleshoot/databaseunavailable.md @@ -0,0 +1,38 @@ +--- +title: "DatabaseUnavailable" +description: "" +summary: "" +date: 2025-06-05T13:52:09+03:00 +lastmod: 2025-06-05T13:52:09+03:00 +draft: false +weight: 103 +toc: true +seo: + title: "" # custom title (optional) + description: "" # custom description (recommended) + canonical: "" # custom canonical URL (optional) + noindex: false # false (default) or true +--- + +## Meaning + +Database is not available. + +{{< details "Full context" open >}} +Database is not available to SQL applications. +There are no Transaction Engines (TEs) ready to service clients. +{{< /details >}} + +## Impact + +Service unavailability. + +The database is down and SQL applications can't connect. + +## Diagnosis + +See [Diagnosing database component]({{< ref "databasecomponentunreadyreplicas#diagnosis" >}}). + +### Scenarios + +See [Database component failures]({{< ref "databasecomponentunreadyreplicas#scenarios" >}}). diff --git a/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md b/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md new file mode 100644 index 0000000..749fc10 --- /dev/null +++ b/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md @@ -0,0 +1,134 @@ +--- +title: "DomainComponentUnreadyReplicas" +description: "" +summary: "" +date: 2025-06-05T13:52:09+03:00 +lastmod: 2025-06-05T13:52:09+03:00 +draft: false +weight: 105 +toc: true +seo: + title: "" # custom title (optional) + description: "" # custom description (recommended) + canonical: "" # custom canonical URL (optional) + noindex: false # false (default) or true +--- + +## Meaning + +Domain component has unready replicas. + +{{< details "Full context" open >}} +Domain resource has replicas which were declared to be unready. +Domain component impacted by this alert is NuoDB Admin Process (AP). +For example, it is expected for a domain to have 3 AP replicas, but it has less than that for a noticeable period of time. +{{< /details >}} + +## Impact + +Service degradation or unavailability. + +The NuoDB domain is fault-tolerant and remains available even if a certain number of APs are down. +If half of the APs go down unexpectedly, this impacts the ability to commit Raft commands such as performing domain configuration changes and starting database processes. + +The APs perform load-balancing for SQL connections to Transaction Engines (TEs) which are not in `UNKNOWN` state. Unavailable APs might impact obtaining new SQL connections for all databases in the domain. + +{{< callout context="note" title="Note" icon="outline/info-circle" >}} +For more information on NuoDB Admin quorum, see [Admin Process (AP) Quorum](https://doc.nuodb.com/nuodb/latest/domain-admin/admin-process-quorum/) and [Admin Scale-down with Kubernetes Aware Admin](https://doc.nuodb.com/nuodb/latest/deployment-models/kubernetes-environments/kubernetes-aware-admin/#admin-scaledown). +{{< /callout >}} + +## Diagnosis + +- Check the domain state using `kubectl describe domain `. +- Check the domain component state and message. +- Check how many replicas are declared for this component. +- List and check the status of all pods associated with the domain's Helm release. +- Check if there are issues with provisioning or attaching disks to pods +- Check if the cluster-autoscaler is able to create new nodes. +- Check pod logs and identify issues during AP startup +- Check the NuoDB process state. +Kubernetes readiness probes require that the APs are in `Connected` state and caught up with the Raft leader. + +### Scenarios + +{{< details "Symptom 1: Pod in `Pending` status for a long time" >}} + +Possible causes for a Pod not being scheduled: + +- A container on the Pod requests a resource not available in the cluster +- The Pod has affinity rules that do not match any available worker node +- One of the containers mounts a volume provisioned in the availability zone (AZ) where no Kubernetes worker is available + +{{< /details >}} + +{{< details "Symptom 2: AP fails to join the domain" >}} + +Upon startup, the AP communicates with its peers to join the domain and receives the domain state from the Raft leader. +For more information, check [Admin Process Peering](https://doc.nuodb.com/nuodb/latest/domain-admin/admin-process/#_admin_process_ap_peering). + +Possible causes for unsuccessful startup during this phase are: + +- Network issues prevent communication between the AP and its peers +- Incorrect initial domain membership or `peer` configuration + +{{< /details >}} + +### Example + +Get the domain name and its namespace from the alert's labels. +Inspect the domain state in the Kubernetes cluster. + +```sh +kubectl get domain acme-messaging -n nuodb-cp-system +``` + +Notice that the `READY` status condition is `False` which means that the domain is in a degraded state. + +```text +NAME TIER VERSION READY SYNCED DISABLED AGE +acme-messaging n0.small 6.0.2 False True False 46h +``` + +Inspect the domain components state. + +```sh +kubectl get domain acme-messaging -o jsonpath='{.status.components}' | jq +``` + +The output below indicates issues with scheduling `acme-messaging-fc4bwd8-2` Pod because the `acme-messaging-fc4bwd8-2-eph-volume` volume is not provisioned by the persistent volume controller. +The mismatch between `replicas` and `readyReplicas` for this component triggers this alert. + +```json +{ + "admins": [ + { + "kind": "StatefulSet", + "message": "pod/acme-messaging-fc4bwd8-2: 0/1 nodes are available: waiting for ephemeral volume controller to create the persistentvolumeclaim \"acme-messaging-fc4bwd8-2-eph-volume\". preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.", + "name": "acme-messaging-fc4bwd8", + "readyReplicas": 2, + "replicas": 3, + "state": "NotReady", + "version": "v1" + } + ], + "lastUpdateTime": "2025-06-06T14:14:57Z" +} +``` + +If needed, drill down to the Pod and PVC resources associated with the domain by using the below command. + +```sh +RELEASE_NAME=$(kubectl get domain acme-messaging -o jsonpath='{.spec.template.releaseName}') +kubectl get pods,pvc -l release=$RELEASE_NAME +``` + +Obtain NuoDB domain state by running [nuocmd show domain](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) inside any NuoDB pod that has `Running` status. + +```sh +ADMIN_POD=$(kubectl get pod \ + -l release=${RELEASE_NAME},component=admin \ + --field-selector=status.phase==Running \ + -o jsonpath='{.items[0].metadata.name}') + +kubectl exec -ti $ADMIN_POD -- nuocmd show domain +``` diff --git a/content/docs/administration/troubleshoot/domainreleaseoutofsync.md b/content/docs/administration/troubleshoot/domainreleaseoutofsync.md new file mode 100644 index 0000000..bf9df2e --- /dev/null +++ b/content/docs/administration/troubleshoot/domainreleaseoutofsync.md @@ -0,0 +1,113 @@ +--- +title: "DomainReleaseOutOfSync" +description: "" +summary: "" +date: 2025-06-05T13:52:09+03:00 +lastmod: 2025-06-05T13:52:09+03:00 +draft: false +weight: 115 +toc: true +seo: + title: "" # custom title (optional) + description: "" # custom description (recommended) + canonical: "" # custom canonical URL (optional) + noindex: false # false (default) or true +--- + +## Meaning + +Domain release is out of sync. + +{{< details "Full context" open >}} +Domain resource desired state is out of sync. +The coresponding domain Helm release install/upgrade operation failed to apply the latest Helm values. +{{< /details >}} + +## Impact + +Latest domain configuration is not enforced. + +New domain won't become available. +Connectivity to already provisioned domains is not impacted by this issue, however, some features that require applying domain configuration changes migth be unavailable (e.g. start/stop domain, TLS rotation, etc.). + +## Diagnosis + +- Check the domain state using `kubectl describe domain `. +- Check the domain `Released` condition's state and message. +- Check the `Released` condition's state and message for the corresponding _HelmApp_ resource. +- List the Helm revisions for the Helm release associated with the domain. +- Check the latest Helm values for the failed Helm release. +- Check that Helm chart repository services are available. +By default, the public NuoDB Helm charts [repository](https://nuodb.github.io/nuodb-helm-charts) in GitHub is used, however, this can be overriden. + +### Scenarios + +See [Helm operation failures]({{< ref "databasereleaseoutofsync#scenarios" >}}). + +### Example + +Get the domain name and its namespace from the alert's labels. +Inspect the domain state in the Kubernetes cluster. + +```sh +kubectl get domain acme-messaging -n nuodb-cp-system +``` + +Notice that the `SYNCED` value is `False` which means that the database desired state is not enforced. + +```text +NAME TIER VERSION READY SYNCED DISABLED AGE +acme-messaging n0.small 6.0.2 False False False 62h +``` + +Inspect the domain `Released` condition. + +```sh +kubectl get domain acme-messaging -o jsonpath='{.status.conditions[?(@.type=="Released")]}' | jq +``` + +The output below indicates issues with the corresponding _HelmApp_ resource `acme-messaging-demo-zfb77wc`. + +```json +{ + "lastTransitionTime": "2025-06-10T10:27:21Z", + "message": "synchronization failed for applications [acme-messaging-fc4bwd8]", + "observedGeneration": 1, + "reason": "ReconciliationFailed", + "status": "False", + "type": "Released" +} +``` + +Inspect the _HelmApp_ resource associalted with the domain. + +```sh +RELEASE_NAME=$(kubectl get domain acme-messaging -o jsonpath='{.spec.template.releaseName}') +kubectl describe helmapp $RELEASE_NAME +``` + +Check the `Released` status condition of the _HelmApp_. + +```sh +kubectl get helmapp $RELEASE_NAME -o jsonpath='{.status.conditions[?(@.type=="Released")]}' | jq +``` + +The output below indicates issues Helm upgrade operation. +An existing _ResourseQuota_ limits creation of ConfigMap resources. + +```json +{ + "lastTransitionTime": "2025-06-10T10:27:21Z", + "message": "HelmApp upgrade failed: error='failed to create resource: configmaps \"acme-messaging-fc4bwd8-readinessprobe\" is forbidden: exceeded quota: quota-account, requested: count/configmaps=1, used: count/configmaps=15, limited: count/configmaps=15', values='{\"admin\":{\"domain\":\"acme-messaging-fc4bwd8\", ... }'", + "observedGeneration": 1, + "reason": "UpgradeFailed", + "status": "False", + "type": "Released" +} +``` + +If needed, drill down to the Helm revisions associated with the domain by using the below commands. + +```sh +helm history $RELEASE_NAME +``` diff --git a/content/docs/administration/troubleshoot/domainunavailable.md b/content/docs/administration/troubleshoot/domainunavailable.md new file mode 100644 index 0000000..8d6a4b8 --- /dev/null +++ b/content/docs/administration/troubleshoot/domainunavailable.md @@ -0,0 +1,40 @@ +--- +title: "DomainUnavailable" +description: "" +summary: "" +date: 2025-06-05T13:52:09+03:00 +lastmod: 2025-06-05T13:52:09+03:00 +draft: false +weight: 107 +toc: true +seo: + title: "" # custom title (optional) + description: "" # custom description (recommended) + canonical: "" # custom canonical URL (optional) + noindex: false # false (default) or true +--- + +## Meaning + +Domain is not available. + +{{< details "Full context" open >}} +Domain is not available to load-balance SQL clients or accept database configuration changes. +There are no NuoDB Admin processes (APs) ready in the NuoDB domain. +{{< /details >}} + +## Impact + +Service unavailability. + +New SQL connections to any database in the domain will fail. +Already established SQL connections are not impacted. +No new database processes can be started in this domain. + +## Diagnosis + +See [Diagnosing domain component]({{< ref "domaincomponentunreadyreplicas#diagnosis" >}}). + +### Scenarios + +See [Domain component failures]({{< ref "domaincomponentunreadyreplicas#scenarios" >}}). From 32888d6a7e5cf618da6a5c2012b0b7ef63e40298 Mon Sep 17 00:00:00 2001 From: Stanimir Ivanov Date: Mon, 23 Jun 2025 13:28:57 +0300 Subject: [PATCH 2/3] Address review comments --- .../databasecomponentunreadyreplicas.md | 48 +++++++++++++------ .../troubleshoot/databasereleaseoutofsync.md | 26 ++++++++-- .../troubleshoot/databaseunavailable.md | 6 ++- .../domaincomponentunreadyreplicas.md | 24 ++++++++-- .../troubleshoot/domainreleaseoutofsync.md | 20 +++++++- .../troubleshoot/domainunavailable.md | 6 ++- 6 files changed, 105 insertions(+), 25 deletions(-) diff --git a/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md b/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md index faddcee..d3b76c9 100644 --- a/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md +++ b/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md @@ -1,6 +1,6 @@ --- title: "DatabaseComponentUnreadyReplicas" -description: "" +description: "Database resource has a component with replicas which were declared to be unready" summary: "" date: 2025-06-05T13:52:09+03:00 lastmod: 2025-06-05T13:52:09+03:00 @@ -9,7 +9,7 @@ weight: 100 toc: true seo: title: "" # custom title (optional) - description: "" # custom description (recommended) + description: "Database resource has a component with replicas which were declared to be unready" # custom description (recommended) canonical: "" # custom canonical URL (optional) noindex: false # false (default) or true --- @@ -20,12 +20,30 @@ Database component has unready replicas. {{< details "Full context" open >}} Database resource has a component with replicas which were declared to be unready. -Database components impacted by this alert are Transaction Engines (TEs) and Storage Managers (SMs) +Database components impacted by this alert are Transaction Engines (TEs) and Storage Managers (SMs). For example, it is expected for a database to have 2 TE replicas, but it has less than that for a noticeable period of time. -On rare occasions, there may be more replicas than it should and system did not clean it up. +On rare occasions, there may be more replicas than request and the system did not clean them up. {{< /details >}} +### Symptom + +To manually evaluate the conditions for this alert follow the steps below. + +Database which has a component with unready replicas will have the `Ready` status condition set to `False`. +List all unready databases. + +```sh +JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[?(@.type=="Ready")]}{@.type}={@.status}{"\n"}{end}{end}' +kubectl get database -o jsonpath="$JSONPATH" | grep "Ready=False" +``` + +Inspect the database component status and compare the `replicas` and `readyReplicas` fields. + +```sh +kubectl get database -o jsonpath='{.status.components}' | jq +``` + ## Impact Service degradation or unavailability. @@ -47,17 +65,17 @@ Kubernetes readiness probes require that the database processes are in `MONITORE ### Scenarios -{{< details "Symptom 1: Pod in `Pending` status for a long time" >}} +{{< details "Scenario 1: Pod in `Pending` status for a long time" >}} Possible causes for a Pod not being scheduled: - A container on the Pod requests a resource not available in the cluster - The Pod has affinity rules that do not match any available worker node -- One of the containers mounts a volume provisioned in the availability zone (AZ) where no Kubernetes worker is available +- One of the containers mounts a volume provisioned in an availability zone (AZ) where no Kubernetes worker is available {{< /details >}} -{{< details "Symptom 2: Pod in `CreateContainerConfigError` status for a long time" >}} +{{< details "Scenario 2: Pod in `CreateContainerConfigError` status for a long time" >}} Possible causes for a container not being created: @@ -66,7 +84,7 @@ Possible causes for a container not being created: {{< /details >}} -{{< details "Symptom 3: Database process fails to join the domain" >}} +{{< details "Scenario 3: Database process fails to join the domain" >}} Upon startup, the main _engine_ container process communicates with the NuoDB Admin to register the database process with the domain and start it using the NuoDB binary. @@ -79,7 +97,7 @@ Possible causes for unsuccessful startup during this phase are: {{< /details >}} -{{< details "Symptom 4: Database process fails to join the database" >}} +{{< details "Scenario 4: Database process fails to join the database" >}} Once started, a database process communicates with the rest of the database and executes an entry protocol. @@ -91,7 +109,7 @@ Possible causes for unsuccessful startup during this phase are: {{< /details >}} -{{< details "Symptom 5: An SM in `TRACKED` state for a long time" >}} +{{< details "Scenario 5: An SM in `TRACKED` state for a long time" >}} The database state might be `AWAITING_ARCHIVE_HISTORIES_MSG` indicating that the database leader assignment is in progress. NuoDB Admin must collect archive history information from all provisioned archives on database cold start. @@ -101,11 +119,11 @@ Possible causes for unsuccessful leader assignment: - Not all SMs have been scheduled by Kubernes or not all SM processes have started - Some of the SM pods are in `CrashLoopBackOff` state with long back-off -- There is a _ghost_ archive metadata provisioned in the domain which is not served by an actual SM +- There is a _defunct_ archive metadata provisioned in the domain which is not served by an actual SM {{< /details >}} -{{< details "Symptom 6: An TE in `TRACKED` state for a long time" >}} +{{< details "Scenario 6: An TE in `TRACKED` state for a long time" >}} A TE process joins the database via an entry node which is normally the first SM that goes to `RUNNING` state. NuoDB Admin performs synchronization tasks so that TEs are started after the entry node is available. @@ -117,10 +135,10 @@ Possible causes for missing entry node: {{< /details >}} -{{< details "Symptom 7: SM in `CONFIGURED:RECOVERING_JOURNAL` state for a long time" >}} +{{< details "Scenario 7: SM in `CONFIGURED:RECOVERING_JOURNAL` state for a long time" >}} -Upon startup, SM processes perform a journal recovery procedure by applying any transaction messages to the atoms. -This involves extensive disk IO and may continue for a while depending on the backlog of messages and the number of atoms to which they are applied. +Upon startup, SM processes perform a journal recovery. +This may be time consuming if there are many journal entries to recover. The SM process reports the progress of the journal recovery which is displayed in `nuocmd show domain` output. Possible causes for slow journal recovery: diff --git a/content/docs/administration/troubleshoot/databasereleaseoutofsync.md b/content/docs/administration/troubleshoot/databasereleaseoutofsync.md index ce093a9..92a30ac 100644 --- a/content/docs/administration/troubleshoot/databasereleaseoutofsync.md +++ b/content/docs/administration/troubleshoot/databasereleaseoutofsync.md @@ -9,7 +9,7 @@ weight: 110 toc: true seo: title: "" # custom title (optional) - description: "" # custom description (recommended) + description: "Database resource desired state is out of sync" # custom description (recommended) canonical: "" # custom canonical URL (optional) noindex: false # false (default) or true --- @@ -23,6 +23,24 @@ Database resource desired state is out of sync. The corresponding database Helm release install/upgrade operation failed to apply the latest Helm values. {{< /details >}} +### Symptom + +To manually evaluate the conditions for this alert follow the steps below. + +Database which desired state is out of sync will have the `Released` status condition set to `False`. +List all out of sync databases. + +```sh +JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[?(@.type=="Released")]}{@.type}={@.status}{"\n"}{end}{end}' +kubectl get database -o jsonpath="$JSONPATH" | grep "Released=False" +``` + +Inspect the database `Released` condition message for more details. + +```sh +kubectl get database -o jsonpath='{.status.conditions[?(@.type=="Released")]}' | jq +``` + ## Impact Latest database configuration is not enforced. @@ -42,7 +60,7 @@ By default, the public NuoDB Helm charts [repository](https://nuodb.github.io/nu ### Scenarios -{{< details "Symptom 1: Helm charts repository not available" >}} +{{< details "Scenario 1: Helm charts repository not available" >}} The NuoDB operator fails to reach the Helm chart repository and reports the following error: @@ -61,7 +79,7 @@ Possible causes for Helm repository unreachable: {{< /details >}} -{{< details "Symptom 2: Helm chart name or version are not found" >}} +{{< details "Scenario 2: Helm chart name or version are not found" >}} The NuoDB operator fails to download the Helm chart and reports the following error: @@ -78,7 +96,7 @@ Possible causes for Helm chart not found: {{< /details >}} -{{< details "Symptom 3: Helm chart resource create/update failure" >}} +{{< details "Scenario 3: Helm chart resource create/update failure" >}} Helm operations are targeting the Kubernetes API server directly. Kubernetes API server and admission controllers are validating incoming resources and any errors will result in failure of the entire Helm operation. diff --git a/content/docs/administration/troubleshoot/databaseunavailable.md b/content/docs/administration/troubleshoot/databaseunavailable.md index 5e113af..949409f 100644 --- a/content/docs/administration/troubleshoot/databaseunavailable.md +++ b/content/docs/administration/troubleshoot/databaseunavailable.md @@ -9,7 +9,7 @@ weight: 103 toc: true seo: title: "" # custom title (optional) - description: "" # custom description (recommended) + description: "Database is not available to SQL applications" # custom description (recommended) canonical: "" # custom canonical URL (optional) noindex: false # false (default) or true --- @@ -23,6 +23,10 @@ Database is not available to SQL applications. There are no Transaction Engines (TEs) ready to service clients. {{< /details >}} +### Symptom + +To manually evaluate the conditions for this alert, see [Unready database component symptom]({{< ref "databasecomponentunreadyreplicas#symptom" >}}). + ## Impact Service unavailability. diff --git a/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md b/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md index 749fc10..ca31b88 100644 --- a/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md +++ b/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md @@ -9,7 +9,7 @@ weight: 105 toc: true seo: title: "" # custom title (optional) - description: "" # custom description (recommended) + description: "Domain resource has replicas which were declared to be unready" # custom description (recommended) canonical: "" # custom canonical URL (optional) noindex: false # false (default) or true --- @@ -24,6 +24,24 @@ Domain component impacted by this alert is NuoDB Admin Process (AP). For example, it is expected for a domain to have 3 AP replicas, but it has less than that for a noticeable period of time. {{< /details >}} +### Symptom + +To manually evaluate the conditions for this alert follow the steps below. + +Domain which has a component with unready replicas will have the `Ready` status condition set to `False`. +List all unready domains. + +```sh +JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[?(@.type=="Ready")]}{@.type}={@.status}{"\n"}{end}{end}' +kubectl get domain -o jsonpath="$JSONPATH" | grep "Ready=False" +``` + +Inspect the domain component status and compare the `replicas` and `readyReplicas` fields. + +```sh +kubectl get domain -o jsonpath='{.status.components.admins}' | jq +``` + ## Impact Service degradation or unavailability. @@ -51,7 +69,7 @@ Kubernetes readiness probes require that the APs are in `Connected` state and ca ### Scenarios -{{< details "Symptom 1: Pod in `Pending` status for a long time" >}} +{{< details "Scenario 1: Pod in `Pending` status for a long time" >}} Possible causes for a Pod not being scheduled: @@ -61,7 +79,7 @@ Possible causes for a Pod not being scheduled: {{< /details >}} -{{< details "Symptom 2: AP fails to join the domain" >}} +{{< details "Scenario 2: AP fails to join the domain" >}} Upon startup, the AP communicates with its peers to join the domain and receives the domain state from the Raft leader. For more information, check [Admin Process Peering](https://doc.nuodb.com/nuodb/latest/domain-admin/admin-process/#_admin_process_ap_peering). diff --git a/content/docs/administration/troubleshoot/domainreleaseoutofsync.md b/content/docs/administration/troubleshoot/domainreleaseoutofsync.md index bf9df2e..e0ac87e 100644 --- a/content/docs/administration/troubleshoot/domainreleaseoutofsync.md +++ b/content/docs/administration/troubleshoot/domainreleaseoutofsync.md @@ -9,7 +9,7 @@ weight: 115 toc: true seo: title: "" # custom title (optional) - description: "" # custom description (recommended) + description: "Domain resource desired state is out of sync" # custom description (recommended) canonical: "" # custom canonical URL (optional) noindex: false # false (default) or true --- @@ -23,6 +23,24 @@ Domain resource desired state is out of sync. The coresponding domain Helm release install/upgrade operation failed to apply the latest Helm values. {{< /details >}} +### Symptom + +To manually evaluate the conditions for this alert follow the steps below. + +Domain which desired state is out of sync will have the `Released` status condition set to `False`. +List all out of sync domains. + +```sh +JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[?(@.type=="Released")]}{@.type}={@.status}{"\n"}{end}{end}' +kubectl get domain -o jsonpath="$JSONPATH" | grep "Released=False" +``` + +Inspect the domain `Released` condition message for more details. + +```sh +kubectl get domain -o jsonpath='{.status.conditions[?(@.type=="Released")]}' | jq +``` + ## Impact Latest domain configuration is not enforced. diff --git a/content/docs/administration/troubleshoot/domainunavailable.md b/content/docs/administration/troubleshoot/domainunavailable.md index 8d6a4b8..7106f70 100644 --- a/content/docs/administration/troubleshoot/domainunavailable.md +++ b/content/docs/administration/troubleshoot/domainunavailable.md @@ -9,7 +9,7 @@ weight: 107 toc: true seo: title: "" # custom title (optional) - description: "" # custom description (recommended) + description: "Domain is not available to load-balance SQL clients or accept database configuration changes" # custom description (recommended) canonical: "" # custom canonical URL (optional) noindex: false # false (default) or true --- @@ -23,6 +23,10 @@ Domain is not available to load-balance SQL clients or accept database configura There are no NuoDB Admin processes (APs) ready in the NuoDB domain. {{< /details >}} +### Symptom + +To manually evaluate the conditions for this alert, see [Unready domain component symptom]({{< ref "domaincomponentunreadyreplicas#symptom" >}}). + ## Impact Service unavailability. From a3ea3edb3ff52c5dc2247e8005f52d8cab983661 Mon Sep 17 00:00:00 2001 From: Stanimir Ivanov Date: Tue, 24 Jun 2025 14:35:21 +0300 Subject: [PATCH 3/3] Address review comments (round 2) --- .../troubleshoot/databasecomponentunreadyreplicas.md | 5 +++-- .../troubleshoot/databasereleaseoutofsync.md | 11 ++++++----- .../troubleshoot/domaincomponentunreadyreplicas.md | 3 ++- .../troubleshoot/domainreleaseoutofsync.md | 6 +++--- 4 files changed, 14 insertions(+), 11 deletions(-) diff --git a/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md b/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md index d3b76c9..95dd01f 100644 --- a/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md +++ b/content/docs/administration/troubleshoot/databasecomponentunreadyreplicas.md @@ -23,14 +23,14 @@ Database resource has a component with replicas which were declared to be unread Database components impacted by this alert are Transaction Engines (TEs) and Storage Managers (SMs). For example, it is expected for a database to have 2 TE replicas, but it has less than that for a noticeable period of time. -On rare occasions, there may be more replicas than request and the system did not clean them up. +On rare occasions, there may be more replicas than requested and the system did not clean them up. {{< /details >}} ### Symptom To manually evaluate the conditions for this alert follow the steps below. -Database which has a component with unready replicas will have the `Ready` status condition set to `False`. +A database, which has a component with unready replicas, will have the `Ready` status condition set to `False`. List all unready databases. ```sh @@ -72,6 +72,7 @@ Possible causes for a Pod not being scheduled: - A container on the Pod requests a resource not available in the cluster - The Pod has affinity rules that do not match any available worker node - One of the containers mounts a volume provisioned in an availability zone (AZ) where no Kubernetes worker is available +- A Persistent volume claim (PVC) created for this Pod has a storage class that may be misconfigured or unusable {{< /details >}} diff --git a/content/docs/administration/troubleshoot/databasereleaseoutofsync.md b/content/docs/administration/troubleshoot/databasereleaseoutofsync.md index 92a30ac..995f80e 100644 --- a/content/docs/administration/troubleshoot/databasereleaseoutofsync.md +++ b/content/docs/administration/troubleshoot/databasereleaseoutofsync.md @@ -27,7 +27,7 @@ The corresponding database Helm release install/upgrade operation failed to appl To manually evaluate the conditions for this alert follow the steps below. -Database which desired state is out of sync will have the `Released` status condition set to `False`. +A database, in which the desired state is out of sync, will have the `Released` status condition set to `False`. List all out of sync databases. ```sh @@ -98,8 +98,8 @@ Possible causes for Helm chart not found: {{< details "Scenario 3: Helm chart resource create/update failure" >}} -Helm operations are targeting the Kubernetes API server directly. -Kubernetes API server and admission controllers are validating incoming resources and any errors will result in failure of the entire Helm operation. +Helm operations make requests on the Kubernetes API server directly. +Kubernetes API server and admission controllers validate the incoming resources and any errors will result in failure of the entire Helm operation. NuoDB Control Plane (CP) enforces extensive validation on NuoDB resources to prevent invalid configuration, however, there are other factors that result in Helm operation errors. Possible causes for resource creation/update failure: @@ -109,10 +109,11 @@ Possible causes for resource creation/update failure: - The NuoDB operator doesn't have required RBAC permissions to create resources of a specific group-kind - There is a _ResourceQuota_ which limits a specific resource - A resource immutable field is updated during `helm upgrade` operation +- Collisions on resources that exist in the cluster but are not managed by Helm and by NuoDB operator {{< /details >}} -{{< callout context="caution" title="Exhausting resource sync attemts" icon="outline/alert-triangle" >}} +{{< callout context="caution" title="Exhausting resource sync attempts" icon="outline/alert-triangle" >}} NuoDB operator will retry failed Helm operations with the configured retry count (`20` by default) and increasing backoff (starting at `60s` by default). Once the retries are exhausted, the Helm operation reconciliation will be suspended. @@ -148,7 +149,7 @@ This failure happens even before invoking the Helm operation. ```json { "lastTransitionTime": "2025-06-10T09:32:08Z", - "message": "failed to reconcile database release acme-messaging-demo-zfb77wc: unable to process Secret default/acme-messaging-demo-zfb77wc-values: secrets \"acme-messaging-demo-zfb77wc-values\" is forbidden: exceeded quota: quota-account, requested: count/secrets=1, used: count/secrets=782, limited: count/secrets=782", + "message": "failed to reconcile database release acme-messaging-demo-zfb77wc: unable to process Secret default/acme-messaging-demo-zfb77wc-values: secrets \"acme-messaging-demo-zfb77wc-values\" is forbidden: exceeded quota: quota-account, requested: count/secrets=1, used: count/secrets=500, limited: count/secrets=500", "observedGeneration": 1, "reason": "ReconciliationFailed", "status": "False", diff --git a/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md b/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md index ca31b88..dd88f89 100644 --- a/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md +++ b/content/docs/administration/troubleshoot/domaincomponentunreadyreplicas.md @@ -28,7 +28,7 @@ For example, it is expected for a domain to have 3 AP replicas, but it has less To manually evaluate the conditions for this alert follow the steps below. -Domain which has a component with unready replicas will have the `Ready` status condition set to `False`. +A domain, which has a component with unready replicas, will have the `Ready` status condition set to `False`. List all unready domains. ```sh @@ -76,6 +76,7 @@ Possible causes for a Pod not being scheduled: - A container on the Pod requests a resource not available in the cluster - The Pod has affinity rules that do not match any available worker node - One of the containers mounts a volume provisioned in the availability zone (AZ) where no Kubernetes worker is available +- A Persistent volume claim (PVC) created for this Pod has a storage class that may be misconfigured or unusable {{< /details >}} diff --git a/content/docs/administration/troubleshoot/domainreleaseoutofsync.md b/content/docs/administration/troubleshoot/domainreleaseoutofsync.md index e0ac87e..97f4892 100644 --- a/content/docs/administration/troubleshoot/domainreleaseoutofsync.md +++ b/content/docs/administration/troubleshoot/domainreleaseoutofsync.md @@ -27,7 +27,7 @@ The coresponding domain Helm release install/upgrade operation failed to apply t To manually evaluate the conditions for this alert follow the steps below. -Domain which desired state is out of sync will have the `Released` status condition set to `False`. +A domain, in which the desired state is out of sync, will have the `Released` status condition set to `False`. List all out of sync domains. ```sh @@ -46,7 +46,7 @@ kubectl get domain -o jsonpath='{.status.conditions[?(@.type=="Released") Latest domain configuration is not enforced. New domain won't become available. -Connectivity to already provisioned domains is not impacted by this issue, however, some features that require applying domain configuration changes migth be unavailable (e.g. start/stop domain, TLS rotation, etc.). +Connectivity to already provisioned domains is not impacted by this issue, however, some features that require applying domain configuration changes might be unavailable (e.g. start/stop domain, TLS rotation, etc.). ## Diagnosis @@ -56,7 +56,7 @@ Connectivity to already provisioned domains is not impacted by this issue, howev - List the Helm revisions for the Helm release associated with the domain. - Check the latest Helm values for the failed Helm release. - Check that Helm chart repository services are available. -By default, the public NuoDB Helm charts [repository](https://nuodb.github.io/nuodb-helm-charts) in GitHub is used, however, this can be overriden. +By default, the public NuoDB Helm charts [repository](https://nuodb.github.io/nuodb-helm-charts) in GitHub is used, however, this can be overridden. ### Scenarios