Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,4 @@ node_modules
public
resources/_gen
hugo_stats.json
.DS_Store
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
title: "DatabaseComponentUnreadyReplicas"
description: "Database resource has a component with replicas which were declared to be unready"
summary: ""
date: 2025-06-05T13:52:09+03:00
lastmod: 2025-06-05T13:52:09+03:00
draft: false
weight: 100
toc: true
seo:
title: "" # custom title (optional)
description: "Database resource has a component with replicas which were declared to be unready" # custom description (recommended)
canonical: "" # custom canonical URL (optional)
noindex: false # false (default) or true
---

## Meaning

Database component has unready replicas.

{{< details "Full context" open >}}
Database resource has a component with replicas which were declared to be unready.
Database components impacted by this alert are Transaction Engines (TEs) and Storage Managers (SMs).
For example, it is expected for a database to have 2 TE replicas, but it has less than that for a noticeable period of time.

On rare occasions, there may be more replicas than requested and the system did not clean them up.
{{< /details >}}

### Symptom

To manually evaluate the conditions for this alert follow the steps below.

A database, which has a component with unready replicas, will have the `Ready` status condition set to `False`.
List all unready databases.

```sh
JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[?(@.type=="Ready")]}{@.type}={@.status}{"\n"}{end}{end}'
kubectl get database -o jsonpath="$JSONPATH" | grep "Ready=False"
```

Inspect the database component status and compare the `replicas` and `readyReplicas` fields.

```sh
kubectl get database <name> -o jsonpath='{.status.components}' | jq
```

## Impact

Service degradation or unavailability.

NuoDB database is fault-tolerant and remains available even if a certain number of database processes are down.
Depending on the database configuration, however, this might have an impact on the database availability of certain data partitions (storage groups) or client applications using custom load-balancing rules.

## Diagnosis

- Check the database state using `kubectl describe database <name>`.
- Check the database component state and message.
- Check how many replicas are declared for this component.
- List and check the status of all pods associated with the database's Helm release.
- Check if there are issues with provisioning or attaching disks to pods
- Check if the cluster-autoscaler is able to create new nodes.
- Check pod logs and identify issues during database process startup
- Check the NuoDB process state.
Kubernetes readiness probes require that the database processes are in `MONITORED:RUNNING` state.

### Scenarios

{{< details "Scenario 1: Pod in `Pending` status for a long time" >}}

Possible causes for a Pod not being scheduled:

- A container on the Pod requests a resource not available in the cluster
- The Pod has affinity rules that do not match any available worker node
- One of the containers mounts a volume provisioned in an availability zone (AZ) where no Kubernetes worker is available
- A Persistent volume claim (PVC) created for this Pod has a storage class that may be misconfigured or unusable

{{< /details >}}

{{< details "Scenario 2: Pod in `CreateContainerConfigError` status for a long time" >}}

Possible causes for a container not being created:

- The container depends on a resource that does not exist yet (e.g. ConfigMap or Secret)
- NuoDB Control Plane external operator did not populate the database connection details yet

{{< /details >}}

{{< details "Scenario 3: Database process fails to join the domain" >}}

Upon startup, the main _engine_ container process communicates with the NuoDB Admin to register the database process with the domain and start it using the NuoDB binary.

Possible causes for unsuccessful startup during this phase are:

- Network issues prevent communication between the container entrypoint client scripts and NuoDB Admin REST API
- The NuoDB Admin layer is not available or has no Raft leader
- No Raft quorum in the NuoDB Admin prevents committing new Raft commands
- AP with ordinal 0 formed a separate domain. In case of catastrophic loss of the `admin-0` container (i.e. its durable domain state `raftlog` file is lost), it might form a new domain causing a split-brain scenario. For more information, see [Setting _bootstrapServers_ Helm value](https://github.com/nuodb/nuodb-helm-charts/blob/v3.10.0/stable/admin/values.yaml#L106).

{{< /details >}}

{{< details "Scenario 4: Database process fails to join the database" >}}

Once started, a database process communicates with the rest of the database and executes an entry protocol.

Possible causes for unsuccessful startup during this phase are:

- Network issues prevent communication between NuoDB database processes
- No suitable entry node is available
- The database process binary version is too old

{{< /details >}}

{{< details "Scenario 5: An SM in `TRACKED` state for a long time" >}}

The database state might be `AWAITING_ARCHIVE_HISTORIES_MSG` indicating that the database leader assignment is in progress.
NuoDB Admin must collect archive history information from all provisioned archives on database cold start.
This requires all SM processes to start and connect to the NuoDB Admin within the configured timeout period.

Possible causes for unsuccessful leader assignment:

- Not all SMs have been scheduled by Kubernes or not all SM processes have started
- Some of the SM pods are in `CrashLoopBackOff` state with long back-off
- There is a _defunct_ archive metadata provisioned in the domain which is not served by an actual SM

{{< /details >}}

{{< details "Scenario 6: An TE in `TRACKED` state for a long time" >}}

A TE process joins the database via an entry node which is normally the first SM that goes to `RUNNING` state.
NuoDB Admin performs synchronization tasks so that TEs are started after the entry node is available.

Possible causes for missing entry node:

- Database leader assignment is not performed after cold start. See _Symptom 5_
- The `UNPARTITIONED` storage group is not in `RUNNING` state

{{< /details >}}

{{< details "Scenario 7: SM in `CONFIGURED:RECOVERING_JOURNAL` state for a long time" >}}

Upon startup, SM processes perform a journal recovery.
This may be time consuming if there are many journal entries to recover.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It can also be time consuming is the storage is very slow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is listed in the possible causes below.

The SM process reports the progress of the journal recovery which is displayed in `nuocmd show domain` output.

Possible causes for slow journal recovery:

- High latency of the archive disk caused by reaching the IOPS limit

{{< /details >}}

### Example

Get the database name and its namespace from the alert's labels.
Inspect the database state in the Kubernetes cluster.

```sh
kubectl get database acme-messaging-demo -n nuodb-cp-system
```

Notice that the `READY` status condition is `False` which means that the database is in a degraded state.

```text
NAME TIER VERSION READY SYNCED DISABLED AGE
acme-messaging-demo n0.small 6.0.2 False True False 46h
```

Inspect the database components state.

```sh
kubectl get database acme-messaging-demo -o jsonpath='{.status.components}' | jq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jq is only used for pretty-printing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I couldn't find similar option for kubectl.

```

The output below indicates issues with scheduling `te-acme-messaging-demo-zfb77wc-5cd8b5f7c4-qnplm` Pod because of insufficient memory on the cluster.
The mismatch between `replicas` and `readyReplicas` for this component triggers this alert.

```json
{
"lastUpdateTime": "2025-06-06T13:08:19Z",
"storageManagers": [
{
"kind": "StatefulSet",
"name": "sm-acme-messaging-demo-zfb77wc",
"readyReplicas": 2,
"replicas": 2,
"state": "Ready",
"version": "v1"
}
],
"transactionEngines": [
{
"kind": "Deployment",
"message": "there is an active rollout for deployment/te-acme-messaging-demo-zfb77wc; pod/te-acme-messaging-demo-zfb77wc-5cd8b5f7c4-qnplm: 0/1 nodes are available: 1 Insufficient memory. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.",
"name": "te-acme-messaging-demo-zfb77wc",
"readyReplicas": 5,
"replicas": 6,
"state": "Updating",
"version": "v1"
}
]
}
```

If needed, drill down to the Pod resources associalted with the database by using the below command.

```sh
RELEASE_NAME=$(kubectl get database acme-messaging-demo -o jsonpath='{.spec.template.releaseName}')
kubectl get pods -l release=$RELEASE_NAME
```

Obtain NuoDB domain state by running [nuocmd show domain](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) and [nuocmd show database](https://doc.nuodb.com/nuodb/latest/reference-information/command-line-tools/nuodb-command/nuocmd-reference/#show-domain) inside any NuoDB pod that has `Running` status.

```sh
SM_POD=$(kubectl get pod \
-l release=${RELEASE_NAME},component=sm \
--field-selector=status.phase==Running \
-o jsonpath='{.items[0].metadata.name}')

kubectl exec -ti $SM_POD -- nuocmd show domain
```
Loading