Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator doesn't revalidate cluster state if node decommission failed due to disk size check #639

Closed
quercusnick opened this issue Apr 22, 2024 · 2 comments · Fixed by #643
Labels
bug Something isn't working done Issues in the state 'done'

Comments

@quercusnick
Copy link

What happened?

During node decommission operator detected that a pod in the statefulset doesn't have enough space to absorb data from decommission node.
Operator log:

2024-04-22T12:59:54.887Z	ERROR	Reconciler error	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"data","namespace":"NAMESPACE"}, "namespace": "NAMESPACE", "name": "data", "reconcileID": "25b7c6a7-a9dc-4fb3-9e7f-64a1c858c3ab", "error": "datacenter data is not in a valid state: Not enough free space available to decommission. k8ssandra-data-default-sts-5 has 1414103935512 free space, but 2202792832087 is needed."}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2

Cluster status:

Status:
  Cassandra Operator Progress:  Updating
  Conditions:
    Last Transition Time:    2023-11-21T07:18:34Z
    Message:
    Reason:
    Status:                  True
    Type:                    Healthy
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  False
    Type:                    Stopped
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  False
    Type:                    ReplacingNodes
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  False
    Type:                    Updating
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  False
    Type:                    RollingRestart
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  False
    Type:                    Resuming
    Last Transition Time:    2024-03-28T13:42:51Z
    Message:
    Reason:
    Status:                  True
    Type:                    ScalingDown
    Last Transition Time:    2024-03-28T13:42:52Z
    Message:                 Not enough free space available to decommission. k8ssandra-data-default-sts-5 has 1414103935512 free space, but 2202792832087 is needed.
    Reason:                  notEnoughSpaceToScaleDown
    Status:                  False
    Type:                    Valid
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  True
    Type:                    Initialized
    Last Transition Time:    2023-08-10T13:19:30Z
    Message:
    Reason:
    Status:                  True
    Type:                    Ready

We've increased PVC for all pods in the statefulset but operator doesn't revalidate cluster:

 kubectl -n NAMESPACE exec k8ssandra-data-default-sts-5 -c cassandra -- df -B1 /var/lib/cassandra
Filesystem         1B-blocks          Used     Available Use% Mounted on
/dev/nvme1n1   4755807707136 1840917585920 2914873344000  39% /var/lib/cassandra

Currently this cluster is kinda locked as we can't either add or remove a node to the cluster. Looks like operator always stops on this step https://github.com/k8ssandra/cass-operator/blob/v1.14.0/pkg/reconciliation/reconcile_racks.go#L2293 and doesn't proceed to step of updating cluster status.

What did you expect to happen?

Operator revalidates status of the cluster and decommission a node.

How can we reproduce it (as minimally and precisely as possible)?

  • Create a cluster and generate some amount of data.
  • Inside any pod of the cluster create file which consumes disk space on /var/lib/cassandra , e.g. fallocate -l SIZEG file_name
  • Remove one node by updating size in corresponding CassandraDatacenter object
  • Remove the generated file and try to either increase or decrease cluster size again

cass-operator version

v1.14.0

Kubernetes version

Server Version: v1.23.2

Method of installation

Argo

Anything else we need to know?

No response

@quercusnick quercusnick added the bug Something isn't working label Apr 22, 2024
@burmanm
Copy link
Contributor

burmanm commented Apr 24, 2024

Hey, you're right, the check should include something like Generation check to verify if CassandraDatacenter has been modified to fix the Valid state.

You can workaround this with some status changes however:

# Get the index of the correct place
kubectl get cassdc dc1 -o yaml | yq '.status.conditions[] | select(.type=="Valid") | path | .[-1]')

# Update
kubectl patch cassdc dc1 --subresource=status --type=json -p='[{"op": "replace", "path": "/status/conditions/7/status", "value": "True"}]'

Replace with correct index the latter /7 and also correct resource name (dc1)

@quercusnick
Copy link
Author

Hi @burmanm , thank you a lot for the assistance. I confirm that workaround you've provided helps!

@adejanovski adejanovski added the ready-for-review Issues in the state 'ready-for-review' label Apr 26, 2024
@adejanovski adejanovski added review Issues in the state 'review' and removed ready-for-review Issues in the state 'ready-for-review' labels May 6, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed review Issues in the state 'review' labels May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Issues in the state 'done'
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants