Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to scale the cassandra cluster up when Medusa is enabled. #1037

Closed
chandapukiran opened this issue Aug 9, 2023 · 11 comments · Fixed by #1052
Closed

Unable to scale the cassandra cluster up when Medusa is enabled. #1037

chandapukiran opened this issue Aug 9, 2023 · 11 comments · Fixed by #1052
Assignees
Labels
bug Something isn't working done Issues in the state 'done'

Comments

@chandapukiran
Copy link

What happened?
When tried to increase the cluster size from 3 to 4, the newly deployed pod is in CrashLoopBackupOff state when Medusa is enabled.
Did you expect to see something different?
The new pod should be up and running
How to reproduce it (as minimally and precisely as possible):

  1. Deploy cassandra cluster with 3 nodes, medusa enabled.
  2. Add some test data and backup the data using medusa to S3
  3. Try to upscale the cluster from 3 nodes to 4.
  4. Notice that the new pod comes up but in CrashLoopBackupOff state.
    Environment
  • K8ssandra Operator version:
    k8ssandra-operator:v1.5.2
    Insert image tag or Git SHA here

cass-operator:v1.14.0

  • Kubernetes version information:

    kubectl version

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"4c9411232e10168d7b050c49a1b59f6df9d7ea4b", GitTreeState:"clean", BuildDate:"2023-04-14T13:14:41Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.12", GitCommit:"ef70d260f3d036fc22b30538576bbf6b36329995", GitTreeState:"clean", BuildDate:"2023-03-15T13:30:13Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"linux/amd64"}
  • Kubernetes cluster kind:

    insert how you created your cluster: kops

  • Manifests:

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: demo
spec:
  cassandra:
    serverVersion: "4.0.1"
    datacenters:
      - metadata:
          name: dc1
        size: 4
        storageConfig:
          cassandraDataVolumeClaimSpec:
            storageClassName: kiran-k8ssandra-poc
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 5Gi
        config:
          jvmOptions:
            heapSize: 512M
  medusa:
    containerImage:
      registry: <removed>
      repository: docker-dev/cassandra
      name: medusa-aws
      tag: 0.13.4
      pullPolicy: IfNotPresent
    storageProperties:
      # Can be either of local, google_storage, azure_blobs, s3, s3_compatible, s3_rgw or ibm_storage
      storageProvider: s3
      # Name of the secret containing the credentials file to access the backup storage backend
      storageSecretRef:
        name: medusa-bucket-key
      # Name of the storage bucket
      bucketName: k8ssandra-dbapp-medusa
      # Prefix for this cluster in the storage bucket directory structure, used for multitenancy
      prefix: test5
      # Host to connect to the storage backend (Omitted for GCS, S3, Azure and local).
      #host: minio.minio.svc.cluster.local
      # Port to connect to the storage backend (Omitted for GCS, S3, Azure and local).
      #port: 9000
      # Region of the storage bucket
      region: us-east-1
  • K8ssandra Operator Logs:
2023-08-09T06:00:44.703Z	ERROR	Failed to prepare restore	{"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"restore-backup2","namespace":"kiran-cassandra-auto"}, "namespace": "kiran-cassandra-auto", "name": "restore-backup2", "reconcileID": "c1063005-6bd8-4445-8569-a9971f26cbd4", "medusarestorejob": "kiran-cassandra-auto/restore-backup2", "error": "prepare restore task failed for restore restore-backup2"}
2023-08-09T06:00:44.703Z	ERROR	Reconciler error	{"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"restore-backup2","namespace":"kiran-cassandra-auto"}, "namespace": "kiran-cassandra-auto", "name": "restore-backup2", "reconcileID": "c1063005-6bd8-4445-8569-a9971f26cbd4", "error": "prepare restore task failed for restore restore-backup2"}
2023-08-09T06:00:44.709Z	ERROR	Failed to prepare restore	{"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"restore-backup2","namespace":"kiran-cassandra-auto"}, "namespace": "kiran-cassandra-auto", "name": "restore-backup2", "reconcileID": "2f56a29a-f072-426b-8806-f488ac0f9f38", "medusarestorejob": "kiran-cassandra-auto/restore-backup2", "error": "prepare restore task failed for restore restore-backup2"}
2023-08-09T06:00:44.709Z	ERROR	Reconciler error	{"controller": "medusarestorejob", "controllerGroup": "medusa.k8ssandra.io", "controllerKind": "MedusaRestoreJob", "MedusaRestoreJob": {"name":"restore-backup2","namespace":"kiran-cassandra-auto"}, "namespace": "kiran-cassandra-auto", "name": "restore-backup2", "reconcileID": "2f56a29a-f072-426b-8806-f488ac0f9f38", "error": "prepare restore task failed for restore restore-backup2"}
2023-08-09T06:00:59.741Z	INFO	Initial token computation could not be performed or is not required in this cluster	{"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"demo","namespace":"kiran-cassandra-auto"}, "namespace": "kiran-cassandra-auto", "name": "demo", "reconcileID": "b4552f7e-94e4-4f2f-94ed-a11d90baf1d2", "K8ssandraCluster": "kiran-cassandra-auto/demo", "error": "cannot compute initial tokens: at least one DC has num_tokens >= 16"}

Anything else we need to know?:

k describe pod demo-dc1-default-sts-3
  Normal   Created                 26m (x5 over 27m)      kubelet                  Created container medusa-restore
  Normal   Started                 26m (x5 over 27m)      kubelet                  Started container medusa-restore
  Warning  BackOff                 2m35s (x113 over 27m)  kubelet                  Back-off restarting failed container
NAME                                             READY   STATUS                  RESTARTS         AGE
demo-dc1-default-sts-0                           3/3     Running                 0                42m
demo-dc1-default-sts-1                           3/3     Running                 0                42m
demo-dc1-default-sts-2                           3/3     Running                 0                42m
demo-dc1-default-sts-3                           0/3     Init:CrashLoopBackOff   10 (2m30s ago)   29m
medusa-poc-cass-operator-588778c6cf-rcg7h        1/1     Running                 0                25h
medusa-poc-k8ssandra-operator-5bf6fdfddb-78mh9   1/1     Running                 0                25h
@chandapukiran chandapukiran added the bug Something isn't working label Aug 9, 2023
@adejanovski adejanovski added the assess Issues in the state 'assess' label Aug 9, 2023
@adejanovski
Copy link
Contributor

Hi @chandapukiran, I'm able to reproduce the issue but with slightly different steps.
It requires a restore to be done before scaling the cluster.
It's actually an old bug that we had fixed a long time ago but we got a regression at some point in Medusa (the problem is in Medusa, not k8ssandra-operator).
We'll create a patch for this.

@adejanovski
Copy link
Contributor

More details on the issue:

When a restore is triggered, the statefulset definition will add some env variables to instruct the medusa-restore init container to perform the restore, along with a restore id and the backup that needs to be restored.
Upon restore completion, a file will be created with the restore id to avoid performing the restore again after a pod restart.
When a restore is done, and then a node is added, the corresponding pod will get the env variables but Medusa will fail to find the backup for this node, since it was created before it was added.
The fix is to ignore the restore operation when the requested backup cannot be found for the node and exit the init-container gracefully.

Definition of Done

  • Medusa's restore.py exits gracefully if the backup cannot be found for the node
  • A scale up operation is added to the Medusa restore e2e test to prevent future regressions

@adejanovski adejanovski added ready Issues in the state 'ready' and removed assess Issues in the state 'assess' labels Aug 9, 2023
@adejanovski adejanovski self-assigned this Aug 9, 2023
@adejanovski adejanovski added in-progress Issues in the state 'in-progress' and removed ready Issues in the state 'ready' labels Sep 8, 2023
@adejanovski adejanovski added done Issues in the state 'done' and removed in-progress Issues in the state 'in-progress' labels Sep 20, 2023
@chandapukiran
Copy link
Author

@adejanovski , is the fix going to be back-ported to medusa 0.13.4 ? we are using k8ssandra-operator with medusa 0.13.4

@adejanovski
Copy link
Contributor

@chandapukiran, I have a fix here.

Would you be able to test it and confirm that it fixes the issue?

@chandapukiran
Copy link
Author

@chandapukiran, I have a fix here.

Would you be able to test it and confirm that it fixes the issue?

@adejanovski i have built a new image with your fix and tested it and now it is working as expected. I have tested following steps.

  • Initially build 3 node cluster and add data
  • Took backup using Medusa
  • Added additional cassandra node and made the size from 3 to 4
  • The new pod has come up and data replication started from the existing cluster
  • Added some more data and took backup
  • Reduced the cluster size from 4 to 3
  • Again increased size from 3 to 4
  • The new pod was able to come up with healthy

@chandapukiran
Copy link
Author

@adejanovski could you confirm if the above test I have mentioned was sufficient to validate the fix you have provided?

@adejanovski
Copy link
Contributor

Hi @chandapukiran,

what needs to be tested is the following:

  • Spin up a 3 nodes cluster
  • Take a backup
  • Restore that backup
  • Scale up to 4 nodes

It's only after a restore that this bug triggers in my experience.

@chandapukiran
Copy link
Author

Hi @adejanovski ,
I have followed the steps
Spin up a 3 nodes cluster
Take a backup
Restore that backup
Scale up to 4 nodes

And the new pod is coming up but i see that there is a rolling restart of the pods. Is it expected ?

@adejanovski
Copy link
Contributor

@chandapukiran, I've seen that as well and created a ticket to track this.
So we can put it like this: it's expected but we should make it so it doesn't happen 😅

@chandapukiran
Copy link
Author

Thank you @adejanovski , so is this fix going to be included in all the Medusa versions or only the newer versions?

@chandapukiran
Copy link
Author

@adejanovski could you please confirm if the fix going to be backported?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working done Issues in the state 'done'
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants