You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On AWS (EKS and S3 Bucket), the Cassandra Statefulset won't come up sometimes after restoration.
On a K8ssandra cluster with 1 dc of 3 nodes, 2 are not reaching readiness probe.
On failed pods, the following error message is showing up in logs:
java.lang.RuntimeException: A node with address /172.0.238.69:7000 already exists, cancelling join. Use cassandra.replace_address if you want to replace this node.
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:749)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:1024)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:874)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:819)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:418)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:759)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:893)
I observer this issue many times in our AWS environments and I never seen it in our GCP and Azure environments.
It seems AWS specific.
Did you expect to see something different?
I expect the Cassandra Statefulset to be up and running after restoration.
How to reproduce it (as minimally and precisely as possible):
Setup S3 Bucket and EKS cluster
Create k8ssandra cluster with 3 nodes and medusa enabled
Do a backup using MedusaBackupJob
Delete k8ssandra cluster (make sure pods and pvc are gone)
Re-create the k8ssandra cluster
Create a MedusaRestoreJob
The issue is not happening consistently. I would say it happens on 1 out of 5 attempts.
Environment
K8ssandra Operator version:
1.14
Kubernetes version information:
v1.29.3-eks-adc7111
Note that after the issue happens, I can delete the k8ssandra cluster, re-create it and re-create a medusa restore job.
Restoration will succeeded and the 3 Cassandra pods will come up.
This is make me think that something is going wrong in the restoration rather than in the backup.
Hello @c3-clement! I've spent some time trying to reproduce this using AWS EKS cluster, but with minio instead of S3. I'm sorry to conculde I did not manage to reproduce the issue.
I only saw two things that were a bit odd. First, after restore, the new nodes attempted to gossip with one or two of the old ones. This can be explained by the operator restoring the system.peers table.
Second, I did see two more conditions for my datacenter (the oldest ones), but I can't imagine how how those would impact the restore.
What happened?
On AWS (EKS and S3 Bucket), the Cassandra Statefulset won't come up sometimes after restoration.
On a K8ssandra cluster with 1 dc of 3 nodes, 2 are not reaching readiness probe.
On failed pods, the following error message is showing up in logs:
I observer this issue many times in our AWS environments and I never seen it in our GCP and Azure environments.
It seems AWS specific.
Did you expect to see something different?
I expect the Cassandra Statefulset to be up and running after restoration.
How to reproduce it (as minimally and precisely as possible):
MedusaBackupJob
MedusaRestoreJob
The issue is not happening consistently. I would say it happens on 1 out of 5 attempts.
Environment
1.14
v1.29.3-eks-adc7111
EKS
0.20.1
k8ssandra.yaml:
medusarestorejob.yaml :
medusabackup.yaml:
k8ssandra-operator-logs.txt
Anything else we need to know?:
k8scass-cs-001-k8scass-001-default-sts-1
medusa-restore logs:medusa-restore-logs.txt
k8scass-cs-001-k8scass-001-default-sts-1
server-system-logger logs:server-system-logger.txt
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: K8OP-22
The text was updated successfully, but these errors were encountered: