-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full Cluster Shutdown of all k8s-nodes results redis-cluster pods in Error State #70
Comments
@cin can you please take a look? |
Thanks for the issue. This looks like a bug @TANISH-18. I don't currently have a openstack cluster to test against. Maybe there's a way to reproduce the issue w/out openstack. So you basically shutdown all the worker nodes and then brought them back up to get into this state? Do you see anything in the operator logs related to the issue? Also, what is the actual error on the pods? Thanks in advance! |
So you basically shutdown all the worker nodes and then brought them back up to get into this state?
It looks like it is trying to connect with pods in the error state. but it did not try to delete that pods or spawn the new pods. but one thing i observed as soon as i delete the redis pods stuck in error state by using kubectl delete commands new pods are getting spawned and all the pods were alive. what is the actual error on the pods?
|
Oh, it sounds like the operator is not recognizing the failed state and is just waiting for the pods to "fix themselves" (which they won't in this case bc the pods aren't managed by any type of replicaset -- they're managed by the operator). I'll see if I can reproduce locally w/kind. |
Hey @cin, did you get any chance to reproduce the issue ? |
Sorry I got pulled into some other things yesterday and didn't get a chance to test. Will make some time today. |
@TANISH-18, I was able to try this in one of our clusters (there's no good way to "restart" nodes in kind). I just rebooted all 3 nodes in the cluster at once. The pods all restarted and came back up fine. I think the main difference w/what I just did is that the pods were still there and were just restarted. I'm guessing your pods were recreated? Here's how my pods look after the reboot.
Did your operator or redis pods ever restart? Am I correct in thinking you got your cluster back to a healthy state by deleting the errored out pods? The prospect of automating that is a bit scary -- not because it's hard but because it's deleting resources. Since you've effectively lost all cache at that point, you can just reinstall the CR as well (easier than deleting all pods). That's not a great answer if you want to go autopilot on disaster recovery though. In the meantime, I'll see if I can get an openshift cluster approved. |
@cin yes my operator pod restarts. deleting all the redis pods will not reproduce this issue. In that case even mine redis cluster pods came back. but the issue is with deleting all the K8 nodes from openstack and then start it after 5-10mins. basically the power outage case. Anyways I got the fix. actually during reconciling operator is trying to connect with pods in error state. so we need to delete failed pods while polling for failed Redis pods. so once we delete all the pods in failed state. Redis cluster pods will come back. |
UPDATE: I went a step farther and deleted the only worker pool in my test cluster. The redis node pods completely went away as expected. The operator and metrics pods went into a pending state (also expected). I then recreated a new worker pool and everything came back w/out incident. I'm starting to think this is an openshift thing only. I wonder what's different... |
@TANISH-18 you may want to try out the latest version of the operator as #84 may have resolved this issue as well. |
Steps to reproduce the issue:
Redis-operator and redis-cluster up and running .
shutoff all K8 instances from openstack and startinstance after 5 mins .
Redis cluster pods are stuck in error state.
once all k8 nodes comes back, Redis-operator is up and running and it is able to bring up only 1 Redis-cluster pod.
There are two issues once all K8 cluster nodes came up.
kc get pods -n fed-redis-cluster -o wide
rediscluster-node-for-redis-9fhdz 0/2 Error 2 91m
rediscluster-node-for-redis-hj59b 0/2 Error 2 93m
rediscluster-node-for-redis-j69st 0/2 Error 0 37m
rediscluster-node-for-redis-w2m79 2/2 Running 0 36m
This is issue from Redis operator side. it is seen while performing the Resiliency Run for power outage case.
The text was updated successfully, but these errors were encountered: