-
Notifications
You must be signed in to change notification settings - Fork 435
Issue Replacing Node Via Heketi #1744
Comments
The error you are seeing in the logs in Other useful information is not because of device remove at all. Device remove doesn't trigger volume delete. It is because of some pending volume delete operation. Which was triggered by you or PVC deletion from k8s. Volume delete can not be done until all nodes are up. Heketi will be able to delete volume only when all nodes are running glusterd. How many bricks do you have in the node which you want to remove? |
Thanks for the response. We have about 30 bricks on the node we want to remove and unfortunately we can't bring it back up again as the disk is corrupt. What would you recommend we do? We have considered removing the bad node from the cluster on the gluster level but didn't want to make the real state out of sync with the Heketi Database. |
I was looking at code after looking at your logs then I realized that we do have a bug at [1]. It is trying to get the vol info from the same node where brick is present. It should get it from the other node. [1]. https://github.com/heketi/heketi/blob/master/apps/glusterfs/operations_device.go#L510 |
Agreed. This appears to be a regression brought in by the brick eviction patches. Previously, the code would attempt to run on the node that owns the brick-to-be-replaced, but then it would fallback to any working node (ones that pass a glusterd check) in the cluster. The newer code lacks the fallback. Fixing the new function to do the fallback like the old code ought to fix this issue. @iamniting do you want to supply a PR? If so, I'll assign the issue to you. |
Hi John, I have assigned this issue to me and working on the patch. |
Hello @iwalmsley we've merged a fix for this issue. At your earliest convenience please update your heketi and try out the fix. |
Thanks all that sounds very promising. We're going to try again tomorrow morning so will update then. |
Just to update, this fix did work in so much as that the node delete began following heal info, however the operation failed after this. Going to raise a separate issue for this. Thanks for the help. |
Kind of issue
Unsure - perhaps a Bug?
Observed behavior
We're seeing a scenario where within a 3 node glusterfs cluster one node has become corrupt/unreachable. At which point we have provisioned a new node (to temporarily make a 4 node gluster cluster) and are trying to remove the corrupt node in order to migrate the bricks to the new node to acheive a 3-replica cluster again (currently all volumes have only 2 bricks). This is failing with the following error indicating that it's trying to SSH to the dead node (IP Address = 192.168.3.25) to get volume information, the issue is that this would never work as the node is down and won't be back up, we need it to use another node to get volume information?
heketi-cli node remove 0ac9ed991748d6eeb8659db6b40e7094
Expected/desired behavior
It should retrieve volume information from a different node in the cluster, and not rely on accessing a node which is down when removing it.
Details on how to reproduce (minimal and precise)
This is a tricky one which is why I've said it may not be a bug. We've provisioned another entire Gluster/Heketi cluster in order to replicate the issue but this one doesn't seem to be affected. You can see the same timeout happening via SSH but this doesn't cause the operation to fail and works fine.
Information about the environment:
Other useful information
We have tried other methods of removing the node such as first removing the devices however with this activity we are getting a different error which is indicating that it's failing due to Peers being down (the only Peer that is down is the offline node in the 4 node cluster that we subsequently want to remove).;
Any help pointing in the right direction would be very helpful.
The text was updated successfully, but these errors were encountered: