Issue Replacing Node Via Heketi #1744

iwalmsley · 2020-07-07T09:25:15Z

Kind of issue

Unsure - perhaps a Bug?

Observed behavior

We're seeing a scenario where within a 3 node glusterfs cluster one node has become corrupt/unreachable. At which point we have provisioned a new node (to temporarily make a 4 node gluster cluster) and are trying to remove the corrupt node in order to migrate the bricks to the new node to acheive a 3-replica cluster again (currently all volumes have only 2 bricks). This is failing with the following error indicating that it's trying to SSH to the dead node (IP Address = 192.168.3.25) to get volume information, the issue is that this would never work as the node is down and won't be back up, we need it to use another node to get volume information?

heketi-cli node remove 0ac9ed991748d6eeb8659db6b40e7094

[asynchttp] INFO 2020/07/07 09:13:34 Started job 9f801a7dfb4c588338821c5693db4915
[heketi] INFO 2020/07/07 09:13:34 Running Remove Device
[heketi] INFO 2020/07/07 09:13:34 Trying Remove Device (attempt #1/1)
[heketi] INFO 2020/07/07 09:13:34 Running Remove Brick from Device
[heketi] INFO 2020/07/07 09:13:34 Trying Remove Brick from Device (attempt #1/1)
[heketi] INFO 2020/07/07 09:13:34 Backup successful
[negroni] 2020-07-07T09:13:34Z | 202 |   49.681971ms | localhost:8080 | POST /nodes/0ac9ed991748d6eeb8659db6b40e7094/state
[negroni] 2020-07-07T09:13:34Z | 200 |   102.987µs | localhost:8080 | GET /queue/9f801a7dfb4c588338821c5693db4915
[cmdexec] WARNING 2020/07/07 09:13:34 Failed to create SSH connection to 192.168.3.25:22: dial tcp 192.168.3.25:22: connect: no route to host
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/volume_entry_allocate.go:100:glusterfs.(*VolumeEntry).getBrickSetForBrickId: Unable to get volume info from gluster node 192.168.3.25 for volume vol_0371ee52c589f64b27fa942a289d17b8: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:167:glusterfs.runOperationAfterBuild: Remove Brick from Device Failed: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[heketi] INFO 2020/07/07 09:13:34 Starting Clean for Evict Brick op:c10f1d091f0cde8322981ca239d1dcde
[heketi] INFO 2020/07/07 09:13:34 found brick to be replaced: 1836f4f73ffd968b68e00fe04129b470
[heketi] INFO 2020/07/07 09:13:34 this is a child op of: 927d81a2be1673d81e440b3a6211cdbf
[heketi] INFO 2020/07/07 09:13:34 Clean is done for Evict Brick op:c10f1d091f0cde8322981ca239d1dcde
[heketi] INFO 2020/07/07 09:13:34 found brick to be replaced: 1836f4f73ffd968b68e00fe04129b470
[heketi] INFO 2020/07/07 09:13:34 this is a child op of: 927d81a2be1673d81e440b3a6211cdbf
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:181:glusterfs.runOperationAfterBuild: Max tries (1) consumed
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:251:glusterfs.RunOperation.func1: Error in Remove Brick from Device: Unable to get volume info of
volume name: vol_0371ee52c589f64b27fa942a289d17b8
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:167:glusterfs.runOperationAfterBuild: Remove Device Failed: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[negroni] 2020-07-07T09:13:34Z | 200 |   124.553µs | localhost:8080 | GET /queue/9f801a7dfb4c588338821c5693db4915
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:181:glusterfs.runOperationAfterBuild: Max tries (1) consumed
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:251:glusterfs.RunOperation.func1: Error in Remove Device: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[asynchttp] INFO 2020/07/07 09:13:34 Completed job 9f801a7dfb4c588338821c5693db4915 in 673.264329ms

Expected/desired behavior

It should retrieve volume information from a different node in the cluster, and not rely on accessing a node which is down when removing it.

Details on how to reproduce (minimal and precise)

This is a tricky one which is why I've said it may not be a bug. We've provisioned another entire Gluster/Heketi cluster in order to replicate the issue but this one doesn't seem to be affected. You can see the same timeout happening via SSH but this doesn't cause the operation to fail and works fine.

Information about the environment:

Heketi version used (e.g. v6.0.0 or master): v9.0.0-193-gf30f11b3
Operating system used: Gluster Cluster - Ubuntu 18.04, Heketi via Docker image
Heketi compiled from sources, as a package (rpm/deb), or container: Container
If container, which container image: heketi/heketi:dev
Using kubernetes, openshift, or direct install: Kubernetes
If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: Outside
If kubernetes/openshift, how was it deployed (gk-deploy, openshift-ansible, other, custom): gk-deploy

Other useful information

We have tried other methods of removing the node such as first removing the devices however with this activity we are getting a different error which is indicating that it's failing due to Peers being down (the only Peer that is down is the offline node in the 4 node cluster that we subsequently want to remove).;

[cmdexec] DEBUG 2020/07/06 14:12:05 heketi/pkg/remoteexec/log/commandlog.go:46:log.(*CommandLogger).Success: Ran command [gluster --mode=script --timeout=600 volume stop vol_c$
ae08c3d6aaaa6c79fed349cdae5e12 force] on [192.168.3.47:22]: Stdout [volume stop: vol_c4ae08c3d6aaaa6c79fed349cdae5e12: success
]: Stderr []
[cmdexec] DEBUG 2020/07/06 14:12:05 heketi/pkg/remoteexec/log/commandlog.go:34:log.(*CommandLogger).Before: Will run command [gluster --mode=script --timeout=600 volume delete
vol_c4ae08c3d6aaaa6c79fed349cdae5e12] on [192.168.3.47:22]
[heketi] INFO 2020/07/06 14:12:05 Trying Remove Device (attempt #1/1)
[heketi] INFO 2020/07/06 14:12:05 Running Remove Brick from Device
[heketi] INFO 2020/07/06 14:12:05 Trying Remove Brick from Device (attempt #1/1)
[heketi] INFO 2020/07/06 14:12:05 Backup successful
[negroni] 2020-07-06T14:12:05Z | 202 |   284.778664ms | localhost:8080 | POST /devices/42ae95bb11ef638edb3d65f53f1c1d4a/state
[negroni] 2020-07-06T14:12:05Z | 200 |   98.278µs | localhost:8080 | GET /queue/b901fcd7fbdc7b3dfadf293d13234aa0
[cmdexec] ERROR 2020/07/06 14:12:05 heketi/pkg/remoteexec/log/commandlog.go:56:log.(*CommandLogger).Error: Failed to run command [gluster --mode=script --timeout=600 volume del
ete vol_c4ae08c3d6aaaa6c79fed349cdae5e12] on [192.168.3.47:22]: Err[Process exited with status 1]: Stdout []: Stderr [volume delete: vol_c4ae08c3d6aaaa6c79fed349cdae5e12: faile
d: Some of the peers are down
]
[cmdexec] ERROR 2020/07/06 14:12:05 heketi/executors/cmdexec/volume.go:160:cmdexec.(*CmdExecutor).VolumeDestroy: Unable to delete volume vol_c4ae08c3d6aaaa6c79fed349cdae5e12: v
olume delete: vol_c4ae08c3d6aaaa6c79fed349cdae5e12: failed: Some of the peers are down

Any help pointing in the right direction would be very helpful.

The text was updated successfully, but these errors were encountered:

iamniting · 2020-07-07T11:10:30Z

The error you are seeing in the logs in Other useful information is not because of device remove at all. Device remove doesn't trigger volume delete. It is because of some pending volume delete operation. Which was triggered by you or PVC deletion from k8s. Volume delete can not be done until all nodes are up. Heketi will be able to delete volume only when all nodes are running glusterd.

How many bricks do you have in the node which you want to remove?

iwalmsley · 2020-07-08T11:32:14Z

Thanks for the response.

We have about 30 bricks on the node we want to remove and unfortunately we can't bring it back up again as the disk is corrupt. What would you recommend we do? We have considered removing the bad node from the cluster on the gluster level but didn't want to make the real state out of sync with the Heketi Database.

iamniting · 2020-07-08T13:42:54Z

I was looking at code after looking at your logs then I realized that we do have a bug at [1]. It is trying to get the vol info from the same node where brick is present. It should get it from the other node.

[1]. https://github.com/heketi/heketi/blob/master/apps/glusterfs/operations_device.go#L510

phlogistonjohn · 2020-07-08T17:16:24Z

Agreed. This appears to be a regression brought in by the brick eviction patches.

Previously, the code would attempt to run on the node that owns the brick-to-be-replaced, but then it would fallback to any working node (ones that pass a glusterd check) in the cluster. The newer code lacks the fallback. Fixing the new function to do the fallback like the old code ought to fix this issue.

@iamniting do you want to supply a PR? If so, I'll assign the issue to you.

iamniting · 2020-07-09T06:55:39Z

Hi John, I have assigned this issue to me and working on the patch.

phlogistonjohn · 2020-07-09T14:20:24Z

Hello @iwalmsley we've merged a fix for this issue. At your earliest convenience please update your heketi and try out the fix.
If you're not building from source yourself a new build on dockerhub should be done within the hour as it is already building now.
Please let us know if this works for you or you continue to have issues with this. Thanks!

iwalmsley · 2020-07-09T15:43:55Z

Thanks all that sounds very promising. We're going to try again tomorrow morning so will update then.

iwalmsley · 2020-07-27T10:28:46Z

Just to update, this fix did work in so much as that the node delete began following heal info, however the operation failed after this. Going to raise a separate issue for this. Thanks for the help.

iamniting self-assigned this Jul 9, 2020

iamniting mentioned this issue Jul 9, 2020

apps: Fix regression in brickEvict/"execGetReplacmentInfo" #1745

Merged

phlogistonjohn closed this as completed in #1745 Jul 9, 2020

iwalmsley mentioned this issue Jul 27, 2020

Error Removing Dead Node #1756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Replacing Node Via Heketi #1744

Issue Replacing Node Via Heketi #1744

iwalmsley commented Jul 7, 2020 •

edited

iamniting commented Jul 7, 2020

iwalmsley commented Jul 8, 2020

iamniting commented Jul 8, 2020

phlogistonjohn commented Jul 8, 2020

iamniting commented Jul 9, 2020

phlogistonjohn commented Jul 9, 2020

iwalmsley commented Jul 9, 2020

iwalmsley commented Jul 27, 2020

Issue Replacing Node Via Heketi #1744

Issue Replacing Node Via Heketi #1744

Comments

iwalmsley commented Jul 7, 2020 • edited

Kind of issue

Observed behavior

Expected/desired behavior

Details on how to reproduce (minimal and precise)

Information about the environment:

Other useful information

iamniting commented Jul 7, 2020

iwalmsley commented Jul 8, 2020

iamniting commented Jul 8, 2020

phlogistonjohn commented Jul 8, 2020

iamniting commented Jul 9, 2020

phlogistonjohn commented Jul 9, 2020

iwalmsley commented Jul 9, 2020

iwalmsley commented Jul 27, 2020

iwalmsley commented Jul 7, 2020 •

edited