Skip to content
This repository has been archived by the owner on Jul 6, 2023. It is now read-only.

Issue Replacing Node Via Heketi #1744

Closed
iwalmsley opened this issue Jul 7, 2020 · 8 comments · Fixed by #1745
Closed

Issue Replacing Node Via Heketi #1744

iwalmsley opened this issue Jul 7, 2020 · 8 comments · Fixed by #1745
Assignees

Comments

@iwalmsley
Copy link

iwalmsley commented Jul 7, 2020

Kind of issue

Unsure - perhaps a Bug?

Observed behavior

We're seeing a scenario where within a 3 node glusterfs cluster one node has become corrupt/unreachable. At which point we have provisioned a new node (to temporarily make a 4 node gluster cluster) and are trying to remove the corrupt node in order to migrate the bricks to the new node to acheive a 3-replica cluster again (currently all volumes have only 2 bricks). This is failing with the following error indicating that it's trying to SSH to the dead node (IP Address = 192.168.3.25) to get volume information, the issue is that this would never work as the node is down and won't be back up, we need it to use another node to get volume information?

heketi-cli node remove 0ac9ed991748d6eeb8659db6b40e7094

[asynchttp] INFO 2020/07/07 09:13:34 Started job 9f801a7dfb4c588338821c5693db4915
[heketi] INFO 2020/07/07 09:13:34 Running Remove Device
[heketi] INFO 2020/07/07 09:13:34 Trying Remove Device (attempt #1/1)
[heketi] INFO 2020/07/07 09:13:34 Running Remove Brick from Device
[heketi] INFO 2020/07/07 09:13:34 Trying Remove Brick from Device (attempt #1/1)
[heketi] INFO 2020/07/07 09:13:34 Backup successful
[negroni] 2020-07-07T09:13:34Z | 202 |   49.681971ms | localhost:8080 | POST /nodes/0ac9ed991748d6eeb8659db6b40e7094/state
[negroni] 2020-07-07T09:13:34Z | 200 |   102.987µs | localhost:8080 | GET /queue/9f801a7dfb4c588338821c5693db4915
[cmdexec] WARNING 2020/07/07 09:13:34 Failed to create SSH connection to 192.168.3.25:22: dial tcp 192.168.3.25:22: connect: no route to host
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/volume_entry_allocate.go:100:glusterfs.(*VolumeEntry).getBrickSetForBrickId: Unable to get volume info from gluster node 192.168.3.25 for volume vol_0371ee52c589f64b27fa942a289d17b8: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:167:glusterfs.runOperationAfterBuild: Remove Brick from Device Failed: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[heketi] INFO 2020/07/07 09:13:34 Starting Clean for Evict Brick op:c10f1d091f0cde8322981ca239d1dcde
[heketi] INFO 2020/07/07 09:13:34 found brick to be replaced: 1836f4f73ffd968b68e00fe04129b470
[heketi] INFO 2020/07/07 09:13:34 this is a child op of: 927d81a2be1673d81e440b3a6211cdbf
[heketi] INFO 2020/07/07 09:13:34 Clean is done for Evict Brick op:c10f1d091f0cde8322981ca239d1dcde
[heketi] INFO 2020/07/07 09:13:34 found brick to be replaced: 1836f4f73ffd968b68e00fe04129b470
[heketi] INFO 2020/07/07 09:13:34 this is a child op of: 927d81a2be1673d81e440b3a6211cdbf
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:181:glusterfs.runOperationAfterBuild: Max tries (1) consumed
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:251:glusterfs.RunOperation.func1: Error in Remove Brick from Device: Unable to get volume info of
volume name: vol_0371ee52c589f64b27fa942a289d17b8
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:167:glusterfs.runOperationAfterBuild: Remove Device Failed: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[negroni] 2020-07-07T09:13:34Z | 200 |   124.553µs | localhost:8080 | GET /queue/9f801a7dfb4c588338821c5693db4915
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:181:glusterfs.runOperationAfterBuild: Max tries (1) consumed
[heketi] ERROR 2020/07/07 09:13:34 heketi/apps/glusterfs/operations_manage.go:251:glusterfs.RunOperation.func1: Error in Remove Device: Unable to get volume info of volume name: vol_0371ee52c589f64b27fa942a289d17b8
[asynchttp] INFO 2020/07/07 09:13:34 Completed job 9f801a7dfb4c588338821c5693db4915 in 673.264329ms

Expected/desired behavior

It should retrieve volume information from a different node in the cluster, and not rely on accessing a node which is down when removing it.

Details on how to reproduce (minimal and precise)

This is a tricky one which is why I've said it may not be a bug. We've provisioned another entire Gluster/Heketi cluster in order to replicate the issue but this one doesn't seem to be affected. You can see the same timeout happening via SSH but this doesn't cause the operation to fail and works fine.

Information about the environment:

  • Heketi version used (e.g. v6.0.0 or master): v9.0.0-193-gf30f11b3
  • Operating system used: Gluster Cluster - Ubuntu 18.04, Heketi via Docker image
  • Heketi compiled from sources, as a package (rpm/deb), or container: Container
  • If container, which container image: heketi/heketi:dev
  • Using kubernetes, openshift, or direct install: Kubernetes
  • If kubernetes/openshift, is gluster running inside kubernetes/openshift or outside: Outside
  • If kubernetes/openshift, how was it deployed (gk-deploy, openshift-ansible, other, custom): gk-deploy

Other useful information

We have tried other methods of removing the node such as first removing the devices however with this activity we are getting a different error which is indicating that it's failing due to Peers being down (the only Peer that is down is the offline node in the 4 node cluster that we subsequently want to remove).;

[cmdexec] DEBUG 2020/07/06 14:12:05 heketi/pkg/remoteexec/log/commandlog.go:46:log.(*CommandLogger).Success: Ran command [gluster --mode=script --timeout=600 volume stop vol_c$
ae08c3d6aaaa6c79fed349cdae5e12 force] on [192.168.3.47:22]: Stdout [volume stop: vol_c4ae08c3d6aaaa6c79fed349cdae5e12: success
]: Stderr []
[cmdexec] DEBUG 2020/07/06 14:12:05 heketi/pkg/remoteexec/log/commandlog.go:34:log.(*CommandLogger).Before: Will run command [gluster --mode=script --timeout=600 volume delete
vol_c4ae08c3d6aaaa6c79fed349cdae5e12] on [192.168.3.47:22]
[heketi] INFO 2020/07/06 14:12:05 Trying Remove Device (attempt #1/1)
[heketi] INFO 2020/07/06 14:12:05 Running Remove Brick from Device
[heketi] INFO 2020/07/06 14:12:05 Trying Remove Brick from Device (attempt #1/1)
[heketi] INFO 2020/07/06 14:12:05 Backup successful
[negroni] 2020-07-06T14:12:05Z | 202 |   284.778664ms | localhost:8080 | POST /devices/42ae95bb11ef638edb3d65f53f1c1d4a/state
[negroni] 2020-07-06T14:12:05Z | 200 |   98.278µs | localhost:8080 | GET /queue/b901fcd7fbdc7b3dfadf293d13234aa0
[cmdexec] ERROR 2020/07/06 14:12:05 heketi/pkg/remoteexec/log/commandlog.go:56:log.(*CommandLogger).Error: Failed to run command [gluster --mode=script --timeout=600 volume del
ete vol_c4ae08c3d6aaaa6c79fed349cdae5e12] on [192.168.3.47:22]: Err[Process exited with status 1]: Stdout []: Stderr [volume delete: vol_c4ae08c3d6aaaa6c79fed349cdae5e12: faile
d: Some of the peers are down
]
[cmdexec] ERROR 2020/07/06 14:12:05 heketi/executors/cmdexec/volume.go:160:cmdexec.(*CmdExecutor).VolumeDestroy: Unable to delete volume vol_c4ae08c3d6aaaa6c79fed349cdae5e12: v
olume delete: vol_c4ae08c3d6aaaa6c79fed349cdae5e12: failed: Some of the peers are down

Any help pointing in the right direction would be very helpful.

@iamniting
Copy link
Member

The error you are seeing in the logs in Other useful information is not because of device remove at all. Device remove doesn't trigger volume delete. It is because of some pending volume delete operation. Which was triggered by you or PVC deletion from k8s. Volume delete can not be done until all nodes are up. Heketi will be able to delete volume only when all nodes are running glusterd.

How many bricks do you have in the node which you want to remove?

@iwalmsley
Copy link
Author

Thanks for the response.

We have about 30 bricks on the node we want to remove and unfortunately we can't bring it back up again as the disk is corrupt. What would you recommend we do? We have considered removing the bad node from the cluster on the gluster level but didn't want to make the real state out of sync with the Heketi Database.

@iamniting
Copy link
Member

I was looking at code after looking at your logs then I realized that we do have a bug at [1]. It is trying to get the vol info from the same node where brick is present. It should get it from the other node.

[1]. https://github.com/heketi/heketi/blob/master/apps/glusterfs/operations_device.go#L510

@phlogistonjohn
Copy link
Contributor

Agreed. This appears to be a regression brought in by the brick eviction patches.

Previously, the code would attempt to run on the node that owns the brick-to-be-replaced, but then it would fallback to any working node (ones that pass a glusterd check) in the cluster. The newer code lacks the fallback. Fixing the new function to do the fallback like the old code ought to fix this issue.

@iamniting do you want to supply a PR? If so, I'll assign the issue to you.

@iamniting iamniting self-assigned this Jul 9, 2020
@iamniting
Copy link
Member

Hi John, I have assigned this issue to me and working on the patch.

@phlogistonjohn
Copy link
Contributor

Hello @iwalmsley we've merged a fix for this issue. At your earliest convenience please update your heketi and try out the fix.
If you're not building from source yourself a new build on dockerhub should be done within the hour as it is already building now.
Please let us know if this works for you or you continue to have issues with this. Thanks!

@iwalmsley
Copy link
Author

Thanks all that sounds very promising. We're going to try again tomorrow morning so will update then.

@iwalmsley
Copy link
Author

Just to update, this fix did work in so much as that the node delete began following heal info, however the operation failed after this. Going to raise a separate issue for this. Thanks for the help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants