error out replace brick if self heal is in progress #718

raghavendra-talur · 2017-03-17T12:57:04Z

When the brick to be replaced is a "source" brick in a replicate or disperse volume, we should fail the replace brick. This should be ideally handled in Gluster but no harm in having extra check in Heketi.

Signed-off-by: Raghavendra Talur rtalur@redhat.com

raghavendra-talur · 2017-03-17T12:57:45Z

@heketi/maintainers This is still work in progress. I need to add tests, but would like your review if the approach is right or not.

centos-ci · 2017-03-17T13:01:08Z

Can one of the admins verify this patch?

obnoxxx · 2017-04-06T07:41:02Z

@raghavendra-talur The approach looks good and the code clean enough. But as you said: it's WIP, tests are missing.

raghavendra-talur · 2017-04-06T10:55:52Z

@obnoxxx yes, WIP tag because of no tests. I will add a test,

humblec · 2017-04-06T11:08:49Z

apps/glusterfs/volume_entry_allocate.go

@@ -131,6 +131,29 @@ func (v *VolumeEntry) replaceBrickInVolume(db *bolt.DB, executor executors.Execu
 		return err
 	}

+	// Get self heal status for this brick's volume
+	healinfo, err := executor.HealInfo(oldBrickNodeEntry.ManageHostName(), v.Info.Name)


Getting heal info only from one node may be not sufficient, cant we have a fallback ?

humblec · 2017-04-06T11:09:32Z

apps/glusterfs/volume_entry_allocate.go

+	for _, brickHealStatus := range healinfo.Bricks.BrickList {
+		// Gluster has a bug that it does not send Name for bricks that are down.
+		// Skip such bricks; it is safe because it is not source if it is down
+		if brickHealStatus.Name == "information not available" {


Are there any status available than this ?

humblec · 2017-04-06T11:11:11Z

executors/sshexec/volume.go

+
+	output, err := s.RemoteExecutor.RemoteCommandExecute(host, command, 10)
+	if err != nil {
+		return nil, fmt.Errorf("Unable to get heal info of volume name: %v", volume)


you can remove "name" from the above log.

humblec · 2017-04-06T11:11:21Z

executors/sshexec/volume.go

+	var healInfo CliOutput
+	err = xml.Unmarshal([]byte(output[0]), &healInfo)
+	if err != nil {
+		return nil, fmt.Errorf("Unable to determine heal info of volume name: %v", volume)


lpabon · 2017-04-25T15:04:29Z

@raghavendra-talur Sounds like this is a state of the Volume. If the volume is self-healing, then I'm assuming that no other state can change can occur until either canceled or completed, right?

obnoxxx · 2017-05-17T10:17:12Z

@lpabon yes, it's a state of the volume, and I guess no other state change should occur, except for possibly remove? Not sure about expansion.

lpabon · 2017-06-20T03:14:34Z

Closes #32

lpabon · 2017-06-29T01:46:47Z

We also need some tests. Any status @raghavendra-talur ?

The conditions to be considered are 1. If the brick to be replaced is source for some files, don't proceed 2. We kill the brick to be replaced, if that would make the replica set lose quorum, don't proceed. This has a side effect that remove device won't work on volumes which have replica count 2. Choosing data safety over migration features. Signed-off-by: Raghavendra Talur <rtalur@redhat.com>

raghavendra-talur · 2017-07-20T00:43:35Z

@lpabon Added tests, ready for review.

While adding tests, found a bug in Gluster which affects heketi. Filed a bug in Gluster https://bugzilla.redhat.com/show_bug.cgi?id=1473026 and accounted for the same in heketi.

@obnoxxx @MohamedAshiqrh @jarrpa @humblec @ramkrsna Please have a look.

raghavendra-talur · 2017-07-26T12:18:51Z

@heketi/dev ping

lpabon · 2017-07-26T17:03:01Z

@raghavendra-talur Sweet! What else is needed. What do you suggest we do with this PR?

raghavendra-talur · 2017-07-27T05:28:13Z

@lpabon nothing else pending here. Just review and merge.

lpabon · 2017-07-31T04:41:20Z

executors/executor.go

@@ -26,6 +26,7 @@ type Executor interface {
 	VolumeExpand(host string, volume *VolumeRequest) (*Volume, error)
 	VolumeReplaceBrick(host string, volume string, oldBrick *BrickInfo, newBrick *BrickInfo) error
 	VolumeInfo(host string, volume string) (*Volume, error)
+	HealInfo(host string, volume string) (*HealInfo, error)


Do not return GlusterFS specific XML data back to Heketi. The executor should interpret the information and the it should return what the result is, instead of letting the caller figure it out.

@lpabon We need to match the db data with gluster data. So I should either send db data to executor or get executor data to app_* part. I don't see any way to reconcile data without doing so.
Any suggestions?

@raghavendra-talur : I think @lpabon means not to return the XML data raw, but have the executor extract all info we need into an internal structure (defined by heketi/your needs). Does not mean to lose any data.

Correct. The executor's job is to abstract any of the fine details of the call. This is letting the caller deal with raw XML data. Instead let the executor look at the XML data and return a simple struct of what the call is supposed to provide.

@lpabon I was trying to have a look at this today, and I don't understand your request. it looks like there's already a function that extracts the relevant data into a structure? https://github.com/heketi/heketi/pull/718/files#diff-2320cf68f8e9328f659fe584a9a33620

@jarrpa I checked it again, and since this is information about the system, it could be that at some time we look at other values in the structure.

I remove this request. It is ok as it is.

lpabon · 2017-08-15T05:34:57Z

@raghavendra-talur please resolve the conflict.

@heketi/dev If the team is ok with this change it LGTM.

lpabon · 2017-08-15T05:36:34Z

@raghavendra-talur @obnoxxx I would like to see this pass the CI functional tests after the conflict is resolve. Also, @raghavendra-talur could you create a new issue for Release 6 to create a functional test for this change to test it? Thanks.

obnoxxx · 2017-08-15T06:40:39Z

@lpabon since @raghavendra-talur is not available currently, I rebased the branch and pushed a new PR in #834 . PTAL there...

obnoxxx · 2017-08-15T07:44:11Z

@lpabon, the straight rebase in PR #834 passed the travis tests and the ci functional tests. hence, i'm going to merge according to your LGTM above.

obnoxxx · 2017-08-15T07:50:06Z

@lpabon also created the issue #835 for creating a functional test.

obnoxxx · 2017-08-15T07:50:38Z

it's done via merging #834 and creating #835.

lpabon · 2017-08-16T06:01:17Z

👍

lpabon added the in progress label Mar 17, 2017

raghavendra-talur mentioned this pull request Apr 3, 2017

Use Case: replace a node in a cluster #635

Closed

humblec reviewed Apr 6, 2017

View reviewed changes

lpabon added this to the Release 5 milestone Jun 13, 2017

lpabon added the bug label Jun 20, 2017

lpabon assigned raghavendra-talur Jun 20, 2017

raghavendra-talur force-pushed the disallow-remove-device-during-self-heal branch from c3c9d8a to 53a8fdd Compare July 20, 2017 00:37

raghavendra-talur force-pushed the disallow-remove-device-during-self-heal branch from 53a8fdd to e3d7f21 Compare July 20, 2017 00:41

lpabon suggested changes Jul 31, 2017

View reviewed changes

lpabon approved these changes Aug 15, 2017

View reviewed changes

obnoxxx mentioned this pull request Aug 15, 2017

check for self heal before replaceBrick #834

Merged

obnoxxx mentioned this pull request Aug 15, 2017

debt: create functional test for heal check before replace brick #835

Closed

obnoxxx closed this Aug 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error out replace brick if self heal is in progress #718

error out replace brick if self heal is in progress #718

raghavendra-talur commented Mar 17, 2017

raghavendra-talur commented Mar 17, 2017

centos-ci commented Mar 17, 2017

obnoxxx commented Apr 6, 2017

raghavendra-talur commented Apr 6, 2017

humblec Apr 6, 2017

humblec Apr 6, 2017

humblec Apr 6, 2017

humblec Apr 6, 2017

lpabon commented Apr 25, 2017

obnoxxx commented May 17, 2017

lpabon commented Jun 20, 2017

lpabon commented Jun 29, 2017

raghavendra-talur commented Jul 20, 2017

raghavendra-talur commented Jul 26, 2017

lpabon commented Jul 26, 2017

raghavendra-talur commented Jul 27, 2017

lpabon Jul 31, 2017

raghavendra-talur Jul 31, 2017

obnoxxx Jul 31, 2017

lpabon Aug 4, 2017

jarrpa Aug 14, 2017

lpabon Aug 15, 2017

lpabon commented Aug 15, 2017

lpabon commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

lpabon commented Aug 16, 2017

error out replace brick if self heal is in progress #718

error out replace brick if self heal is in progress #718

Conversation

raghavendra-talur commented Mar 17, 2017

raghavendra-talur commented Mar 17, 2017

centos-ci commented Mar 17, 2017

obnoxxx commented Apr 6, 2017

raghavendra-talur commented Apr 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpabon commented Apr 25, 2017

obnoxxx commented May 17, 2017

lpabon commented Jun 20, 2017

lpabon commented Jun 29, 2017

raghavendra-talur commented Jul 20, 2017

raghavendra-talur commented Jul 26, 2017

lpabon commented Jul 26, 2017

raghavendra-talur commented Jul 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lpabon commented Aug 15, 2017

lpabon commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

obnoxxx commented Aug 15, 2017

lpabon commented Aug 16, 2017