error out replace brick if self heal is in progress #718
error out replace brick if self heal is in progress #718
Conversation
@heketi/maintainers This is still work in progress. I need to add tests, but would like your review if the approach is right or not. |
Can one of the admins verify this patch? |
@raghavendra-talur The approach looks good and the code clean enough. But as you said: it's WIP, tests are missing. |
@obnoxxx yes, WIP tag because of no tests. I will add a test, |
@@ -131,6 +131,29 @@ func (v *VolumeEntry) replaceBrickInVolume(db *bolt.DB, executor executors.Execu | |||
return err | |||
} | |||
|
|||
// Get self heal status for this brick's volume | |||
healinfo, err := executor.HealInfo(oldBrickNodeEntry.ManageHostName(), v.Info.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting heal info only from one node may be not sufficient, cant we have a fallback ?
for _, brickHealStatus := range healinfo.Bricks.BrickList { | ||
// Gluster has a bug that it does not send Name for bricks that are down. | ||
// Skip such bricks; it is safe because it is not source if it is down | ||
if brickHealStatus.Name == "information not available" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any status available than this ?
executors/sshexec/volume.go
Outdated
|
||
output, err := s.RemoteExecutor.RemoteCommandExecute(host, command, 10) | ||
if err != nil { | ||
return nil, fmt.Errorf("Unable to get heal info of volume name: %v", volume) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove "name" from the above log.
executors/sshexec/volume.go
Outdated
var healInfo CliOutput | ||
err = xml.Unmarshal([]byte(output[0]), &healInfo) | ||
if err != nil { | ||
return nil, fmt.Errorf("Unable to determine heal info of volume name: %v", volume) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
@raghavendra-talur Sounds like this is a state of the Volume. If the volume is self-healing, then I'm assuming that no other state can change can occur until either canceled or completed, right? |
@lpabon yes, it's a state of the volume, and I guess no other state change should occur, except for possibly remove? Not sure about expansion. |
Closes #32 |
We also need some tests. Any status @raghavendra-talur ? |
c3c9d8a
to
53a8fdd
Compare
The conditions to be considered are 1. If the brick to be replaced is source for some files, don't proceed 2. We kill the brick to be replaced, if that would make the replica set lose quorum, don't proceed. This has a side effect that remove device won't work on volumes which have replica count 2. Choosing data safety over migration features. Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
53a8fdd
to
e3d7f21
Compare
@lpabon Added tests, ready for review. While adding tests, found a bug in Gluster which affects heketi. Filed a bug in Gluster https://bugzilla.redhat.com/show_bug.cgi?id=1473026 and accounted for the same in heketi. @obnoxxx @MohamedAshiqrh @jarrpa @humblec @ramkrsna Please have a look. |
@heketi/dev ping |
@raghavendra-talur Sweet! What else is needed. What do you suggest we do with this PR? |
@lpabon nothing else pending here. Just review and merge. |
@@ -26,6 +26,7 @@ type Executor interface { | |||
VolumeExpand(host string, volume *VolumeRequest) (*Volume, error) | |||
VolumeReplaceBrick(host string, volume string, oldBrick *BrickInfo, newBrick *BrickInfo) error | |||
VolumeInfo(host string, volume string) (*Volume, error) | |||
HealInfo(host string, volume string) (*HealInfo, error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not return GlusterFS specific XML data back to Heketi. The executor should interpret the information and the it should return what the result is, instead of letting the caller figure it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lpabon We need to match the db data with gluster data. So I should either send db data to executor or get executor data to app_* part. I don't see any way to reconcile data without doing so.
Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raghavendra-talur : I think @lpabon means not to return the XML data raw, but have the executor extract all info we need into an internal structure (defined by heketi/your needs). Does not mean to lose any data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The executor's job is to abstract any of the fine details of the call. This is letting the caller deal with raw XML data. Instead let the executor look at the XML data and return a simple struct of what the call is supposed to provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lpabon I was trying to have a look at this today, and I don't understand your request. it looks like there's already a function that extracts the relevant data into a structure? https://github.com/heketi/heketi/pull/718/files#diff-2320cf68f8e9328f659fe584a9a33620
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jarrpa I checked it again, and since this is information about the system, it could be that at some time we look at other values in the structure.
I remove this request. It is ok as it is.
@raghavendra-talur please resolve the conflict. @heketi/dev If the team is ok with this change it LGTM. |
@raghavendra-talur @obnoxxx I would like to see this pass the CI functional tests after the conflict is resolve. Also, @raghavendra-talur could you create a new issue for Release 6 to create a functional test for this change to test it? Thanks. |
@lpabon since @raghavendra-talur is not available currently, I rebased the branch and pushed a new PR in #834 . PTAL there... |
👍 |
When the brick to be replaced is a "source" brick in a replicate or disperse volume, we should fail the replace brick. This should be ideally handled in Gluster but no harm in having extra check in Heketi.
Signed-off-by: Raghavendra Talur rtalur@redhat.com