Skip to content
This repository has been archived by the owner on Jul 6, 2023. It is now read-only.

fix race between device remove and other ops #736

Merged
merged 2 commits into from Apr 6, 2017

Conversation

raghavendra-talur
Copy link
Member

It is possible that a device gets more bricks added to it in a parallel
volume create/expand operation when we are adding a brick to it as part
of device remove opeation. What we save to disk should be the final
state which is consequence of both the operations.

Signed-off-by: Raghavendra Talur rtalur@redhat.com

@centos-ci
Copy link
Collaborator

Can one of the admins verify this patch?

@obnoxxx
Copy link
Contributor

obnoxxx commented Apr 5, 2017

@centos-ci ok to test

@raghavendra-talur
Copy link
Member Author

@lpabon @obnoxxx @MohamedAshiqrh This patch is ready for review. I have run unit tests and SmokeTests on the code.

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aproach looks good in general, but I think there are still places where we operate on stale data.

v.gidRequested, v.Info.Id)
deviceEntry.Save(tx)
return nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about checking err?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

float64(v.Info.Snapshot.Factor),
v.gidRequested, v.Info.Id)
err = db.Update(func(tx *bolt.Tx) error {
deviceEntry, _ := NewDeviceEntryFromId(tx, deviceId)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment explaining why you re-fetch the device here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

v.gidRequested, v.Info.Id)
deviceEntry.Save(tx)
return nil
})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you re-fetched the device entry in the transaction, and changed it, you should re-fetch it outside? Otherwise the next operationon newDeviceEntry will operate on stale data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

defer func() {
if e != nil {
db.Update(func(tx *bolt.Tx) error {
newDeviceEntry.StorageFree(newBrickEntry.TotalSize())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newDeviceEntry is stale here, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if err != nil {
logger.Err(err)
return err
}
return nil
})
if err != nil {
logger.LogError("Replace brick operation succeeded but failed to update db")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

message is not fully making sense. please reword...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed this line

if err != nil {
logger.Err(err)
return err
}
return nil
})
if err != nil {
logger.LogError("Replace brick operation succeeded but failed to update db")
logger.Err(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we combine the two log calls?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good so far.

  • Were you able to reliably reproduce the problem with an unpatched server?
  • Can we also have unit tests for the new code?

v.removeBrickFromDb(tx, oldBrickEntry)
err = v.Save(tx)
reReadVolEntry.BrickAdd(newBrickEntry.Id())
reReadVolEntry.removeBrickFromDb(tx, oldBrickEntry)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any possibility of an error on above 2 operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@raghavendra-talur
Copy link
Member Author

raghavendra-talur commented Apr 6, 2017

@obnoxxx

Were you able to reliably reproduce the problem with an unpatched server?

Yes, ran test 5 times, got the following list of brickNumbers 8,12,6,13,8 instead of 16.

It is possible that a device gets more bricks added to it in a parallel
volume create/expand operation when we are adding a brick to it as part
of device remove opeation. What we save to disk should be the final
state which is consequence of both the operations.

Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
@raghavendra-talur raghavendra-talur force-pushed the fix-db-inconsistency branch 2 times, most recently from d333461 to 0e640de Compare April 6, 2017 17:06
Copy link
Contributor

@obnoxxx obnoxxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, except for the one request...

return err
}
newDeviceEntry.StorageFree(newBrickEntry.TotalSize())
newDeviceEntry.Save(tx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageFree returns error. shouldn't we catch it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, Save returns error, not StorageFree ... ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not entirely sure about this. I guess it might be ok in this case. Also, we are not fully consistent with this across the code base. Inclined to merge it like this and we can sort it out later in full generality..

@obnoxxx
Copy link
Contributor

obnoxxx commented Apr 6, 2017

LGTM

@obnoxxx obnoxxx merged commit 696d002 into heketi:master Apr 6, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants