fix race between device remove and other ops #736
Conversation
Can one of the admins verify this patch? |
@centos-ci ok to test |
64dec70
to
504b492
Compare
@lpabon @obnoxxx @MohamedAshiqrh This patch is ready for review. I have run unit tests and SmokeTests on the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aproach looks good in general, but I think there are still places where we operate on stale data.
v.gidRequested, v.Info.Id) | ||
deviceEntry.Save(tx) | ||
return nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about checking err?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
float64(v.Info.Snapshot.Factor), | ||
v.gidRequested, v.Info.Id) | ||
err = db.Update(func(tx *bolt.Tx) error { | ||
deviceEntry, _ := NewDeviceEntryFromId(tx, deviceId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment explaining why you re-fetch the device here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
v.gidRequested, v.Info.Id) | ||
deviceEntry.Save(tx) | ||
return nil | ||
}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you re-fetched the device entry in the transaction, and changed it, you should re-fetch it outside? Otherwise the next operationon newDeviceEntry will operate on stale data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
defer func() { | ||
if e != nil { | ||
db.Update(func(tx *bolt.Tx) error { | ||
newDeviceEntry.StorageFree(newBrickEntry.TotalSize()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newDeviceEntry is stale here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
if err != nil { | ||
logger.Err(err) | ||
return err | ||
} | ||
return nil | ||
}) | ||
if err != nil { | ||
logger.LogError("Replace brick operation succeeded but failed to update db") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
message is not fully making sense. please reword...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this line
if err != nil { | ||
logger.Err(err) | ||
return err | ||
} | ||
return nil | ||
}) | ||
if err != nil { | ||
logger.LogError("Replace brick operation succeeded but failed to update db") | ||
logger.Err(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we combine the two log calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks good so far.
- Were you able to reliably reproduce the problem with an unpatched server?
- Can we also have unit tests for the new code?
v.removeBrickFromDb(tx, oldBrickEntry) | ||
err = v.Save(tx) | ||
reReadVolEntry.BrickAdd(newBrickEntry.Id()) | ||
reReadVolEntry.removeBrickFromDb(tx, oldBrickEntry) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any possibility of an error on above 2 operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Yes, ran test 5 times, got the following list of brickNumbers 8,12,6,13,8 instead of 16. |
It is possible that a device gets more bricks added to it in a parallel volume create/expand operation when we are adding a brick to it as part of device remove opeation. What we save to disk should be the final state which is consequence of both the operations. Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
Signed-off-by: Raghavendra Talur <rtalur@redhat.com>
d333461
to
0e640de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, except for the one request...
return err | ||
} | ||
newDeviceEntry.StorageFree(newBrickEntry.TotalSize()) | ||
newDeviceEntry.Save(tx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StorageFree returns error. shouldn't we catch it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, Save returns error, not StorageFree ... ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not entirely sure about this. I guess it might be ok in this case. Also, we are not fully consistent with this across the code base. Inclined to merge it like this and we can sort it out later in full generality..
LGTM |
It is possible that a device gets more bricks added to it in a parallel
volume create/expand operation when we are adding a brick to it as part
of device remove opeation. What we save to disk should be the final
state which is consequence of both the operations.
Signed-off-by: Raghavendra Talur rtalur@redhat.com