-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock issue around reaper's mutex during async-nsenter execution #36
Conversation
Problem diagnosis has been well documented as part of issue [#266](nestybox/sysbox#266). Here I'm briefly explaining the fix that I've identified for this bug. As described in the issue's notes, the problem is caused by a deadlock scenario while dealing with the reaper's RWmutex. The fix is fairly simple: have 'async' nsenter logic remove its reaper's dependency. See that there's no real need for the reaper in the 'async' nsenter case as the SendRequest() callee will always sigkill() the nsenter process, so we can safely remove all this concurrency complexity in this case. Fix has been heavily tested over the last couple of weeks in the same scaling scenario where the problem was originally reported. No deadlock issues have been observed ever since. Signed-off-by: Rodny Molina <rmolina@nestybox.com>
nsenter/event.go
Outdated
@@ -548,6 +566,14 @@ func (e *NSenterEvent) ReceiveResponse() *domain.NSenterMessage { | |||
return e.ResMsg | |||
} | |||
|
|||
func (e *NSenterEvent) ReapStartRequest() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this new function called from anywhere; same for ReapEndRequest() below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, i ended up not needing these new methods. Will remove them and merge.
domain/nsenter.go
Outdated
@@ -132,6 +132,8 @@ type NSenterEventIface interface { | |||
SetResponseMsg(m *NSenterMessage) | |||
GetResponseMsg() *NSenterMessage | |||
GetProcessID() uint32 | |||
ReapStartRequest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any callers for these new interface methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG, just a couple of comments.
Signed-off-by: Rodny Molina <rmolina@nestybox.com>
Problem diagnosis has been well documented as part of issue #266. Here I'm briefly explaining the fix that I've identified for this bug.
As described in the issue's notes, the problem is caused by a deadlock scenario while dealing with the reaper's RWmutex. The fix is fairly simple: have 'async' nsenter logic remove its reaper's dependency. See that there's no real need for the reaper in the 'async' nsenter case as the SendRequest() callee will always sigkill() the nsenter process, so we can safely remove all this concurrency complexity in this case.
Fix has been heavily tested over the last couple of weeks in the same scaling scenario where the problem was originally reported. No deadlock issues has been observed ever since.
Signed-off-by: Rodny Molina rmolina@nestybox.com