Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock issue around reaper's mutex during async-nsenter execution #36

Merged
merged 2 commits into from
May 15, 2021

Conversation

rodnymolina
Copy link
Member

Problem diagnosis has been well documented as part of issue #266. Here I'm briefly explaining the fix that I've identified for this bug.

As described in the issue's notes, the problem is caused by a deadlock scenario while dealing with the reaper's RWmutex. The fix is fairly simple: have 'async' nsenter logic remove its reaper's dependency. See that there's no real need for the reaper in the 'async' nsenter case as the SendRequest() callee will always sigkill() the nsenter process, so we can safely remove all this concurrency complexity in this case.

Fix has been heavily tested over the last couple of weeks in the same scaling scenario where the problem was originally reported. No deadlock issues has been observed ever since.

Signed-off-by: Rodny Molina rmolina@nestybox.com

Problem	diagnosis has been well documented as part of issue [#266](nestybox/sysbox#266). Here	I'm briefly explaining the fix that I've identified for this bug.

As described in	the issue's notes, the problem is caused by a deadlock scenario while dealing with the reaper's RWmutex. The fix is fairly simple: have 'async' nsenter logic remove its reaper's dependency. See that there's no real need for the reaper in the 'async' nsenter case as the SendRequest() callee will always sigkill() the nsenter process, so we can safely remove all this concurrency complexity in this case.

Fix has	been heavily tested over the last couple of weeks in the same scaling scenario where the problem was originally reported. No deadlock issues have been observed ever since.

Signed-off-by: Rodny Molina <rmolina@nestybox.com>
@rodnymolina rodnymolina requested a review from ctalledo May 15, 2021 06:21
nsenter/event.go Outdated
@@ -548,6 +566,14 @@ func (e *NSenterEvent) ReceiveResponse() *domain.NSenterMessage {
return e.ResMsg
}

func (e *NSenterEvent) ReapStartRequest() {
Copy link
Member

@ctalledo ctalledo May 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this new function called from anywhere; same for ReapEndRequest() below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, i ended up not needing these new methods. Will remove them and merge.

@@ -132,6 +132,8 @@ type NSenterEventIface interface {
SetResponseMsg(m *NSenterMessage)
GetResponseMsg() *NSenterMessage
GetProcessID() uint32
ReapStartRequest()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any callers for these new interface methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Member

@ctalledo ctalledo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG, just a couple of comments.

Signed-off-by: Rodny Molina <rmolina@nestybox.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants