Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back-port: Prevent the router from deadlocking itself when calling Commit() #13744

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Apr 12, 2017

Backport of #13717

The router reload function (Commit()) was changed so that it could be
rate limited. The rate limiter tracks changes in a kcache.Queue
object. It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
Thread 1 (the rate-limiter background process):

  • Wake up
  • Ratelimit sees there is work to be done and calls fifo.go's Pop() function
  • Fifo.go acquires a fifo lock and call the processing function
  • Router.go's commitAndReload() function acquires a lock on the router object
    Thread 2 (triggered by the event handler that commit's changes to the router):
  • Get the event and process it
  • Since there are changes to be made, call router.Commit()
  • Commit() grabs a lock on the router object
  • Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
  • In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function. The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)

@knobunc knobunc added this to the 1.5.0 milestone Apr 12, 2017
@knobunc knobunc self-assigned this Apr 12, 2017
@knobunc
Copy link
Contributor Author

knobunc commented Apr 12, 2017

[test]

@smarterclayton
Copy link
Contributor

LGTM [merge]

@openshift-bot
Copy link
Contributor

openshift-bot commented Apr 13, 2017

continuous-integration/openshift-jenkins/merge Waiting: You are in the build queue at position: 3

@smarterclayton
Copy link
Contributor

smarterclayton commented Apr 13, 2017 via email

@smarterclayton
Copy link
Contributor

I wonder if this actually regresses a previous fix for duplicate resync (possibly a bug in rate limited function).

@stevekuznetsov
Copy link
Contributor

You mean the TestRouterReloadSuppressionOnSync failure? Not sure

Backport of openshift#13717

The router reload function (Commit()) was changed so that it could be
rate limited.  The rate limiter tracks changes in a kcache.Queue
object.  It will coalesce changes to the same call so that if three
calls to Invoke() the rate limited function happen before the next
time it is allowed to process, then only one will occur.

Our problem was that we were doing:
 Thread 1 (the rate-limiter background process):
   - Wake up
   - Ratelimit sees there is work to be done and calls fifo.go's Pop() function
   - Fifo.go acquires a fifo lock and call the processing function
   - Router.go's commitAndReload() function acquires a lock on the router object
 Thread 2 (triggered by the event handler that commit's changes to the router):
   - Get the event and process it
   - Since there are changes to be made, call router.Commit()
   - Commit() grabs a lock on the router object
   - Then calls the rate-limiter wrapper around commitAndReload() using Invoke() to queue work
   - In order to queue the work... it acquires a lock on the fifo

So thread 1 locks: fifo then router; thread 2 locks: router then fifo.
If you get unlucky, those threads deadlock and you never process
another event.

The fix is to release the lock on the router object in our Commit()
function before we call Invoke on the rate limited function.  The lock
is not actually protecting anything at that point since the rate
limited function does its own locking, and is run in a separate thread
anyway.

Bug 1440977 (https://bugzilla.redhat.com/show_bug.cgi?id=1440977)
@knobunc knobunc force-pushed the bug/bz1440977-router-deadlock-fix-1.5 branch from 9f0e757 to 92ede84 Compare April 13, 2017 17:45
@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to 92ede84

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to 92ede84

@knobunc
Copy link
Contributor Author

knobunc commented Apr 13, 2017

The problem was that I needed to update the background commit function in the test. I've done that now. I'm pretty sure the double reload problem is not with the rate-limiter. I put the rate limiter code through its paces to make sure it worked correctly: https://gist.github.com/knobunc/c03ec0c70ec23f79129b8d7f1584aaa1

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test UNSTABLE (https://ci.openshift.redhat.com/jenkins/job/test_pull_request_origin/738/) (Base Commit: 0b3bc05)

@stevekuznetsov
Copy link
Contributor

The UNSTABLE is due to release-1.5 needing a backport of 42516d1

@smarterclayton
Copy link
Contributor

Backported in #13764

@smarterclayton smarterclayton merged commit e43422d into openshift:release-1.5 Apr 13, 2017
@knobunc knobunc deleted the bug/bz1440977-router-deadlock-fix-1.5 branch June 7, 2018 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants