Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lock and semaphore timeouts #4346

Merged
merged 1 commit into from
Jul 10, 2018
Merged

Conversation

hamishforbes
Copy link
Contributor

Currently when contending a semaphore (or lock) decrements the query wait time by the elapsed time since the attempt to acquire started.
The overall timeout check for the acquire call is also performed against the query timeout.

When there are many attempts at contending (e.g. you have many processes running consul lock) this can cause the timeout to be reduced by far too much far too quickly.

With some debug logging in place I ran 8 consul lock processes and then killed the sleep command after a short period of time:

consul:master> bin/consul lock -verbose --timeout 20m -n 2 "test_lock" sleep 86400
Setting up semaphore (limit 2) at prefix: test_lock
Attempting lock acquisition
Query Wait Time: 20m0s
Elapsed: 7.056498ms
Query Wait Time: 19m59.992943502s

Elapsed: 40.010927567s
Query Wait Time: 18m55.326596046s
Elapsed: 40.02293547s
Query Wait Time: 18m15.303660576s

Elapsed: 1m41.701636479s
Query Wait Time: 16m33.602024097s
Elapsed: 1m41.751224579s
Query Wait Time: 14m51.850799518s

Elapsed: 1m53.171078743s
Query Wait Time: 12m58.679720775s
Elapsed: 1m53.176185899s
Query Wait Time: 11m5.503534876s

Elapsed: 1m58.033721344s
Query Wait Time: 9m7.469813532s
Elapsed: 1m58.040545154s
Query Wait Time: 7m9.429268378s

You can see that after less than 2 minutes we've reduced the wait time by over 12 minutes.

The acquire attempt shouldn't timeout until the originally specified timeout has been reached,
and the query wait time should be the original timeout, decremented by the elapsed time so that the query won't block beyond the overall timeout.

With these changes:

consul:lock_timeout> bin/consul lock -verbose --timeout 20m -n 2 "test_lock" sleep 86400
Setting up semaphore (limit 2) at prefix: test_lock
Attempting lock acquisition
Query Wait Time: 20m0s
Elapsed: 25.799709ms
Query Wait Time: 19m59.974200291s

Elapsed: 1m17.384652686s
Query Wait Time: 18m42.615347314s
Elapsed: 1m17.399565454s
Query Wait Time: 18m42.600434546s


Elapsed: 1m30.830220906s
Query Wait Time: 18m29.169779094s
Elapsed: 1m30.85710184s
Query Wait Time: 18m29.14289816s

Elapsed: 1m40.2700322s
Query Wait Time: 18m19.7299678s

Writing tests for this is tricky as you need to have multiple clients contending at once to trigger it.

Fixes #4003 #3262 #2399

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. The logic looks good and the manual demonstration in PR is useful sanity check.

It's unfortunate it's hard to write tests for this kind of thing which would be ideal but I'm happy that the logic changes here are correct.

The key bug as you noted was in decrementing the wait time rather than subtracting elapsed form the configured max wait.

@mkeeler @kyhavlov any concerns before we merge this?

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@banks
Copy link
Member

banks commented Jul 10, 2018

Thanks @hamishforbes !

@banks banks merged commit 7ed3b23 into hashicorp:master Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lock.Lock() returns without the lock way before LockWaitTime
3 participants