Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lock.Lock() returns without the lock way before LockWaitTime #4003

Closed
laurentganne opened this issue Mar 29, 2018 · 2 comments · Fixed by #4346
Closed

lock.Lock() returns without the lock way before LockWaitTime #4003

laurentganne opened this issue Mar 29, 2018 · 2 comments · Fixed by #4346
Milestone

Comments

@laurentganne
Copy link

laurentganne commented Mar 29, 2018

Description of the Issue (and unexpected/desired result)

Using the lock api with these LockOptions :

  • LockTryOnce: true,
  • LockWaitTime: a duration of 45 seconds,

would expect that a call to lock.Lock(nil) would attempt to acquire the lock for 45 seconds before returning without having succeeded.

But actually, in a multi-threaded environment where many threads attempted to acquire the lock, after 5 seconds many of them could get the lock but several threads saw lock.Lock(nil) return nil,nil.
Although only 5 seconds had passed, and the LockWaitTime specified was 45 seconds.

Looking at the code, in the following lines
at https://github.com/hashicorp/consul/blob/v1.0.6/api/lock.go#L181-L189

181	// Handle the one-shot mode.
182	if l.opts.LockTryOnce && attempts > 0 {
183		elapsed := time.Since(start)
184		if elapsed > qOpts.WaitTime {
185			return nil, nil
186		}
187
188		qOpts.WaitTime -= elapsed
189 }

Line 184, the elapsed time since start is compared to qOpts.WaitTime, which seems correct.
But the issue is that line 188, qOpts.WaitTime is lowered by substracting the elapsed time.
Let's say the elapsed time since start is around 5 seconds, after each attempt which can be quick,
you will substract around 5 seconds to qOpts.WaitTime.
So elapsed time will be > qOpts.WaitTime way sooner than expected.

I think a fix should be to replace line 184 :

184		if elapsed > qOpts.WaitTime {

by :

184		if elapsed > l.opts.LockWaitTime {

to compare the elapsed time with the LockWaitTime provided by the user.

consul version for both Client and Server

Using the latest version 1.0.6

@benagricola
Copy link

Am running into this as well using consul lock (the semaphore usage with multiple locks), and using much longer wait times (8h for example).

I think your fix works in that it will stop the consul lock process from returning early - however the qOpts are still set to an invalid value (?wait= has a 10 minute maximum, according to the api doco at https://www.consul.io/api/index.html).

I'm not sure what the correct behaviour is here. consul lock will happily wait for 8h with this fix in, and there seems to be no issue with the long polling as I suspect it is just set to 10m if the value is too high - however, this means that towards the end of the timeout (in the last 10 minutes) it's possible for consul lock to wait for more than the timeout, as the 'wait' var will always be 10 (or might default to 5).

Suspect this is why WaitTime was being decremented, albeit incorrectly, so that the long poll will return 'on time' if the remaining time is less than 10 minutes.

@banks
Copy link
Member

banks commented Jul 10, 2018

@benagricola the documentation for LockWaitTime (at least in code) points out that it is a tradeoff that affects how soon you might return if lock is not acquired.

Note that the timeout specified is used as the wait param in the blocking query, but Lock also loops around until it's expired. So in your example, each blocking query will only block for 10 mins, but the Lock() method won't return unless it either acquires the lock or the full 8 hours elapses.

I think the wait time decrementing was just a bug with no tests to catch it not an intentional way to mitigate the fact that you can't separate the individual blocking request time (and hence max delay before returning) and the actual time the lock attempt will block for.

The fix in the PR should make this behaviour work correctly though - the last blocking query when there is less than 10 mins of timeout left will subtract elapsed from 8hrs and get back the smaller wait time required to unblock right at the 8hr elapsed mark.

@banks banks added this to the 1.2.1 milestone Jul 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants