Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul lock -monitor-retry failed during election #5437

Open
yeukhon opened this issue Mar 6, 2019 · 2 comments
Open

consul lock -monitor-retry failed during election #5437

yeukhon opened this issue Mar 6, 2019 · 2 comments
Labels
help-wanted We encourage community PRs for these issues! type/bug Feature does not function as expected

Comments

@yeukhon
Copy link

yeukhon commented Mar 6, 2019

Overview of the Issue

I have 5 Consul servers, quorum is 3. monitor-retry did not prevent consul lock from failing during election. Running on version 1.4.2.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 5 nodes, quorum=3.
  2. Run the following on node 1 (non-leader).
consul lock -verbose -token=xxxx mylock 'sleep 60'
  1. Run the following on node 2 (non-leader)
consul lock -token=xxxx -monitor-retry 60 mylock 'sleep 1'
  1. Run the following on leader node.
sudo systemctl restart consul

Consul info for both Client and Server

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 4
	services = 4
build:
	prerelease =
	revision = c97c712e
	version = 1.4.2
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.10.9.164:8300
	server = true
raft:
	applied_index = 1558164
	commit_index = 1558164
	fsm_pending = 0
	last_contact = 22.8813ms
	last_log_index = 1558164
	last_log_term = 32
	last_snapshot_index = 1549174
	last_snapshot_term = 20
	latest_configuration = ......
	latest_configuration_index = 1558146
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	term = 32
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 87
	max_procs = 10
	os = linux
	version = go1.11.4
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 16
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 40
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 127
	members = 5
	query_queue = 0
	query_time = 1

Operating system and Environment details

CentOS Linux release 7.6.1810 (Core)

Log Fragments

On node 2 (where monitor-retry is used). This failure occurs pretty much within the default timeout (which is 3 seconds according to the doc).

[myuser@myhost ~]$ consul lock -token=xxxx -monitor-retry 60 mylock 'sleep 1'
Lock acquisition failed: failed to read lock: Unexpected response code: 500
@pearkes pearkes added type/bug Feature does not function as expected help-wanted We encourage community PRs for these issues! labels Apr 4, 2019
@pearkes
Copy link
Contributor

pearkes commented Apr 4, 2019

Its possible -monitor-retry isn't respected all the way through. It'd be great if you could put together a simple reproduction for this if possible. Otherwise I'm going to tag this as a bug with help wanted as how you're describing the usage of the flag should work.

@pearkes pearkes changed the title consul lock -monitor-try failed during election consul lock -monitor-retry failed during election Apr 4, 2019
@gmichelo
Copy link
Contributor

gmichelo commented Aug 7, 2021

I am a new contributor, so take this with a grain of salt.

Just took a quick look at the code and I think there is a misunderstanding of what monitor-retry actually do.

As per documentation:

Retry up to this number of times if Consul returns a 500 error while monitoring the lock. [...] increases the amount of time required to detect a lost lock in some cases.

So, my understanding is that the lock must already be held and this flag just tells how many times the lock-monitor goroutine needs to retry before deciding that the lock is lost.

The error Lock acquisition failed: failed to read lock: Unexpected response code: 500 probably comes from the below lines of code, i.e. before the monitor goroutine is even started, because the lock is not yet acquired.

consul/api/lock.go

Lines 201 to 205 in d3325b0

// Look for an existing lock, blocking until not taken
pair, meta, err := kv.Get(l.opts.Key, &qOpts)
if err != nil {
return nil, fmt.Errorf("failed to read lock: %v", err)
}

The monitor goroutine is started when the lock acquisition is done:

consul/api/lock.go

Lines 251 to 260 in d3325b0

HELD:
// Watch to ensure we maintain leadership
leaderCh := make(chan struct{})
go l.monitorLock(l.lockSession, leaderCh)
// Set that we own the lock
l.isHeld = true
// Locked! All done
return leaderCh, nil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted We encourage community PRs for these issues! type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

3 participants