New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confused by watch event behavior. Maybe bug? #3742

Open
nugend opened this Issue Dec 12, 2017 · 17 comments

Comments

Projects
None yet
4 participants
@nugend

nugend commented Dec 12, 2017

consul version for both Client and Server

Client: 1.0.1
Server: 1.0.1

consul info for both Client and Server

Client:

consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 9564c29
        version = 1.0.1
consul:
        bootstrap = false
        known_datacenters = 1
        leader = true
        leader_addr = 127.0.0.1:8300
        server = true
raft:
        applied_index = 10
        commit_index = 10
        fsm_pending = 0
        last_contact = 0
        last_log_index = 10
        last_log_term = 2
        last_snapshot_index = 0
        last_snapshot_term = 0
        latest_configuration = [{Suffrage:Voter ID:0079fb17-5098-6d39-b6be-fde0b482cde6 Address:127.0.0.1:8300}]
        latest_configuration_index = 1
        num_peers = 0
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Leader
        term = 2
runtime:
        arch = amd64
        cpu_count = 32
        goroutines = 73
        max_procs = 32
        os = linux
        version = go1.9.2
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 5
        event_time = 6
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

Server:

Same

Operating system and Environment details

uname -a
Linux REDACTED 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Description of the Issue (and unexpected/desired result)

Started consul in dev mode:

$ consul agent -dev
$ cat ~/tmp/extract.py
#!/usr/bin/env python
import json
import sys
import base64
import datetime
import time

o = json.load(sys.stdin)
if len(o):
  print((datetime.datetime.now(),len(o),base64.b64decode(o[-1]['Payload'])))

Sent some events

$ consul event -name=sample "Hello1"
Event ID: bc61dd84-bde2-d3e8-54cc-e93834c464e4
$ consul event -name=sample "Hello2"
Event ID: 6d395e5a-1173-ffea-7bab-35bf6e389666
$ consul event -name=sample "Hello3"
Event ID: a1b79dcb-9bb2-2ee0-67cc-af9f4da7268f
$ consul event -name=sample "Hello4"
Event ID: ce467f09-8768-f120-e0c6-68cc040b593a

Ran the following watch:

$ consul watch -type=event -name=sample $PWD/extract.py
(datetime.datetime(2017, 12, 12, 16, 43, 34, 834371), 1, b'Hello1')
(datetime.datetime(2017, 12, 12, 16, 43, 36, 693774), 1, b'Hello2')
(datetime.datetime(2017, 12, 12, 16, 43, 36, 744399), 2, b'Hello2')
(datetime.datetime(2017, 12, 12, 16, 43, 38, 887691), 1, b'Hello3')
(datetime.datetime(2017, 12, 12, 16, 43, 41, 205318), 1, b'Hello4')
(datetime.datetime(2017, 12, 12, 16, 43, 41, 253713), 4, b'Hello4')

So, basically, sometimes the watch gets fired twice. once for an individual event and again for the entire list. Frankly, I don't really care as long as there's some rationale for what's going on so I can reason through it.

Log Fragments or Link to gist

==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.0.1'
           Node ID: '0079fb17-5098-6d39-b6be-fde0b482cde6'
         Node name: 'REDACTED'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/12/12 16:43:30 [DEBUG] Using random ID "0079fb17-5098-6d39-b6be-fde0b482cde6" as node ID
    2017/12/12 16:43:30 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:0079fb17-5098-6d39-b6be-fde0b482cde6 Address:127.0.0.1:8300}]
    2017/12/12 16:43:30 [INFO] raft: Node at 127.0.0.1:8300 [Follower] entering Follower state (Leader: "")
    2017/12/12 16:43:30 [INFO] serf: EventMemberJoin: REDACTED.dc1 127.0.0.1
    2017/12/12 16:43:30 [INFO] serf: EventMemberJoin: REDACTED 127.0.0.1
    2017/12/12 16:43:30 [INFO] consul: Adding LAN server REDACTED (Addr: tcp/127.0.0.1:8300) (DC: dc1)
    2017/12/12 16:43:30 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
    2017/12/12 16:43:30 [INFO] consul: Handled member-join event for server "REDACTED.dc1" in area "wan"
    2017/12/12 16:43:30 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
    2017/12/12 16:43:30 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
    2017/12/12 16:43:30 [INFO] agent: started state syncer
    2017/12/12 16:43:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/12/12 16:43:30 [INFO] raft: Node at 127.0.0.1:8300 [Candidate] entering Candidate state in term 2
    2017/12/12 16:43:30 [DEBUG] raft: Votes needed: 1
    2017/12/12 16:43:30 [DEBUG] raft: Vote granted from 0079fb17-5098-6d39-b6be-fde0b482cde6 in term 2. Tally: 1
    2017/12/12 16:43:30 [INFO] raft: Election won. Tally: 1
    2017/12/12 16:43:30 [INFO] raft: Node at 127.0.0.1:8300 [Leader] entering Leader state
    2017/12/12 16:43:30 [INFO] consul: cluster leadership acquired
    2017/12/12 16:43:30 [INFO] consul: New leader elected: REDACTED 
    2017/12/12 16:43:30 [DEBUG] consul: Skipping self join check for "REDACTED" since the cluster is too small
    2017/12/12 16:43:30 [INFO] consul: member 'REDACTED' joined, marking health alive
    2017/12/12 16:43:30 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
    2017/12/12 16:43:30 [INFO] agent: Synced node info
    2017/12/12 16:43:30 [DEBUG] agent: Node info in sync
    2017/12/12 16:43:31 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
    2017/12/12 16:43:31 [DEBUG] agent: Node info in sync
    2017/12/12 16:43:32 [DEBUG] http: Request GET /v1/agent/self (1.015ms) from=127.0.0.1:43564
    2017/12/12 16:43:32 [DEBUG] http: Request GET /v1/event/list?name=sample (112µs) from=127.0.0.1:43565
    2017/12/12 16:43:34 [DEBUG] http: Request GET /v1/agent/self (823µs) from=127.0.0.1:43566
    2017/12/12 16:43:34 [DEBUG] http: Request PUT /v1/event/fire/sample (175µs) from=127.0.0.1:43566
    2017/12/12 16:43:34 [DEBUG] consul: User event: sample
    2017/12/12 16:43:34 [DEBUG] agent: new event: sample (bc61dd84-bde2-d3e8-54cc-e93834c464e4)
    2017/12/12 16:43:34 [DEBUG] http: Request GET /v1/event/list?index=1&name=sample (2.058725s) from=127.0.0.1:43565
    2017/12/12 16:43:36 [DEBUG] http: Request GET /v1/agent/self (739µs) from=127.0.0.1:43567
    2017/12/12 16:43:36 [DEBUG] http: Request PUT /v1/event/fire/sample (132µs) from=127.0.0.1:43567
    2017/12/12 16:43:36 [DEBUG] consul: User event: sample
    2017/12/12 16:43:36 [DEBUG] agent: new event: sample (6d395e5a-1173-ffea-7bab-35bf6e389666)
    2017/12/12 16:43:36 [DEBUG] http: Request GET /v1/event/list?index=16766114972129736460&name=sample (1.809057s) from=127.0.0.1:43565
    2017/12/12 16:43:36 [DEBUG] http: Request GET /v1/event/list?name=sample (37µs) from=127.0.0.1:43565
    2017/12/12 16:43:38 [DEBUG] http: Request GET /v1/agent/self (2.18ms) from=127.0.0.1:43568
    2017/12/12 16:43:38 [DEBUG] http: Request PUT /v1/event/fire/sample (202µs) from=127.0.0.1:43568
    2017/12/12 16:43:38 [DEBUG] consul: User event: sample
    2017/12/12 16:43:38 [DEBUG] agent: new event: sample (a1b79dcb-9bb2-2ee0-67cc-af9f4da7268f)
    2017/12/12 16:43:38 [DEBUG] http: Request GET /v1/event/list?index=1626481048861501836&name=sample (2.09145s) from=127.0.0.1:43565
    2017/12/12 16:43:41 [DEBUG] http: Request GET /v1/agent/self (751µs) from=127.0.0.1:43569
    2017/12/12 16:43:41 [DEBUG] http: Request PUT /v1/event/fire/sample (198µs) from=127.0.0.1:43569
    2017/12/12 16:43:41 [DEBUG] consul: User event: sample
    2017/12/12 16:43:41 [DEBUG] agent: new event: sample (ce467f09-8768-f120-e0c6-68cc040b593a)
    2017/12/12 16:43:41 [DEBUG] http: Request GET /v1/event/list?index=14302080381595486319&name=sample (2.268265s) from=127.0.0.1:43565
    2017/12/12 16:43:41 [DEBUG] http: Request GET /v1/event/list?name=sample (54µs) from=127.0.0.1:43565
==> Failed to check for updates: Get https://checkpoint-api.hashicorp.com/v1/check/consul?arch=amd64&os=linux&signature=&version=1.0.1: Proxy Authentication Required
    2017/12/12 16:44:30 [DEBUG] consul: Skipping self join check for "REDACTED" since the cluster is too small
    2017/12/12 16:44:49 [DEBUG] http: Request GET /v1/agent/self (2.2ms) from=127.0.0.1:43571
    2017/12/12 16:44:58 [DEBUG] http: Request GET /v1/agent/self (793µs) from=127.0.0.1:43572
    2017/12/12 16:45:14 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
    2017/12/12 16:45:14 [DEBUG] agent: Node info in sync
    2017/12/12 16:45:30 [DEBUG] manager: Rebalanced 1 servers, next active server is REDACTED.dc1 (Addr: tcp/127.0.0.1:8300) (DC: dc1)
    2017/12/12 16:45:30 [DEBUG] consul: Skipping self join check for "REDACTED" since the cluster is too small
    2017/12/12 16:46:30 [DEBUG] consul: Skipping self join check for "REDACTED" since the cluster is too small
    2017/12/12 16:47:07 [DEBUG] Skipping remote check "serfHealth" since it is managed automatically
    2017/12/12 16:47:07 [DEBUG] agent: Node info in sync
    2017/12/12 16:47:30 [DEBUG] consul: Skipping self join check for "REDACTED" since the cluster is too small

@nugend nugend changed the title from Confused by watch event behavior. Mabye bug? to Confused by watch event behavior. Maybe bug? Dec 12, 2017

@nugend

This comment has been minimized.

nugend commented Dec 12, 2017

This seems to be due to IDToIndex which tries to convert an event UUID into an index value for the watch planner to use.

Is there a good reason LTime isn't used instead? Lamport Time should monotonically increase as events occur, right?

edit: Oh, wait, does the LTime come from the client firing the event?

@nugend

This comment has been minimized.

nugend commented Dec 12, 2017

I guess if it has to be a UUID, then maybe UserEvents should be generated with a v1 UUID instead of a v4 UUID?

That way there might be something resembling monotonicity for the events?

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

This is normal behavior from what I can see. We rely on user events for broadcasts between a few of our golang applications. The event handler will return all of the events in the buffer whenever the event index changes. You will see the event index in the X-Consul-Index header. Whenever you get a hit on the watcher the event should be sorted by increasing lamport time (newest at the bottom of the list). You will either need to keep track of the events already processed or just assume you won't miss an event (trust consul) and look at the last event in the payload on each hit of the watcher.

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

You can see the index and the index header and the full payload with something like: curl -vvv localhost:8500/v1/event/list and look at the headers.

Also to your question about Lamport Time ... The Lamport clock is a monotonic tick that is kept in sync across the cluster.

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

The event payload also does include the LTime as well as the IDs ...

  {
    "ID": "5fb0f019-24ca-5f20-bb96-390bd79c3db7",
    "Name": "xxx-fake-event",
    "Payload": "xxx-censored-base64-payload",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 8644212
  }
@nugend

This comment has been minimized.

nugend commented Dec 13, 2017

The problem isn't that I get the list of recent events. The problem is that I get the most recent event and THEN I get the list of recent events sometimes.

This seems to be due to the behavior of this line:

consul/watch/plan.go

Lines 89 to 91 in 6be97eb

if p.lastIndex < oldIndex {
p.lastIndex = 0
}

Combined with the way that the index is calculated here:

consul/api/event.go

Lines 92 to 104 in 6f0a3b9

func (e *Event) IDToIndex(uuid string) uint64 {
lower := uuid[0:8] + uuid[9:13] + uuid[14:18]
upper := uuid[19:23] + uuid[24:36]
lowVal, err := strconv.ParseUint(lower, 16, 64)
if err != nil {
panic("Failed to convert " + lower)
}
highVal, err := strconv.ParseUint(upper, 16, 64)
if err != nil {
panic("Failed to convert " + upper)
}
return lowVal ^ highVal
}

And the index is created here:

if params.ID, err = uuid.GenerateUUID(); err != nil {

And as I mentioned, because the UUID is a randomly generated one (UUID code here: http://github.com/hashicorp/go-uuid), there's no ordering of event IDs for p.lastIndex to consistently compare with oldIndex

Consequently, you can see in the server logs that the index parameters in the GET event/list requests is not incremental and that when a higher index precedes a lower index, a GET request is issued without the index parameter, retrieving the whole list.

If I happened to get very lucky, I might get a run of several events which had a higher "index" value than the previous ones and not get the entire list republished.

If my analysis of the code behavior is wrong, please highlight where it is incorrect. If this really is the intended behavior of event watches, I'd appreciate someone from the Hashi team saying so. If the Lamport clock is really synchronized (which somehow seems not right given that the events are supposed to be published even if the masters are locked, but I might be misunderstanding their role in the cluster behavior), then I don't see why the Lamport Index isn't used to set the index value for the purposes of event comparison in the watch. And if it isn't synchronized, then I really think that v1 UUIDs would be a fairly reasonable and useful alternative given the stated semantics of events in the documentation: even assuming really badly synchronized time across multiple hosts, it's better than random event indices.

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

Random UUIDs are not the cause of the problem. The event id is not required to be incremental or related. Watches in consul are performed via blocking queries:

https://www.consul.io/api/index.html#blocking-queries

If you do not pass an index to the /v1/event/list endpoint you will simply get the full list of events. That is not a watch.

The randomness of the event id is also referenced as a caveat in the event endpoint documentation:

https://www.consul.io/api/event.html#caveat

In short ... a watch (blocking query) only occurs if the request passes the X-Consul-Index header with a value from a previous list call. When you make the event list call without an index you should expect an immediate dump of all events in the buffer.

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

One other point, the behavior you see with a /v1/event/list call being made without an index and then being made again with an index is the only way to implement a watch. You have to get the current state (index) before you can pass it and force a blocking query/watch.

@nugend

This comment has been minimized.

nugend commented Dec 13, 2017

The queries block when the current id for an endpoint is equal to the provided index.

The caveat implies that the restriction is due to the gossip protocol delivering events out of order, but that's not why these events are out of order.

I understand that on the first invocation of the event/list endpoint, the total list needs to be returned, but the problem is that I don't see why for subsequent events, a single event is returned and then, randomly, the entire list is returned again. That seems to be only because of the choice of event id being based on a random UUID.

Do you at least see what I'm talking about in the code I cited?

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

In the first block I see where you reference the comparison of the indexes. That should be checking the wait index vs the passed index either via index query parameter or the X-Consul-Index header.

The IDToIndex function is indeed a hack as documented in the comment before the function and that could impact how many events are returned when you look at the code in the internal watch function that attempts to prune the returned results:

consul/watch/funcs.go

Lines 222 to 229 in 9564c29

// Prune to only the new events
for i := 0; i < len(events); i++ {
if event.IDToIndex(events[i].ID) == p.lastIndex {
events = events[i+1:]
break
}
}
return meta.LastIndex, events, err

I do understand the confusion as to why you sometimes see a single event and sometimes you see the whole event buffer. In our usage of the event watches we take the last element of the returned result as the newest event when the watch triggers. If you want to be sure you get ALL events since the last incarnation of the watch and not just the absolute newest you would need to keep track of the LTime associated with the last event your processed.

@nugend

This comment has been minimized.

nugend commented Dec 13, 2017

If it were consistently sending the entire list for each new event. That'd be fine by me. I really don't care.

It's this:

(datetime.datetime(2017, 12, 12, 16, 43, 36, 693774), 1, b'Hello2')
(datetime.datetime(2017, 12, 12, 16, 43, 36, 744399), 2, b'Hello2')
(datetime.datetime(2017, 12, 12, 16, 43, 38, 887691), 1, b'Hello3')
(datetime.datetime(2017, 12, 12, 16, 43, 41, 205318), 1, b'Hello4')
(datetime.datetime(2017, 12, 12, 16, 43, 41, 253713), 4, b'Hello4')

That is a problem. Do you see how there are two events published for "Hello2", the second of which contains the entire list, One event published for "Hello3" which just has the most recent event, and then two published for "Hello4"

It's just random whether you get two invocations of your watch function or not because of the randomized event indices. If the design was just, "Okay, every time there's an event published, you'll get everything" or "You only ever get the most recent event. If you want the full list, you have to ask for it yourself" That'd be fine. It'd be easy enough to work with. The non-determinism of frequent double invocations is hard to work with though.

Or hey, maybe the watch func should actually track last non-zero index and discard based on that rather than p.lastIndex.

Would this work? (I don't really know go, so please forgive any syntax errors)

func eventWatch(params map[string]interface{}) (WatcherFunc, error) {
	// The stale setting doesn't apply to events.

        var lastNonZeroIndex uint64 = 0
	var name string
	if err := assignValue(params, "name", &name); err != nil {
		return nil, err
	}

	fn := func(p *Plan) (uint64, interface{}, error) {
		event := p.client.Event()
		opts := makeQueryOptionsWithContext(p, false)
		defer p.cancelFunc()
		events, meta, err := event.List(name, &opts)
		if err != nil {
			return 0, nil, err
		}

                if p.lastIndex != 0 { lastNonZeroIndex = p.lastIndex }
		// Prune to only the new events
		for i := 0; i < len(events); i++ {
			if event.IDToIndex(events[i].ID) == lastNonZeroIndex {
				events = events[i+1:]
				break
			}
		}
		return meta.LastIndex, events, err
	}
	return fn, nil
}
@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

Like I said, I can understand your frustration and I can see where you may be getting confused. However I don't see a need to modify the watch plan code in the core. Taking a very similar example to your first look at the output with the full payloads:

Window 1: consul agent -dev

    2017/12/13 12:13:33 [DEBUG] http: Request GET /v1/agent/self (1.142439ms) from=127.0.0.1:63121
    2017/12/13 12:13:34 [DEBUG] http: Request GET /v1/event/list?name=test (122.733µs) from=127.0.0.1:63122
    2017/12/13 12:13:39 [DEBUG] http: Request GET /v1/agent/self (374.559µs) from=127.0.0.1:63125
    2017/12/13 12:13:39 [DEBUG] consul: User event: test
    2017/12/13 12:13:39 [DEBUG] http: Request PUT /v1/event/fire/test (160.729µs) from=127.0.0.1:63125
    2017/12/13 12:13:39 [DEBUG] agent: new event: test (13fa61a3-8fe4-140a-bffd-8599499bcc3a)
    2017/12/13 12:13:39 [DEBUG] http: Request GET /v1/event/list?index=1&name=test (4.982070097s) from=127.0.0.1:63122
    2017/12/13 12:13:43 [DEBUG] http: Request GET /v1/agent/self (331.579µs) from=127.0.0.1:63129
    2017/12/13 12:13:43 [DEBUG] consul: User event: test
    2017/12/13 12:13:43 [DEBUG] http: Request PUT /v1/event/fire/test (130.639µs) from=127.0.0.1:63129
    2017/12/13 12:13:43 [DEBUG] agent: new event: test (08296359-4d76-7bda-14ff-28fa3dc37125)
    2017/12/13 12:13:43 [DEBUG] http: Request GET /v1/event/list?index=12396127440450082864&name=test (3.947802445s) from=127.0.0.1:63122
    2017/12/13 12:13:43 [DEBUG] http: Request GET /v1/event/list?name=test (66.634µs) from=127.0.0.1:63122
==> Newer Consul version available: 1.0.1 (currently running: 0.9.3)
    2017/12/13 12:14:05 [DEBUG] http: Request GET /v1/agent/self (394.77µs) from=127.0.0.1:63136
    2017/12/13 12:14:05 [DEBUG] consul: User event: test
    2017/12/13 12:14:05 [DEBUG] http: Request PUT /v1/event/fire/test (136.664µs) from=127.0.0.1:63136
    2017/12/13 12:14:05 [DEBUG] agent: new event: test (7501e379-a9ec-499f-ffe6-70d126f7bf04)
    2017/12/13 12:14:05 [DEBUG] http: Request GET /v1/event/list?index=2077931443420728063&name=test (22.533914826s) from=127.0.0.1:63122
    2017/12/13 12:14:10 [DEBUG] http: Request GET /v1/agent/self (341.497µs) from=127.0.0.1:63139
    2017/12/13 12:14:10 [DEBUG] consul: User event: test
    2017/12/13 12:14:10 [DEBUG] http: Request PUT /v1/event/fire/test (115.587µs) from=127.0.0.1:63139
    2017/12/13 12:14:10 [DEBUG] agent: new event: test (04c9d19e-2854-b135-ccdb-3ad28598b1f0)
    2017/12/13 12:14:10 [DEBUG] http: Request GET /v1/event/list?index=10009131049018979995&name=test (4.681049627s) from=127.0.0.1:63122
    2017/12/13 12:14:28 [DEBUG] consul: Skipping self join check for "ahurt1.local" since the cluster is too small
    2017/12/13 12:14:34 [DEBUG] agent: Node info in sync
    2017/12/13 12:15:28 [DEBUG] manager: Rebalanced 1 servers, next active server is ahurt1.local.dc1 (Addr: tcp/127.0.0.1:8300) (DC: dc1)
    2017/12/13 12:15:28 [DEBUG] consul: Skipping self join check for "ahurt1.local" since the cluster is too small
    2017/12/13 12:15:43 [DEBUG] agent: Node info in sync

Window 2: consul watch -type=event -name=test jq

Immediate result after starting watch

[]

First event

[
  {
    "ID": "13fa61a3-8fe4-140a-bffd-8599499bcc3a",
    "Name": "test",
    "Payload": "dGVzdDE=",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 2
  }
]

Second event

[
  {
    "ID": "08296359-4d76-7bda-14ff-28fa3dc37125",
    "Name": "test",
    "Payload": "dGVzdDI=",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 3
  }
]
[
  {
    "ID": "13fa61a3-8fe4-140a-bffd-8599499bcc3a",
    "Name": "test",
    "Payload": "dGVzdDE=",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 2
  },
  {
    "ID": "08296359-4d76-7bda-14ff-28fa3dc37125",
    "Name": "test",
    "Payload": "dGVzdDI=",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 3
  }
]

Third event

[
  {
    "ID": "7501e379-a9ec-499f-ffe6-70d126f7bf04",
    "Name": "test",
    "Payload": "dGVzdDM=",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 4
  }
]

Fourth event

[
  {
    "ID": "04c9d19e-2854-b135-ccdb-3ad28598b1f0",
    "Name": "test",
    "Payload": "dGVzdDQ=",
    "NodeFilter": "",
    "ServiceFilter": "",
    "TagFilter": "",
    "Version": 1,
    "LTime": 5
  }
]

Window 3: Fire!

ahurt1:~ ahurt$ consul event -name=test test1
Event ID: 13fa61a3-8fe4-140a-bffd-8599499bcc3a
ahurt1:~ ahurt$ consul event -name=test test2
Event ID: 08296359-4d76-7bda-14ff-28fa3dc37125
ahurt1:~ ahurt$ consul event -name=test test3
Event ID: 7501e379-a9ec-499f-ffe6-70d126f7bf04
ahurt1:~ ahurt$ consul event -name=test test4
Event ID: 04c9d19e-2854-b135-ccdb-3ad28598b1f0

It doesn't really matter how many elements you get back in the watch payload. The last element is the most recent with the newest Lamport Time and the only element you should actually be concerned about. If you really want to ensure you never miss an event due to network issues or agent downtime you could persist and check the Lamport Time LTime of each returned element in the payload and only process events that have a newer LTime than you saw in the last payload. Does that make sense?

@nugend

This comment has been minimized.

nugend commented Dec 13, 2017

Right, I understand how it is working now. You keep implying that I don’t understand how it works. I have read through the code and observed the behavior. I understand what mitigation strategies might be used in order to work around the behavior as is present in order to get it to behave in a more useful way.

Do you agree that the presence of code in the event handler intended to drop records that are “older” than the most recently seen index indicates that the current behavior is not necessarily the intended behavior?

@leprechau

This comment has been minimized.

Contributor

leprechau commented Dec 13, 2017

Sorry, it was not my intention to be derogatory.

Yes, I agree that reading the comments it appears the desired behavior was to filter out "old" events from the watch payload. The comments above the IDToIndex also seem to indicate they are aware it's not going to work in all situations.

@slackpad

This comment has been minimized.

Contributor

slackpad commented Dec 18, 2017

Pulled some dup issues into this and tagging for a closer look - it seems like there's a possible code change to make this better or at least some better documentation.

@pierresouchay

This comment has been minimized.

Contributor

pierresouchay commented Feb 15, 2018

might be related to #3890 as well

@nugend

This comment has been minimized.

nugend commented Feb 15, 2018

@pierresouchay I dunno. I thought this was only on the events endpoint since it was using random indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment