Graphing of lag over time #4

splee · 2015-06-15T22:18:50Z

Hi,

Sorry for posting this as an issue - if there is a better way to get in contact with the team I can't find one on the wiki so far :)

As an engineer I love to see graphs over time for our systems, and we've been using raw offset requests for this in our apps. I was hoping that Burrow would be able to allow ingest of consumer metrics (specifically lag, so we know whether we need to scale up our consumers) into our InfluxDB instance for this purpose but it doesn't seem like it's possible.

At the moment from what I can tell you can only have the HTTP Notifier or the HTTP Endpoints return lag when the consumers are not in an "OK" state.

Is there a reason for that, before I start working on a pull request? It'd be great to be able to have some kind of "metrics bridge" where the notifier will trigger and return lag metrics regularly regardless of the state of the consumer group so lag metrics can be graphed at all times.

It's possible that I've missed the point of the project, but it does seem like it would be helpful to have continuous lag metrics in addition to alerts when consumer groups etc fall behind or go offline.

splee · 2015-06-16T09:24:28Z

After playing with the HTTP API I whipped up a little process which pulls the offsets for the topics and consumer groups from Burrow and pushes the calculated lag to Influx.

I still think that having this as part of the project might be beneficial, so if I get time I'll look into integrating the functionality if it makes sense to do so.

toddpalino · 2015-06-16T18:00:08Z

Thanks for the feedback, @splee. I'll see if we can get something else for discussion set up!

As far as the feature you're looking for goes, we do something like this internally, but as you put together, we use a separate app to pull the lag and push it into our metrics system. The original intent behind Burrow was to provide consumer lag checking, and not duplicate this. That said, I've also put generalized API features into it to start with. I don't think that what you're suggesting is a bad idea at all, especially as a configurable piece like the existing notifiers.

pfischermx · 2015-07-21T02:07:20Z

@splee ... and I was just about to ask this. By any chance, can you share your code where you calculate the lag?

Thx!

splee · 2015-07-29T20:01:50Z

@pfischermx I would have to extract it from within the project where it currently resides, but I will try and do that soon.

splee · 2015-07-29T21:21:01Z

@pfischermx http://github.com/splee/burrower

Hope this helps, let me know if you have any questions about the code. I'm guessing there will be things about it that aren't useful for your use case, but happy to accept pull requests :)

toddpalino · 2015-10-08T00:02:59Z

As a note, I've added a new HTTP endpoint to complement /status that is /lag. This one does the same thing as the status check, except it returns every partition, rather than just the bad ones. This may facilitate exporting the lag to another application (like influx) more easily.

I'm also looking at adding some simple rendered HTML pages to Burrow that will allow you to display consumer group information, including a graph of lag over the period that Burrow has information for.

splee · 2015-10-13T21:28:19Z

@toddpalino nice! Is this on a specific branch, or has it dropped into master already?

toddpalino · 2015-10-14T00:55:17Z

That is on master, and it is in the new 0.1.0 release.

zzmoss · 2016-01-19T18:33:42Z

Hi! I work on Analytics @wikimedia and we've been using burrow to monitor consumer lag. We are interested in graphing lag over time too, and we use statsd and graphite for aggregation. I was wondering if periodically sending metrics to statsd is something that would belong as part of burrow - since it seems to be a really common tool for this. I imagine it would work like the HttpNotifier, except report lag irrespective of consumers being in a bad state. If it's something that can be added to burrow, I'd be happy to work on a pull request for a StatsdNotifier.

toddpalino · 2016-01-19T23:56:42Z

I think this would be interesting to see, but there are some things to be aware of (since we use something separate for lag monitoring, we've got some experience here)...

The amount of traffic can be fairly overwhelming for larger clusters. In this case polling may be better because you can control which groups you are getting lag for. Another option to consider is emitting an aggregate lag number for the consumer group as a whole.
Metrics typically attach to hostnames. Because the hosts in a consumer group can shift around and change, and because when the group is stopped there are technically no hosts, you may not be able to use actual hostnames. One of my solutions uses a synthetic hostname to emit the lag metrics under so we don't have this problem

zzmoss · 2016-01-20T18:19:22Z

@toddpalino

Is the amount of traffic emitted overwhelming if it can be configured to emit the metrics every 60 seconds or more? Aggregating lag numbers for the whole group would be interesting, and it could be a configurable setting that can be turned on may be?
I'm not understanding the hostname thing - In the statsd/graphite world - metrics are associated with a prefix - and they could be cluster.consumer_group or something like that. Could you explain how hostnames play into this - looks like I'm missing something.

toddpalino · 2016-01-22T20:09:46Z

On the first one - yes, it can be significant at that interval. Consider a cluster that has 1000 consumers and more topics than that, with some of the consumers consuming multiple, or even all, topics. It's not necessarily a reason not to add this functionality, but it's something to keep in mind.

The second one is probably my own biases with the monitoring system I'm working with internally. As long as you tag the metrics to the group itself, and not any real hosts, there's no issue.

adamdubiel · 2016-01-24T22:31:26Z

I been trying to implement some kind of metric store lag reporting, but i found one big blocker, which might be a design thing, so i would like to consul it.

When operating in normal state, the lag (in my clusters) is virtually zero on all partitions, since broker offsets are refreshed less often than partition offsets. This is a bit inaccurate for lag reporting (in reality each partition has some small "natural" lag), but not a blocker.

What blocks me is when consumer goes down, the lag will no longer be calculated. Lag calculation is tied to receiving update about consumer offsets. When consumer is down, there are no more updates, thus lag is not updated. This leads to situation in which partition with "real" lag of thousands of events still reports few dozens of events in /lag endpoint (and ConsumerGroupStatus struct). The data is still there, since i can access stopped consumer group offsets, but the lag is not refreshed.

Is it okay to change the implementation to refresh the lag even when there are no consumer offset updates?

toddpalino · 2016-01-27T00:07:33Z

This is an interesting question. How do you propose doing this? I'm not sure there's a good way to continually do this, as you do not want to modify the consumer offset ring unless there are offsets coming in. Otherwise, you will interfere with detecting a stopped consumer.

I think it would be more reasonable in what @madhuvishy is proposing to calculate the lag emitted using the last committed consumer offset and the current broker offset, ignoring the "Lag" field in the offset ring entirely. This will give a more current view of lag without causing problems with status evaluation.

adamdubiel · 2016-01-27T06:28:32Z

I think it is possible without touching existing data structures. I would propose to move lag calculation from consumer offset update to broker offset update. In current implementation lag is always calculated based on the most recent consumer offset and (usually) stale broker offset. If we changed it, lag would be always calculated on most recent values for both broker and consumer. This would give better accuracy and would also report correct lag even when consumer is down.

This might have some implications in form of calculation complexity, as we would have to iterate over whole consumer offsets structure each time broker offsets are updated.

josselin-c · 2016-02-04T12:58:43Z

Do you suggest moving the lag computation and ring update to the addBrokerOffset function or doing the computation/update in both addConsumerOffset and addBrokerOffset?

adamdubiel · 2016-02-04T13:13:43Z

I think it should be moved entirely. Data updated in addConsumerOffset has little value when observing lags, because in most cases it computes 0, even though there is a small or natural lag on this partition. Having lag computed in both places would just cause noise (values jumping between 0 and the real value).

toddpalino · 2016-02-05T23:30:02Z

I tend to agree with @adamdubiel here, however I think there's more work required to do that. Updating lag at that time requires interaction of two locks, the one for the broker offset storage and the one for the consumer offset storage. The addBrokerOffset routine is called individually for each partition offset that is retrieved. This is already an inefficient use of the broker offset lock, because it has to be locked and unlocked for each partition. I'm very nervous about locking the consumer offset lock in the same way.

I think the first thing to do is refactor broker offset fetching in kafka_client.go to deliver the offsets to storage in a batch. This will make the locking much more efficient, and I think there's less of a concern about locking the consumer lock for long enough to iterate over all the consumers and update lag.

If it does become a performance issue with large topic or consumer counts, what we can do is add a linkage for the topic in broker offsets to the consumers that consume that topic (that's a little vague - I know what it looks like in my head if it's needed). That way when you update an offset, you'll know exactly what consumers to go update lag for without iterating the entire tree. It might be premature to do that optimization now, though.

adamdubiel · 2016-02-06T13:57:30Z

Thanks for the insights from the technical point of view. I have burrow forked and as soon as i find some more time i will try to propose a solution including your remarks.

travisjeffery · 2016-03-24T06:02:10Z

I started working on a small/simple solution for sending stats to statsd here: https://github.com/travisjeffery/kafka-statsd. Working OK so far with the small number of topics/partitions we have, might need to drop in some concurrency/perf improvements talked about in the issue for larger setups.

toddpalino · 2016-03-24T23:24:38Z

I actually had something along these lines come up as a new requirement for something I wanted to do internally, so I've been working on it and I've committed most of the changes to trunk. It's not documented on the wiki yet. The idea is that I wanted to be able to continually emit a count of the number of good and bad (with the various levels of bad) partitions for each consumer group.

Essentially, I've added a configurable threshold for the severity level to send out notification POSTs for, and you can set it to "OK", which means send every time regardless of status. It currently only provides the maxlag partition (not a full list of all good partitions), but I did add a total partition count to what can be used in the template, and that way you can figure out how many partitions are good. I also have some uncommitted changes to do a total lag calculation for the consumer group. This all works pretty well even for a large number of partitions and consumers on the clusters.

What I haven't worked on yet is when the lag calculation is done (at broker update vs. offset commit). It's also probably worthwhile to consider a way to provide all partitions, even the good ones so you could do full lag graphing in an external system.

cberry777 · 2016-04-22T17:31:33Z

Todd,
This sounds like a wonderful addition.
Then external metrics agents, such as DataDog or collectd, can simply poll /lag on some known interval.
Letting them all do with the data as they wish...

splee · 2016-04-26T23:54:00Z

@toddpalino Thanks for implementing lag tracking to the consumer status endpoint!

I have a question about the fields in the response, as it's not clear to me which I should be reporting on.

Each item within the "partitions" key contains two child objects with the same fields: "start" and "end".

What is the difference between these two values? It looks like the "maxlag" key is calculating the max lag using the value from "start.lag" in those objects, not "end.lag". I don't understand why there are two values, and so far the reason for the two keys is not clear to me from the source.

toddpalino · 2016-04-28T21:02:30Z

So the start and end keys are supposed to give you the first offset commit (and the lag number) that Burrow has for the partition, and the last one. The maxlag value in the status object is set to the partition, not a specific offset within that partition. Given that, it's up to you (in the template) to pull out the value you want to use.

splee · 2016-04-29T18:01:52Z

@toddpalino Ah, that makes sense now when I take Burrow's windowing into account. Thanks for the clarification.

splee · 2016-04-30T00:44:00Z

I updated Burrower to work with InfluxDB 0.9 and the latest version of Burrow. Hopefully someone will find that project useful again :)

toddpalino · 2016-05-11T02:09:39Z

This looks like it's all taken care of with the /lag endpoint, as well as the ability to emit to an HTTP server even when the consumer is in an OK state.

Updates from linkedin master and fixes for when worker not running

sballal1 · 2017-01-18T10:57:35Z

"maxlag" block always seems to have highest numbered partition, irrespective of whether it is a maximum lag or not.
Is this supposed to be like this? I was expecting it to be pointing to partition with maximum "end lag".

Example :
{"topic":"logs","partition":33,"status":"OK","start":{"offset":25686208,"timestamp":1484730645749,"lag":0},"end":{"offset":25695127,"timestamp":1484736045717,"lag":0}},{"topic":"logs","partition":34,"status":"OK","start":{"offset":25997264,"timestamp":1484730645767,"lag":75177},"end":{"offset":26024174,"timestamp":1484736045718,"lag":57184}},{"topic":"logs","partition":35,"status":"OK","start":{"offset":25997294,"timestamp":1484730645767,"lag":75105},"end":{"offset":26024148,"timestamp":1484736045719,"lag":57169}},

{"topic":"logs","partition":47,"status":"OK","start":{"offset":25998819,"timestamp":1484730645704,"lag":73583},"end":{"offset":26025871,"timestamp":1484736045718,"lag":55450}}],"maxlag":{"topic":"logs","partition":47,"status":"OK","start":{"offset":25998819,"timestamp":1484730645704,"lag":73583},"end":{"offset":26025871,"timestamp":1484736045718,"lag":55450}}}}

Update module and import references

toddpalino closed this as completed May 11, 2016

songyiyang pushed a commit to songyiyang/Burrow that referenced this issue Jul 18, 2016

Merge pull request linkedin#4 from songyiyang/master

d515bba

Updates from linkedin master and fixes for when worker not running

rconn01 mentioned this issue Dec 20, 2017

Improve lag evaluation rules #303

Open

bai pushed a commit that referenced this issue May 11, 2020

Merge pull request #4 from rjh-yext/fixing_module

a3fb8e6

Update module and import references

MichaelPak added a commit to rbs-develop/Burrow that referenced this issue Apr 12, 2021

add docker build linkedin#4

e16ea96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphing of lag over time #4

Graphing of lag over time #4

splee commented Jun 15, 2015

splee commented Jun 16, 2015

toddpalino commented Jun 16, 2015

pfischermx commented Jul 21, 2015

splee commented Jul 29, 2015

splee commented Jul 29, 2015

toddpalino commented Oct 8, 2015

splee commented Oct 13, 2015

toddpalino commented Oct 14, 2015

zzmoss commented Jan 19, 2016

toddpalino commented Jan 19, 2016

zzmoss commented Jan 20, 2016

toddpalino commented Jan 22, 2016

adamdubiel commented Jan 24, 2016

toddpalino commented Jan 27, 2016

adamdubiel commented Jan 27, 2016

josselin-c commented Feb 4, 2016

adamdubiel commented Feb 4, 2016

toddpalino commented Feb 5, 2016

adamdubiel commented Feb 6, 2016

travisjeffery commented Mar 24, 2016

toddpalino commented Mar 24, 2016

cberry777 commented Apr 22, 2016

splee commented Apr 26, 2016 •

edited

Loading

toddpalino commented Apr 28, 2016

splee commented Apr 29, 2016

splee commented Apr 30, 2016 •

edited

Loading

toddpalino commented May 11, 2016

sballal1 commented Jan 18, 2017

Graphing of lag over time #4

Graphing of lag over time #4

Comments

splee commented Jun 15, 2015

splee commented Jun 16, 2015

toddpalino commented Jun 16, 2015

pfischermx commented Jul 21, 2015

splee commented Jul 29, 2015

splee commented Jul 29, 2015

toddpalino commented Oct 8, 2015

splee commented Oct 13, 2015

toddpalino commented Oct 14, 2015

zzmoss commented Jan 19, 2016

toddpalino commented Jan 19, 2016

zzmoss commented Jan 20, 2016

toddpalino commented Jan 22, 2016

adamdubiel commented Jan 24, 2016

toddpalino commented Jan 27, 2016

adamdubiel commented Jan 27, 2016

josselin-c commented Feb 4, 2016

adamdubiel commented Feb 4, 2016

toddpalino commented Feb 5, 2016

adamdubiel commented Feb 6, 2016

travisjeffery commented Mar 24, 2016

toddpalino commented Mar 24, 2016

cberry777 commented Apr 22, 2016

splee commented Apr 26, 2016 • edited Loading

toddpalino commented Apr 28, 2016

splee commented Apr 29, 2016

splee commented Apr 30, 2016 • edited Loading

toddpalino commented May 11, 2016

sballal1 commented Jan 18, 2017

splee commented Apr 26, 2016 •

edited

Loading

splee commented Apr 30, 2016 •

edited

Loading