Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphing of lag over time #4

Closed
splee opened this issue Jun 15, 2015 · 28 comments
Closed

Graphing of lag over time #4

splee opened this issue Jun 15, 2015 · 28 comments

Comments

@splee
Copy link

splee commented Jun 15, 2015

Hi,

Sorry for posting this as an issue - if there is a better way to get in contact with the team I can't find one on the wiki so far :)

As an engineer I love to see graphs over time for our systems, and we've been using raw offset requests for this in our apps. I was hoping that Burrow would be able to allow ingest of consumer metrics (specifically lag, so we know whether we need to scale up our consumers) into our InfluxDB instance for this purpose but it doesn't seem like it's possible.

At the moment from what I can tell you can only have the HTTP Notifier or the HTTP Endpoints return lag when the consumers are not in an "OK" state.

Is there a reason for that, before I start working on a pull request? It'd be great to be able to have some kind of "metrics bridge" where the notifier will trigger and return lag metrics regularly regardless of the state of the consumer group so lag metrics can be graphed at all times.

It's possible that I've missed the point of the project, but it does seem like it would be helpful to have continuous lag metrics in addition to alerts when consumer groups etc fall behind or go offline.

@splee
Copy link
Author

splee commented Jun 16, 2015

After playing with the HTTP API I whipped up a little process which pulls the offsets for the topics and consumer groups from Burrow and pushes the calculated lag to Influx.

I still think that having this as part of the project might be beneficial, so if I get time I'll look into integrating the functionality if it makes sense to do so.

@toddpalino
Copy link
Contributor

Thanks for the feedback, @splee. I'll see if we can get something else for discussion set up!

As far as the feature you're looking for goes, we do something like this internally, but as you put together, we use a separate app to pull the lag and push it into our metrics system. The original intent behind Burrow was to provide consumer lag checking, and not duplicate this. That said, I've also put generalized API features into it to start with. I don't think that what you're suggesting is a bad idea at all, especially as a configurable piece like the existing notifiers.

@pfischermx
Copy link

@splee ... and I was just about to ask this. By any chance, can you share your code where you calculate the lag?

Thx!

@splee
Copy link
Author

splee commented Jul 29, 2015

@pfischermx I would have to extract it from within the project where it currently resides, but I will try and do that soon.

@splee
Copy link
Author

splee commented Jul 29, 2015

@pfischermx http://github.com/splee/burrower

Hope this helps, let me know if you have any questions about the code. I'm guessing there will be things about it that aren't useful for your use case, but happy to accept pull requests :)

@toddpalino
Copy link
Contributor

As a note, I've added a new HTTP endpoint to complement /status that is /lag. This one does the same thing as the status check, except it returns every partition, rather than just the bad ones. This may facilitate exporting the lag to another application (like influx) more easily.

I'm also looking at adding some simple rendered HTML pages to Burrow that will allow you to display consumer group information, including a graph of lag over the period that Burrow has information for.

@splee
Copy link
Author

splee commented Oct 13, 2015

@toddpalino nice! Is this on a specific branch, or has it dropped into master already?

@toddpalino
Copy link
Contributor

That is on master, and it is in the new 0.1.0 release.

@zzmoss
Copy link

zzmoss commented Jan 19, 2016

Hi! I work on Analytics @wikimedia and we've been using burrow to monitor consumer lag. We are interested in graphing lag over time too, and we use statsd and graphite for aggregation. I was wondering if periodically sending metrics to statsd is something that would belong as part of burrow - since it seems to be a really common tool for this. I imagine it would work like the HttpNotifier, except report lag irrespective of consumers being in a bad state. If it's something that can be added to burrow, I'd be happy to work on a pull request for a StatsdNotifier.

@toddpalino
Copy link
Contributor

I think this would be interesting to see, but there are some things to be aware of (since we use something separate for lag monitoring, we've got some experience here)...

  1. The amount of traffic can be fairly overwhelming for larger clusters. In this case polling may be better because you can control which groups you are getting lag for. Another option to consider is emitting an aggregate lag number for the consumer group as a whole.

  2. Metrics typically attach to hostnames. Because the hosts in a consumer group can shift around and change, and because when the group is stopped there are technically no hosts, you may not be able to use actual hostnames. One of my solutions uses a synthetic hostname to emit the lag metrics under so we don't have this problem

@zzmoss
Copy link

zzmoss commented Jan 20, 2016

@toddpalino

  1. Is the amount of traffic emitted overwhelming if it can be configured to emit the metrics every 60 seconds or more? Aggregating lag numbers for the whole group would be interesting, and it could be a configurable setting that can be turned on may be?

  2. I'm not understanding the hostname thing - In the statsd/graphite world - metrics are associated with a prefix - and they could be cluster.consumer_group or something like that. Could you explain how hostnames play into this - looks like I'm missing something.

@toddpalino
Copy link
Contributor

On the first one - yes, it can be significant at that interval. Consider a cluster that has 1000 consumers and more topics than that, with some of the consumers consuming multiple, or even all, topics. It's not necessarily a reason not to add this functionality, but it's something to keep in mind.

The second one is probably my own biases with the monitoring system I'm working with internally. As long as you tag the metrics to the group itself, and not any real hosts, there's no issue.

@adamdubiel
Copy link

I been trying to implement some kind of metric store lag reporting, but i found one big blocker, which might be a design thing, so i would like to consul it.

When operating in normal state, the lag (in my clusters) is virtually zero on all partitions, since broker offsets are refreshed less often than partition offsets. This is a bit inaccurate for lag reporting (in reality each partition has some small "natural" lag), but not a blocker.

What blocks me is when consumer goes down, the lag will no longer be calculated. Lag calculation is tied to receiving update about consumer offsets. When consumer is down, there are no more updates, thus lag is not updated. This leads to situation in which partition with "real" lag of thousands of events still reports few dozens of events in /lag endpoint (and ConsumerGroupStatus struct). The data is still there, since i can access stopped consumer group offsets, but the lag is not refreshed.

Is it okay to change the implementation to refresh the lag even when there are no consumer offset updates?

@toddpalino
Copy link
Contributor

This is an interesting question. How do you propose doing this? I'm not sure there's a good way to continually do this, as you do not want to modify the consumer offset ring unless there are offsets coming in. Otherwise, you will interfere with detecting a stopped consumer.

I think it would be more reasonable in what @madhuvishy is proposing to calculate the lag emitted using the last committed consumer offset and the current broker offset, ignoring the "Lag" field in the offset ring entirely. This will give a more current view of lag without causing problems with status evaluation.

@adamdubiel
Copy link

I think it is possible without touching existing data structures. I would propose to move lag calculation from consumer offset update to broker offset update. In current implementation lag is always calculated based on the most recent consumer offset and (usually) stale broker offset. If we changed it, lag would be always calculated on most recent values for both broker and consumer. This would give better accuracy and would also report correct lag even when consumer is down.

This might have some implications in form of calculation complexity, as we would have to iterate over whole consumer offsets structure each time broker offsets are updated.

@josselin-c
Copy link

Do you suggest moving the lag computation and ring update to the addBrokerOffset function or doing the computation/update in both addConsumerOffset and addBrokerOffset?

@adamdubiel
Copy link

I think it should be moved entirely. Data updated in addConsumerOffset has little value when observing lags, because in most cases it computes 0, even though there is a small or natural lag on this partition. Having lag computed in both places would just cause noise (values jumping between 0 and the real value).

@toddpalino
Copy link
Contributor

I tend to agree with @adamdubiel here, however I think there's more work required to do that. Updating lag at that time requires interaction of two locks, the one for the broker offset storage and the one for the consumer offset storage. The addBrokerOffset routine is called individually for each partition offset that is retrieved. This is already an inefficient use of the broker offset lock, because it has to be locked and unlocked for each partition. I'm very nervous about locking the consumer offset lock in the same way.

I think the first thing to do is refactor broker offset fetching in kafka_client.go to deliver the offsets to storage in a batch. This will make the locking much more efficient, and I think there's less of a concern about locking the consumer lock for long enough to iterate over all the consumers and update lag.

If it does become a performance issue with large topic or consumer counts, what we can do is add a linkage for the topic in broker offsets to the consumers that consume that topic (that's a little vague - I know what it looks like in my head if it's needed). That way when you update an offset, you'll know exactly what consumers to go update lag for without iterating the entire tree. It might be premature to do that optimization now, though.

@adamdubiel
Copy link

Thanks for the insights from the technical point of view. I have burrow forked and as soon as i find some more time i will try to propose a solution including your remarks.

@travisjeffery
Copy link

I started working on a small/simple solution for sending stats to statsd here: https://github.com/travisjeffery/kafka-statsd. Working OK so far with the small number of topics/partitions we have, might need to drop in some concurrency/perf improvements talked about in the issue for larger setups.

@toddpalino
Copy link
Contributor

I actually had something along these lines come up as a new requirement for something I wanted to do internally, so I've been working on it and I've committed most of the changes to trunk. It's not documented on the wiki yet. The idea is that I wanted to be able to continually emit a count of the number of good and bad (with the various levels of bad) partitions for each consumer group.

Essentially, I've added a configurable threshold for the severity level to send out notification POSTs for, and you can set it to "OK", which means send every time regardless of status. It currently only provides the maxlag partition (not a full list of all good partitions), but I did add a total partition count to what can be used in the template, and that way you can figure out how many partitions are good. I also have some uncommitted changes to do a total lag calculation for the consumer group. This all works pretty well even for a large number of partitions and consumers on the clusters.

What I haven't worked on yet is when the lag calculation is done (at broker update vs. offset commit). It's also probably worthwhile to consider a way to provide all partitions, even the good ones so you could do full lag graphing in an external system.

@cberry777
Copy link

Todd,
This sounds like a wonderful addition.
Then external metrics agents, such as DataDog or collectd, can simply poll /lag on some known interval.
Letting them all do with the data as they wish...

@splee
Copy link
Author

splee commented Apr 26, 2016

@toddpalino Thanks for implementing lag tracking to the consumer status endpoint!

I have a question about the fields in the response, as it's not clear to me which I should be reporting on.

Each item within the "partitions" key contains two child objects with the same fields: "start" and "end".

What is the difference between these two values? It looks like the "maxlag" key is calculating the max lag using the value from "start.lag" in those objects, not "end.lag". I don't understand why there are two values, and so far the reason for the two keys is not clear to me from the source.

@toddpalino
Copy link
Contributor

So the start and end keys are supposed to give you the first offset commit (and the lag number) that Burrow has for the partition, and the last one. The maxlag value in the status object is set to the partition, not a specific offset within that partition. Given that, it's up to you (in the template) to pull out the value you want to use.

@splee
Copy link
Author

splee commented Apr 29, 2016

@toddpalino Ah, that makes sense now when I take Burrow's windowing into account. Thanks for the clarification.

@splee
Copy link
Author

splee commented Apr 30, 2016

I updated Burrower to work with InfluxDB 0.9 and the latest version of Burrow. Hopefully someone will find that project useful again :)

@toddpalino
Copy link
Contributor

This looks like it's all taken care of with the /lag endpoint, as well as the ability to emit to an HTTP server even when the consumer is in an OK state.

songyiyang pushed a commit to songyiyang/Burrow that referenced this issue Jul 18, 2016
Updates from linkedin master and fixes for when worker not running
@sballal1
Copy link

"maxlag" block always seems to have highest numbered partition, irrespective of whether it is a maximum lag or not.
Is this supposed to be like this? I was expecting it to be pointing to partition with maximum "end lag".

Example :
{"topic":"logs","partition":33,"status":"OK","start":{"offset":25686208,"timestamp":1484730645749,"lag":0},"end":{"offset":25695127,"timestamp":1484736045717,"lag":0}},{"topic":"logs","partition":34,"status":"OK","start":{"offset":25997264,"timestamp":1484730645767,"lag":75177},"end":{"offset":26024174,"timestamp":1484736045718,"lag":57184}},{"topic":"logs","partition":35,"status":"OK","start":{"offset":25997294,"timestamp":1484730645767,"lag":75105},"end":{"offset":26024148,"timestamp":1484736045719,"lag":57169}},

{"topic":"logs","partition":47,"status":"OK","start":{"offset":25998819,"timestamp":1484730645704,"lag":73583},"end":{"offset":26025871,"timestamp":1484736045718,"lag":55450}}],"maxlag":{"topic":"logs","partition":47,"status":"OK","start":{"offset":25998819,"timestamp":1484730645704,"lag":73583},"end":{"offset":26025871,"timestamp":1484736045718,"lag":55450}}}}

bai pushed a commit that referenced this issue May 11, 2020
Update module and import references
MichaelPak added a commit to rbs-develop/Burrow that referenced this issue Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants