-
Notifications
You must be signed in to change notification settings - Fork 793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graphing of lag over time #4
Comments
After playing with the HTTP API I whipped up a little process which pulls the offsets for the topics and consumer groups from Burrow and pushes the calculated lag to Influx. I still think that having this as part of the project might be beneficial, so if I get time I'll look into integrating the functionality if it makes sense to do so. |
Thanks for the feedback, @splee. I'll see if we can get something else for discussion set up! As far as the feature you're looking for goes, we do something like this internally, but as you put together, we use a separate app to pull the lag and push it into our metrics system. The original intent behind Burrow was to provide consumer lag checking, and not duplicate this. That said, I've also put generalized API features into it to start with. I don't think that what you're suggesting is a bad idea at all, especially as a configurable piece like the existing notifiers. |
@splee ... and I was just about to ask this. By any chance, can you share your code where you calculate the lag? Thx! |
@pfischermx I would have to extract it from within the project where it currently resides, but I will try and do that soon. |
@pfischermx http://github.com/splee/burrower Hope this helps, let me know if you have any questions about the code. I'm guessing there will be things about it that aren't useful for your use case, but happy to accept pull requests :) |
As a note, I've added a new HTTP endpoint to complement I'm also looking at adding some simple rendered HTML pages to Burrow that will allow you to display consumer group information, including a graph of lag over the period that Burrow has information for. |
@toddpalino nice! Is this on a specific branch, or has it dropped into master already? |
That is on master, and it is in the new 0.1.0 release. |
Hi! I work on Analytics @wikimedia and we've been using burrow to monitor consumer lag. We are interested in graphing lag over time too, and we use statsd and graphite for aggregation. I was wondering if periodically sending metrics to statsd is something that would belong as part of burrow - since it seems to be a really common tool for this. I imagine it would work like the HttpNotifier, except report lag irrespective of consumers being in a bad state. If it's something that can be added to burrow, I'd be happy to work on a pull request for a StatsdNotifier. |
I think this would be interesting to see, but there are some things to be aware of (since we use something separate for lag monitoring, we've got some experience here)...
|
|
On the first one - yes, it can be significant at that interval. Consider a cluster that has 1000 consumers and more topics than that, with some of the consumers consuming multiple, or even all, topics. It's not necessarily a reason not to add this functionality, but it's something to keep in mind. The second one is probably my own biases with the monitoring system I'm working with internally. As long as you tag the metrics to the group itself, and not any real hosts, there's no issue. |
I been trying to implement some kind of metric store lag reporting, but i found one big blocker, which might be a design thing, so i would like to consul it. When operating in normal state, the lag (in my clusters) is virtually zero on all partitions, since broker offsets are refreshed less often than partition offsets. This is a bit inaccurate for lag reporting (in reality each partition has some small "natural" lag), but not a blocker. What blocks me is when consumer goes down, the lag will no longer be calculated. Lag calculation is tied to receiving update about consumer offsets. When consumer is down, there are no more updates, thus lag is not updated. This leads to situation in which partition with "real" lag of thousands of events still reports few dozens of events in Is it okay to change the implementation to refresh the lag even when there are no consumer offset updates? |
This is an interesting question. How do you propose doing this? I'm not sure there's a good way to continually do this, as you do not want to modify the consumer offset ring unless there are offsets coming in. Otherwise, you will interfere with detecting a stopped consumer. I think it would be more reasonable in what @madhuvishy is proposing to calculate the lag emitted using the last committed consumer offset and the current broker offset, ignoring the "Lag" field in the offset ring entirely. This will give a more current view of lag without causing problems with status evaluation. |
I think it is possible without touching existing data structures. I would propose to move lag calculation from consumer offset update to broker offset update. In current implementation lag is always calculated based on the most recent consumer offset and (usually) stale broker offset. If we changed it, lag would be always calculated on most recent values for both broker and consumer. This would give better accuracy and would also report correct lag even when consumer is down. This might have some implications in form of calculation complexity, as we would have to iterate over whole consumer offsets structure each time broker offsets are updated. |
Do you suggest moving the lag computation and ring update to the addBrokerOffset function or doing the computation/update in both addConsumerOffset and addBrokerOffset? |
I think it should be moved entirely. Data updated in |
I tend to agree with @adamdubiel here, however I think there's more work required to do that. Updating lag at that time requires interaction of two locks, the one for the broker offset storage and the one for the consumer offset storage. The I think the first thing to do is refactor broker offset fetching in If it does become a performance issue with large topic or consumer counts, what we can do is add a linkage for the topic in broker offsets to the consumers that consume that topic (that's a little vague - I know what it looks like in my head if it's needed). That way when you update an offset, you'll know exactly what consumers to go update lag for without iterating the entire tree. It might be premature to do that optimization now, though. |
Thanks for the insights from the technical point of view. I have burrow forked and as soon as i find some more time i will try to propose a solution including your remarks. |
I started working on a small/simple solution for sending stats to statsd here: https://github.com/travisjeffery/kafka-statsd. Working OK so far with the small number of topics/partitions we have, might need to drop in some concurrency/perf improvements talked about in the issue for larger setups. |
I actually had something along these lines come up as a new requirement for something I wanted to do internally, so I've been working on it and I've committed most of the changes to trunk. It's not documented on the wiki yet. The idea is that I wanted to be able to continually emit a count of the number of good and bad (with the various levels of bad) partitions for each consumer group. Essentially, I've added a configurable threshold for the severity level to send out notification POSTs for, and you can set it to "OK", which means send every time regardless of status. It currently only provides the maxlag partition (not a full list of all good partitions), but I did add a total partition count to what can be used in the template, and that way you can figure out how many partitions are good. I also have some uncommitted changes to do a total lag calculation for the consumer group. This all works pretty well even for a large number of partitions and consumers on the clusters. What I haven't worked on yet is when the lag calculation is done (at broker update vs. offset commit). It's also probably worthwhile to consider a way to provide all partitions, even the good ones so you could do full lag graphing in an external system. |
Todd, |
@toddpalino Thanks for implementing lag tracking to the consumer status endpoint! I have a question about the fields in the response, as it's not clear to me which I should be reporting on. Each item within the "partitions" key contains two child objects with the same fields: "start" and "end". What is the difference between these two values? It looks like the "maxlag" key is calculating the max lag using the value from "start.lag" in those objects, not "end.lag". I don't understand why there are two values, and so far the reason for the two keys is not clear to me from the source. |
So the start and end keys are supposed to give you the first offset commit (and the lag number) that Burrow has for the partition, and the last one. The maxlag value in the status object is set to the partition, not a specific offset within that partition. Given that, it's up to you (in the template) to pull out the value you want to use. |
@toddpalino Ah, that makes sense now when I take Burrow's windowing into account. Thanks for the clarification. |
I updated Burrower to work with InfluxDB 0.9 and the latest version of Burrow. Hopefully someone will find that project useful again :) |
This looks like it's all taken care of with the |
Updates from linkedin master and fixes for when worker not running
"maxlag" block always seems to have highest numbered partition, irrespective of whether it is a maximum lag or not. Example : |
Hi,
Sorry for posting this as an issue - if there is a better way to get in contact with the team I can't find one on the wiki so far :)
As an engineer I love to see graphs over time for our systems, and we've been using raw offset requests for this in our apps. I was hoping that Burrow would be able to allow ingest of consumer metrics (specifically lag, so we know whether we need to scale up our consumers) into our InfluxDB instance for this purpose but it doesn't seem like it's possible.
At the moment from what I can tell you can only have the HTTP Notifier or the HTTP Endpoints return lag when the consumers are not in an "OK" state.
Is there a reason for that, before I start working on a pull request? It'd be great to be able to have some kind of "metrics bridge" where the notifier will trigger and return lag metrics regularly regardless of the state of the consumer group so lag metrics can be graphed at all times.
It's possible that I've missed the point of the project, but it does seem like it would be helpful to have continuous lag metrics in addition to alerts when consumer groups etc fall behind or go offline.
The text was updated successfully, but these errors were encountered: