Skip to content
This repository has been archived by the owner on Mar 17, 2024. It is now read-only.

Incorrect kafka_consumergroup_group_offset Metric #30

Closed
abhishekjiitr opened this issue Jun 14, 2019 · 9 comments · Fixed by #33
Closed

Incorrect kafka_consumergroup_group_offset Metric #30

abhishekjiitr opened this issue Jun 14, 2019 · 9 comments · Fixed by #33
Labels
bug Something isn't working
Milestone

Comments

@abhishekjiitr
Copy link
Contributor

All the kafka_consumergroup_group_offset metrics for various consumer groups are all zero which leads to the lag metrics being insanely large (i.e. equal to the latest offsets), which is incorrect as the same metrics are shown correctly using kafka-consumer-groups.sh

Helm Chart used: kafka-lag-exporter-0.4.1
Kafka Version used: 1.1.0
Kafka is deployed on kubernetes, and is not a Strimzi Kafka

Debug logs also look fine, no errors:

2019-06-14 07:45:46,218 WARN  o.a.k.c.admin.AdminClientConfig  - The configuration 'sasl.jaas.config' was supplied but isn't a known config. 

2019-06-14 07:45:46,480 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version: 2.2.1 
2019-06-14 07:45:46,480 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId: 55783d3133a5a49a 
2019-06-14 07:45:46,529 INFO  org.apache.kafka.clients.Metadata  - Cluster ID: 3r4hHZncSxSz0vhHRSRuMA 
2019-06-14 07:45:46,567 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Updating lookup tables 
2019-06-14 07:45:46,602 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Reporting offsets 
2019-06-14 07:45:46,657 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Clearing evicted metrics 
2019-06-14 07:45:46,658 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Polling in 5 seconds 
2019-06-14 07:45:51,678 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Collecting offsets 
2019-06-14 07:45:51,776 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Updating lookup tables 
2019-06-14 07:45:51,779 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Reporting offsets 
2019-06-14 07:45:51,823 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Clearing evicted metrics 
2019-06-14 07:45:51,823 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Polling in 5 seconds 
2019-06-14 07:45:56,838 DEBUG c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-centrallogging - Collecting offsets 
@seglo
Copy link
Owner

seglo commented Jun 15, 2019

Thanks for trying out the project @abhishekjiitr. When Kafka Lag Exporter reports group offsets it will default to 0 for any group partitions where an offset is not provided. I've seen this occur when a consumer group is created, but has had nothing committed to it. For example, when offsets are managed by the user instead of being committed to Kafka, or when groups are just used for consumer group member balancing and not offset management.

Can you provide more information? Could you attach Kafka Lag Exporter metrics for the consumer group thta's being reported with group offsets of 0, along with your call to and output of kafka-consumer-groups.sh?

@smlgbl
Copy link

smlgbl commented Jun 20, 2019

I have the same issue, even though our topics are in used a lot. Other lag checkers (burrow, remora, command-line, and yahoo's kafka-manager) see correct numbers.

kafka_consumergroup_group_offset{cluster_name="logging",group="inventorycrawler",topic="inventorycrawler",partition="9",member_host="/10.139.129.194",consumer_id="logstash-10-513ec4b2-0060-4159-bfea-0f6fe5fe307c",client_id="logstash-10",} 0.0
vs.
kafka_consumergroup_group_lag{cluster_name="logging",group="inventorycrawler",topic="inventorycrawler",partition="9",member_host="/10.139.129.194",consumer_id="logstash-10-513ec4b2-0060-4159-bfea-0f6fe5fe307c",client_id="logstash-10",} 29460.0

What is reported here as the lag is the actual offset of the consumer, and also the latest offset produced. It's not possible, that this partition isn't consumed. And the group offset gets updated ...

Output of kafka-consumer-groups.sh:
inventorycrawler 9 29460 29460 0 logstash-10-513ec4b2-0060-4159-bfea-0f6fe5fe307c /10.139.129.194 logstash-10

@seglo
Copy link
Owner

seglo commented Jun 20, 2019

@smlgbl I misunderstood the original problem. That's indeed a regression bug. Looking into it now.

@seglo seglo added the bug Something isn't working label Jun 20, 2019
@smlgbl
Copy link

smlgbl commented Jun 20, 2019

@seglo Thank you very much. I'd really like this to work - burrow and remora have proven unreliable...

@abhishekjiitr
Copy link
Contributor Author

abhishekjiitr commented Jun 20, 2019

Some sample metrics from the lag exporter (with topic & cluster name obfuscated):


kafka_consumergroup_group_offset{cluster_name="CLUSTER",group="sysloggroup",topic="MY_TOPIC",partition="0",member_host="/172.16.9.1",consumer_id="sysloggroup-0-cea4fb38-30ce-4131-a46f-31b14c440769",client_id="sysloggroup-0",} 0.0
kafka_partition_latest_offset{cluster_name="CLUSTER",topic="MY_TOPIC",partition="0",} 3.22118322E8
kafka_consumergroup_group_lag{cluster_name="CLUSTER",group="sysloggroup",topic="MY_TOPIC",partition="0",member_host="/10.10.16.31",consumer_id="sysloggroup-0-219f7f96-0021-4567-80c1-06cae1346316",client_id="sysloggroup-0",} 1.752221812E9

Output for ./kafka-consumer-groups.sh --bootstrap-server kafka:9092 --describe --group CONSUMER-group

TOPIC                          PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG        CONSUMER-ID                                       HOST                           CLIENT-ID

TOPIC-PREFIX-syslog            0          1752394699      1752394864      165        CONSUMER-group-0-219f7f96-0021-4567-80c1-06cae1346316/10.10.16.31                   CONSUMER-group-0
TOPIC-PREFIX-syslog            2          1752536132      1752536220      88         CONSUMER-group-0-2b56babd-cee4-40dc-8106-cf41ab126f57/172.16.5.1                    CONSUMER-group-0
TOPIC-PREFIX-syslog            1          1752502015      1752502179      164        CONSUMER-group-0-2649dda7-2932-4f81-a006-fa99715dc120/172.16.7.1                    CONSUMER-group-0
TOPIC-PREFIX-syslog            3          1752506376      1752506634      258        CONSUMER-group-1-54891801-d12c-4d6a-9039-603edc46682e/172.16.5.1                    CONSUMER-group-1

@seglo
Copy link
Owner

seglo commented Jun 20, 2019

@abhishekjiitr @smlgbl Try 0.4.3 and let me know if it works for you.

@seglo seglo modified the milestones: 0.4.2, 0.4.4 Jun 20, 2019
@smlgbl
Copy link

smlgbl commented Jun 21, 2019

It's working better, but does not report the same lags as the other tools ... the green is the "correct" lag (all the other tools agree it's about 5M :)
image

@abhishekjiitr
Copy link
Contributor Author

abhishekjiitr commented Jun 21, 2019

Yes, it is much better than the last version, the metrics look correct now
Thanks @seglo! 🥂

@seglo
Copy link
Owner

seglo commented Jun 21, 2019

@smlgbl Thanks for checking. The peaks in the chart you're showing are from kafka-lag-exporter? Can you confirm it's the kafka_consumergroup_group_lag metric? Can you confirm that it's not being aggregated in any way (i.e. max lag from a set of partitions for a topic)?

EDIT: I'm partially colourblind so I'm not sure which line is which ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants