Kafka monitoring #4138

nrmitchi · 2018-03-13T16:03:10Z

What this PR does / why we need it:
Currently this chart does not expose jmx metrics to prometheus. This PR is basically a helm-ification of the changes at: Yolean/kubernetes-kafka@5a2b8c7, which adds an exporter.

@faraazkhan @h0tbird @benjigoldberg

k8s-ci-robot · 2018-03-13T16:03:15Z

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.
If you have done the above and are still having issues with the CLA being reported as unsigned, please email the CNCF helpdesk: helpdesk@rt.linuxfoundation.org

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Also changes the readiness probe to match the new one upstream

paulczar · 2018-03-13T17:25:00Z

Can you bump the version number to 0.5.0 ? Given this is adding resources we'd consider it to be a minor update rather than a patch under semver.

nrmitchi · 2018-03-13T18:29:08Z

Not a problem, can totally do that.

That being said, it also seems like the standard in other charts is for metrics to be disabled by default, so I'll get that flipped too

nrmitchi · 2018-03-13T20:01:43Z

After looking through some of the metrics that this is providing, I'm not 100% sure that these are necessarily the correct metrics that should be included in the chart (ie, I think there are metrics that should be included, but are not).

Based on that, I think we should put this on hold for right now. I'd hate to get something merged in upstream only for it to not actually solve the problem it aimed to.

bradenwright · 2018-03-13T22:15:44Z

So I agree that metrics would be super nice and I've been working on it as well. Few things to bring up.

(1) the whitelist/pattern is removing a lot of useful stats. I think its too limiting
(2) There should be a way to override the config map values or enable/disable it (related to #1)
(3) This works for standard install of prometheus but doesn't work with the prometheus-operator, I guess both should probably be supported (I was actually working on it the other way, using prometheus-operator)
(4) I setup as a sidecar, we already get stats about cpu/memory from prometheus, so to me a sidecar seemed a little more appropriate but open to discussion
(5) It would be nice to setup some generic alerts that every kafka cluster would want (although maybe that's a separate PR)

Also I'd gladly contribute too, but not quite ready for a PR, my codes working but still needs some love.

nrmitchi · 2018-03-13T23:06:48Z

@bradenwright
Completely agree on points 1 and 2. I actually pulled the rules out into the config, and left what I had just as an example/template. I found that I was missing stuff, and so for now I'm just collecting everything.

Regarding the other points:
3) I have really no experience with the prometheus operator. I'm more than happy to make it work for both, but I'm really not sure what any of the differences are.
4) The foundation of this was taken from the commit referenced in the PR message. My understanding is that this chart was meant to follow that implementation, so I have it matching that.
5) Totally agree that that would be pretty cool, but I think that would probably be best as a separate PR

bradenwright · 2018-03-13T23:58:05Z

@nrmitchi cool, that's where I started too. There ended up being a number of metrics I thought were overkill, so I was gonna blacklist so stats but make configureable.

(3) I more than happy to contribute that part, I'm hoping to have it all done by Monday, if not earlier
(4) I just think this probably deserves some discussion. And it maybe more appropriate to run as a java-agent. https://github.com/prometheus/jmx_exporter#jmx-exporter mentions running as java-agent as being the preferred, but reading the reasoning I thought it fit better as a sidecar (just seemed to fit the docker/k8s model better). And I felt it was actually a little easier to configure as a sidecar in certain ways. I also liked the idea if I wanted to change config on my exporter I didn't think that should require a reboot of kafka, seemed a little extreme and better to be de-coupled. But again very open to discussion on this.
(5) cool!

I'm more than happy to collaborate however you want on this, I think it will work out well that you did it with Prometheus and I did it with Prometheus Operator.

bradenwright · 2018-03-14T00:03:14Z

Code wise too if you/anyone wanted to compare, again I can push my code running as a sidecar instead of java-agent for a comparison (if wanted), again I'll have a clean PR Monday maybe earlier.

… well

nrmitchi · 2018-03-15T18:13:19Z

Updated to include both the jmx exporter, as well as the kafka-exporter referenced at: https://prometheus.io/docs/instrumenting/exporters/

Considering adding a configurable burrow (https://github.com/linkedin/Burrow) deployment as well.

Also planning to swap out the default image that is currently being used with the base jmx_exporter (https://github.com/prometheus/jmx_exporter) rather than the custom one that is currently being used

t-d-d · 2018-03-21T02:18:05Z

@bradenwright I would be interested in seeing your sidecar approach, especially if it works with prometheus-operator

NicolasTr · 2018-03-21T15:36:50Z

@nrmitchi What is the idea behind the change in the readiness probe?

I found this pull request after running into an issue with the readiness probe when enabling JMX on the original chart.

I added the following to the values file:

configurationOverrides:
  jmx.opts: "-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9096 -Djava.rmi.server.hostname=localhost -Dcom.sun.management.jmxremote.rmi.port=9096"

I got the following issue when running the readiness probe:

Error: JMX connector server communication error: service:jmx:rmi://pj-kafka-kafka-0:9096

And I fixed it by changing the readiness probe in the chart to this:

        readinessProbe:
          exec:
            command:
              - bash
              - -ec
              - KAFKA_JMX_OPTS= kafka-topics --zookeeper {{ template "zookeeper.url" . }} --list
          initialDelaySeconds: 30
          timeoutSeconds: 5

I don't know kafka well enough to know which one would be the most meaningful.

nrmitchi · 2018-03-21T15:59:56Z

@NicolasTr a member of my team did a lot of the initial work in getting the jmx export up and running, but my understanding is that the readiness probe change was taken from the upstream (non helm) kubernetes-kafka project, and the pre-existing one didn't work due to a port conflict.

I'm honestly not entirely sure if it is the best readinessProbe, but again I'm trusting the change upstream (https://github.com/Yolean/kubernetes-kafka/blob/master/kafka/50kafka.yml#L63)

bradenwright · 2018-03-21T19:54:21Z

@t-d-d @nrmitchi

I still need to do a little clean up, and I'm currently tweaking jmx-exporter-configmap.yaml to produce stats that are more appropriate for Prometheus, but it should be fine to discuss things above:

https://github.com/spothero/kubernetes-charts/pull/3/files

benjigoldberg · 2018-03-22T15:15:39Z

@nrmitchi the readiness probe is a good one, we're going to merge another PR from @t-d-d that already has this change included in it. If possible, it would be nice to just hone this PR down to just the changes around JMX and metrics export.

My initial thoughts are that I prefer the sidecar method to modifying the core container to export metrics as well. Sidecars are a fairly common paradigm in the community generally, and are typically recommended for this kind of metrics export operation from what I've seen.

full disclosure: @bradenwright and I are coworkers at SpotHero

The final thought I have -- I think the zookeeper metrics are great -- but my initial reaction is that they likely belong with zookeeper. Although many users will deploy a zookeeper cluster with a kafka cluster, that isnt universally true. I'd prefer to encourage people to think about and package metrics with the component to which they belong.

What do you think @nrmitchi? How about others on this PR @bradenwright @t-d-d do you have any opinions given your time using Kafka?

benjigoldberg

Couple of questions/comments

benjigoldberg · 2018-04-10T15:34:02Z

incubator/kafka/values.yaml

+    port: 5555
+
+    # Rules to apply to the Kafka JMX Exporter
+    kafkaConfig:


do you think we could drop this key and tab everything over?

I don't think so due to the way the key is used; ie, the blob is injected as {{ toYaml .Values.metrics.jmx.kafkaConfig | indent 4 }}. If we were to drop the key and unindent, we'd end up including all of the other .metrics.jmx values in the jmx-kafka-prometheus.yml configuration. I'm not sure if that will cause the jmx exporter to error on startup, but it seems unclean either way.

ah yeah, of course 👍

benjigoldberg · 2018-04-10T15:34:28Z

incubator/kafka/values.yaml

+      jmxUrl: service:jmx:rmi:///jndi/rmi://127.0.0.1:5555/jmxrmi
+      ssl: false
+      whitelistObjectNames: ["kafka.server:*", "kafka.controller:*", "java.lang:*"]
+      # rules:


Perhaps these should just be set as defaults, what do you think (eg not commented)?

My reasoning for leaving them commented was because including them excludes a lot of metrics; I'm personally of the opinion that the default should include everything, allowing a user to exclude stuff if necessary, rather than hide metrics by default, requiring a user to figure out why they are missing and how to expose them. Typically articles/help guides about monitoring in general will reference using specific metrics, and it can be confusing while going through those if the metrics that should be there happened to be hidden by default.

I left them in and commented mostly as a reference in case someone did want to write their own rules, but if the choice is between uncommenting, or removing entirely, I would err on the side of "just remove them"

Seems reasonable to me 👍

benjigoldberg · 2018-04-10T17:00:38Z

/ok-to-test

josdotso · 2018-04-11T22:10:50Z

related: #4931

Awesome stuff! I have a similar branch for kafka locally that I will PR after this merges. It adds JMX and Kafka exporters too, but also refactors similar to #4931. I'll hold back filing my kafka PR until this PR merges -- and I get feedback on #4931's pattern. Thanks!

benjigoldberg · 2018-04-12T12:45:20Z

@nrmitchi if we clean up the jmx port definition I think we're good to go. Lets use the structure you've defined and remove the one at the top level and update the docs to remove that.

nrmitchi · 2018-04-12T15:29:37Z

@benjigoldberg 👍

benjigoldberg · 2018-04-12T15:44:59Z

/lgtm

k8s-ci-robot · 2018-04-12T15:45:08Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benjigoldberg, nrmitchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~incubator/kafka/OWNERS~~ [benjigoldberg]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

benjigoldberg · 2018-04-12T15:45:29Z

@nrmitchi thanks much for your contribution and working through all the questions/requests with us!

aruneli · 2018-04-12T22:42:47Z

@nrmitchi I am trying to install kafka chart after this merge. Now I get this message "Error: render error in "kafka/templates/statefulset.yaml": template: kafka/templates/statefulset.yaml:174:28: executing "kafka/templates/statefulset.yaml" at <{{template "kafka.co...>: template "kafka.configmap" not defined" . I hard coded configmap name to "kafka-metrics" to unblock myself. Please test it.

aruneli · 2018-04-12T23:17:26Z

@nrmitchi I run prometheus/grafans cluster using charts here - https://github.com/camilb/prometheus-kubernetes
After bringing up kafka cluster, where do I configure prometheus to collect metrics from kafka?

nrmitchi · 2018-04-13T00:28:36Z

@aruneli the conflict seems to be with this PR which was merged in between the initial branch for this PR, and the merge: ff9f02d#diff-e4513710c9e51c6f539000cb48ad85f0

Putting something up to fix it.

aruneli · 2018-04-13T00:51:16Z

@nrmitchi Thanks. Could you also please check my other comment on how to enable running prometheus to scrape data from the kafka exporter.

ebabani · 2018-04-13T01:08:28Z

@aruneli It would be similar to how you have configured prometheus to scrape other pods. That seems like a better question for https://github.com/camilb/prometheus-kubernetes

nrmitchi · 2018-04-13T01:10:20Z

How you scrape the metrics would be entirely up to you; they are exposed, but I'm not sure how that particular chart determines its targets.

This chart now adds annotations to the statefulset if you enable the jmx metrics (and to the kafka-exporter if you enabled that), so it should work out of the box if you prometheus config uses service discovery.

bradenwright · 2018-04-13T13:52:17Z

@nrmitchi thanks for putting this together, I'll make sure to open an new PR to add support for Prometheus Operator.

aruneli · 2018-04-13T17:39:20Z

@bradenwright Thanks

bradenwright · 2018-04-18T18:56:00Z

In case anyone is follow here I opened a PR for the Prometheus Operator changes: #5120

* Helm-ify the JMX exporter that was added to the base project Also changes the readiness probe to match the new one upstream * Increment version in Chart.yaml * Bump Charts.yaml version and make metrics off by default * Template out jmx rules, thus allowing people to whitelist if they want * Pull a bit more into the config * lint * Fix value from copypaste error * Add metric export option using the recommend kafka metric exporter as well * Remove zookeeper metrics * Update kafka-exporter default * Add back additionalPorts that I accidently nixed in merge * Parameterize jmx port and add new values to readme * Fix readme formatting * Remove old jmxPort from values.yaml

* Helm-ify the JMX exporter that was added to the base project Also changes the readiness probe to match the new one upstream * Increment version in Chart.yaml * Bump Charts.yaml version and make metrics off by default * Template out jmx rules, thus allowing people to whitelist if they want * Pull a bit more into the config * lint * Fix value from copypaste error * Add metric export option using the recommend kafka metric exporter as well * Remove zookeeper metrics * Update kafka-exporter default * Add back additionalPorts that I accidently nixed in merge * Parameterize jmx port and add new values to readme * Fix readme formatting * Remove old jmxPort from values.yaml Signed-off-by: voron <av@arilot.com>

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 13, 2018

k8s-ci-robot requested a review from benjigoldberg March 13, 2018 16:03

nrmitchi force-pushed the kafka-monitoring branch from 7733844 to 0a91751 Compare March 13, 2018 16:05

Helm-ify the JMX exporter that was added to the base project

3ad70c0

Also changes the readiness probe to match the new one upstream

nrmitchi force-pushed the kafka-monitoring branch from 0a91751 to 3ad70c0 Compare March 13, 2018 16:09

Increment version in Chart.yaml

cd14ef6

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 13, 2018

Bump Charts.yaml version and make metrics off by default

618de47

nrmitchi added 3 commits March 13, 2018 17:53

Template out jmx rules, thus allowing people to whitelist if they want

6861412

Pull a bit more into the config

2f37391

lint

b744544

Fix value from copypaste error

72a45d1

Add metric export option using the recommend kafka metric exporter as…

d2d8cf7

… well

benjigoldberg reviewed Apr 10, 2018

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 10, 2018

Remove old jmxPort from values.yaml

c5fd524

k8s-ci-robot assigned benjigoldberg Apr 12, 2018

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2018

k8s-ci-robot merged commit a5f822e into helm:master Apr 12, 2018

nrmitchi mentioned this pull request Apr 13, 2018

Add back the kakfa.configmap helper #4993

Merged

aruneli mentioned this pull request Apr 13, 2018

Need support to pull metrics from kafka cluster camilb/prometheus-kubernetes#79

Open

nrmitchi mentioned this pull request Apr 18, 2018

[incubator/kafka] Added support for Prometheus exporter, updated JMX stats #5120

Merged

mattfarina unassigned benjigoldberg Jul 19, 2018

Kafka monitoring #4138

Kafka monitoring #4138

Conversation

nrmitchi commented Mar 13, 2018

k8s-ci-robot commented Mar 13, 2018

paulczar commented Mar 13, 2018

nrmitchi commented Mar 13, 2018

nrmitchi commented Mar 13, 2018

bradenwright commented Mar 13, 2018

nrmitchi commented Mar 13, 2018

bradenwright commented Mar 13, 2018 • edited

bradenwright commented Mar 14, 2018 • edited

nrmitchi commented Mar 15, 2018

t-d-d commented Mar 21, 2018

NicolasTr commented Mar 21, 2018

nrmitchi commented Mar 21, 2018

bradenwright commented Mar 21, 2018 • edited

benjigoldberg commented Mar 22, 2018

benjigoldberg left a comment

Choose a reason for hiding this comment

benjigoldberg Apr 10, 2018

Choose a reason for hiding this comment

nrmitchi Apr 10, 2018

Choose a reason for hiding this comment

benjigoldberg Apr 10, 2018

Choose a reason for hiding this comment

benjigoldberg Apr 10, 2018

Choose a reason for hiding this comment

nrmitchi Apr 10, 2018

Choose a reason for hiding this comment

benjigoldberg Apr 10, 2018

Choose a reason for hiding this comment

benjigoldberg commented Apr 10, 2018

josdotso commented Apr 11, 2018

benjigoldberg commented Apr 12, 2018

nrmitchi commented Apr 12, 2018

benjigoldberg commented Apr 12, 2018

k8s-ci-robot commented Apr 12, 2018

benjigoldberg commented Apr 12, 2018

aruneli commented Apr 12, 2018 • edited

aruneli commented Apr 12, 2018

nrmitchi commented Apr 13, 2018

aruneli commented Apr 13, 2018

ebabani commented Apr 13, 2018

nrmitchi commented Apr 13, 2018

bradenwright commented Apr 13, 2018

aruneli commented Apr 13, 2018

bradenwright commented Apr 18, 2018

bradenwright commented Mar 13, 2018 •

edited

bradenwright commented Mar 14, 2018 •

edited

bradenwright commented Mar 21, 2018 •

edited

aruneli commented Apr 12, 2018 •

edited