Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

Kafka monitoring #4138

Merged
merged 16 commits into from
Apr 12, 2018
Merged

Kafka monitoring #4138

merged 16 commits into from
Apr 12, 2018

Conversation

nrmitchi
Copy link
Contributor

What this PR does / why we need it:
Currently this chart does not expose jmx metrics to prometheus. This PR is basically a helm-ification of the changes at: Yolean/kubernetes-kafka@5a2b8c7, which adds an exporter.

@faraazkhan @h0tbird @benjigoldberg

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 13, 2018
Also changes the readiness probe to match the new one upstream
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 13, 2018
@paulczar
Copy link
Collaborator

Can you bump the version number to 0.5.0 ? Given this is adding resources we'd consider it to be a minor update rather than a patch under semver.

@nrmitchi
Copy link
Contributor Author

Not a problem, can totally do that.

That being said, it also seems like the standard in other charts is for metrics to be disabled by default, so I'll get that flipped too

@nrmitchi
Copy link
Contributor Author

After looking through some of the metrics that this is providing, I'm not 100% sure that these are necessarily the correct metrics that should be included in the chart (ie, I think there are metrics that should be included, but are not).

Based on that, I think we should put this on hold for right now. I'd hate to get something merged in upstream only for it to not actually solve the problem it aimed to.

@bradenwright
Copy link
Contributor

So I agree that metrics would be super nice and I've been working on it as well. Few things to bring up.

(1) the whitelist/pattern is removing a lot of useful stats. I think its too limiting
(2) There should be a way to override the config map values or enable/disable it (related to #1)
(3) This works for standard install of prometheus but doesn't work with the prometheus-operator, I guess both should probably be supported (I was actually working on it the other way, using prometheus-operator)
(4) I setup as a sidecar, we already get stats about cpu/memory from prometheus, so to me a sidecar seemed a little more appropriate but open to discussion
(5) It would be nice to setup some generic alerts that every kafka cluster would want (although maybe that's a separate PR)

Also I'd gladly contribute too, but not quite ready for a PR, my codes working but still needs some love.

@nrmitchi
Copy link
Contributor Author

@bradenwright
Completely agree on points 1 and 2. I actually pulled the rules out into the config, and left what I had just as an example/template. I found that I was missing stuff, and so for now I'm just collecting everything.

Regarding the other points:
3) I have really no experience with the prometheus operator. I'm more than happy to make it work for both, but I'm really not sure what any of the differences are.
4) The foundation of this was taken from the commit referenced in the PR message. My understanding is that this chart was meant to follow that implementation, so I have it matching that.
5) Totally agree that that would be pretty cool, but I think that would probably be best as a separate PR

@bradenwright
Copy link
Contributor

bradenwright commented Mar 13, 2018

@nrmitchi cool, that's where I started too. There ended up being a number of metrics I thought were overkill, so I was gonna blacklist so stats but make configureable.

(3) I more than happy to contribute that part, I'm hoping to have it all done by Monday, if not earlier
(4) I just think this probably deserves some discussion. And it maybe more appropriate to run as a java-agent. https://github.com/prometheus/jmx_exporter#jmx-exporter mentions running as java-agent as being the preferred, but reading the reasoning I thought it fit better as a sidecar (just seemed to fit the docker/k8s model better). And I felt it was actually a little easier to configure as a sidecar in certain ways. I also liked the idea if I wanted to change config on my exporter I didn't think that should require a reboot of kafka, seemed a little extreme and better to be de-coupled. But again very open to discussion on this.
(5) cool!

I'm more than happy to collaborate however you want on this, I think it will work out well that you did it with Prometheus and I did it with Prometheus Operator.

@bradenwright
Copy link
Contributor

bradenwright commented Mar 14, 2018

Code wise too if you/anyone wanted to compare, again I can push my code running as a sidecar instead of java-agent for a comparison (if wanted), again I'll have a clean PR Monday maybe earlier.

@nrmitchi
Copy link
Contributor Author

Updated to include both the jmx exporter, as well as the kafka-exporter referenced at: https://prometheus.io/docs/instrumenting/exporters/

Considering adding a configurable burrow (https://github.com/linkedin/Burrow) deployment as well.

Also planning to swap out the default image that is currently being used with the base jmx_exporter (https://github.com/prometheus/jmx_exporter) rather than the custom one that is currently being used

@t-d-d
Copy link
Contributor

t-d-d commented Mar 21, 2018

@bradenwright I would be interested in seeing your sidecar approach, especially if it works with prometheus-operator

@NicolasTr
Copy link

@nrmitchi What is the idea behind the change in the readiness probe?

I found this pull request after running into an issue with the readiness probe when enabling JMX on the original chart.

I added the following to the values file:

configurationOverrides:
  jmx.opts: "-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=9096 -Djava.rmi.server.hostname=localhost -Dcom.sun.management.jmxremote.rmi.port=9096"

I got the following issue when running the readiness probe:

Error: JMX connector server communication error: service:jmx:rmi://pj-kafka-kafka-0:9096

And I fixed it by changing the readiness probe in the chart to this:

        readinessProbe:
          exec:
            command:
              - bash
              - -ec
              - KAFKA_JMX_OPTS= kafka-topics --zookeeper {{ template "zookeeper.url" . }} --list
          initialDelaySeconds: 30
          timeoutSeconds: 5

I don't know kafka well enough to know which one would be the most meaningful.

@nrmitchi
Copy link
Contributor Author

@NicolasTr a member of my team did a lot of the initial work in getting the jmx export up and running, but my understanding is that the readiness probe change was taken from the upstream (non helm) kubernetes-kafka project, and the pre-existing one didn't work due to a port conflict.

I'm honestly not entirely sure if it is the best readinessProbe, but again I'm trusting the change upstream (https://github.com/Yolean/kubernetes-kafka/blob/master/kafka/50kafka.yml#L63)

@bradenwright
Copy link
Contributor

bradenwright commented Mar 21, 2018

@t-d-d @nrmitchi

I still need to do a little clean up, and I'm currently tweaking jmx-exporter-configmap.yaml to produce stats that are more appropriate for Prometheus, but it should be fine to discuss things above:

https://github.com/spothero/kubernetes-charts/pull/3/files

@benjigoldberg
Copy link
Collaborator

@nrmitchi the readiness probe is a good one, we're going to merge another PR from @t-d-d that already has this change included in it. If possible, it would be nice to just hone this PR down to just the changes around JMX and metrics export.

My initial thoughts are that I prefer the sidecar method to modifying the core container to export metrics as well. Sidecars are a fairly common paradigm in the community generally, and are typically recommended for this kind of metrics export operation from what I've seen.

full disclosure: @bradenwright and I are coworkers at SpotHero

The final thought I have -- I think the zookeeper metrics are great -- but my initial reaction is that they likely belong with zookeeper. Although many users will deploy a zookeeper cluster with a kafka cluster, that isnt universally true. I'd prefer to encourage people to think about and package metrics with the component to which they belong.

What do you think @nrmitchi? How about others on this PR @bradenwright @t-d-d do you have any opinions given your time using Kafka?

Copy link
Collaborator

@benjigoldberg benjigoldberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions/comments

port: 5555

# Rules to apply to the Kafka JMX Exporter
kafkaConfig:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we could drop this key and tab everything over?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so due to the way the key is used; ie, the blob is injected as {{ toYaml .Values.metrics.jmx.kafkaConfig | indent 4 }}. If we were to drop the key and unindent, we'd end up including all of the other .metrics.jmx values in the jmx-kafka-prometheus.yml configuration. I'm not sure if that will cause the jmx exporter to error on startup, but it seems unclean either way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, of course 👍

jmxUrl: service:jmx:rmi:///jndi/rmi://127.0.0.1:5555/jmxrmi
ssl: false
whitelistObjectNames: ["kafka.server:*", "kafka.controller:*", "java.lang:*"]
# rules:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these should just be set as defaults, what do you think (eg not commented)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning for leaving them commented was because including them excludes a lot of metrics; I'm personally of the opinion that the default should include everything, allowing a user to exclude stuff if necessary, rather than hide metrics by default, requiring a user to figure out why they are missing and how to expose them. Typically articles/help guides about monitoring in general will reference using specific metrics, and it can be confusing while going through those if the metrics that should be there happened to be hidden by default.

I left them in and commented mostly as a reference in case someone did want to write their own rules, but if the choice is between uncommenting, or removing entirely, I would err on the side of "just remove them"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me 👍

@benjigoldberg
Copy link
Collaborator

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 10, 2018
@josdotso
Copy link
Contributor

related: #4931

Awesome stuff! I have a similar branch for kafka locally that I will PR after this merges. It adds JMX and Kafka exporters too, but also refactors similar to #4931. I'll hold back filing my kafka PR until this PR merges -- and I get feedback on #4931's pattern. Thanks!

@benjigoldberg
Copy link
Collaborator

@nrmitchi if we clean up the jmx port definition I think we're good to go. Lets use the structure you've defined and remove the one at the top level and update the docs to remove that.

@nrmitchi
Copy link
Contributor Author

@benjigoldberg 👍

@benjigoldberg
Copy link
Collaborator

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benjigoldberg, nrmitchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 12, 2018
@benjigoldberg
Copy link
Collaborator

@nrmitchi thanks much for your contribution and working through all the questions/requests with us!

@k8s-ci-robot k8s-ci-robot merged commit a5f822e into helm:master Apr 12, 2018
@aruneli
Copy link

aruneli commented Apr 12, 2018

@nrmitchi I am trying to install kafka chart after this merge. Now I get this message "Error: render error in "kafka/templates/statefulset.yaml": template: kafka/templates/statefulset.yaml:174:28: executing "kafka/templates/statefulset.yaml" at <{{template "kafka.co...>: template "kafka.configmap" not defined" . I hard coded configmap name to "kafka-metrics" to unblock myself. Please test it.

@aruneli
Copy link

aruneli commented Apr 12, 2018

@nrmitchi I run prometheus/grafans cluster using charts here - https://github.com/camilb/prometheus-kubernetes
After bringing up kafka cluster, where do I configure prometheus to collect metrics from kafka?

@nrmitchi
Copy link
Contributor Author

@aruneli the conflict seems to be with this PR which was merged in between the initial branch for this PR, and the merge: ff9f02d#diff-e4513710c9e51c6f539000cb48ad85f0

Putting something up to fix it.

@aruneli
Copy link

aruneli commented Apr 13, 2018

@nrmitchi Thanks. Could you also please check my other comment on how to enable running prometheus to scrape data from the kafka exporter.

@ebabani
Copy link
Contributor

ebabani commented Apr 13, 2018

@aruneli It would be similar to how you have configured prometheus to scrape other pods. That seems like a better question for https://github.com/camilb/prometheus-kubernetes

@nrmitchi
Copy link
Contributor Author

How you scrape the metrics would be entirely up to you; they are exposed, but I'm not sure how that particular chart determines its targets.

This chart now adds annotations to the statefulset if you enable the jmx metrics (and to the kafka-exporter if you enabled that), so it should work out of the box if you prometheus config uses service discovery.

@bradenwright
Copy link
Contributor

@nrmitchi thanks for putting this together, I'll make sure to open an new PR to add support for Prometheus Operator.

@aruneli
Copy link

aruneli commented Apr 13, 2018

@bradenwright Thanks

@bradenwright
Copy link
Contributor

In case anyone is follow here I opened a PR for the Prometheus Operator changes: #5120

ichtar pushed a commit to Bestmile/charts that referenced this pull request May 15, 2018
* Helm-ify the JMX exporter that was added to the base project

Also changes the readiness probe to match the new one upstream

* Increment version in Chart.yaml

* Bump Charts.yaml version and make metrics off by default

* Template out jmx rules, thus allowing people to whitelist if they want

* Pull a bit more into the config

* lint

* Fix value from copypaste error

* Add metric export option using the recommend kafka metric exporter as well

* Remove zookeeper metrics

* Update kafka-exporter default

* Add back additionalPorts that I accidently nixed in merge

* Parameterize jmx port and add new values to readme

* Fix readme formatting

* Remove old jmxPort from values.yaml
voron pushed a commit to dysnix/helm-charts that referenced this pull request Sep 5, 2018
* Helm-ify the JMX exporter that was added to the base project

Also changes the readiness probe to match the new one upstream

* Increment version in Chart.yaml

* Bump Charts.yaml version and make metrics off by default

* Template out jmx rules, thus allowing people to whitelist if they want

* Pull a bit more into the config

* lint

* Fix value from copypaste error

* Add metric export option using the recommend kafka metric exporter as well

* Remove zookeeper metrics

* Update kafka-exporter default

* Add back additionalPorts that I accidently nixed in merge

* Parameterize jmx port and add new values to readme

* Fix readme formatting

* Remove old jmxPort from values.yaml

Signed-off-by: voron <av@arilot.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet