Add prometheus cluster monitoring addon. #62195

serathius · 2018-04-06T14:01:35Z

This PR adds new cluster monitoring addon based on prometheus.
It adds prometheus deployment with e2e tests.
Additional components will be added iterativly in future.
Manifests based on current Helm chart.
At current state it's not intended for production use.

cc @piosz @kawych @miekg

Add prometheus cluster monitoring addon to kube-up

/sig instrumentation
/kind feature
/priority important-soon

dims · 2018-04-06T14:54:51Z

/ok-to-test

kawych · 2018-04-06T16:05:51Z

cluster/gce/gci/configure-helper.sh

@@ -2104,14 +2104,17 @@ EOF
    prepare-kube-proxy-manifest-variables "$src_dir/kube-proxy/kube-proxy-ds.yaml"
    setup-addon-manifests "addons" "kube-proxy"
  fi
+  if [[ "${ENABLE_CLUSTER_MONITORING:-}" != "none" ]]; then


I prefer if [[ "${ENABLE_CLUSTER_MONITORING:-}" == "prometheus" ]]. It makes sense to have fully separated conditions for prometheus and other monitoring systems, because this is the only one that doesn't use heapster. Please add some comments to make this separation clear, e.g. "set up cluster monitoring using prometheus" and "set up cluster monitoring using heapster"

kawych · 2018-04-06T16:05:56Z

cluster/addons/cluster-monitoring/prometheus/prometheus-pvc.yaml

+  namespace: kube-system
+  labels:
+    kubernetes.io/cluster-service: "true"
+    addonmanager.kubernetes.io/mode: EnsureExists 


Do we actually intend users to modify this, i.e. other parts than storage request?

kawych · 2018-04-06T16:06:02Z

cluster/addons/cluster-monitoring/prometheus/prometheus-configmap.yaml

@@ -0,0 +1,190 @@
+---
+apiVersion: v1
+kind: ConfigMap


Can you include some reference for the format of this config map?

kawych · 2018-04-06T16:06:03Z

cluster/addons/cluster-monitoring/prometheus/prometheus-configmap.yaml

@@ -0,0 +1,190 @@
+---


nit: please skip unnecessary separator lines like this (all first lines)

kawych · 2018-04-06T16:07:47Z

The deployments look fine, I'll take a look at the e2e tests on Monday. Can you split this PR to two commits: deplyments and tests?

kawych · 2018-04-06T16:09:23Z

cluster/addons/cluster-monitoring/prometheus/prometheus-deployment.yaml

@@ -0,0 +1,86 @@
+---
+apiVersion: extensions/v1beta1
+kind: Deployment


Do you know what is prometheus cpu/memory usage and whether we can rely on defaults?

We cannot rely on defaults in kube-system, I will prepare them.

brancz · 2018-04-09T09:47:10Z

cluster/addons/cluster-monitoring/prometheus/prometheus-configmap.yaml

+      - replacement: kubernetes.default.svc:443
+        target_label: __address__
+      - regex: (.+)
+        replacement: /api/v1/nodes/${1}/proxy/metrics


I'm not sure this is a good idea to advocate for to users. People will look at this and think this is the recommended way to run this, but in reality it's giving close to root access to the Prometheus pod to all kubelets, that doesn't seem like a good idea. I would prefer cert or token based authN + authZ from the kubelet.

We're discussing how to remove this from the example in the Prometheus repo. tl;dr people are asking this to stay as GKE doesn't have another possibility.

Looks like kube-up has http endpoints enabled for kubelet. I will use it as temporary solution and work on authorization in meantime.

@brancz Is using unencrypted metric endpoints acceptable for first version? I plan to support this addon through changes into kubelet metrics. For this PR I wanted to move current community solution for prometheus into addon, but enhance it with e2e tests.

kawych · 2018-04-10T14:35:31Z

test/utils/runners.go

@@ -127,6 +127,7 @@ type RCConfig struct {
 	ReadinessProbe    *v1.Probe
 	DNSPolicy         *v1.DNSPolicy
 	PriorityClassName string
+	PodAnnotations    map[string]string


nit: this is similar to labels, so probably it would fit better after Labels field

kawych · 2018-04-10T14:35:37Z

test/e2e/instrumentation/monitoring/prometheus.go

+	return fmt.Sprintf(`sum(QPS{kubernetes_namespace="%s",kubernetes_pod_name=~"%s.*"})`, namespace, podNamePrefix)
+}
+
+func retryUntil(predicate func() bool, timeout time.Duration) {


Please consider some logical ordering of functions, i.e. move helper methods below test logic.

kawych · 2018-04-12T12:15:46Z

/lgtm

serathius · 2018-04-12T14:19:46Z

Fixed typo in relabeling schema.

kawych · 2018-04-12T15:15:23Z

/lgtm

brancz

generally looks good, just two suggestions

brancz · 2018-04-12T15:38:53Z

cluster/addons/prometheus/prometheus-configmap.yaml

+    {}
+  prometheus.yml: |
+    rule_files:
+    - /etc/config/rules


why specify rules and alert files if they are then left empty?

brancz · 2018-04-12T15:39:57Z

cluster/addons/prometheus/prometheus-deployment.yaml

+              memory: 10Mi
+
+        - name: prometheus-server
+          image: "prom/prometheus:v2.1.0"


v2.1.0 had a variety of problems, I'd recommend v2.2.1

brancz · 2018-04-12T15:44:23Z

What I'm trying to understand is do we exclusively want to use this for the e2e tests of custom metrics or promote this as an official addon? If the latter then I'm not sure I'm comfortable with this. For testing purposes the insecure port is totally fine, but in production environments it's not what I would recommend.

Personally I'd of course like to see this done with the Prometheus Operator 😉 .

kawych · 2018-04-16T09:21:31Z

@brancz
We want to promote it as official addon, an alternative to other monitoring systems implemented by Heapster. We had a discussion about the insecure port, I don't think we came up with a good alternative for this, but it certainly has to be solved at some point. (i.e. Do you know how Prometheus Operator handles it?).

This is disabled by default. Can we comment it better to make users aware of the issues? What's your recommendation? I'd prefer to merge this to get e2e tests running.

brancz · 2018-04-16T17:35:58Z

The Prometheus Operator is soon going to implement the TokenRequest API, in order to use tokens for specific audiences. As far as I understand token impersonation is why token auth on kubelets is not enabled by GKE today (RE: #57997). So until TokenRequests are available it won't be possible on GKE. This is somewhat reasonable I guess (although personally I feel the kubelet is higher privileged than Prometheus, so impersonation would not be a security concern, but I don't want to start that discussion here, also I'm happy to be proven wrong, I admit I haven't analyzed the security situation to its fullest).

Eventually I'd prefer to see this Prometheus Operator based as it solves a lot of operational needs of Prometheus (the TokenRequest being only one example, which is unlikely to land in Prometheus itself). Also we maintain a rather exhaustive setup already to perform cluster monitoring, which we have productionized on top of OpenShift and are planning to add support for vanilla Kubernetes as well.

tl;dr I'm ok with this state for now, but I'd prefer if we don't commit to this in the long term as we already know of shortcomings of this and converge to a Prometheus Operator based setup.

(disclaimer I'm one of the maintainers of the Prometheus Operator)

serathius · 2018-04-17T08:33:24Z

/cc @gmarek @roberthbailey

kawych · 2018-04-17T09:42:52Z

/lgtm
@brancz thank you for explanations. From my knowledge, token auth is going to be enabled (see #58178), as discussed with @serathius we can move away from insecure port in a follow-up PR.

@piosz and @serathius should be able to contribute more to discussion about using Prometheus Operator. I'm not fully aware of the benefits of Prometheus Operator that are already available, @serathius has been investigating this more, i.e. he raised a concern that we may need some wider review of Prometheus Operator CRDs.

wojtek-t · 2018-04-18T08:24:18Z

I didn't carefully review neither e2e test not those yaml.
I looked into glue-ing code and that looks fine.

/approve no-issue

k8s-ci-robot · 2018-04-18T08:24:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, serathius, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/OWNERS~~ [wojtek-t]
~~hack/OWNERS~~ [wojtek-t]
~~test/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-04-18T08:25:39Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-04-18T09:17:47Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

brancz · 2018-04-19T15:27:45Z

I would like us to discuss in more detail what we want to achieve here. From the sig-instrumentation meeting two weeks ago it I was under the impression that all we wanted to do is a very simple setup purely to validate in e2e tests that Kubernetes SD and other integrations like the custom metrics monitoring pipeline are not totally broken. A fully fledged cluster monitoring addon is another story. I would like us to reconsider this.

k8s-ci-robot requested review from foxish, kawych and mwielgus April 6, 2018 14:02

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Apr 6, 2018

serathius force-pushed the prometheus branch 2 times, most recently from 3a42553 to 5d5e7f4 Compare April 6, 2018 14:15

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 6, 2018

kawych reviewed Apr 6, 2018

View reviewed changes

serathius force-pushed the prometheus branch from 5d5e7f4 to 315ac0d Compare April 6, 2018 16:25

piosz assigned kawych, piosz and brancz Apr 9, 2018

brancz reviewed Apr 9, 2018

View reviewed changes

serathius force-pushed the prometheus branch from 315ac0d to 68656db Compare April 9, 2018 14:12

kawych reviewed Apr 10, 2018

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2018

serathius force-pushed the prometheus branch from a388d86 to 93f5a0d Compare April 12, 2018 14:18

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 12, 2018

brancz reviewed Apr 12, 2018

View reviewed changes

serathius force-pushed the prometheus branch from 93f5a0d to 69a6614 Compare April 13, 2018 08:27

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 13, 2018

serathius added 2 commits April 13, 2018 11:12

Add prometheus addon

113987e

Test e2e prometheus addon

9544222

serathius force-pushed the prometheus branch from 69a6614 to 9544222 Compare April 13, 2018 09:12

k8s-ci-robot requested review from gmarek and roberthbailey April 17, 2018 08:33

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 17, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2018

k8s-github-robot merged commit bb8f58b into kubernetes:master Apr 18, 2018

serathius mentioned this pull request Oct 4, 2019

Remove prometheus addon #83442

Merged

serathius deleted the prometheus branch July 11, 2020 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus cluster monitoring addon. #62195

Add prometheus cluster monitoring addon. #62195

serathius commented Apr 6, 2018 •

edited

Loading

dims commented Apr 6, 2018

kawych Apr 6, 2018

kawych Apr 6, 2018

kawych Apr 6, 2018

kawych Apr 6, 2018

kawych commented Apr 6, 2018

kawych Apr 6, 2018

serathius Apr 9, 2018

brancz Apr 9, 2018 •

edited

Loading

serathius Apr 9, 2018 •

edited

Loading

serathius Apr 12, 2018

kawych Apr 10, 2018

kawych Apr 10, 2018

kawych commented Apr 12, 2018

serathius commented Apr 12, 2018

kawych commented Apr 12, 2018

brancz left a comment

brancz Apr 12, 2018

serathius Apr 13, 2018

brancz Apr 12, 2018

serathius Apr 13, 2018

brancz commented Apr 12, 2018

kawych commented Apr 16, 2018

brancz commented Apr 16, 2018 •

edited

Loading

serathius commented Apr 17, 2018

kawych commented Apr 17, 2018

wojtek-t commented Apr 18, 2018

k8s-ci-robot commented Apr 18, 2018

k8s-github-robot commented Apr 18, 2018

k8s-github-robot commented Apr 18, 2018

brancz commented Apr 19, 2018 •

edited

Loading

Add prometheus cluster monitoring addon. #62195

Add prometheus cluster monitoring addon. #62195

Conversation

serathius commented Apr 6, 2018 • edited Loading

dims commented Apr 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kawych commented Apr 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

serathius Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kawych commented Apr 12, 2018

serathius commented Apr 12, 2018

kawych commented Apr 12, 2018

brancz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz commented Apr 12, 2018

kawych commented Apr 16, 2018

brancz commented Apr 16, 2018 • edited Loading

serathius commented Apr 17, 2018

kawych commented Apr 17, 2018

wojtek-t commented Apr 18, 2018

k8s-ci-robot commented Apr 18, 2018

k8s-github-robot commented Apr 18, 2018

k8s-github-robot commented Apr 18, 2018

brancz commented Apr 19, 2018 • edited Loading

serathius commented Apr 6, 2018 •

edited

Loading

brancz Apr 9, 2018 •

edited

Loading

serathius Apr 9, 2018 •

edited

Loading

brancz commented Apr 16, 2018 •

edited

Loading

brancz commented Apr 19, 2018 •

edited

Loading