New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus cluster monitoring addon. #62195

Merged
merged 2 commits into from Apr 18, 2018

Conversation

@serathius
Contributor

serathius commented Apr 6, 2018

This PR adds new cluster monitoring addon based on prometheus.
It adds prometheus deployment with e2e tests.
Additional components will be added iterativly in future.
Manifests based on current Helm chart.
At current state it's not intended for production use.

cc @piosz @kawych @miekg

Add prometheus cluster monitoring addon to kube-up

/sig instrumentation
/kind feature
/priority important-soon

@dims

This comment has been minimized.

Member

dims commented Apr 6, 2018

/ok-to-test

@@ -2104,14 +2104,17 @@ EOF
prepare-kube-proxy-manifest-variables "$src_dir/kube-proxy/kube-proxy-ds.yaml"
setup-addon-manifests "addons" "kube-proxy"
fi
if [[ "${ENABLE_CLUSTER_MONITORING:-}" != "none" ]]; then

This comment has been minimized.

@kawych

kawych Apr 6, 2018

Contributor

I prefer if [[ "${ENABLE_CLUSTER_MONITORING:-}" == "prometheus" ]]. It makes sense to have fully separated conditions for prometheus and other monitoring systems, because this is the only one that doesn't use heapster. Please add some comments to make this separation clear, e.g. "set up cluster monitoring using prometheus" and "set up cluster monitoring using heapster"

namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists

This comment has been minimized.

@kawych

kawych Apr 6, 2018

Contributor

Do we actually intend users to modify this, i.e. other parts than storage request?

@@ -0,0 +1,190 @@
---
apiVersion: v1
kind: ConfigMap

This comment has been minimized.

@kawych

kawych Apr 6, 2018

Contributor

Can you include some reference for the format of this config map?

@@ -0,0 +1,190 @@
---

This comment has been minimized.

@kawych

kawych Apr 6, 2018

Contributor

nit: please skip unnecessary separator lines like this (all first lines)

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 6, 2018

The deployments look fine, I'll take a look at the e2e tests on Monday. Can you split this PR to two commits: deplyments and tests?

@@ -0,0 +1,86 @@
---
apiVersion: extensions/v1beta1
kind: Deployment

This comment has been minimized.

@kawych

kawych Apr 6, 2018

Contributor

Do you know what is prometheus cpu/memory usage and whether we can rely on defaults?

This comment has been minimized.

@serathius

serathius Apr 9, 2018

Contributor

We cannot rely on defaults in kube-system, I will prepare them.

- replacement: kubernetes.default.svc:443
target_label: __address__
- regex: (.+)
replacement: /api/v1/nodes/${1}/proxy/metrics

This comment has been minimized.

@brancz

brancz Apr 9, 2018

Member

I'm not sure this is a good idea to advocate for to users. People will look at this and think this is the recommended way to run this, but in reality it's giving close to root access to the Prometheus pod to all kubelets, that doesn't seem like a good idea. I would prefer cert or token based authN + authZ from the kubelet.

We're discussing how to remove this from the example in the Prometheus repo. tl;dr people are asking this to stay as GKE doesn't have another possibility.

This comment has been minimized.

@serathius

serathius Apr 9, 2018

Contributor

Looks like kube-up has http endpoints enabled for kubelet. I will use it as temporary solution and work on authorization in meantime.

This comment has been minimized.

@serathius

serathius Apr 12, 2018

Contributor

@brancz Is using unencrypted metric endpoints acceptable for first version? I plan to support this addon through changes into kubelet metrics. For this PR I wanted to move current community solution for prometheus into addon, but enhance it with e2e tests.

@@ -127,6 +127,7 @@ type RCConfig struct {
ReadinessProbe *v1.Probe
DNSPolicy *v1.DNSPolicy
PriorityClassName string
PodAnnotations map[string]string

This comment has been minimized.

@kawych

kawych Apr 10, 2018

Contributor

nit: this is similar to labels, so probably it would fit better after Labels field

return fmt.Sprintf(`sum(QPS{kubernetes_namespace="%s",kubernetes_pod_name=~"%s.*"})`, namespace, podNamePrefix)
}
func retryUntil(predicate func() bool, timeout time.Duration) {

This comment has been minimized.

@kawych

kawych Apr 10, 2018

Contributor

Please consider some logical ordering of functions, i.e. move helper methods below test logic.

}
time.Sleep(prometheusSleepBetweenAttempts)
}
framework.Failf("monitoring using prometheus test failed")

This comment has been minimized.

@kawych

kawych Apr 10, 2018

Contributor

Can you make the error message more useful, e.g. number of nodes with metrics vs all nodes, actual metric value vs expected, etc?

return true
}
func validateQueryValues(c clientset.Interface, query string, expectedValue float64, samplesCount int, errorTolerance float64) bool {

This comment has been minimized.

@kawych

kawych Apr 10, 2018

Contributor

nit: please stick to some naming convention, e.g. use "validate" prefix in all functions that check if the result is expected or use some other name here ("metricHasCorrectValue"). Personally I'd prefer the latter and use validate.* for functions that do some validation and fail the test/return error if it doesn't pass.

}
if len(discovery.ActiveTargets) == 0 {
framework.Logf("prometheus is not scraping any targets")
return false

This comment has been minimized.

@kawych

kawych Apr 10, 2018

Contributor

What targets are required?

return true
}
func metricAvailableOnNodes(c clientset.Interface, expectedNodesNames []string, metric string, samplesCount int) bool {

This comment has been minimized.

@kawych

kawych Apr 10, 2018

Contributor

s/metricAvailableOnNodes/metricAvailableOnAllNodes

}, prometheusTestTimeout)
})
It("should scrape metrics from annotated services.", func() {
var customMetricTargetValue float64 = 1000

This comment has been minimized.

@kawych

kawych Apr 11, 2018

Contributor

I think you missed my previous comment - can you use lliterals instead? (you can use bool constants for readability: doAnnotatePods = true, doAnnotateServices = true, dontAnnotatePods = false, dontAnnotateServices = false)

This comment has been minimized.

@serathius

serathius Apr 11, 2018

Contributor

I moved test parameterization to consts and split exposeCustomMetrics function based on annotation type.

@@ -2104,6 +2104,12 @@ EOF
prepare-kube-proxy-manifest-variables "$src_dir/kube-proxy/kube-proxy-ds.yaml"
setup-addon-manifests "addons" "kube-proxy"
fi
# Setup cluster monitoring using prometheus

This comment has been minimized.

@kawych

kawych Apr 11, 2018

Contributor

AFAIU we may still need to run standalone Heapster in addition to this - for users that use older versions of kubectl

This comment has been minimized.

@serathius

serathius Apr 11, 2018

Contributor

I will remove prometheus from cluster monitoring addons to allow deploy heapster separately.

@@ -2104,6 +2104,11 @@ EOF
prepare-kube-proxy-manifest-variables "$src_dir/kube-proxy/kube-proxy-ds.yaml"
setup-addon-manifests "addons" "kube-proxy"
fi
# Setup prometheus stack for monitoring kubernetes cluster
if [[ "${ENABLE_PROMETHEUS_MONITORING:-}" == "true" ]]; then

This comment has been minimized.

@kawych

kawych Apr 12, 2018

Contributor

Have you tested it? If you add new env var, I think you need to add it here as well: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/util.sh#L678

This comment has been minimized.

@serathius

serathius Apr 12, 2018

Contributor

Just finished testing it from fresh cluster. Moving prometheus out of cluster-monitoring took more changes.

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 12, 2018

LGTM. Please fix presubmits and make sure that other reviewers approve (or uncc them).

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 12, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added lgtm and removed lgtm labels Apr 12, 2018

@serathius

This comment has been minimized.

Contributor

serathius commented Apr 12, 2018

Fixed typo in relabeling schema.

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 12, 2018

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 12, 2018

@brancz

generally looks good, just two suggestions

{}
prometheus.yml: |
rule_files:
- /etc/config/rules

This comment has been minimized.

@brancz

brancz Apr 12, 2018

Member

why specify rules and alert files if they are then left empty?

This comment has been minimized.

@serathius

serathius Apr 13, 2018

Contributor

Removed

memory: 10Mi
- name: prometheus-server
image: "prom/prometheus:v2.1.0"

This comment has been minimized.

@brancz

brancz Apr 12, 2018

Member

v2.1.0 had a variety of problems, I'd recommend v2.2.1

This comment has been minimized.

@serathius

serathius Apr 13, 2018

Contributor

Updated

@brancz

This comment has been minimized.

Member

brancz commented Apr 12, 2018

What I'm trying to understand is do we exclusively want to use this for the e2e tests of custom metrics or promote this as an official addon? If the latter then I'm not sure I'm comfortable with this. For testing purposes the insecure port is totally fine, but in production environments it's not what I would recommend.

Personally I'd of course like to see this done with the Prometheus Operator 😉 .

@k8s-ci-robot k8s-ci-robot removed the lgtm label Apr 13, 2018

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 16, 2018

@brancz
We want to promote it as official addon, an alternative to other monitoring systems implemented by Heapster. We had a discussion about the insecure port, I don't think we came up with a good alternative for this, but it certainly has to be solved at some point. (i.e. Do you know how Prometheus Operator handles it?).

This is disabled by default. Can we comment it better to make users aware of the issues? What's your recommendation? I'd prefer to merge this to get e2e tests running.

@brancz

This comment has been minimized.

Member

brancz commented Apr 16, 2018

The Prometheus Operator is soon going to implement the TokenRequest API, in order to use tokens for specific audiences. As far as I understand token impersonation is why token auth on kubelets is not enabled by GKE today (RE: #57997). So until TokenRequests are available it won't be possible on GKE. This is somewhat reasonable I guess (although personally I feel the kubelet is higher privileged than Prometheus, so impersonation would not be a security concern, but I don't want to start that discussion here, also I'm happy to be proven wrong, I admit I haven't analyzed the security situation to its fullest).

Eventually I'd prefer to see this Prometheus Operator based as it solves a lot of operational needs of Prometheus (the TokenRequest being only one example, which is unlikely to land in Prometheus itself). Also we maintain a rather exhaustive setup already to perform cluster monitoring, which we have productionized on top of OpenShift and are planning to add support for vanilla Kubernetes as well.

tl;dr I'm ok with this state for now, but I'd prefer if we don't commit to this in the long term as we already know of shortcomings of this and converge to a Prometheus Operator based setup.

(disclaimer I'm one of the maintainers of the Prometheus Operator)

@serathius

This comment has been minimized.

Contributor

serathius commented Apr 17, 2018

@k8s-ci-robot k8s-ci-robot requested review from gmarek and roberthbailey Apr 17, 2018

@kawych

This comment has been minimized.

Contributor

kawych commented Apr 17, 2018

/lgtm
@brancz thank you for explanations. From my knowledge, token auth is going to be enabled (see #58178), as discussed with @serathius we can move away from insecure port in a follow-up PR.

@piosz and @serathius should be able to contribute more to discussion about using Prometheus Operator. I'm not fully aware of the benefits of Prometheus Operator that are already available, @serathius has been investigating this more, i.e. he raised a concern that we may need some wider review of Prometheus Operator CRDs.

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 17, 2018

@wojtek-t

This comment has been minimized.

Member

wojtek-t commented Apr 18, 2018

I didn't carefully review neither e2e test not those yaml.
I looked into glue-ing code and that looks fine.

/approve no-issue

@k8s-ci-robot

This comment has been minimized.

Contributor

k8s-ci-robot commented Apr 18, 2018

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, serathius, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 18, 2018

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-merge-robot

This comment has been minimized.

Contributor

k8s-merge-robot commented Apr 18, 2018

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-merge-robot k8s-merge-robot merged commit bb8f58b into kubernetes:master Apr 18, 2018

14 of 15 checks passed

Submit Queue Required Github CI test is not green: pull-kubernetes-e2e-gce
Details
cla/linuxfoundation serathius authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gke Job succeeded.
Details
pull-kubernetes-e2e-kops-aws Job succeeded.
Details
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
@brancz

This comment has been minimized.

Member

brancz commented Apr 19, 2018

I would like us to discuss in more detail what we want to achieve here. From the sig-instrumentation meeting two weeks ago it I was under the impression that all we wanted to do is a very simple setup purely to validate in e2e tests that Kubernetes SD and other integrations like the custom metrics monitoring pipeline are not totally broken. A fully fledged cluster monitoring addon is another story. I would like us to reconsider this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment