Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice prometheus monitoring #425

Closed
runningman84 opened this issue May 1, 2019 · 35 comments
Closed

Best practice prometheus monitoring #425

runningman84 opened this issue May 1, 2019 · 35 comments
Assignees
Labels
kind/question No code change, just asking/answering a question

Comments

@runningman84
Copy link

Describe the bug
I would like to monitor a k3s system. Therefore I installed the prometheus operator helm chart. Out of the box a lot of alerts are in state FIRING.
A lot of rules which cover the apiserver and kubelet are not working. Should users just disable these rules or ar you going to provide your own default rules for a k3s setup?

To Reproduce
Install prometheus helm chart with default values

Expected behavior
Everything should look green if k3s specific instructions were followed....

Screenshots
KubeAPIDown (1 active)
KubeControllerManagerDown (1 active)
KubeDaemonSetRolloutStuck (1 active) kube-state-metrics
KubeSchedulerDown (1 active)
KubeletDown (1 active)
TargetDown (2 active) apiserver, kubelet

@deniseschannon deniseschannon added the kind/question No code change, just asking/answering a question label May 4, 2019
@runningman84
Copy link
Author

In order to remove target scrape errors I use this configuration:

    kubeApiServer:
      enabled: false
    kubeEtcd:
      enabled: false
    kubeControllerManager:
      enabled: false
    kubeScheduler:
      enabled: false

Unfortunatly core parts of k3s are not monitored using this config.

@JeffreyVdb
Copy link

It should be possible to monitor the API server, or at least give an option to change the advertise address.

@szamuboy
Copy link

szamuboy commented Jul 12, 2019

You can try my HelmChart CRD.

It's not perfect, the kubelet for some reason does not report certain labels but it solves most of your issues.

@hlugt
Copy link

hlugt commented Aug 13, 2019

I am also trying to get kube-prometheus to work on k3s (currenlty version 0.8.0). I am running my cluster on arm, which complicates it a bit: kube-state-metrics and the kube-rbac-proxy for example are not readily available for arm. I made some images myself but lucky enough carlosedp has made the necessary arm images available. You can have a look at his github cluster_monitoring.

Problem is though authorization for node-exporter and kube-state-metrics (and possibly more): it seems k3s uses another authentication version as user phillebaba has found. See issue carlosedp/cluster-monitoring#13 (comment) .

Can k3s developers or anyone else maybe shed some light or advise on this?

@carlosedp
Copy link
Contributor

I've added a workaround in my cluster-monitoring stack to remove kube-rbac-proxy from node_exporter and kube-state-metrics.

Can you test-out the k3s branch from https://github.com/carlosedp/cluster-monitoring/tree/k3s and report back if it worked? It's a matter of applying the manifests from "manifests" dir. They are already generated from jsonnet.

@anarcher
Copy link

anarcher commented Aug 30, 2019

With k3s (k3d) and kube-state-metrics (kube-rbac-proxy), I have the same problem. If the intention of k3s is to remove alpha and non-default features, I think the kube-rbac-proxy should change to use authentications/v1, or to remove kube-rbac-proxy in kube-state-metrics and node-exporter in our monitoring stack.
But I wish that k3s handles with authentications/v1beta1 too. :->
kind works fine with kube-rbac-proxy.

Problem is though authorization for node-exporter and kube-state-metrics (and possibly more): it seems k3s uses another authentication version as user phillebaba has found. See issue carlosedp/cluster-monitoring#13 (comment) .

@carlosedp
Copy link
Contributor

The problem with changing to auth/v1 is that it would not be compatible with previous versions of k8s where the api was still beta.

@ramukima
Copy link

You can try my HelmChart CRD.

It's not perfect, the kubelet for some reason does not report certain labels but it solves most of your issues.

Fails first and eventually succeeds with the following additional changes after failure -

  valuesContent: |-
    prometheusOperator:
      createCustomResource: false

It just disables the creation of CRDs after first failed attempt.

@hlugt
Copy link

hlugt commented Nov 14, 2019

I've added a workaround in my cluster-monitoring stack to remove kube-rbac-proxy from node_exporter and kube-state-metrics.

Can you test-out the k3s branch from https://github.com/carlosedp/cluster-monitoring/tree/k3s and report back if it worked? It's a matter of applying the manifests from "manifests" dir. They are already generated from jsonnet.

Hi, I now do have node-exporter metrics, thx, but cadvisor and the k3s kubelet still give authentication errors?

Edit: I have changed prometheus-serviceMonitorKubelet.yaml to use https and include tls and now I can collect metrics with the carlosedp set of manifests (so without the kube-rbac-proxy).

@carlosedp
Copy link
Contributor

As I added to the readme on the repo with more details on carlosedp/cluster-monitoring#17, under K3s you need to use Docker as the runtime to have all cAdvisor metrics.

@onedr0p
Copy link
Contributor

onedr0p commented Apr 5, 2020

Any update on this? It would be great to monitor with the Prometheus Operator Helm Chart. kubeApiServer working just fine, it's only the following three that are not able to be monitored

    kubeControllerManager:
      enabled: false
    kubeScheduler:
      enabled: false
    kubeProxy:
      enabled: false

@larssb
Copy link

larssb commented Jun 2, 2020

Yeah what is the latest on this?

@brandond
Copy link
Contributor

brandond commented Jun 2, 2020

Is there an issue? I have a bog-standard prometheus install pointed at metrics-server and node-exporter. Literally copied the manifests over from an EKS cluster and didn't have to change anything.

@larssb
Copy link

larssb commented Jun 3, 2020

Hi @brandond,

This issue just gave me the impression that Prometheus could be challenging to get up and running. So I was wondering, trying to inquire for an update onto some best practices. But, if its simply just throwing a Prometheus Helm chart at K3S I'll better just jump into it.

@brandond
Copy link
Contributor

brandond commented Jun 3, 2020

You have to make sure you have things like metrics-server, kube-state-metrics, node-exporter etc deployed, but that's not unique to k3s. Nor is the prometheus scraper configuration. None of these should require any configuration that wouldn't be necessary on any other k8s cluster.

@larssb
Copy link

larssb commented Jun 5, 2020

Great stuff. Thank you Mr. @brandond

@isshwar
Copy link

isshwar commented Jul 21, 2020

Hi,

I am new to k3s. I have got k3s installation set up. I am trying to pull metrics from the cluster. My prometheus is hosted outside.

It would be great help if someone could throw some light on how to set this up. I have literally spent hours trying to find a solution.

do the installation should have metrics server or kube-state-metrics running?

@ioagel
Copy link

ioagel commented Aug 11, 2020

kubeControllerManager:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>
kubeScheduler:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>

This fixed my problems.

@cubic3d
Copy link

cubic3d commented Aug 23, 2020

@ioagel @onedr0p did you find a way to get kubeProxy working or is it the only component without metrics access?

@lictw
Copy link

lictw commented Sep 19, 2020

+1 for KubeProxy

@djhoese
Copy link

djhoese commented Oct 4, 2020

Following @ioagel's advice I got the controller manager and scheduler to work for my K3s cluster. I ended up having to disable (enable: false) etcd and proxy for my single node test cluster. Thanks @ioagel.

@TiemenSch
Copy link

I tried getting the https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack chart to run on my K3s cluster of 3 RPi4's, but sadly some of the images aren't proper multi-arch images (e.g. they fail with standard_init_linux.go:211: exec user process caused "exec format error"). I used the following HelmChart spec:

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: kube-prometheus-stack
  namespace: kube-system
spec:
  chart: kube-prometheus-stack
  repo: https://prometheus-community.github.io/helm-charts
  targetNamespace: monitoring

So, what would be the simplest (best practice) way to deploy a minimal installation of Prometheus and Grafana and perhaps point them to the metrics-server afterwards?

Some of the guides on the internet immediately start utilizing all sorts of templated helper repositories, but that doesn't quite serve as an easy to understand minimal baseline installation at all. A tutorial installation IMHO shouldn't rely on any custom repo's, but rather use the conventional ones where possible.

billimek added a commit to billimek/k8s-gitops that referenced this issue Nov 10, 2020
... After upgrading to k3s 1.19 (from 1.18), the promethgeus target scraping for those two targets stopped wokring,

`Get "http://10.2.0.30:10252/metrics": dial tcp 10.2.0.30:10252: connect: connection refused`

Looking at k3s-io/k3s#425 (comment) suggests the endpoint approach should work.  Experimenting with removing the explicit endpoint callout to see if there is an improvement.

Signed-off-by: Jeff Billimek <jeff@billimek.com>
@billimek
Copy link

billimek commented Nov 10, 2020

It appears, for me at least, after upgrading from k3s 1.18 to 1.19 that the explicit endpoint approach stopped working.

I suspect that there is now a firewall rule preventing connections to the endpoints on port 10251 & 10252 from anywhere other than 127.0.0.1

edit: This commit seems to be the culprit: 4808c4e#diff-c68274534954d72488196ca23f12cfb3ebe65998d9e7c4a43d7ba9acc9532574

@cablespaghetti
Copy link

cablespaghetti commented Jan 30, 2021

This should help people a bit :)

prometheus-community/helm-charts#626

Also I try to keep this repo up-to-date which is a bit of a quick start: https://github.com/cablespaghetti/k3s-monitoring

@onedr0p
Copy link
Contributor

onedr0p commented Apr 4, 2021

This should help people a bit :)

prometheus-community/helm-charts#626

Also I try to keep this repo up-to-date which is a bit of a quick start: https://github.com/cablespaghetti/k3s-monitoring

Are you able to monitor kube controller, kube scheduler or kube proxy? I've looked at you repo and saw your PR over at kube-prometheus-stack but it seems like nothing works on k3s to have these monitored.

@brandond I would love to hear how this worked for you, it doesn't seem like I'm doing anything wrong in my helm values. You can take a look at them over at:

https://github.com/onedr0p/home-cluster/blob/main/cluster/monitoring/kube-prometheus-stack/helm-release.yaml

As soon as I enable those metrics (with or without an endpoint) they will not be scraped and the target will appear as down in prometheus.

It would be great to get more eyes on this, as rolling out k3s in a production env would be wise to have these metrics collected.

Let me know if you need any more information.

@cablespaghetti
Copy link

The people maintaining kube-prometheus-stack unfortunately didn't like the PR due to the level of tweaking required to get k3s working. As such I'm not sure it's possible with the main chart right now.

The way things work with k3s is that the api server endpoint gives you metrics from controller manager and scheduler as well. So you'll probably have all the metrics but the helm chart rules and dashboard don't expect them to be tagged with job=apiserver.

I'm not sure how kube proxy works off the top of my head but it may well be the same.

The way I see it is there are two options. Maintain a fork of the chart or have an option in k3s to split out the metrics endpoints in a "more standard" way which is compatible with the chart as it stands.

@ThomasADavis
Copy link

In a separate issue, the rancher monitoring now uses a forked version of PushProx to get many of stats bound to localhost, from a single port.

To see it in action, without loading up all of rancher, try this manifest file (drop in /var/lib/rancher/k3s/server/manifests). You'll get the operator servicemonitor, and 4 or 5 of the sets of stats from a single exporter..

apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: pushprox
  namespace: monitoring
spec:
  chart: https://charts.rancher.io/assets/rancher-pushprox/rancher-pushprox-0.1.201.tgz
  metricsPort: 10249
  component: k3s-server
  valuesContent: |-
    serviceMonitor:
      enabled: true
    clients:
      port: 10013
      useLocalhost: true
      tolerations:
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"

To see what stats this enables, do a 'curl -s http://localhost:10249/metrics'

@onedr0p
Copy link
Contributor

onedr0p commented Apr 5, 2021

@ThomasADavis I had to update your config:

---
apiVersion: helm.cattle.io/v1
kind: HelmChart
metadata:
  name: pushprox
  namespace: monitoring
spec:
  chart: https://charts.rancher.io/assets/rancher-pushprox/rancher-pushprox-0.1.201.tgz
  targetNamespace: monitoring
  valuesContent: |-
    metricsPort: 10249
    component: k3s-server
    serviceMonitor:
      enabled: true
    clients:
      port: 10013
      useLocalhost: true
      tolerations:
      - effect: "NoExecute"
        operator: "Exists"
      - effect: "NoSchedule"
        operator: "Exists"

However, while that does get these components monitored they are not working out of the box with the default prometheus rules or grafana dashboards shipped with kube-prometheus-stack.

Prometheus

image

Grafana

image

@nvtkaszpir
Copy link

Wild idea, kube-prometheus-stack + promethesu relabel of k3s to patch standard k8s deployment?

@RouNNdeL
Copy link

I think if there was an option to bind the controller manager and the scheduler to 0.0.0.0 in k3s, then it should work when combined with:

kubeControllerManager:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>
kubeScheduler:
 endpoints:
  - ip_of_your_master_node <i.e. 192.168.1.38>

Is there such an option or is the bind address hardcoded?

@onedr0p
Copy link
Contributor

onedr0p commented Apr 18, 2021

@RouNNdeL see this commit 4808c4e

Before this change, k3s configured the scheduler and controller's insecure ports to listen on 0.0.0.0. Those ports include pprof, which provides a DoS vector at the very least.

These ports are only enabled for componentstatus checks in the first place, and componentstatus is hardcoded to only do the check on localhost anyway (see https://github.com/kubernetes/kubernetes/blob/v1.18.2/pkg/registry/core/rest/storage_core.go#L341-L344), so there shouldn't be any downside to switching them to listen only on localhost.

It is hardcoded.

@RouNNdeL
Copy link

Are there any chances we would see this implemented as an option? We would have to accept the security risks of enabling it, but I'd be fine with that.

@onedr0p
Copy link
Contributor

onedr0p commented Jul 12, 2021

I was able to get etcd monitored in kube-prometheus-stack in a standard way:

  • Set on the k3s servers: --etcd-expose-metrics=true
  • Set in the kube-prometheus-stack config:
    kubeEtcd:
      enabled: true
      endpoints:
      - IP of k3s master 1
      - IP of k3s master 2
      - IP of k3s master 3
      service:
        enabled: true
        port: 2381
        targetPort: 2381

Default dashboard shipped with kube-prometheus-stack:

image

@onedr0p
Copy link
Contributor

onedr0p commented Jul 12, 2021

I have got all component monitored again:

#3619 (comment)

I believe this issue can be closed!

@cwayne18
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question No code change, just asking/answering a question
Projects
None yet
Development

No branches or pull requests