Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus [Input] plugin - Optimizing for bigger kubernetes clusters (500+ pods) when scraping thru 'monitor_kubernetes_pods' #8762

Merged
merged 26 commits into from
Mar 8, 2021

Conversation

gracewehner
Copy link
Contributor

Required for all PRs:

  • Associated README.md updated.
  • Has appropriate unit tests.

PR for the issue #8705.

If monitor_kubernetes_pods = true and the new monitor_kubernetes_pods_version = 2 is specified, the pod list is queried locally on the node every 60s instead of using the watch api. This allows for scalability for large clusters so the pods to scrape are distributed among the nodes, which requires telegraf to be run as a daemonset. The same annotations to choose which pods to scrape are used and the optional filtering by namespace and selectors is also the same, with this filtering done afterwards. The feature is backwards compatible and is used if and only if monitor_kubernetes_pods_version = 2 is specified in the config.

Copy link
Contributor

@telegraf-tiger telegraf-tiger bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤝 ✅ CLA has been signed. Thank you!

@lgtm-com
Copy link

lgtm-com bot commented Jan 27, 2021

This pull request introduces 1 alert when merging 49336fa into d415d9f - view on LGTM.com

new alerts:

  • 1 for Disabled TLS certificate check

@lgtm-com
Copy link

lgtm-com bot commented Jan 27, 2021

This pull request introduces 1 alert when merging 79e0e71 into d415d9f - view on LGTM.com

new alerts:

  • 1 for Disabled TLS certificate check

@lgtm-com
Copy link

lgtm-com bot commented Jan 28, 2021

This pull request introduces 1 alert when merging cdc439c into d415d9f - view on LGTM.com

new alerts:

  • 1 for Disabled TLS certificate check

@gracewehner
Copy link
Contributor Author

#8705

@vishiy
Copy link
Contributor

vishiy commented Feb 3, 2021

@ssoroka - can you please look for this to be merged ?

Copy link
Contributor

@ssoroka ssoroka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. I'd like to see more tests around the cAdvisor code if possible, as it looks as if it will panic when run.

@@ -33,6 +33,10 @@ in Prometheus format.
## - prometheus.io/path: If the metrics path is not /metrics, define it with this annotation.
## - prometheus.io/port: If port is not 9102 use this annotation
# monitor_kubernetes_pods = true
## Get the list of pods to scrape from either:
## - version 1 (default): the kubernetes watch api (cluster-wide)
## - version 2: the local cadvisor api (node-wide); for scalability. Note that the environment variable NODE_IP must be set to the host IP.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about accepting node_ip as a string parameter? I find environment variables aren't great for discoverability. I'm not against it also setting the value from env NODE_IP by default in init()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this will be solving scalability, there will be a large number of nodes. For this new way, telegraf will need to be run on every node, so having to put a different node ip value for every telegraf config could be impractical.

Also, if the environment variable does not exist, then we would need as many kube api calls as there are during startup which could overload the kube api server at scale. I also have instructions in the readme for how to get the environment variable.

Please let me know what you think.

plugins/inputs/prometheus/README.md Outdated Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Show resolved Hide resolved
plugins/inputs/prometheus/kubernetes.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/prometheus.go Outdated Show resolved Hide resolved
@gracewehner
Copy link
Contributor Author

Hi @ssoroka, I apologize for the delay. I have addressed and/or responded to your comments. Could please take another look? Thank you!

plugins/inputs/prometheus/kubernetes.go Show resolved Hide resolved
Comment on lines 161 to 164
p.nodeIP = os.Getenv("NODE_IP")
if p.nodeIP == "" {
return errors.New("The environment variable NODE_IP is not set. Cannot get pod list for monitor_kubernetes_pods using node scrape scope")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider defaulting this to os.Getenv("NODE_IP") and allowing it to be set from config. ... actually, I think it's a requirement that all settings need to be configurable from config settings, though env variable defaults are fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this as a config setting. If the config for this is empty or the specified ip is invalid, it then checks if the env var exists and is valid

@ssoroka ssoroka merged commit d7df2c5 into influxdata:master Mar 8, 2021
@ssoroka
Copy link
Contributor

ssoroka commented Mar 8, 2021

Thanks for all the work here, much appreciated!

@rashmichandrashekar
Copy link

Thanks for all the work here, much appreciated!

Thanks @ssoroka, @vishiy and @gracewehner!

ssoroka pushed a commit that referenced this pull request Mar 10, 2021
… (500+ pods) when scraping thru 'monitor_kubernetes_pods' (#8762)

(cherry picked from commit d7df2c5)
arstercz pushed a commit to arstercz/telegraf that referenced this pull request Mar 5, 2023
… (500+ pods) when scraping thru 'monitor_kubernetes_pods' (influxdata#8762)

(cherry picked from commit d7df2c5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants