Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Advanced Configuration #46

Merged
merged 7 commits into from
Jun 27, 2018
Merged

Conversation

DirectXMan12
Copy link
Contributor

@DirectXMan12 DirectXMan12 commented Feb 16, 2018

This commit introduces advanced configuration. The rate-interval and
label-prefix flags are deprecated, and replaced by a configuration file
that allows you to specify series queries and the rules for transforming
those into metrics queries and API resources.

@DirectXMan12
Copy link
Contributor Author

cc @brancz

An example config can be found in the config package, as generated by the DefaultConfig function. I want to write a few more unit tests, and maybe simplify the config a bit, and/or make whitelisting easier/less verbose.

@DirectXMan12 DirectXMan12 changed the title Advanced Configuration [WIP] Advanced Configuration Feb 16, 2018
@DirectXMan12
Copy link
Contributor Author

(P.S. feel free to tell me if you find the configuration overly convoluted)

@brancz
Copy link
Contributor

brancz commented Feb 17, 2018

I’m on vacation until the 26., but I overflow this and generally this goes in the same direction as I was thinking, so if urgent feel free to move forward without a thorough review by me.

@DirectXMan12
Copy link
Contributor Author

We can wait a bit. I'd like to get a more detailed review from you, and it gives time to get end-user feedback.

# this also introduces an implicit filter on metric family names
naming:
preifx: "container_"
suffix: "_seconds_total"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be generalized to a regexp replace maybe?

@tcolgate
Copy link

This looks like a more useful approach. A couple of things that would be handy:

  • dynamic config reload, either watch the config and reload, or provide a web endpoint
  • potentially multiple config files to allow them to be written separately.
  • The prefix/suffic options might be better done as a regex match and replace

I need to play with it more to see how the resource mapping stuff works.
It's a shame that the custom metrics API seem to reclude the idea of a user dumping a query straight into a HPA config, but if we must preconfigure the metrics that are available, this seems like the most adptable approach.
(I'd rebuilding our prom sidecar at the moment and am probably going to include a customresource for providing chunks of config for k8s-prometheus-adapter, so users will be able to add custom metrics by creating these rules)

Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to play with this in the wild a bit more, but broadly this is really awesome and looks good to me.

I think @tcolgate's comment on regexing is justified. Personally I think hot-reloading the config is not really necessary, I'd expect one to roll out a new deployment + config if there is a change, meaning a completely new pod. This makes failure scenarios a lot easier to deal with as they appear at start rather than at runtime.

@@ -85,6 +86,12 @@ func NewCommandStartPrometheusAdapterServer(out, errOut io.Writer, stopCh <-chan
flags.StringVar(&o.LabelPrefix, "label-prefix", o.LabelPrefix,
"Prefix to expect on labels referring to pod resources. For example, if the prefix is "+
"'kube_', any series with the 'kube_pod' label would be considered a pod metric")
flags.StringVar(&o.MetricsConfigFile, "metrics-config", o.MetricsConfigFile,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this should just be --config, there's nothing about metrics of the adapter in here. Just a nit, feel free to ignore if you disagree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the name of the flag

# specify that the `container_` and `_seconds_total` suffixes should be removed.
# this also introduces an implicit filter on metric family names
naming:
preifx: "container_"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

# attach only pod and namespace resources by mapping label names to group-resources
overrides:
namespace: {resource: "namespace"},
pod_name: {resource: "pod"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly I think this is a change we should just make upstream, it's annoying that cadvisor labels are not following the official guidelines for instrumentation, it makes joining metrics unnecessarily hard

Copy link
Contributor Author

@DirectXMan12 DirectXMan12 Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I've proposed it before. It'll take some doing though :-/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've heard "metrics are completely left out of stability guarantees for Kubernetes" often enough that I think it's a legit change for the better. Let's bring it up in the next sig meeting.

@DirectXMan12
Copy link
Contributor Author

That look better re the regex thing?

Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick check in from my review, I'm not completely done yet.

// will be of the form `<prefix>${.Resource}$`, cadvisor series will be
// of the form `container_`, and have the label `pod_name`. Any series ending
// in total will be treated as a rate metric.
func DefaultConfig(rateInterval time.Duration, labelPrefix string) *MetricsDiscoveryConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why this can't just be a default config as in a yaml file? It seems odd that we have this "default" but double-parameterizable config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-program one makes the existing flags still work, so it's an easier switch-over. If we have no desire to do that, the default config can be moved to a YAML file. It also provides a nice out-of-the-box experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way people break is the same once we remove the flags, so I'd say let's break immediately and use the YAML file config option.

# This is a Go template where the `.Series` and `.LabelMatchers` string values
# are available, and the delimiters are `${` and `}$` to avoid conflicts with
# the prometheus query language
metricsQuery: "sum(rate(${.Series}${${.LabelMatchers}$,container_name!="POD"}[2m])) by (${.GroupBy}$)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to test this PR and tried the example config. I found this line should be metricsQueries and not metricsQuery based on the value in config.go#L34.

Similarly for the other rules in this config

@@ -302,17 +285,49 @@ func (l *cachingMetricsLister) RunUntil(stopChan <-chan struct{}) {
func (l *cachingMetricsLister) updateMetrics() error {
startTime := pmodel.Now().Add(-1 * l.updateInterval)

sels := l.Selectors()
// don't do duplicate queries when it's just the matchers that change
seriesCacheByQuery := make(map[prom.Selector][]prom.Series)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map has concurrency issues. Getting concurrent map writes and concurrent map read and map write.

Would it be worth using a syncmap for this? I've built locally and wrapped all uses within a mutex locak and unlock cycle, seems to fix the issue.

@bradenwright
Copy link

Any update on this, I'd like to start using this. And don't mind contributing back if needed, but we need to scale on rabbitmq queue length which currently isn't possible. I think this PR should fix #42 though.

@lumeche
Copy link

lumeche commented May 24, 2018

Do you have any ETA on when this will fixed? Im having issues in my environment where the prometheus adapter is taking way too much RAM.

I could do a temporal fix in my side (fork) until this is merged.

This commit introduces advanced configuration.  The rate-interval and
label-prefix flags are removed, and replaced by a configuration file
that allows you to specify series queries and the rules for transforming
those into metrics queries and API resources.
@DirectXMan12
Copy link
Contributor Author

Hey, sorry this took so long to update. I think this should address remaining concerns (concurrent map access should be fixed, better docs in place, default config is just a YAML file, etc). Still have to run through final in-cluster tests, but if those pass, this should be good to merge monday :-).

Please take a look and let me know what you think!

NB: the delimiters in the templates in the config changed from ${ and }$ to << and >>. This should make things much easier for a human to read, while still not conflicting with PromQL.

I find it to be a bit faster in some cases, and easier to work with.
@brancz
Copy link
Contributor

brancz commented Jun 25, 2018

If as you said the in cluster tests succeed smoothly this lgtm

DirectXMan12 and others added 4 commits June 27, 2018 16:56
This moves the DefaultConfig method out into a helper to generate legacy
configuration.  Passing in a config file is now required.
This updates the documentation and README to have information on the
configuration file format.
This fixes asynchronous read/write issues to when performing series
discovery by pushing series results onto a channel, instead of trying to
write them directly to a map.
This makes the makefile's build target have actual dependencies, so that
it only rebuilds any given adapter if that adapter's actual go files
have changed (yes, this is mostly redundant with Go 1.10, but it makes
working on read-only filesystems a bit nicer).
This updates our dependencies to the Kubernetes 1.11 versions.
In the future, this will also allow us to support the external
metrics API.
@DirectXMan12
Copy link
Contributor Author

DirectXMan12 commented Jun 27, 2018

In-cluster testing has succeeded. Need to stand up integration tests for that at some point. Merging when tests complete.

@DirectXMan12 DirectXMan12 merged commit 7b606a7 into master Jun 27, 2018
@DirectXMan12 DirectXMan12 deleted the feature/advanced-config branch August 23, 2018 19:28
dgrisonnet pushed a commit to dgrisonnet/k8s-prometheus-adapter that referenced this pull request Mar 31, 2021
…ncy-openshift-4.8-ose-prometheus-adapter

Updating ose-prometheus-adapter builder & base images to be consistent with ART
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants