Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics collection profiles #1298

Closed
Closed
Changes from 2 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
edf8e8e
Adds skeleton for monitoring/scrape-profiles
JoaoBraveCoding Dec 7, 2022
e948486
monitoring/scrape-profiles-v1
JoaoBraveCoding Dec 12, 2022
f2e2dc5
Re-writes some of the paragraphs, fixes typos and renames operational to
JoaoBraveCoding Dec 14, 2022
061c590
Added analysis to motivation, added more alternatives, added concrete
JoaoBraveCoding Dec 21, 2022
7d210b7
Apply Simon suggestions from code review
JoaoBraveCoding Jan 20, 2023
995f2de
Addressed Simon review
JoaoBraveCoding Jan 20, 2023
719b231
Adds reference to Hypershift implementation of scrape profiles
JoaoBraveCoding Feb 7, 2023
75f9e97
Apply suggestions from code review
JoaoBraveCoding Jul 19, 2023
9451562
Renamed feature to metrics collection profiles
JoaoBraveCoding Jul 19, 2023
b961378
Address rexagod review comments
JoaoBraveCoding Jul 19, 2023
70da408
Fix some indentation
JoaoBraveCoding Jul 19, 2023
61c414a
Commit Jan suggestion
JoaoBraveCoding Jul 19, 2023
c3910ee
Addressed Jan review comments
JoaoBraveCoding Jul 21, 2023
6b2963b
Apply suggestions from code review
JoaoBraveCoding Jul 24, 2023
b8ce949
Addressed Pranshu review comments
JoaoBraveCoding Jul 24, 2023
74e33fd
Address Pranshu review
JoaoBraveCoding Jul 27, 2023
650c20c
Address Simon review comments
JoaoBraveCoding Aug 2, 2023
7a07c6f
Fix typo
JoaoBraveCoding Aug 2, 2023
755ded3
add documentation for CPV
rexagod Oct 10, 2023
cf7bedd
Update enhancements/monitoring/metrics-collection-profiles.md
rexagod Oct 16, 2023
46aa3b9
Update enhancements/monitoring/metrics-collection-profiles.md
rexagod Oct 16, 2023
b8b25d4
Update enhancements/monitoring/metrics-collection-profiles.md
rexagod Oct 16, 2023
7e37545
Merge pull request #1 from rexagod/scrape-profiles-post-cpv
JoaoBraveCoding Oct 18, 2023
f7694e7
Apply suggestions from juzhao code review
JoaoBraveCoding Oct 20, 2023
38f58d2
Address some other review comments
JoaoBraveCoding Oct 20, 2023
48a5c98
Fixed markdown test
JoaoBraveCoding Nov 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
273 changes: 273 additions & 0 deletions enhancements/monitoring/scrape-profiles.md
@@ -0,0 +1,273 @@
---
title: scrape-profiles
authors:
- @JoaoBraveCoding
reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect"
- @openshift/openshift-team-monitoring
approvers:
- TBD
api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None"
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
- TBD
creation-date: 2022-12-06
last-updated: yyyy-mm-dd
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- https://issues.redhat.com/browse/MON-2483
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
see-also:
- "/enhancements/this-other-neat-thing.md"
replaces:
- "/enhancements/that-less-than-great-idea.md"
superseded-by:
- "/enhancements/our-past-effort.md"
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
---

# Scrape profiles

## Summary

The core OpenShift components ship a large number of metrics. A 4.12-nightly
cluster on AWS currently produces around 350K series by default, and enabling
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
additional add-ons increases that number. Users have repeatedly asked for a supported
method of making Prometheus consume less memory, either by increasing the scraping
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
timeout (which isn't a direction we would like to follow) or by scraping fewer targets
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
-- for example, modifying the ServiceMonitors to drop metrics undesired to some users.

These modifications are currently not possible, because the service monitors deployed to OCP
are managed by operators and cannot be modified. This proposal outlines a solution for allowing
users to set a level of scraping aligned with their needs.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

## Motivation
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

OpenShift is an opinionated platform, and the default set of metrics has of course
been crafted to match what we think the majority of users might need. Nevertheless,
users have repeatedly asked for the ability to reduce the amount of memory consumed by
Prometheus either lowering the Prometheus scrape intervals or by modifying ServiceMonitors.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

Users currently can not control the ServiceMonitors scraped by Prometheus since some of the
metrics collected are essential for other parts of the system to function propperly
(console, HPA and alerts). Users also are not allowed to tune the interval at which Prometheus
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
scrapes targets as this again can have unforeseen results that can hinder the platform, a very
low cadence may overwhelm the platform Prometheus instance a very high interval may render some
of the default alerts ineffective.

The goal of this proposal is to allow users to pick their desired level of scraping while limiting
the impact this might have on the platform, via resources under the control of the
cluster-monitoring-operator and other platform operators.


JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
### User Stories

- As an OpenShift user, I want to lower the amount of memory consumed by Prometheus in a supported way, so I can choose between different scrape profiles, e.g `full` or `operational`.
- As an OpenShift developer, I want a supported way to collect a subset of the metrics exported by my operator depending on the distribution.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

### Goals

- Give users a supported and documented method to pick the amount of metrics being consumed by the platform, in a way that allows cluster-monitoring-operator and other operators to provide their own defaulting.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
- Other components of the cluster (console, HPA, alerts) should not be negatively affected by this change.
- Openshift developers may optionally adhere to this change, developers that do not adhere should not be impacted.

### Non-Goals

- Dynamically adjusting metrics scraped at runtime based on heuristics.

## Proposal

This enhancement leverages label selectors to allow cluster-monitoring-operator to select the appropriate monitors ([Pod|Service]Monitor) for a given scrape profile.

Given this, the following steps would be necessary to implement this enhancement:

- cluster-monitoring-operator has to be updated to support the logic of the new field;
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
- OCP components that would like to implement scrape profiles would have to provide X monitors. Where X is the number of scrape profiles that we would support.

### Workflow Description
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

The cluster-monitoring-operator (CMO) will be extended with a new field, `scrapeProfile` under `prometheusK8s`. This new field will then allow users to configure their desired scrape profile while preserving the platform functionality (console, HPA and alerts).

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
prometheusK8s:
scrapeProfile: full
```

The different profiles would be pre-defined by us. Once a profile is selected CMO then populates the main Prometheus CR to select resources that implement the requested profile -- using [pod|service]MonitorSelector and the label `monitoring.openshift.io/scrape-profile` (profile label) -- this way Prometheus would select monitors from two sets:
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
- monitors with the profile label and the requested label value (profile)
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
- monitors without the profile label present (additionally to the current namespace selector).
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

This was it would be up to OpenShift teams if they wanted to adopt this feature. Without any change to a monitor, if a user picks a profile in the CMO config, things should work as they did before. When a OpenShift team wants to implement scrape profiles, they need to provide monitors for all profiles making sure they do not provide monitors without the profile label. If a profile label is not used, then a monitor will not be scraped at all for any given profile.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

In the beginning the goal is to support 2 profiles:

- `full` (same as today)
- `operational` (only collect metrics necessary for alerts, recording rules, telemetry and dashboards)
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

When the cluster admin enables the `operational` profile, the k8s Prometheus resource would be configured accordingly:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: k8s
namespace: openshift-monitoring
spec:
serviceMonitorSelector:
matchExpressions:
- key: monitoring.openshift.io/scrape-profile
operator: NotIn
values:
- "full"
```

An OpenShift team that would want to support the scrape profiles feature would need to provide 2 monitors for each profile (in this example 1 service monitor per profile).
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

```yaml
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
monitoring.openshift.io/scrape-profile: full
name: foo-full
namespace: openshift-bar
spec:
endpoints:
port: metrics
selector:
matchLabels:
app.kubernetes.io/name: foo
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
monitoring.openshift.io/scrape-profile: operational
name: foo-operational
namespace: openshift-bar
spec:
endpoints:
port: metrics
selector:
matchLabels:
app.kubernetes.io/name: foo
metricRelabelings:
- sourceLabels: [__name__]
action: keep
regex: "requests_total|requests_failed_total"
```

#### Variation [optional]

NA

### API Extensions

- Addition of a new field to the `cluster-monitoring-config` ConfigMap

```go
type PrometheusK8sConfig struct {
// Defines the scraping profile that will be enforced on the platform
// prometheus instance. Possible values are full and operational.
//
// +kubebuilder:validation:Enum=full;operational
// +optional
ScrapeProfile string `json:"scrapeProfile,omitempty"`
}
```

### Implementation Details/Notes/Constraints [optional]
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

Each OpenShift team that wants to adopt this feature will be responsible for providing the different monitors. This work is not trivial, dependecies beween operators exist, this makes unclear for developers if it's really fine or not for their operator in the "operational" profile to not expose a given metric. To try and help teams with this problem we would provide a tool that would consume a monitor and generate the "operational" monitor for that operator. However for this, the tool would need to have access to an up to date, instalation of OpenShift in order to query the Prometheus instance running in the cluster.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

### Risks and Mitigations

- How are monitors supposed to be kept up to date? In 4.12 a metric that wasn't being used in an alert is now required, how does the monitor responsible for that metric gets updated?
- The tool we provide would run in CI and ensure are ServiceMonitors are up to date;

- A new profile is added, what will happen to operators that had implemented scrape profiles but did not implement the latest profile.
- TBD. Currently, according to the description above, they would not be scraped when the new profile would be picked.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

### Drawbacks

- Extra CI cycles
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

## Design Details

### Open Questions [optional]

- Should we add future profiles?
- Should developers have to comply with all profiles?
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

### Test Plan

- Unit tests in CMO to validate that the correct monitors are being selected
- E2E tests in CMO to validate that everything works correctly
Unsure on this one... (- Testing in openshift/CI to validate that every metrics being used exists in the cluster?)
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

### Graduation Criteria

Plan to release as Dev Preview initially.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

#### Dev Preview -> Tech Preview

- Ability to utilize the enhancement end to end
- End user documentation
- Sufficient test coverage
- Gather feedback from users rather than just developers

#### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default
- Backhaul SLI telemetry
- Document SLOs for the component
- Conduct load testing
- User facing documentation created in [openshift-docs](https://github.com/openshift/openshift-docs/)
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

**For non-optional features moving to GA, the graduation criteria must include
end to end tests.**

#### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

- Upgrade and downgrade are not expected to present a significant challenge. The new field is a small layers over existing stable APIs.
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

### Version Skew Strategy

TBD but I don't think it applies here

### Operational Aspects of API Extensions

TBD but I don't think it applies here

#### Failure Modes

TBD but I don't think it applies here

#### Support Procedures

TBD but I don't think it applies here

## Implementation History

Initial proofs-of-concept:

- https://github.com/openshift/cluster-monitoring-operator/pull/1785

## Alternatives
JoaoBraveCoding marked this conversation as resolved.
Show resolved Hide resolved

- Add a seperate container to prometheus-operator (p-o) that would be used by p-o to modify the prometheus config according to a scrape profile.
- This container would perform an analysis on what metrics were being used. Then it would provide prometheus operator with this list.
- Prometheus-operator with the list provided by this new component would know what scraping targets it could change to keep certain metrics.

## Infrastructure Needed [optional]

None.