Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MON-2693: Scrape profiles #1785

Merged
merged 12 commits into from Mar 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Expand Up @@ -2,6 +2,7 @@

## 4.13

- [#1785](https://github.com/openshift/cluster-monitoring-operator/pull/1785) Adds support for CollectionProfiles TechPreview
- [#1830](https://github.com/openshift/cluster-monitoring-operator/pull/1830) Add alert KubePodNotScheduled
- [#1843](https://github.com/openshift/cluster-monitoring-operator/pull/1843) Node Exporter ignores network interface under name "enP.*".
- [#1860](https://github.com/openshift/cluster-monitoring-operator/pull/1860) Adds runbook for PrometheusRuleFailures
Expand Down
1 change: 1 addition & 0 deletions Documentation/api.md
Expand Up @@ -295,6 +295,7 @@ The `PrometheusK8sConfig` resource defines settings for the Prometheus component
| retentionSize | string | Defines the maximum amount of disk space used by data blocks plus the write-ahead log (WAL). Supported values are `B`, `KB`, `KiB`, `MB`, `MiB`, `GB`, `GiB`, `TB`, `TiB`, `PB`, `PiB`, `EB`, and `EiB`. By default, no limit is defined. |
| tolerations | [][v1.Toleration](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#toleration-v1-core) | Defines tolerations for the pods. |
| topologySpreadConstraints | []v1.TopologySpreadConstraint | Defines the pod's topology spread constraints. |
| collectionProfile | CollectionProfile | Defines the metrics collection profile that Prometheus uses to collect metrics from the platform components. Supported values are `full` or `minimal`. In the `full` profile (default), Prometheus collects all metrics that are exposed by the platform components. In the `minimal` profile, Prometheus only collects metrics necessary for the default platform alerts, recording rules, telemetry and console dashboards. |
| volumeClaimTemplate | *[monv1.EmbeddedPersistentVolumeClaim](https://github.com/prometheus-operator/prometheus-operator/blob/v0.62.0/Documentation/api.md#embeddedpersistentvolumeclaim) | Defines persistent storage for Prometheus. Use this setting to configure the persistent volume claim, including storage class, volume size and name. |

[Back to TOC](#table-of-contents)
Expand Down
2 changes: 2 additions & 0 deletions Documentation/openshiftdocs/modules/prometheusk8sconfig.adoc
Expand Up @@ -42,6 +42,8 @@ Appears in: link:clustermonitoringconfiguration.adoc[ClusterMonitoringConfigurat

|topologySpreadConstraints|[]v1.TopologySpreadConstraint|Defines the pod's topology spread constraints.

|collectionProfile|CollectionProfile|Defines the metrics collection profile that Prometheus uses to collect metrics from the platform components. Supported values are `full` or `minimal`. In the `full` profile (default), Prometheus collects all metrics that are exposed by the platform components. In the `minimal` profile, Prometheus only collects metrics necessary for the default platform alerts, recording rules, telemetry and console dashboards.

|volumeClaimTemplate|*monv1.EmbeddedPersistentVolumeClaim|Defines persistent storage for Prometheus. Use this setting to configure the persistent volume claim, including storage class, volume size and name.

|===
Expand Down
30 changes: 30 additions & 0 deletions assets/control-plane/minimal-service-monitor-etcd.yaml
@@ -0,0 +1,30 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: etcd
k8s-app: etcd
monitoring.openshift.io/collection-profile: minimal
name: etcd-minimal
namespace: openshift-monitoring
spec:
endpoints:
- interval: 30s
metricRelabelings:
- action: keep
regex: (etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_mvcc_db_total_size_in_use_in_bytes|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_failures_total|etcd_server_has_leader|etcd_server_is_leader|etcd_server_proposals_failed_total|etcd_server_quota_backend_bytes|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|process_start_time_seconds)
sourceLabels:
- __name__
port: etcd-metrics
scheme: https
tlsConfig:
caFile: /etc/prometheus/secrets/kube-etcd-client-certs/etcd-client-ca.crt
certFile: /etc/prometheus/secrets/kube-etcd-client-certs/etcd-client.crt
keyFile: /etc/prometheus/secrets/kube-etcd-client-certs/etcd-client.key
jobLabel: k8s-app
namespaceSelector:
matchNames:
- openshift-etcd
selector:
matchLabels:
k8s-app: etcd
106 changes: 106 additions & 0 deletions assets/control-plane/minimal-service-monitor-kubelet.yaml
@@ -0,0 +1,106 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/name: kubelet
app.kubernetes.io/part-of: openshift-monitoring
k8s-app: kubelet
monitoring.openshift.io/collection-profile: minimal
name: kubelet-minimal
namespace: openshift-monitoring
spec:
endpoints:
- bearerTokenFile: ""
honorLabels: true
interval: 30s
metricRelabelings:
- action: keep
regex: (apiserver_audit_event_total|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_usage_bytes|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|container_spec_cpu_shares|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_containers_per_pod_count_sum|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_bucket|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_free|kubelet_volume_stats_inodes_used|kubelet_volume_stats_used_bytes|machine_cpu_cores|machine_memory_bytes|process_start_time_seconds|rest_client_requests_total|storage_operation_duration_seconds_count)
sourceLabels:
- __name__
port: https-metrics
relabelings:
- sourceLabels:
- __metrics_path__
targetLabel: metrics_path
scheme: https
scrapeTimeout: 30s
tlsConfig:
caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
insecureSkipVerify: false
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
- bearerTokenFile: ""
honorLabels: true
honorTimestamps: false
interval: 30s
metricRelabelings:
- action: labeldrop
regex: __tmp_keep_metric
- action: keep
regex: (apiserver_audit_event_total|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_usage_bytes|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|container_spec_cpu_shares|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_containers_per_pod_count_sum|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_bucket|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_free|kubelet_volume_stats_inodes_used|kubelet_volume_stats_used_bytes|machine_cpu_cores|machine_memory_bytes|process_start_time_seconds|rest_client_requests_total|storage_operation_duration_seconds_count)
sourceLabels:
- __name__
path: /metrics/cadvisor
port: https-metrics
relabelings:
- sourceLabels:
- __metrics_path__
targetLabel: metrics_path
scheme: https
scrapeTimeout: 30s
tlsConfig:
caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
insecureSkipVerify: false
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
- bearerTokenFile: ""
honorLabels: true
interval: 30s
metricRelabelings:
- action: keep
regex: (apiserver_audit_event_total|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_usage_bytes|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|container_spec_cpu_shares|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_containers_per_pod_count_sum|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_bucket|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_free|kubelet_volume_stats_inodes_used|kubelet_volume_stats_used_bytes|machine_cpu_cores|machine_memory_bytes|process_start_time_seconds|rest_client_requests_total|storage_operation_duration_seconds_count)
sourceLabels:
- __name__
path: /metrics/probes
port: https-metrics
relabelings:
- sourceLabels:
- __metrics_path__
targetLabel: metrics_path
scheme: https
scrapeTimeout: 30s
tlsConfig:
caFile: /etc/prometheus/configmaps/kubelet-serving-ca-bundle/ca-bundle.crt
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
insecureSkipVerify: false
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
- interval: 30s
metricRelabelings:
- action: keep
regex: (apiserver_audit_event_total|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_usage_bytes|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|container_spec_cpu_shares|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_containers_per_pod_count_sum|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_bucket|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_free|kubelet_volume_stats_inodes_used|kubelet_volume_stats_used_bytes|machine_cpu_cores|machine_memory_bytes|process_start_time_seconds|rest_client_requests_total|storage_operation_duration_seconds_count)
sourceLabels:
- __name__
port: https-metrics
relabelings:
- action: replace
regex: (.+)(?::\d+)
replacement: $1:9537
sourceLabels:
- __address__
targetLabel: __address__
- action: replace
replacement: crio
sourceLabels:
- endpoint
targetLabel: endpoint
- action: replace
replacement: crio
targetLabel: job
jobLabel: k8s-app
namespaceSelector:
matchNames:
- kube-system
selector:
matchLabels:
k8s-app: kubelet
1 change: 1 addition & 0 deletions assets/control-plane/service-monitor-etcd.yaml
Expand Up @@ -4,6 +4,7 @@ metadata:
labels:
app.kubernetes.io/name: etcd
k8s-app: etcd
monitoring.openshift.io/collection-profile: full
name: etcd
namespace: openshift-monitoring
spec:
Expand Down
Expand Up @@ -5,6 +5,7 @@ metadata:
app.kubernetes.io/name: kubelet
app.kubernetes.io/part-of: openshift-monitoring
k8s-app: kubelet
monitoring.openshift.io/collection-profile: full
name: kubelet-resource-metrics
namespace: openshift-monitoring
spec:
Expand Down
1 change: 1 addition & 0 deletions assets/control-plane/service-monitor-kubelet.yaml
Expand Up @@ -5,6 +5,7 @@ metadata:
app.kubernetes.io/name: kubelet
app.kubernetes.io/part-of: openshift-monitoring
k8s-app: kubelet
monitoring.openshift.io/collection-profile: full
name: kubelet
namespace: openshift-monitoring
spec:
Expand Down
57 changes: 57 additions & 0 deletions assets/kube-state-metrics/minimal-service-monitor.yaml
@@ -0,0 +1,57 @@
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 2.8.1
monitoring.openshift.io/collection-profile: minimal
name: kube-state-metrics-minimal
namespace: openshift-monitoring
spec:
endpoints:
- bearerTokenFile: ""
honorLabels: true
interval: 1m
metricRelabelings:
- action: labeldrop
regex: instance
- action: keep
regex: (kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_status_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_status_active|kube_job_status_start_time|kube_node_info|kube_node_labels|kube_node_role|kube_node_spec_taint|kube_node_spec_unschedulable|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_info|kube_persistentvolume_status_phase|kube_persistentvolumeclaim_access_mode|kube_persistentvolumeclaim_info|kube_persistentvolumeclaim_labels|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_last_terminated_reason|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_pod_status_ready|kube_pod_status_unschedulable|kube_poddisruptionbudget_status_current_healthy|kube_poddisruptionbudget_status_desired_healthy|kube_poddisruptionbudget_status_expected_pods|kube_replicaset_owner|kube_replicationcontroller_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kube_storageclass_info|process_start_time_seconds)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kube_node_labels is kept, but no

kube_pod_labels
kube_namespace_labels
kube_poddisruptionbudget_labels
kube_persistentvolume_labels
kube_persistentvolumeclaim_labels

we have following bugs to keep the above metrics
https://bugzilla.redhat.com/show_bug.cgi?id=2011698
https://bugzilla.redhat.com/show_bug.cgi?id=2015386
https://bugzilla.redhat.com/show_bug.cgi?id=2018431

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these but I'll have to investigate why the tool I developed didn't pick up these, I'll open an issue on the project and add a task to the GA epic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay looking quickly at the bugs these seem to be metrics that are not used in our default alerting so it's normal that they got excluded from the list. The minimal profile is very restrictive and should only contain metrics that are essential to default alerts, default rules, console and telemetry
Note that kube_persistentvolumeclaim_labels it's already on the list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also searched on the CMO repo for those metrics to double check and they are not used, so from my POV things are working correctly and those metrics should be excluded (except kube_persistentvolumeclaim_labels, which was already on the list). But do let me know if I missed something

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, it makes sense

sourceLabels:
- __name__
port: https-main
relabelings:
- action: labeldrop
regex: pod
scheme: https
scrapeTimeout: 1m
tlsConfig:
caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
insecureSkipVerify: false
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
serverName: kube-state-metrics.openshift-monitoring.svc
- bearerTokenFile: ""
interval: 1m
metricRelabelings:
- action: keep
regex: (kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_status_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_status_active|kube_job_status_start_time|kube_node_info|kube_node_labels|kube_node_role|kube_node_spec_taint|kube_node_spec_unschedulable|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_info|kube_persistentvolume_status_phase|kube_persistentvolumeclaim_access_mode|kube_persistentvolumeclaim_info|kube_persistentvolumeclaim_labels|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_last_terminated_reason|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_pod_status_ready|kube_pod_status_unschedulable|kube_poddisruptionbudget_status_current_healthy|kube_poddisruptionbudget_status_desired_healthy|kube_poddisruptionbudget_status_expected_pods|kube_replicaset_owner|kube_replicationcontroller_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kube_storageclass_info|process_start_time_seconds)
sourceLabels:
- __name__
port: https-self
scheme: https
scrapeTimeout: 1m
tlsConfig:
caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
certFile: /etc/prometheus/secrets/metrics-client-certs/tls.crt
insecureSkipVerify: false
keyFile: /etc/prometheus/secrets/metrics-client-certs/tls.key
serverName: kube-state-metrics.openshift-monitoring.svc
jobLabel: app.kubernetes.io/name
selector:
matchLabels:
app.kubernetes.io/component: exporter
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: openshift-monitoring
1 change: 1 addition & 0 deletions assets/kube-state-metrics/service-monitor.yaml
Expand Up @@ -6,6 +6,7 @@ metadata:
app.kubernetes.io/name: kube-state-metrics
app.kubernetes.io/part-of: openshift-monitoring
app.kubernetes.io/version: 2.8.1
monitoring.openshift.io/collection-profile: full
name: kube-state-metrics
namespace: openshift-monitoring
spec:
Expand Down