Skip to content

Commit

Permalink
OCM: Add OpenShift Cluster Manager team
Browse files Browse the repository at this point in the history
The OCM team relies on these queries for reporting metrics to customers
and would like to be notified when they change so that the cluster
service can adapt.
  • Loading branch information
vkareh committed Apr 8, 2020
1 parent 97e2a67 commit 4e0f326
Show file tree
Hide file tree
Showing 2 changed files with 108 additions and 76 deletions.
92 changes: 54 additions & 38 deletions Documentation/data-collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,20 +32,22 @@ data:
# to identify when an update causes a service to begin to crash-loop or
# flake.
- '{__name__="count:up1"}'
# cluster_version reports what payload and version the cluster is being
# configured to and is used to identify what versions are on a cluster
# that is experiencing problems.
# (@openshift/openshift-team-cluster-manager) cluster_version reports what
# payload and version the cluster is being configured to and is used to
# identify what versions are on a cluster that is experiencing problems.
- '{__name__="cluster_version"}'
# cluster_version_available_updates reports the channel and version
# server the cluster is configured to use and how many updates are
# available. This is used to ensure that updates are being properly
# served to clusters.
# (@openshift/openshift-team-cluster-manager)
# cluster_version_available_updates reports the channel and version server
# the cluster is configured to use and how many updates are available. This
# is used to ensure that updates are being properly served to clusters.
- '{__name__="cluster_version_available_updates"}'
# (@openshift/openshift-team-olm) cluster_operator_up reports the health status of the core cluster
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster_operator_up reports the health status of the core cluster
# operators - like up, an upgrade that fails due to a configuration value
# on the cluster will help narrow down which component is affected.
- '{__name__="cluster_operator_up"}'
# (@openshift/openshift-team-olm) cluster_operator_conditions exposes the status conditions cluster
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster_operator_conditions exposes the status conditions cluster
# operators report for debugging. The condition and status are reported.
- '{__name__="cluster_operator_conditions"}'
# cluster_version_payload captures how far through a payload the cluster
Expand All @@ -55,8 +57,9 @@ data:
# (@openshift/openshift-team-olm) cluster_installer reports what installed the cluster, along with its
# version number and invoker.
- '{__name__="cluster_installer"}'
# (@openshift/openshift-team-olm) cluster_infrastructure_provider reports the configured cloud provider
# if any, along with the infrastructure region when running in the public
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster_infrastructure_provider reports the configured cloud provider if
# any, along with the infrastructure region when running in the public
# cloud.
- '{__name__="cluster_infrastructure_provider"}'
# cluster_feature_set reports the configured cluster feature set and
Expand All @@ -66,29 +69,32 @@ data:
# - the rough size of the data stored in etcd and
# - the consistency between the etcd instances.
- '{__name__="instance:etcd_object_counts:sum"}'
# (@openshift/openshift-team-olm) alerts are the key summarization of the system state. They are
# reported via telemetry to assess their value in detecting
# upgrade failure causes and also to prevent the need to gather
# large sets of metrics that are already summarized on the cluster.
# Reporting alerts also creates an incentive to improve per
# cluster alerting for the purposes of preventing upgrades from
# failing for end users.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# alerts are the key summarization of the system state. They are reported
# via telemetry to assess their value in detecting upgrade failure causes
# and also to prevent the need to gather large sets of metrics that are
# already summarized on the cluster. Reporting alerts also creates an
# incentive to improve per cluster alerting for the purposes of preventing
# upgrades from failing for end users.
- '{__name__="ALERTS",alertstate="firing"}'
# the following three metrics will be used for SLA analysis reports.
# (@openshift/openshift-team-olm) code:apiserver_request_count:rate:sum identifies average of occurances
# of each http status code over 10 minutes
- '{__name__="code:apiserver_request_count:rate:sum"}'
# (@openshift/openshift-team-olm) cluster:capacity_cpu_cores:sum is the total number of CPU cores
# in the cluster labeled by node role and type.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:capacity_cpu_cores:sum is the total number of CPU cores in the
# cluster labeled by node role and type.
- '{__name__="cluster:capacity_cpu_cores:sum"}'
# cluster:capacity_memory_bytes:sum is the total bytes of memory
# in the cluster labeled by node role and type.
# (@openshift/openshift-team-cluster-manager) cluster:capacity_memory_bytes:sum is the
# total bytes of memory in the cluster labeled by node role and type.
- '{__name__="cluster:capacity_memory_bytes:sum"}'
# (@openshift/openshift-team-olm) cluster:cpu_usage_cores:sum is the current amount of CPU in
# use across the whole cluster.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:cpu_usage_cores:sum is the current amount of CPU in use across
# the whole cluster.
- '{__name__="cluster:cpu_usage_cores:sum"}'
# (@openshift/openshift-team-olm) cluster:memory_usage_bytes:sum is the current amount of memory in
# use across the whole cluster.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:memory_usage_bytes:sum is the current amount of memory in use
# across the whole cluster.
- '{__name__="cluster:memory_usage_bytes:sum"}'
# (@openshift/openshift-team-olm) openshift:cpu_usage_cores:sum is the current amount of CPU
# used by OpenShift components, including the control plane and
Expand All @@ -109,26 +115,35 @@ data:
# This metric helps identify issues specific to a virtualization
# type or bare metal.
- '{__name__="cluster:virt_platform_nodes:sum"}'
# (@openshift/openshift-team-olm) cluster:node_instance_type_count:sum is the number of nodes
# of each instance type and role.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:node_instance_type_count:sum is the number of nodes of each
# instance type and role.
- '{__name__="cluster:node_instance_type_count:sum"}'
# cnv:vmi_status_running:count is the total number of VM instances running in the cluster.
- '{__name__="cnv:vmi_status_running:count"}'
# (@openshift/openshift-team-olm) node_role_os_version_machine:cpu_capacity_cores:sum is the total number of CPU cores
# in the cluster labeled by master and/or infra node role, os, architecture, and hyperthreading state.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# node_role_os_version_machine:cpu_capacity_cores:sum is the total number
# of CPU cores in the cluster labeled by master and/or infra node role, os,
# architecture, and hyperthreading state.
- '{__name__="node_role_os_version_machine:cpu_capacity_cores:sum"}'
# node_role_os_version_machine:cpu_capacity_sockets:sum is the total number of CPU sockets
# in the cluster labeled by master and/or infra node role, os, architecture, and hyperthreading state.
# (@openshift/openshift-team-cluster-manager)
# node_role_os_version_machine:cpu_capacity_sockets:sum is the total number
# of CPU sockets in the cluster labeled by master and/or infra node role,
# os, architecture, and hyperthreading state.
- '{__name__="node_role_os_version_machine:cpu_capacity_sockets:sum"}'
# subscription_sync_total is the number of times an OLM operator
# Subscription has been synced, labelled by name and installed csv
- '{__name__="subscription_sync_total"}'
# csv_succeeded is unique to the namespace, name, version, and phase labels.
# The metrics is always present and can be equal to 0 or 1, where 0 represents that the
# csv is not in the succeeded state while 1 represents that the csv is in the succeeded state.
# (@openshift/openshift-team-cluster-manager) csv_succeeded is unique to the
# namespace, name, version, and phase labels. The metrics is always
# present and can be equal to 0 or 1, where 0 represents that the csv is
# not in the succeeded state while 1 represents that the csv is in the
# succeeded state.
- '{__name__="csv_succeeded"}'
# csv_abnormal represents the reason why a csv is not in the succeeded state and includes the
# namespace, name, version, phase, reason labels. When a csv is updated, the previous time series associated with the csv will be deleted.
# (@openshift/openshift-team-cluster-manager) csv_abnormal represents the reason why
# a csv is not in the succeeded state and includes the namespace, name,
# version, phase, reason labels. When a csv is updated, the previous time
# series associated with the csv will be deleted.
- '{__name__="csv_abnormal"}'
# OCS metrics to be collected:
# ceph_cluster_total_bytes gives the size of ceph cluster in bytes.
Expand Down Expand Up @@ -157,7 +172,8 @@ data:
- '{__name__="noobaa_accounts_num"}'
# noobaa_total_usage gives the total usage of noobaa's storage in bytes.
- '{__name__="noobaa_total_usage"}'
# console_url is the url of the console running on the cluster.
# (@openshift/openshift-team-cluster-manager) console_url is the url of the console
# running on the cluster.
- '{__name__="console_url"}'
# cluster:network_attachment_definition_instances:max" gives max no of instance
# in the cluster that are annotated with k8s.v1.cni.cncf.io/networks, labelled by networks.
Expand Down
92 changes: 54 additions & 38 deletions manifests/0000_50_cluster_monitoring_operator_04-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,20 +24,22 @@ data:
# to identify when an update causes a service to begin to crash-loop or
# flake.
- '{__name__="count:up1"}'
# cluster_version reports what payload and version the cluster is being
# configured to and is used to identify what versions are on a cluster
# that is experiencing problems.
# (@openshift/openshift-team-cluster-manager) cluster_version reports what
# payload and version the cluster is being configured to and is used to
# identify what versions are on a cluster that is experiencing problems.
- '{__name__="cluster_version"}'
# cluster_version_available_updates reports the channel and version
# server the cluster is configured to use and how many updates are
# available. This is used to ensure that updates are being properly
# served to clusters.
# (@openshift/openshift-team-cluster-manager)
# cluster_version_available_updates reports the channel and version server
# the cluster is configured to use and how many updates are available. This
# is used to ensure that updates are being properly served to clusters.
- '{__name__="cluster_version_available_updates"}'
# (@openshift/openshift-team-olm) cluster_operator_up reports the health status of the core cluster
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster_operator_up reports the health status of the core cluster
# operators - like up, an upgrade that fails due to a configuration value
# on the cluster will help narrow down which component is affected.
- '{__name__="cluster_operator_up"}'
# (@openshift/openshift-team-olm) cluster_operator_conditions exposes the status conditions cluster
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster_operator_conditions exposes the status conditions cluster
# operators report for debugging. The condition and status are reported.
- '{__name__="cluster_operator_conditions"}'
# cluster_version_payload captures how far through a payload the cluster
Expand All @@ -47,8 +49,9 @@ data:
# (@openshift/openshift-team-olm) cluster_installer reports what installed the cluster, along with its
# version number and invoker.
- '{__name__="cluster_installer"}'
# (@openshift/openshift-team-olm) cluster_infrastructure_provider reports the configured cloud provider
# if any, along with the infrastructure region when running in the public
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster_infrastructure_provider reports the configured cloud provider if
# any, along with the infrastructure region when running in the public
# cloud.
- '{__name__="cluster_infrastructure_provider"}'
# cluster_feature_set reports the configured cluster feature set and
Expand All @@ -58,29 +61,32 @@ data:
# - the rough size of the data stored in etcd and
# - the consistency between the etcd instances.
- '{__name__="instance:etcd_object_counts:sum"}'
# (@openshift/openshift-team-olm) alerts are the key summarization of the system state. They are
# reported via telemetry to assess their value in detecting
# upgrade failure causes and also to prevent the need to gather
# large sets of metrics that are already summarized on the cluster.
# Reporting alerts also creates an incentive to improve per
# cluster alerting for the purposes of preventing upgrades from
# failing for end users.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# alerts are the key summarization of the system state. They are reported
# via telemetry to assess their value in detecting upgrade failure causes
# and also to prevent the need to gather large sets of metrics that are
# already summarized on the cluster. Reporting alerts also creates an
# incentive to improve per cluster alerting for the purposes of preventing
# upgrades from failing for end users.
- '{__name__="ALERTS",alertstate="firing"}'
# the following three metrics will be used for SLA analysis reports.
# (@openshift/openshift-team-olm) code:apiserver_request_count:rate:sum identifies average of occurances
# of each http status code over 10 minutes
- '{__name__="code:apiserver_request_count:rate:sum"}'
# (@openshift/openshift-team-olm) cluster:capacity_cpu_cores:sum is the total number of CPU cores
# in the cluster labeled by node role and type.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:capacity_cpu_cores:sum is the total number of CPU cores in the
# cluster labeled by node role and type.
- '{__name__="cluster:capacity_cpu_cores:sum"}'
# cluster:capacity_memory_bytes:sum is the total bytes of memory
# in the cluster labeled by node role and type.
# (@openshift/openshift-team-cluster-manager) cluster:capacity_memory_bytes:sum is the
# total bytes of memory in the cluster labeled by node role and type.
- '{__name__="cluster:capacity_memory_bytes:sum"}'
# (@openshift/openshift-team-olm) cluster:cpu_usage_cores:sum is the current amount of CPU in
# use across the whole cluster.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:cpu_usage_cores:sum is the current amount of CPU in use across
# the whole cluster.
- '{__name__="cluster:cpu_usage_cores:sum"}'
# (@openshift/openshift-team-olm) cluster:memory_usage_bytes:sum is the current amount of memory in
# use across the whole cluster.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:memory_usage_bytes:sum is the current amount of memory in use
# across the whole cluster.
- '{__name__="cluster:memory_usage_bytes:sum"}'
# (@openshift/openshift-team-olm) openshift:cpu_usage_cores:sum is the current amount of CPU
# used by OpenShift components, including the control plane and
Expand All @@ -101,26 +107,35 @@ data:
# This metric helps identify issues specific to a virtualization
# type or bare metal.
- '{__name__="cluster:virt_platform_nodes:sum"}'
# (@openshift/openshift-team-olm) cluster:node_instance_type_count:sum is the number of nodes
# of each instance type and role.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# cluster:node_instance_type_count:sum is the number of nodes of each
# instance type and role.
- '{__name__="cluster:node_instance_type_count:sum"}'
# cnv:vmi_status_running:count is the total number of VM instances running in the cluster.
- '{__name__="cnv:vmi_status_running:count"}'
# (@openshift/openshift-team-olm) node_role_os_version_machine:cpu_capacity_cores:sum is the total number of CPU cores
# in the cluster labeled by master and/or infra node role, os, architecture, and hyperthreading state.
# (@openshift/openshift-team-olm, @openshift/openshift-team-cluster-manager)
# node_role_os_version_machine:cpu_capacity_cores:sum is the total number
# of CPU cores in the cluster labeled by master and/or infra node role, os,
# architecture, and hyperthreading state.
- '{__name__="node_role_os_version_machine:cpu_capacity_cores:sum"}'
# node_role_os_version_machine:cpu_capacity_sockets:sum is the total number of CPU sockets
# in the cluster labeled by master and/or infra node role, os, architecture, and hyperthreading state.
# (@openshift/openshift-team-cluster-manager)
# node_role_os_version_machine:cpu_capacity_sockets:sum is the total number
# of CPU sockets in the cluster labeled by master and/or infra node role,
# os, architecture, and hyperthreading state.
- '{__name__="node_role_os_version_machine:cpu_capacity_sockets:sum"}'
# subscription_sync_total is the number of times an OLM operator
# Subscription has been synced, labelled by name and installed csv
- '{__name__="subscription_sync_total"}'
# csv_succeeded is unique to the namespace, name, version, and phase labels.
# The metrics is always present and can be equal to 0 or 1, where 0 represents that the
# csv is not in the succeeded state while 1 represents that the csv is in the succeeded state.
# (@openshift/openshift-team-cluster-manager) csv_succeeded is unique to the
# namespace, name, version, and phase labels. The metrics is always
# present and can be equal to 0 or 1, where 0 represents that the csv is
# not in the succeeded state while 1 represents that the csv is in the
# succeeded state.
- '{__name__="csv_succeeded"}'
# csv_abnormal represents the reason why a csv is not in the succeeded state and includes the
# namespace, name, version, phase, reason labels. When a csv is updated, the previous time series associated with the csv will be deleted.
# (@openshift/openshift-team-cluster-manager) csv_abnormal represents the reason why
# a csv is not in the succeeded state and includes the namespace, name,
# version, phase, reason labels. When a csv is updated, the previous time
# series associated with the csv will be deleted.
- '{__name__="csv_abnormal"}'
# OCS metrics to be collected:
# ceph_cluster_total_bytes gives the size of ceph cluster in bytes.
Expand Down Expand Up @@ -149,7 +164,8 @@ data:
- '{__name__="noobaa_accounts_num"}'
# noobaa_total_usage gives the total usage of noobaa's storage in bytes.
- '{__name__="noobaa_total_usage"}'
# console_url is the url of the console running on the cluster.
# (@openshift/openshift-team-cluster-manager) console_url is the url of the console
# running on the cluster.
- '{__name__="console_url"}'
# cluster:network_attachment_definition_instances:max" gives max no of instance
# in the cluster that are annotated with k8s.v1.cni.cncf.io/networks, labelled by networks.
Expand Down

0 comments on commit 4e0f326

Please sign in to comment.