Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1904503: Add prometheus alerts for vsphere #126

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
25 changes: 25 additions & 0 deletions assets/vsphere_problem_detector/12_prometheusrules.yaml
@@ -0,0 +1,25 @@
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vsphere-problem-detector
namespace: openshift-cluster-storage-operator
labels:
role: alert-rules
spec:
groups:
- name: vsphere-problem-detector.rules
rules:
- alert: VSphereOpenshiftNodeHealthFail
expr: vsphere_node_check_errors == 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type is this metric? A counter or gauge? Quick search could not find it in the operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's gauge now, openshift/vsphere-problem-detector#24
(used to be counter yesterday, hard to alert on it).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment to improve the alerting rule: in the current version, a failed scrape by Prometheus would resolve the alert if it was firing previously.
To protect against it, you can use min_over_time() like this

Suggested change
expr: vsphere_node_check_errors == 1
expr: min_over_time(vsphere_node_check_errors[5m]) == 1

see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(used to be counter yesterday, hard to alert on it).

Nice thanks!

Agreed with Simons suggestion, otherwise looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why min_over_time - shouldn't this be max_over_time? I would think if a scrape failed and a value is missing on time t1 then it would replace with some kind of sentinel value (0? - I don't know prometheus very well and how it fills the holes in the data) .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target is down (up == 0) then Prometheus will mark the vsphere_node_check_errors metric as stale (meaning it doesn't exist anymore). On the next evaluation of the alerting rule, the result from the rule's expression would be "no data" so Prometheus will consider that the alert is resolved.

for: 10m
labels:
severity: warning
annotations:
message: "VSphere health check {{ $labels.check }} is failing on {{ $labels.node }}."
- alert: VSphereOpenshiftClusterHealthFail
expr: vsphere_cluster_check_errors == 1
for: 10m
labels:
severity: warning
annotations:
message: "VSphere cluster health checks are failing with {{ $labels.check }}"
1 change: 1 addition & 0 deletions go.mod
Expand Up @@ -11,6 +11,7 @@ require (
github.com/openshift/client-go v0.0.0-20200827190008-3062137373b5
github.com/openshift/library-go v0.0.0-20200909144351-f29d76719396
github.com/prometheus-operator/prometheus-operator v0.44.1
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.44.1
github.com/prometheus/client_golang v1.8.0
github.com/spf13/cobra v1.0.0
github.com/spf13/pflag v1.0.5
Expand Down
45 changes: 45 additions & 0 deletions pkg/generated/bindata.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

96 changes: 85 additions & 11 deletions pkg/operator/vsphereproblemdetector/monitoring.go
Expand Up @@ -2,6 +2,7 @@ package vsphereproblemdetector

import (
"context"
"fmt"
"time"

operatorapi "github.com/openshift/api/operator/v1"
Expand All @@ -10,39 +11,65 @@ import (
"github.com/openshift/library-go/pkg/controller/factory"
"github.com/openshift/library-go/pkg/operator/events"
"github.com/openshift/library-go/pkg/operator/resource/resourceapply"
"github.com/openshift/library-go/pkg/operator/resource/resourcemerge"
"github.com/openshift/library-go/pkg/operator/v1helpers"
promv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
promclient "github.com/prometheus-operator/prometheus-operator/pkg/client/versioned"
promscheme "github.com/prometheus-operator/prometheus-operator/pkg/client/versioned/scheme"
"k8s.io/apimachinery/pkg/api/equality"
apierrors "k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/serializer"
"k8s.io/client-go/dynamic"
"k8s.io/client-go/kubernetes"
"k8s.io/klog/v2"
)

type monitoringController struct {
operatorClient v1helpers.OperatorClient
kubeClient kubernetes.Interface
dynamicClient dynamic.Interface
eventRecorder events.Recorder
operatorClient v1helpers.OperatorClient
kubeClient kubernetes.Interface
dynamicClient dynamic.Interface
monitoringClient promclient.Interface
eventRecorder events.Recorder
}

const (
monitoringControllerName = "VSphereProblemDetectorMonitoringController"
prometheusRuleFile = "vsphere_problem_detector/12_prometheusrules.yaml"
)

var (
genericScheme = runtime.NewScheme()
genericCodecs = serializer.NewCodecFactory(genericScheme)
genericCodec = genericCodecs.UniversalDeserializer()
)

func init() {
if err := promscheme.AddToScheme(genericScheme); err != nil {
panic(err)
}
}

func newMonitoringController(
clients *csoclients.Clients,
eventRecorder events.Recorder,
resyncInterval time.Duration) factory.Controller {

c := &monitoringController{
operatorClient: clients.OperatorClient,
kubeClient: clients.KubeClient,
dynamicClient: clients.DynamicClient,
eventRecorder: eventRecorder.WithComponentSuffix("vsphere-monitoring-controller"),
operatorClient: clients.OperatorClient,
kubeClient: clients.KubeClient,
dynamicClient: clients.DynamicClient,
eventRecorder: eventRecorder.WithComponentSuffix("vsphere-monitoring-controller"),
monitoringClient: clients.MonitoringClient,
}

return factory.New().
WithSync(c.sync).
WithInformers(
c.operatorClient.Informer(),
clients.MonitoringInformer.Monitoring().V1().ServiceMonitors().Informer()).
clients.MonitoringInformer.Monitoring().V1().ServiceMonitors().Informer(),
clients.MonitoringInformer.Monitoring().V1().PrometheusRules().Informer()).
ResyncEvery(resyncInterval).
WithSyncDegradedOnError(clients.OperatorClient).
ToController(monitoringControllerName, c.eventRecorder)
Expand All @@ -69,14 +96,61 @@ func (c *monitoringController) sync(ctx context.Context, syncContext factory.Syn
return err
}

serviceMonitorAvailable := operatorapi.OperatorCondition{
prometheusRuleBytes := generated.MustAsset(prometheusRuleFile)
_, _, err = c.syncPrometheusRule(ctx, prometheusRuleBytes)
if err != nil {
return err
}

monitoringCondition := operatorapi.OperatorCondition{
Type: monitoringControllerName + operatorapi.OperatorStatusTypeAvailable,
Status: operatorapi.ConditionTrue,
}
if _, _, err := v1helpers.UpdateStatus(c.operatorClient,
v1helpers.UpdateConditionFn(serviceMonitorAvailable),
v1helpers.UpdateConditionFn(monitoringCondition),
); err != nil {
return err
}
return nil
}

func (c *monitoringController) syncPrometheusRule(ctx context.Context, prometheusRuleBytes []byte) (*promv1.PrometheusRule, bool, error) {
requiredObj, _, err := genericCodec.Decode(prometheusRuleBytes, nil, nil)
if err != nil {
return nil, false, fmt.Errorf("cannot decode %q: %v", prometheusRuleFile, err)
}

prometheusRule, ok := requiredObj.(*promv1.PrometheusRule)
if !ok {
return nil, false, fmt.Errorf("invalid prometheusrule: %+v", requiredObj)
}

existingRule, err := c.monitoringClient.MonitoringV1().PrometheusRules(prometheusRule.Namespace).Get(ctx, prometheusRule.Name, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
existingRule, err = c.monitoringClient.MonitoringV1().
PrometheusRules(prometheusRule.Namespace).Create(ctx, prometheusRule, metav1.CreateOptions{})
if err != nil {
return nil, false, fmt.Errorf("failed to create prometheus rule: %v", err)
}
return existingRule, true, nil
}

existingRuleCopy := existingRule.DeepCopy()
existingSpec := existingRuleCopy.Spec

modified := resourcemerge.BoolPtr(false)

resourcemerge.EnsureObjectMeta(modified, &existingRuleCopy.ObjectMeta, prometheusRule.ObjectMeta)
contentSame := equality.Semantic.DeepEqual(existingSpec, prometheusRule.Spec)
// no modifications are necessary everything is same
if contentSame && !*modified {
return existingRule, false, nil
}

prometheusRule.ObjectMeta = *existingRuleCopy.ObjectMeta.DeepCopy()
prometheusRule.TypeMeta = existingRuleCopy.TypeMeta

klog.V(4).Infof("prometheus rule %s is modified outside of openshift - updating", prometheusRuleFile)
updatedRule, err := c.monitoringClient.MonitoringV1().PrometheusRules(prometheusRule.Namespace).Update(ctx, prometheusRule, metav1.UpdateOptions{})
return updatedRule, true, err
}
91 changes: 91 additions & 0 deletions pkg/operator/vsphereproblemdetector/monitoring_test.go
@@ -0,0 +1,91 @@
package vsphereproblemdetector

import (
"context"
"testing"

"github.com/openshift/cluster-storage-operator/pkg/csoclients"
"github.com/openshift/library-go/pkg/operator/events"
promv1 "github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1"
"k8s.io/apimachinery/pkg/runtime"
)

func TestSyncPrometheusRule(t *testing.T) {
tests := []struct {
name string
inititialRules []*promv1.PrometheusRule
// we merely use this field as a marker in test to check if rule was applied properly
expectedAlertCountInRule int
modified bool
}{
{
name: "for new rule creation",
inititialRules: []*promv1.PrometheusRule{},
expectedAlertCountInRule: 2,
modified: true,
},
}

for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
initialObjects := &csoclients.FakeTestObjects{}
for _, r := range test.inititialRules {
initialObjects.MonitoringObjects = append(initialObjects.MonitoringObjects, runtime.Object(r))
}

client := csoclients.NewFakeClients(initialObjects)
eventRecorder := events.NewInMemoryRecorder("vsphere-client")
c := &monitoringController{
operatorClient: client.OperatorClient,
kubeClient: client.KubeClient,
dynamicClient: client.DynamicClient,
eventRecorder: eventRecorder,
monitoringClient: client.MonitoringClient,
}
ctx := context.TODO()
rule, modified, err := c.syncPrometheusRule(ctx, getPrometheusRuleRaw())
if err != nil {
t.Errorf("unexpected error: %v", err)
}

if modified != test.modified {
t.Errorf("expected rule modification to be %v got %v", test.modified, modified)
}
actualRules := rule.Spec.Groups[0].Rules
if len(actualRules) != test.expectedAlertCountInRule {
t.Errorf("expected alert count in rule to be %d got %d", test.expectedAlertCountInRule, len(actualRules))
}
})

}
}

func getPrometheusRuleRaw() []byte {
return []byte(`
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vsphere-problem-detector
namespace: openshift-cluster-storage-operator
labels:
role: alert-rules
spec:
groups:
- name: vsphere-problem-detector.rules
rules:
- alert: VSphereOpenshiftNodeHealthFail
expr: vsphere_node_check_errors == 1
for: 10m
labels:
severity: warning
annotations:
message: "Vsphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}"
- alert: VSphereOpenshiftClusterHealthFail
expr: vsphere_cluster_check_errors == 1
for: 10m
labels:
severity: critical
annotations:
message: "VSpehre cluster health checks are failing with {{ $labels.check }}"
`)
}
1 change: 1 addition & 0 deletions vendor/modules.txt
Expand Up @@ -234,6 +234,7 @@ github.com/prometheus-operator/prometheus-operator/pkg/client/versioned/typed/mo
github.com/prometheus-operator/prometheus-operator/pkg/client/versioned/typed/monitoring/v1alpha1
github.com/prometheus-operator/prometheus-operator/pkg/client/versioned/typed/monitoring/v1alpha1/fake
# github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring v0.44.1
## explicit
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1
github.com/prometheus-operator/prometheus-operator/pkg/apis/monitoring/v1alpha1
Expand Down