Skip to content

Proposal: Enhance the stability of log alerts #1411

@zdyj3170101136

Description

@zdyj3170101136

background

We are heavy users of log alerts, and internally we have spent a lot of time and effort on log alerts.

We referenced Grafana's code and restructured Datadog's functionality to completely overhaul the alerting process.

enhancement

async execute alert

Currently, if any log alert takes too long to run, it will block the execution of other alerts.

  const now = new Date();
  const alerts = await getAlerts();
  logger.info(`Going to process ${alerts.length} alerts`);
  await Promise.all(alerts.map(alert => processAlert(now, alert)));
};

Modified to:

  • Ensure that only one alertID is executed at a time via runningAlerts.

  • If an alert with the same ID has not finished executing, skip it.

Refer to Grafana: https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/schedule.go#L420

const runningAlerts = new Map<string, Date>();

export default async () => {
  const now = new Date();
  const alerts = await getAlerts();
  logger.info(`Going to process ${alerts.length} alerts`);
  alerts.forEach(alert => {
    const alertId = alert.id;
    if (runningAlerts.has(alertId)) {
      logger.error({
        message: 'Tick dropped because alert rule evaluation is too slow',
        alert_id: alertId,
        alertname: alert.name,
        time: now,
        lastEvaluation: runningAlerts.get(alertId),
      });
      scheduleRuleEvaluationsMissedTotal.inc({
        alertname: alert.name ?? 'unknown',
      });
      return;
    }
    runningAlerts.set(alertId, now);
    void processAlert(now, alert).finally(() => runningAlerts.delete(alertId));
  });
};

add retry

During alert execution, whether a ClickHouse query fails or a webhook request fails, an exception should be thrown upwards.

Retry should then be initiated at the top level.

See https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/alert_rule.go#L285

      const attempt = async (retries: number): Promise<void> => {
        try {
          await processAlert(now, alert);
        } catch (err) {
          logger.error({
            message: '"Failed to evaluate rule',
            retries,
            error: serializeError(err),
            alertname: alert.name,
            alertId,
          });
          if (retries < MAX_ATTEMPTS) {
            await new Promise<void>(resolve =>
              setTimeout(resolve, RETRY_DELAY_MS),
            );
            return attempt(retries + 1);
          }
          // no more retries
          return;
        }
      };
      void attempt(1).finally(() => {
        EvalDuration.observe(
          (new Date().getTime() - evalStart.getTime()) / 1000,
        );
        runningAlerts.delete(alertId);
      });

Alarms are executed evenly.

Execute evenly over 1 minute to reduce database load.

Refer to https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/schedule.go#L439

    const alertId = alert.id;
    const delayMs = (fnv.hash(alertId) % 60) * 1000;
    setTimeout((alert) => {
        processAlert(alert)
    }, delayMs))
    // 00:01:30 s 或 00:01:59 去执行告警,告警都查询的是 [00:00:00, 00:01:00) 的日志。

do not query history alert

Without querying historical data: In the old implementation, alert would query data from the last execution to the current time.

For example, if a machine has been down for a month, querying the logs for the most recent month would trigger many meaningless alerts.

OK -> alert -> recovery -> alert.

Referring to the Grafana implementation, this should be modified to only check the current time (1 minute past).

Observability.

Implemented according to https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/metrics/scheduler.go.


const EvaluationMissed = new Prometheus.Counter({
  name: 'schedule_rule_evaluations_missed_total',
  help: 'The total number of rule evaluations missed due to a slow rule evaluation or schedule problem.',
  labelNames: ['alertname'],
});

const EvalDuration = new Prometheus.Summary({
  name: 'rule_evaluation_duration_seconds',
  help: 'The time to evaluate a rule.',
  labelNames: ['type'],
  percentiles: [0.01, 0.05, 0.5, 0.9, 0.99],
});

const EvalRetry = new Prometheus.Counter({
  name: 'rule_evaluation_retry_total',
  help: 'The total number of rule retry.',
  labelNames: ['alertname'],
});

const EvalFailures = new Prometheus.Counter({
  name: 'rule_evaluation_failures_total',
  help: 'The total number of rule evaluation failures.',
  labelNames: ['alertname'],
});

run alert as service

The alerting service runs independently, with multiple pods electing a leader via MongoDB.

Currently, the API, UI, and AlertTask tasks all run in a single deployment.

The AlertTask task will be split into a separate deployment to reduce interference.

The alerting service runs in multiple pods, with a leader elected via MongoDB to ensure that only one pod is actually querying alerts.

MongoDB Tables:

lockValue: string; // Unique identifier of the lock holder (pod name)

acquiredAt: Date; // Lock acquisition time

expiresAt: Date; // Lock expiration time, automatically cleared with TTL

Acquiring a Lock:

  • Query MongoDB to see if anyone holds the lock.

    • If the current service holds the lock, return true and extend the lock's expiration time.

    • If another service holds the lock, return false.

    • If no service holds the lock, create a record and set the expiration time to 2 minutes.

      • If 11000 is returned, it indicates that a concurrent service successfully acquired the lock; return false.

      • An error was returned.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions