Proposal: Enhance the stability of log alerts

## background
We are heavy users of log alerts, and internally we have spent a lot of time and effort on log alerts.

We referenced Grafana's code and restructured Datadog's functionality to completely overhaul the alerting process.
## enhancement
#### async execute alert
Currently, if any log alert takes too long to run, it will block the execution of other alerts.
```
  const now = new Date();
  const alerts = await getAlerts();
  logger.info(`Going to process ${alerts.length} alerts`);
  await Promise.all(alerts.map(alert => processAlert(now, alert)));
};
```
Modified to:

- Ensure that only one alertID is executed at a time via runningAlerts.

- If an alert with the same ID has not finished executing, skip it.

Refer to Grafana: https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/schedule.go#L420
```
const runningAlerts = new Map<string, Date>();

export default async () => {
  const now = new Date();
  const alerts = await getAlerts();
  logger.info(`Going to process ${alerts.length} alerts`);
  alerts.forEach(alert => {
    const alertId = alert.id;
    if (runningAlerts.has(alertId)) {
      logger.error({
        message: 'Tick dropped because alert rule evaluation is too slow',
        alert_id: alertId,
        alertname: alert.name,
        time: now,
        lastEvaluation: runningAlerts.get(alertId),
      });
      scheduleRuleEvaluationsMissedTotal.inc({
        alertname: alert.name ?? 'unknown',
      });
      return;
    }
    runningAlerts.set(alertId, now);
    void processAlert(now, alert).finally(() => runningAlerts.delete(alertId));
  });
};
```

#### add retry
During alert execution, whether a ClickHouse query fails or a webhook request fails, an exception should be thrown upwards.

Retry should then be initiated at the top level.

See https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/alert_rule.go#L285

```
      const attempt = async (retries: number): Promise<void> => {
        try {
          await processAlert(now, alert);
        } catch (err) {
          logger.error({
            message: '"Failed to evaluate rule',
            retries,
            error: serializeError(err),
            alertname: alert.name,
            alertId,
          });
          if (retries < MAX_ATTEMPTS) {
            await new Promise<void>(resolve =>
              setTimeout(resolve, RETRY_DELAY_MS),
            );
            return attempt(retries + 1);
          }
          // no more retries
          return;
        }
      };
      void attempt(1).finally(() => {
        EvalDuration.observe(
          (new Date().getTime() - evalStart.getTime()) / 1000,
        );
        runningAlerts.delete(alertId);
      });
```

#### Alarms are executed evenly.

Execute evenly over 1 minute to reduce database load.

Refer to https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/schedule/schedule.go#L439

```
    const alertId = alert.id;
    const delayMs = (fnv.hash(alertId) % 60) * 1000;
    setTimeout((alert) => {
        processAlert(alert)
    }, delayMs))
    // 00:01:30 s 或 00:01:59 去执行告警，告警都查询的是 [00:00:00, 00:01:00) 的日志。
```

#### do not query history alert
Without querying historical data: In the old implementation, `alert` would query data from the last execution to the current time.

For example, if a machine has been down for a month, querying the logs for the most recent month would trigger many meaningless alerts.

OK -> alert -> recovery -> alert.

Referring to the Grafana implementation, this should be modified to only check the current time (1 minute past).
#### Observability.

Implemented according to https://github.com/grafana/grafana/blob/v12.1.1/pkg/services/ngalert/metrics/scheduler.go.
```

const EvaluationMissed = new Prometheus.Counter({
  name: 'schedule_rule_evaluations_missed_total',
  help: 'The total number of rule evaluations missed due to a slow rule evaluation or schedule problem.',
  labelNames: ['alertname'],
});

const EvalDuration = new Prometheus.Summary({
  name: 'rule_evaluation_duration_seconds',
  help: 'The time to evaluate a rule.',
  labelNames: ['type'],
  percentiles: [0.01, 0.05, 0.5, 0.9, 0.99],
});

const EvalRetry = new Prometheus.Counter({
  name: 'rule_evaluation_retry_total',
  help: 'The total number of rule retry.',
  labelNames: ['alertname'],
});

const EvalFailures = new Prometheus.Counter({
  name: 'rule_evaluation_failures_total',
  help: 'The total number of rule evaluation failures.',
  labelNames: ['alertname'],
});
```

#### run alert as service
The alerting service runs independently, with multiple pods electing a leader via MongoDB.

Currently, the API, UI, and AlertTask tasks all run in a single deployment.

The AlertTask task will be split into a separate deployment to reduce interference.

The alerting service runs in multiple pods, with a leader elected via MongoDB to ensure that only one pod is actually querying alerts.

MongoDB Tables:
```
lockValue: string; // Unique identifier of the lock holder (pod name)

acquiredAt: Date; // Lock acquisition time

expiresAt: Date; // Lock expiration time, automatically cleared with TTL
```

Acquiring a Lock:

- Query MongoDB to see if anyone holds the lock.

  - If the current service holds the lock, return true and extend the lock's expiration time.

  - If another service holds the lock, return false.

  - If no service holds the lock, create a record and set the expiration time to 2 minutes.

    - If 11000 is returned, it indicates that a concurrent service successfully acquired the lock; return false.

    - An error was returned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal: Enhance the stability of log alerts #1411

background

enhancement

async execute alert

add retry

Alarms are executed evenly.

do not query history alert

Observability.

run alert as service

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Enhance the stability of log alerts #1411

Description

background

enhancement

async execute alert

add retry

Alarms are executed evenly.

do not query history alert

Observability.

run alert as service

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions