Skip to content

feat: Show alert execution errors in the UI#2132

Merged
pulpdrew merged 1 commit intomainfrom
drew/alert-execution-errors
Apr 17, 2026
Merged

feat: Show alert execution errors in the UI#2132
pulpdrew merged 1 commit intomainfrom
drew/alert-execution-errors

Conversation

@pulpdrew
Copy link
Copy Markdown
Contributor

@pulpdrew pulpdrew commented Apr 17, 2026

Summary

This PR persists and displays errors that occur during alert execution.

Previously, when an alert execution error'd (due to bad SQL, invalid webhook configuration, etc), users would have no way to debug the issue - they'd just see that the error was missing AlertHistory entries on the alert page. Now, the user will see an error icon, with the option to view a more detailed error message.

Details

The errors are persisted on the Alert documents (rather than the AlertHistory documents) because

  1. Each execution of a grouped alert may have multiple AlertHistory documents, each of which would then have to duplicate any query errors, resulting in more complicated de-dupe logic in the API and more database bandwidth used when fetching histories (which is already an intense aggregation query)
  2. The errors are mostly just important for the most recent execution. If multiple executions fail, the error message is likely to be the same each time. Further, when the execution succeeds, the issue has likely been resolved, so we can clear the error state.

Screenshots or video

Screenshot 2026-04-17 at 9 30 42 AM Screenshot 2026-04-17 at 9 30 29 AM Screenshot 2026-04-17 at 9 30 24 AM

How to test locally or on Vercel

This must be tested locally, since alerts are not supported in the preview environment.

To see the notification content, run an echo server locally and create a webhook that targets it (http://localhost:3000):

npx http-echo-server

Interesting cases to test:

  • Missing webhook
  • Invalid webhook URL
  • Query error

References

  • Linear Issue: Closes HDX-3990 HDX-2868
  • Related PRs:

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Apr 17, 2026 2:21pm

Request Review

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 17, 2026

🦋 Changeset detected

Latest commit: 700bab2

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Patch
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Apr 17, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (5):
    • packages/api/src/routers/external-api/v2/alerts.ts
    • packages/api/src/tasks/checkAlerts/index.ts
    • packages/api/src/tasks/checkAlerts/providers/default.ts
    • packages/api/src/tasks/checkAlerts/providers/index.ts
    • packages/api/src/tasks/checkAlerts/template.ts
  • Cross-layer change: touches frontend (packages/app) + backend (packages/api) + shared utils (packages/common-utils)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 14
  • Production lines changed: 455 (+ 1076 in test files, excluded from tier calculation)
  • Branch: drew/alert-execution-errors
  • Author: pulpdrew

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

PR Review

✅ No critical issues found.

This is a well-implemented feature with comprehensive tests (unit, integration, and E2E). A few minor observations:

  • ⚠️ Comment/PR description says "10KB" message truncation but code does message.slice(0, 10000) (10,000 chars ≠ 10,240 bytes). Harmless in practice for ASCII errors but the stated limit is technically incorrect.
  • ⚠️ getPopulatedChannel behavior change (return undefined → throw) is correct for the new error-recording flow, but any code paths not covered by tests that relied on the old undefined return would silently change behavior. Test coverage appears thorough enough that this is low risk.
  • ℹ️ ClickHouse error messages (which may contain SQL/query details derived from user-defined filters) are persisted to MongoDB and exposed via API. Since users only see their own alerts, there's no cross-tenant concern — but worth noting as a tradeoff documented in the PR.

The error accumulation flow (query errors → early return + recordAlertErrors, webhook errors → accumulate + pass to updateAlertState, success → pass empty array to clear) is logically sound and well-tested.

@pulpdrew pulpdrew force-pushed the drew/alert-execution-errors branch from b4a599b to 251f25a Compare April 17, 2026 13:42
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

E2E Test Results

All tests passed • 147 passed • 3 skipped • 1110s

Status Count
✅ Passed 147
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@pulpdrew pulpdrew force-pushed the drew/alert-execution-errors branch from 4139b24 to 700bab2 Compare April 17, 2026 14:19
@pulpdrew pulpdrew requested review from a team and karl-power and removed request for a team April 17, 2026 14:29
@pulpdrew pulpdrew merged commit 3c05772 into main Apr 17, 2026
19 checks passed
Copilot AI pushed a commit that referenced this pull request Apr 20, 2026
Co-authored-by: peter-leonov-ch <209667683+peter-leonov-ch@users.noreply.github.com>
kodiakhq Bot pushed a commit that referenced this pull request Apr 20, 2026
## Summary

This PR updates the recent alert runner error persistence + display (#2132) to hardcode webhook and unknown-type errors. The raw error messages could contain potentially sensitive information, so we won't persist them or show them in the UI.

<img width="664" height="183" alt="Screenshot 2026-04-20 at 7 13 57 AM" src="https://github.com/user-attachments/assets/0f4f600b-2cdd-47e5-ba72-cec4dbc40423" />

### How to test locally or on Vercel

This can be tested locally by running an alert with an invalid webhook destination.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants