Skip to content

feat(eventrecorder): add Kafka output#5246

Open
siavashs wants to merge 2 commits into
prometheus:mainfrom
siavashs:feat/event-recorder-kafka-integration
Open

feat(eventrecorder): add Kafka output#5246
siavashs wants to merge 2 commits into
prometheus:mainfrom
siavashs:feat/event-recorder-kafka-integration

Conversation

@siavashs
Copy link
Copy Markdown
Contributor

@siavashs siavashs commented May 19, 2026

Add a third event recorder destination alongside file and webhook that
produces serialized events to a Kafka topic via franz-go.

Configuration is per-output under event_recorder.outputs:

  event_recorder:
    outputs:
      - type: kafka
        brokers: ["kafka-1:9093", "kafka-2:9093"]
        topic: alertmanager.events
        format: json          # or "protobuf"
        acks: leader          # "none" | "leader" | "all"
        compression: snappy   # "none" | "gzip" | "snappy" | "lz4" | "zstd"
        buffer_size: 1024
        tls_config: { ... }

Implementation notes:

  • KafkaOutput buffers events in a bounded local channel and forwards them to franz-go's async producer. When the buffer is full, events are dropped (counted via alertmanager_event_output_drops_total) so a slow or unreachable broker cannot block the upstream pipeline.
  • Broker unreachability at startup is logged at warn level and does not prevent Alertmanager from starting; franz-go retries connections in the background.
  • Records use the producing instance's hostname as the message key, keeping per-instance ordering on the same partition.
  • A new optional ProtoDestination interface lets the Kafka output receive protobuf events directly, skipping JSON serialization when no JSON-mode destination is configured.
  • JSON marshalling in marshalAndSend is now lazy: it only happens when at least one non-proto destination needs it.
  • TLS is supported via prometheus/common's TLSConfig (mTLS or server-only). SASL is intentionally out of scope for this change and can be added later via franz-go's kgo.SASL options.
  • Idempotent writes are disabled unless acks=all is explicitly set, to keep the default leader-ack path compatible with franz-go.

Metric changes:

  • Rename alertmanager_event_webhook_drops_total -> alertmanager_event_output_drops_total{output}, shared by webhook and kafka outputs. This is a breaking metric rename; dashboards and alerts referencing the old name need to be updated.
  • Add alertmanager_event_kafka_produce_errors_total{output,error_type} populated from franz-go's produce callback.

Testing:

  • Unit tests use github.com/twmb/franz-go/pkg/kfake to spin an in-process broker for JSON, protobuf, message-key, drop-on-full, close-flush, initial-ping-failure, name-stability, and config validation cases, plus a proto fast-path test against marshalAndSend.
  • A docker-compose example under examples/kafka/ provides a single-node Apache Kafka (KRaft) broker for manual end-to-end verification with a matching alertmanager.yml.

Dependencies added:

  • github.com/twmb/franz-go
  • github.com/twmb/franz-go/pkg/kfake (test)
  • github.com/twmb/franz-go/plugin/kslog

Pull Request Checklist

Please check all the applicable boxes.

  • Please list all open issue(s) discussed with maintainers related to this change
    • Fixes #
  • Is this a new Receiver integration?
  • Is this a bugfix?
    • I have added tests that can reproduce the bug which pass with this bugfix applied
  • Is this a new feature?
    • I have added tests that test the new feature's functionality
  • Does this change affect performance?
    • I have provided benchmarks comparison that shows performance is improved or is not degraded
      • You can use benchstat to compare benchmarks
    • I have added new benchmarks if required or requested by maintainers
  • Is this a breaking change?
    • My changes do not break the existing cluster messages
    • My changes do not break the existing api
  • I have added/updated the required documentation
  • I have signed-off my commits
  • I will follow best practices for contributing to this project

Which user-facing changes does this PR introduce?

[CHANGE] Rename alertmanager_event_webhook_drops_total  to alertmanager_event_output_drops_total{output},

Summary by CodeRabbit

  • New Features

    • Kafka added as an event-recorder output supporting JSON and Protobuf payloads, with configurable brokers, topic, client ID, acks, compression, buffering, and optional TLS.
    • Metrics: new Kafka produce error metric and shared per-output dropped-events metric.
  • Documentation

    • New configuration docs describing the event recorder, its outputs (file, webhook, kafka) and related settings (timeouts, retries, backoff, log rotation).

Review Change Stack

@siavashs siavashs requested a review from a team as a code owner May 19, 2026 11:37
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ae478dbc-b5ab-47e3-9504-953bb487d5d7

📥 Commits

Reviewing files that changed from the base of the PR and between 590738d and 510008b.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (7)
  • docs/configuration.md
  • eventrecorder/config.go
  • eventrecorder/eventrecorder.go
  • eventrecorder/eventrecorder_test.go
  • eventrecorder/kafka.go
  • eventrecorder/kafka_test.go
  • go.mod
✅ Files skipped from review due to trivial changes (1)
  • docs/configuration.md
🚧 Files skipped from review as they are similar to previous changes (6)
  • eventrecorder/eventrecorder_test.go
  • go.mod
  • eventrecorder/kafka_test.go
  • eventrecorder/eventrecorder.go
  • eventrecorder/kafka.go
  • eventrecorder/config.go

📝 Walkthrough

Walkthrough

Adds Kafka output to the event recorder: docs and config schema, YAML parsing/validation, KafkaOutput implemented with franz-go (bounded enqueue, async produce, error classification, Close draining), proto fast-path integration, metrics, and tests.

Changes

Kafka Event Recorder Output

Layer / File(s) Summary
Configuration, docs, and YAML parsing
docs/configuration.md, eventrecorder/config.go, eventrecorder/eventrecorder_test.go
Adds Kafka-related docs and constants, extends Output with Kafka fields (brokers, topic, client_id, format, acks, compression, buffer_size, tls_config), implements UnmarshalYAML/validateKafka and order-independent broker equality, and tests parsing/defaults/error cases.
Event recorder dispatcher and metrics
eventrecorder/eventrecorder.go, eventrecorder/eventrecorder_test.go
Adds OutputKafka and ProtoDestination interface; refactors metrics (shared output_drops and kafka_produce_errors), wires metrics/instance into buildOutputs, and rewrites marshalAndSend to use proto fast-path or lazily JSON-marshal once for fan-out.
Kafka producer implementation and tests
eventrecorder/kafka.go, eventrecorder/kafka_test.go
Implements KafkaOutput via franz-go with configurable acks/compression/TLS, best-effort startup ping, bounded channel with drop-on-overflow (counter+warn), single dispatcher producing asynchronously with classified error metrics, and Close draining+flush. Tests cover JSON/proto sends, keying, drops, Close flush, unreachable broker behavior, config validation, deterministic naming, and proto fast-path.
Go module updates
go.mod
Adds direct dependency on github.com/twmb/franz-go (including kfake and kslog) and github.com/prometheus/client_model, bumps golang.org/x/text, and updates several indirect pins required by franz-go.

Sequence Diagram(s)

sequenceDiagram
  participant EventRecorder
  participant KafkaOutput
  participant Dispatcher
  participant FranzGo as Franz-Go Client
  participant Kafka
  EventRecorder->>KafkaOutput: SendProto(event)
  KafkaOutput->>KafkaOutput: enqueue(marshaled bytes)
  alt buffer full
    KafkaOutput->>KafkaOutput: increment drops counter
  else buffer has space
    KafkaOutput->>Dispatcher: signal work available
  end
  Dispatcher->>FranzGo: Produce record keyed by instance
  FranzGo->>Kafka: produce to topic with acks/compression
  Kafka-->>FranzGo: callback (error or success)
  FranzGo->>Dispatcher: update metrics (errors or success)
  EventRecorder->>KafkaOutput: Close()
  KafkaOutput->>Dispatcher: stop signal
  Dispatcher->>FranzGo: drain queued records
  FranzGo->>Kafka: flush within budget
  FranzGo-->>KafkaOutput: flush result
  KafkaOutput->>EventRecorder: closed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • prometheus/alertmanager#5190: Overlaps changes to eventrecorder/config.go (Output.UnmarshalYAML, configEqual) and output handling constants.
  • prometheus/alertmanager#5189: Related edits to eventrecorder configuration/types and YAML unmarshalling that intersect with these Kafka additions.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.72% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(eventrecorder): add Kafka output' directly and concisely summarizes the main change—adding Kafka support to the event recorder.
Description check ✅ Passed The description addresses all critical template items: identifies it as a new feature with added tests, documents changes, indicates signed-off commits, and includes release notes with a breaking change notice.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/configuration.md`:
- Line 2098: Replace the inconsistent placeholder `<secret_url>` with the
documented placeholder `<secret>` in the configuration examples (specifically
the webhook URL entry shown as `url: <secret_url>`); locate the occurrence of
`<secret_url>` in the docs/configuration.md snippet and update it to `url:
<secret>` so it matches the schema placeholder `<secret>` used elsewhere in the
document.

In `@eventrecorder/kafka.go`:
- Around line 151-156: The synchronous startup call to client.Ping using
pingCtx/defaultKafkaPingTimeout blocks initialization; remove the blocking ping
and instead spawn a background goroutine that performs the ping with its own
context.WithTimeout, calls client.Ping inside the goroutine, logs the same
warning (using logger.Warn, "output", name, "err", pingErr) on failure, and
ensures the context cancel is called inside the goroutine to avoid leaks; keep
the main init path non-blocking and do not call cancel() immediately from the
main thread.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7d611017-d7d4-4c31-a56b-55a1f37efc7d

📥 Commits

Reviewing files that changed from the base of the PR and between f493986 and 72da8ae.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (7)
  • docs/configuration.md
  • eventrecorder/config.go
  • eventrecorder/eventrecorder.go
  • eventrecorder/eventrecorder_test.go
  • eventrecorder/kafka.go
  • eventrecorder/kafka_test.go
  • go.mod

Comment thread docs/configuration.md Outdated
Comment thread eventrecorder/kafka.go Outdated
@siavashs siavashs force-pushed the feat/event-recorder-kafka-integration branch from 72da8ae to 590738d Compare May 19, 2026 11:47
siavashs added 2 commits May 19, 2026 13:52
Document the event recorder configuration including all supported outputs.

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
Add a third event recorder destination alongside file and webhook that
produces serialized events to a Kafka topic via franz-go.

Configuration is per-output under `event_recorder.outputs`:

  event_recorder:
    outputs:
      - type: kafka
        brokers: ["kafka-1:9093", "kafka-2:9093"]
        topic: alertmanager.events
        format: json          # or "protobuf"
        acks: leader          # "none" | "leader" | "all"
        compression: snappy   # "none" | "gzip" | "snappy" | "lz4" | "zstd"
        buffer_size: 1024
        tls_config: { ... }

Implementation notes:

- KafkaOutput buffers events in a bounded local channel and forwards
  them to franz-go's async producer.  When the buffer is full, events
  are dropped (counted via alertmanager_event_output_drops_total) so a
  slow or unreachable broker cannot block the upstream pipeline.
- Broker unreachability at startup is logged at warn level and does
  not prevent Alertmanager from starting; franz-go retries connections
  in the background.
- Records use the producing instance's hostname as the message key,
  keeping per-instance ordering on the same partition.
- A new optional ProtoDestination interface lets the Kafka output
  receive protobuf events directly, skipping JSON serialization when
  no JSON-mode destination is configured.
- JSON marshalling in marshalAndSend is now lazy: it only happens
  when at least one non-proto destination needs it.
- TLS is supported via prometheus/common's TLSConfig (mTLS or
  server-only).  SASL is intentionally out of scope for this change
  and can be added later via franz-go's kgo.SASL options.
- Idempotent writes are disabled unless acks=all is explicitly set,
  to keep the default leader-ack path compatible with franz-go.

Metric changes:

- Rename alertmanager_event_webhook_drops_total ->
  alertmanager_event_output_drops_total{output}, shared by webhook and
  kafka outputs.  This is a breaking metric rename; dashboards and
  alerts referencing the old name need to be updated.
- Add alertmanager_event_kafka_produce_errors_total{output,error_type}
  populated from franz-go's produce callback.

Testing:

- Unit tests use github.com/twmb/franz-go/pkg/kfake to spin an
  in-process broker for JSON, protobuf, message-key, drop-on-full,
  close-flush, initial-ping-failure, name-stability, and config
  validation cases, plus a proto fast-path test against marshalAndSend.

Dependencies added:
  github.com/twmb/franz-go
  github.com/twmb/franz-go/pkg/kfake (test)
  github.com/twmb/franz-go/plugin/kslog

Signed-off-by: Siavash Safi <siavash@cloudflare.com>
@siavashs siavashs force-pushed the feat/event-recorder-kafka-integration branch from 590738d to 510008b Compare May 19, 2026 11:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant