Skip to content

api,metrics: add changefeed operation history#5095

Merged
ti-chi-bot[bot] merged 1 commit into
pingcap:masterfrom
wlwilliamx:codex/changefeed-operation-history
May 19, 2026
Merged

api,metrics: add changefeed operation history#5095
ti-chi-bot[bot] merged 1 commit into
pingcap:masterfrom
wlwilliamx:codex/changefeed-operation-history

Conversation

@wlwilliamx
Copy link
Copy Markdown
Collaborator

@wlwilliamx wlwilliamx commented May 19, 2026

What problem does this PR solve?

Issue Number: close #5087

What is changed and how it works?

  • Add structured audit logs for public changefeed mutation APIs: create, update, pause, resume, and delete.
  • Keep a bounded in-memory metric history for the latest 100 operations so the default Prometheus dashboard can show recent investigation context without requiring a log datasource.
  • Add a Changefeed Operation History table panel with operation time, result, username, non-sensitive details, and error summary.
  • Include operation-specific summaries such as changed update fields, resume checkpoint overwrite state, and delete pre-state/checkpoint context.

Check List

Tests

  • Unit test
  • Manual test
CleanShot 2026-05-19 at 14 41 43@2x

Questions

Will it cause performance regression or break compatibility?

No compatibility change. The dashboard-facing metric cache is bounded to the latest 100 operations to avoid unbounded cardinality growth.

Do you need to update user documentation, design documentation or monitoring documentation?

The Grafana dashboard is updated in this PR. No separate user or design documentation change is required.

Release note

Add a Changefeed Operation History panel to help investigate recent user-initiated changefeed operations.

Summary by CodeRabbit

Release Notes

  • New Features
    • Changefeed operations (create, update, pause, resume, delete) are now recorded and tracked with metrics, capturing success/failure status, authenticated user, operation timing, and relevant details.
    • Added "Changefeed Operation History" dashboard panels to Grafana for monitoring recent changefeed operations across all environments.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label May 19, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

📝 Walkthrough

Walkthrough

This PR implements changefeed operation recording infrastructure for oncall investigation. A new middleware captures user-initiated changefeed mutations (create, update, pause, resume, delete), records timestamps and metadata via a bounded in-memory store and Prometheus gauge, and exposes the history in Grafana dashboard panels. All mutating API handlers are instrumented to populate operation-specific metadata.

Changes

Changefeed Operation Recording & Monitoring

Layer / File(s) Summary
Prometheus metric gauge definition and registration
pkg/metrics/changefeed.go
Introduces ChangefeedOperationTimeGauge with labels for keyspace, changefeed, operation, result, username, details, error, and event_id; registers the gauge during metrics initialization.
Core middleware handler, bounded operation store, and context helpers
api/middleware/changefeed_operation.go
Implements middleware that captures request timing, outcome ("success"/"failed"), and metadata; maintains a mutex-protected bounded store of recent operations with automatic oldest-entry eviction and metric cleanup; provides context setters for downstream handlers to attach operation target and details.
Middleware wiring into v1 and v2 API routes
api/v1/api.go, api/v2/api.go
Integrates ChangefeedOperationMiddleware into all mutating changefeed routes (create, update, pause, resume, delete) for both API versions, positioned before downstream handlers.
Handler instrumentation to populate operation metadata
api/v2/changefeed.go
Updates CreateChangefeed, DeleteChangefeed, PauseChangefeed, ResumeChangefeed, and UpdateChangefeed to call SetChangefeedOperationTarget and SetChangefeedOperationDetails, capturing keyspace/changefeed identity and operation-specific details (timestamps, state transitions, field change flags).
Unit tests for middleware, store behavior, and normalization
api/middleware/changefeed_operation_test.go
Tests verify successful and failed operation recording, bounded store eviction with Prometheus metric cleanup, and text normalization (whitespace compaction, length limits, ellipsis truncation); includes helpers to reset state and inspect gauge values.
Grafana dashboard panels for operation history visualization
metrics/grafana/ticdc_new_arch.json, metrics/nextgengrafana/ticdc_new_arch_next_gen.json, metrics/nextgengrafana/ticdc_new_arch_with_keyspace_name.json
Adds "Changefeed Operation History" table panels across three dashboard variants, each querying ticdc_owner_changefeed_operation_time and transforming Prometheus labels to display operation time, changefeed, result, username, and concise details.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • pingcap/ticdc#4507: Modifies Grafana dashboard JSONs to fix panel layout and ID conflicts, which may overlap with the dashboard metadata added by this PR.

Suggested labels

lgtm, approved, size/XL

Suggested reviewers

  • lidezhu
  • wk989898
  • asddongmen

🐰 A middleware weaves through the routes so keen,
Recording every changefeed's dream,
Create, pause, resume, update, delete—
Now Grafana shows the complete feat,
Oncall sleuths can solve the mystery scene!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 56.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'api,metrics: add changefeed operation history' clearly and concisely describes the main changes, accurately reflecting the addition of changefeed operation tracking to APIs and metrics.
Linked Issues check ✅ Passed The PR successfully implements all coding requirements from issue #5087: middleware for audit logs, bounded metric history (100 operations), Grafana panels with operation details, operation-specific summaries, and avoids unbounded cardinality.
Out of Scope Changes check ✅ Passed All changes are directly scoped to the stated objectives: middleware implementation, metrics definition, API route integration, handler context enrichment, and Grafana dashboard updates—no unrelated modifications detected.
Description check ✅ Passed The PR description includes all required template sections: issue number, what changed and how it works, tests (unit test + manual test with screenshots), answers to compatibility and documentation questions, and a release note.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 19, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a new middleware to audit user-initiated changefeed operations, such as creation, updates, and deletions. It records these events in a bounded in-memory store and exports them via a new Prometheus metric, changefeed_operation_time, which is integrated into the TiCDC Grafana dashboards. Review feedback highlighted a likely compilation error due to a missing package prefix on ClientVersionHeader, suggested broader username detection beyond Basic Auth, and cautioned against potential Prometheus cardinality issues resulting from the inclusion of a unique event ID in metric labels.

zap.String("username", username),
zap.String("ip", c.ClientIP()),
zap.String("userAgent", c.Request.UserAgent()),
zap.String("clientVersion", c.Request.Header.Get(ClientVersionHeader)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The constant ClientVersionHeader appears to be missing the api. prefix. Based on the imports and the usage of other constants like api.APIOpVarKeyspace in this file, it is likely defined in the github.com/pingcap/ticdc/pkg/api package.

Suggested change
zap.String("clientVersion", c.Request.Header.Get(ClientVersionHeader)),
zap.String("clientVersion", c.Request.Header.Get(api.ClientVersionHeader)),

Comment on lines +127 to +130
username, _, _ := c.Request.BasicAuth()
if username == "" {
username = "anonymous"
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The middleware currently only attempts to retrieve the username via Basic Auth. If the API supports other authentication methods (e.g., token-based or certificate-based auth) that are handled by authenticateMiddleware, the username might be stored in the Gin context rather than the Authorization header. It is recommended to also check the context for a user object to ensure the audit log accurately identifies the requester.

username: normalizeChangefeedOperationMetricText(username),
details: normalizeChangefeedOperationMetricText(info.details),
err: normalizeChangefeedOperationMetricError(operationErr),
eventID: fmt.Sprintf("%d", eventID),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a unique event_id as a Prometheus label for every request is a known anti-pattern that leads to high cardinality. While the in-memory store in the coordinator is bounded to 100 entries and explicitly deletes old series from the exporter, Prometheus will still record every unique series in its index, which can cause memory pressure and slow down queries over time if the operation rate is high. Consider if the event_id is strictly necessary for the dashboard or if the history could be managed differently (e.g., using a fixed set of 'slot' labels to keep the number of series constant).

@wlwilliamx
Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 19, 2026
@ti-chi-bot ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 19, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lidezhu, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 19, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-19 06:57:44.103503822 +0000 UTC m=+247393.607634498: ☑️ agreed by lidezhu.
  • 2026-05-19 07:02:43.879679885 +0000 UTC m=+247693.383810561: ☑️ agreed by wk989898.

@ti-chi-bot ti-chi-bot Bot merged commit 3a652c1 into pingcap:master May 19, 2026
23 of 25 checks passed
@wlwilliamx wlwilliamx added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label May 20, 2026
@ti-chi-bot
Copy link
Copy Markdown
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #5105.
But this PR has conflicts, please resolve them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add changefeed operation history to Grafana dashboard

4 participants