api,metrics: add changefeed operation history#5095
Conversation
📝 WalkthroughWalkthroughThis PR implements changefeed operation recording infrastructure for oncall investigation. A new middleware captures user-initiated changefeed mutations (create, update, pause, resume, delete), records timestamps and metadata via a bounded in-memory store and Prometheus gauge, and exposes the history in Grafana dashboard panels. All mutating API handlers are instrumented to populate operation-specific metadata. ChangesChangefeed Operation Recording & Monitoring
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements a new middleware to audit user-initiated changefeed operations, such as creation, updates, and deletions. It records these events in a bounded in-memory store and exports them via a new Prometheus metric, changefeed_operation_time, which is integrated into the TiCDC Grafana dashboards. Review feedback highlighted a likely compilation error due to a missing package prefix on ClientVersionHeader, suggested broader username detection beyond Basic Auth, and cautioned against potential Prometheus cardinality issues resulting from the inclusion of a unique event ID in metric labels.
| zap.String("username", username), | ||
| zap.String("ip", c.ClientIP()), | ||
| zap.String("userAgent", c.Request.UserAgent()), | ||
| zap.String("clientVersion", c.Request.Header.Get(ClientVersionHeader)), |
There was a problem hiding this comment.
The constant ClientVersionHeader appears to be missing the api. prefix. Based on the imports and the usage of other constants like api.APIOpVarKeyspace in this file, it is likely defined in the github.com/pingcap/ticdc/pkg/api package.
| zap.String("clientVersion", c.Request.Header.Get(ClientVersionHeader)), | |
| zap.String("clientVersion", c.Request.Header.Get(api.ClientVersionHeader)), |
| username, _, _ := c.Request.BasicAuth() | ||
| if username == "" { | ||
| username = "anonymous" | ||
| } |
There was a problem hiding this comment.
The middleware currently only attempts to retrieve the username via Basic Auth. If the API supports other authentication methods (e.g., token-based or certificate-based auth) that are handled by authenticateMiddleware, the username might be stored in the Gin context rather than the Authorization header. It is recommended to also check the context for a user object to ensure the audit log accurately identifies the requester.
| username: normalizeChangefeedOperationMetricText(username), | ||
| details: normalizeChangefeedOperationMetricText(info.details), | ||
| err: normalizeChangefeedOperationMetricError(operationErr), | ||
| eventID: fmt.Sprintf("%d", eventID), |
There was a problem hiding this comment.
Using a unique event_id as a Prometheus label for every request is a known anti-pattern that leads to high cardinality. While the in-memory store in the coordinator is bounded to 100 entries and explicitly deletes old series from the exporter, Prometheus will still record every unique series in its index, which can cause memory pressure and slow down queries over time if the operation rate is high. Consider if the event_id is strictly necessary for the dashboard or if the history could be managed differently (e.g., using a fixed set of 'slot' labels to keep the number of series constant).
|
/test all |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lidezhu, wk989898 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
In response to a cherrypick label: new pull request created to branch |
What problem does this PR solve?
Issue Number: close #5087
What is changed and how it works?
Changefeed Operation Historytable panel with operation time, result, username, non-sensitive details, and error summary.Check List
Tests
Questions
Will it cause performance regression or break compatibility?
No compatibility change. The dashboard-facing metric cache is bounded to the latest 100 operations to avoid unbounded cardinality growth.
Do you need to update user documentation, design documentation or monitoring documentation?
The Grafana dashboard is updated in this PR. No separate user or design documentation change is required.
Release note
Summary by CodeRabbit
Release Notes