pkg/eventservice: use checkpointTs as dispatcher runtime lower bound#4448
pkg/eventservice: use checkpointTs as dispatcher runtime lower bound#44483AceShowHand wants to merge 4 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request focuses on improving the system's debuggability by strategically adding more detailed logging statements throughout the event processing and dispatching pipeline. The changes aim to capture richer contextual information, such as various timestamps and state values, at critical points, which will be invaluable for diagnosing and resolving panics or unexpected behaviors within the event store and related components without altering the core logic. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
📝 WalkthroughWalkthroughDispatcher reset and handshake logic now uses checkpoint timestamp (checkpointTs) as the lower bound and reference for schema lookups, logging, and handshake emission. dispatcherStat initialization/reset was consolidated to resetLowerBound(checkpointTs); checkpoint propagation via copyStatistics was removed. Tests updated/added to validate these behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Broker as EventBroker\n(pkg/eventservice)
participant Dispatcher as DispatcherStat
participant TableInfo as TableInfoCache
participant Downstream as DownstreamConsumer
rect rgba(200,230,255,0.5)
Client->>Broker: request resetDispatcher(table, dispatcherInfo)
Broker->>Dispatcher: read oldCheckpointTs (oldStat.checkpointTs)\nread newCheckpointTs (dispatcherInfo.GetStartTs() -> newCheckpointTs)
end
rect rgba(200,255,200,0.5)
Broker->>TableInfo: lookup tableInfo(using newCheckpointTs)
TableInfo-->>Broker: schema/table metadata
Broker->>Dispatcher: resetLowerBound(newCheckpointTs)
Dispatcher-->>Broker: ack reset
end
rect rgba(255,230,200,0.5)
Broker->>Downstream: sendHandshakeIfNeed(checkpointTs)
Downstream-->>Broker: confirm handshake received
Broker->>Client: reset complete (logs include requestStartTs, oldCheckpointTs, newCheckpointTs)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/test all |
There was a problem hiding this comment.
Code Review
This pull request adds several logging statements to aid in debugging the event store. The changes are generally helpful for diagnostics. I've identified a few areas for improvement related to logging efficiency and code duplication. Specifically, using zap.Stringer instead of zap.Any for DispatcherID would be more performant, and some duplicated logging code in event_broker.go could be refactored for better maintainability.
| log.Info("before copy statistics when reset dispatcher", | ||
| zap.Stringer("changefeedID", changefeedID), | ||
| zap.Stringer("dispatcherID", dispatcherID), | ||
| zap.Uint64("requestStartTs", dispatcherInfo.GetStartTs()), | ||
| zap.Uint64("oldCheckpointTs", oldStat.checkpointTs.Load()), | ||
| zap.Uint64("oldLastScannedCommitTs", oldStat.lastScannedCommitTs.Load()), | ||
| zap.Uint64("newCheckpointTs", newStat.checkpointTs.Load()), | ||
| zap.Uint64("newLastScannedCommitTs", newStat.lastScannedCommitTs.Load()), | ||
| zap.Uint64("oldEpoch", oldStat.epoch), | ||
| zap.Uint64("newEpoch", newStat.epoch)) | ||
| newStat.copyStatistics(oldStat) | ||
| log.Info("after copy statistics when reset dispatcher", | ||
| zap.Stringer("changefeedID", changefeedID), | ||
| zap.Stringer("dispatcherID", dispatcherID), | ||
| zap.Uint64("requestStartTs", dispatcherInfo.GetStartTs()), | ||
| zap.Uint64("oldCheckpointTs", oldStat.checkpointTs.Load()), | ||
| zap.Uint64("oldLastScannedCommitTs", oldStat.lastScannedCommitTs.Load()), | ||
| zap.Uint64("newCheckpointTs", newStat.checkpointTs.Load()), | ||
| zap.Uint64("newLastScannedCommitTs", newStat.lastScannedCommitTs.Load()), | ||
| zap.Uint64("oldEpoch", oldStat.epoch), | ||
| zap.Uint64("newEpoch", newStat.epoch)) |
There was a problem hiding this comment.
The two logging blocks before and after newStat.copyStatistics(oldStat) are nearly identical, which introduces code duplication. This can be refactored to improve readability and maintainability by extracting the common logging fields into a shared slice.
logFields := []zap.Field{
zap.Stringer("changefeedID", changefeedID),
zap.Stringer("dispatcherID", dispatcherID),
zap.Uint64("requestStartTs", dispatcherInfo.GetStartTs()),
zap.Uint64("oldCheckpointTs", oldStat.checkpointTs.Load()),
zap.Uint64("oldLastScannedCommitTs", oldStat.lastScannedCommitTs.Load()),
zap.Uint64("oldEpoch", oldStat.epoch),
zap.Uint64("newEpoch", newStat.epoch),
}
log.Info("before copy statistics when reset dispatcher",
append(logFields,
zap.Uint64("newCheckpointTs", newStat.checkpointTs.Load()),
zap.Uint64("newLastScannedCommitTs", newStat.lastScannedCommitTs.Load()))...)
newStat.copyStatistics(oldStat)
log.Info("after copy statistics when reset dispatcher",
append(logFields,
zap.Uint64("newCheckpointTs", newStat.checkpointTs.Load()),
zap.Uint64("newLastScannedCommitTs", newStat.lastScannedCommitTs.Load()))...)| if dispatcher.checkpointTs.Load() < dp.CheckpointTs { | ||
| log.Info("update dispatcher checkpoint by the heartbeat", | ||
| zap.Stringer("serverID", node.ID(heartbeat.serverID)), | ||
| zap.Any("dispatcherID", dispatcher.id), |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@downstreamadapter/eventcollector/dispatcher_stat.go`:
- Around line 195-200: The log is recording d.lastEventSeq.Load() and
d.lastEventCommitTs.Load() after those fields were reset, so it always prints 0;
capture the pre-reset values into local variables (e.g., oldSeq :=
d.lastEventSeq.Load(), oldCommit := d.lastEventCommitTs.Load()) before the reset
operation (the place where lastEventSeq/lastEventCommitTs are zeroed), and then
use those local snapshots in the zap.Uint64(...) calls for "lastEventSeq" and
"lastEventCommitTs" so the diagnostic reflects the state prior to mutation.
In `@pkg/eventservice/event_broker.go`:
- Around line 1258-1266: The checkpoint update must be made monotonic using an
atomic compare-and-swap instead of a plain Load/Store; replace the current
Load()+Store() sequence around dispatcher.checkpointTs with a CAS loop that
reads old := dispatcher.checkpointTs.Load(), returns early if dp.CheckpointTs <=
old, otherwise attempts dispatcher.checkpointTs.CompareAndSwap(old,
dp.CheckpointTs) and retries on failure; only log the "update dispatcher
checkpoint" message after a successful CAS so a racing handler cannot overwrite
a newer checkpoint with an older dp.CheckpointTs.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a403e8bf-be15-4a54-9b69-cc1a0c7aa528
📒 Files selected for processing (4)
downstreamadapter/dispatcher/basic_dispatcher.godownstreamadapter/eventcollector/dispatcher_stat.gologservice/eventstore/event_store.gopkg/eventservice/event_broker.go
| zap.Uint64("resetTs", resetTs), | ||
| zap.Uint64("checkpointTs", d.target.GetCheckpointTs()), | ||
| zap.Uint64("resolvedTs", d.target.GetResolvedTs()), | ||
| zap.Uint64("startTs", d.target.GetStartTs()), | ||
| zap.Uint64("lastEventCommitTs", d.lastEventCommitTs.Load()), | ||
| zap.Uint64("lastEventSeq", d.lastEventSeq.Load())) |
There was a problem hiding this comment.
Capture the pre-reset sequence before zeroing it.
lastEventSeq was already reset on Line 185, so this new field always logs 0. That makes the added reset diagnostics misleading right where they are meant to explain ordering/reset problems. Snapshot the old sequence/commit values before mutating state, then log those snapshots instead.
Proposed fix
func (d *dispatcherStat) doReset(serverID node.ID, resetTs uint64) {
+ lastEventSeq := d.lastEventSeq.Load()
+ lastEventCommitTs := d.lastEventCommitTs.Load()
epoch := d.epoch.Add(1)
d.lastEventSeq.Store(0)
// remove the dispatcher from the dynamic stream
resetRequest := d.newDispatcherResetRequest(d.eventCollector.getLocalServerID().String(), resetTs, epoch)
msg := messaging.NewSingleTargetMessage(serverID, messaging.EventServiceTopic, resetRequest)
@@
zap.Uint64("resetTs", resetTs),
zap.Uint64("checkpointTs", d.target.GetCheckpointTs()),
zap.Uint64("resolvedTs", d.target.GetResolvedTs()),
zap.Uint64("startTs", d.target.GetStartTs()),
- zap.Uint64("lastEventCommitTs", d.lastEventCommitTs.Load()),
- zap.Uint64("lastEventSeq", d.lastEventSeq.Load()))
+ zap.Uint64("lastEventCommitTs", lastEventCommitTs),
+ zap.Uint64("lastEventSeq", lastEventSeq))
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@downstreamadapter/eventcollector/dispatcher_stat.go` around lines 195 - 200,
The log is recording d.lastEventSeq.Load() and d.lastEventCommitTs.Load() after
those fields were reset, so it always prints 0; capture the pre-reset values
into local variables (e.g., oldSeq := d.lastEventSeq.Load(), oldCommit :=
d.lastEventCommitTs.Load()) before the reset operation (the place where
lastEventSeq/lastEventCommitTs are zeroed), and then use those local snapshots
in the zap.Uint64(...) calls for "lastEventSeq" and "lastEventCommitTs" so the
diagnostic reflects the state prior to mutation.
| log.Info("update dispatcher checkpoint by the heartbeat", | ||
| zap.Stringer("serverID", node.ID(heartbeat.serverID)), | ||
| zap.Any("dispatcherID", dispatcher.id), | ||
| zap.Uint64("oldCheckpointTs", dispatcher.checkpointTs.Load()), | ||
| zap.Uint64("newCheckpointTs", dp.CheckpointTs), | ||
| zap.Uint64("sentResolvedTs", dispatcher.sentResolvedTs.Load()), | ||
| zap.Uint64("lastScannedCommitTs", dispatcher.lastScannedCommitTs.Load()), | ||
| zap.Uint64("dispatcherEpoch", dispatcher.epoch)) | ||
| dispatcher.checkpointTs.Store(dp.CheckpointTs) |
There was a problem hiding this comment.
Make the checkpoint update monotonic.
This is still a load/compare/store sequence. If two heartbeat handlers race here, both can pass the < check and the slower goroutine can overwrite a newer checkpoint with an older one. That can regress checkpointTs and feed stale scan ranges back into the event path.
Proposed fix
- if dispatcher.checkpointTs.Load() < dp.CheckpointTs {
- log.Info("update dispatcher checkpoint by the heartbeat",
- zap.Stringer("serverID", node.ID(heartbeat.serverID)),
- zap.Any("dispatcherID", dispatcher.id),
- zap.Uint64("oldCheckpointTs", dispatcher.checkpointTs.Load()),
- zap.Uint64("newCheckpointTs", dp.CheckpointTs),
- zap.Uint64("sentResolvedTs", dispatcher.sentResolvedTs.Load()),
- zap.Uint64("lastScannedCommitTs", dispatcher.lastScannedCommitTs.Load()),
- zap.Uint64("dispatcherEpoch", dispatcher.epoch))
- dispatcher.checkpointTs.Store(dp.CheckpointTs)
- }
+ for {
+ oldCheckpointTs := dispatcher.checkpointTs.Load()
+ if oldCheckpointTs >= dp.CheckpointTs {
+ break
+ }
+ if dispatcher.checkpointTs.CompareAndSwap(oldCheckpointTs, dp.CheckpointTs) {
+ log.Info("update dispatcher checkpoint by the heartbeat",
+ zap.Stringer("serverID", node.ID(heartbeat.serverID)),
+ zap.Any("dispatcherID", dispatcher.id),
+ zap.Uint64("oldCheckpointTs", oldCheckpointTs),
+ zap.Uint64("newCheckpointTs", dp.CheckpointTs),
+ zap.Uint64("sentResolvedTs", dispatcher.sentResolvedTs.Load()),
+ zap.Uint64("lastScannedCommitTs", dispatcher.lastScannedCommitTs.Load()),
+ zap.Uint64("dispatcherEpoch", dispatcher.epoch))
+ break
+ }
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/eventservice/event_broker.go` around lines 1258 - 1266, The checkpoint
update must be made monotonic using an atomic compare-and-swap instead of a
plain Load/Store; replace the current Load()+Store() sequence around
dispatcher.checkpointTs with a CAS loop that reads old :=
dispatcher.checkpointTs.Load(), returns early if dp.CheckpointTs <= old,
otherwise attempts dispatcher.checkpointTs.CompareAndSwap(old, dp.CheckpointTs)
and retries on failure; only log the "update dispatcher checkpoint" message
after a successful CAS so a racing handler cannot overwrite a newer checkpoint
with an older dp.CheckpointTs.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/eventservice/event_broker.go (1)
1265-1274:⚠️ Potential issue | 🔴 CriticalKeep heartbeat checkpoint updates monotonic.
Line 1265 is still a
Load/Storesequence, so concurrent heartbeat handlers can both pass the guard and the slower one can overwrite a newer checkpoint with an olderdp.CheckpointTs. Use a CAS loop and only emit the log after the CAS succeeds.Suggested local fix
- if dispatcher.checkpointTs.Load() < dp.CheckpointTs { - log.Info("update dispatcher checkpoint by the heartbeat", - zap.Stringer("serverID", node.ID(heartbeat.serverID)), - zap.Any("dispatcherID", dispatcher.id), - zap.Uint64("oldCheckpointTs", dispatcher.checkpointTs.Load()), - zap.Uint64("newCheckpointTs", dp.CheckpointTs), - zap.Uint64("sentResolvedTs", dispatcher.sentResolvedTs.Load()), - zap.Uint64("lastScannedCommitTs", dispatcher.lastScannedCommitTs.Load()), - zap.Uint64("dispatcherEpoch", dispatcher.epoch)) - dispatcher.checkpointTs.Store(dp.CheckpointTs) - } + for { + oldCheckpointTs := dispatcher.checkpointTs.Load() + if oldCheckpointTs >= dp.CheckpointTs { + break + } + if dispatcher.checkpointTs.CompareAndSwap(oldCheckpointTs, dp.CheckpointTs) { + log.Info("update dispatcher checkpoint by the heartbeat", + zap.Stringer("serverID", node.ID(heartbeat.serverID)), + zap.Any("dispatcherID", dispatcher.id), + zap.Uint64("oldCheckpointTs", oldCheckpointTs), + zap.Uint64("newCheckpointTs", dp.CheckpointTs), + zap.Uint64("sentResolvedTs", dispatcher.sentResolvedTs.Load()), + zap.Uint64("lastScannedCommitTs", dispatcher.lastScannedCommitTs.Load()), + zap.Uint64("dispatcherEpoch", dispatcher.epoch)) + break + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/eventservice/event_broker.go` around lines 1265 - 1274, The current Load/Store sequence around dispatcher.checkpointTs with dp.CheckpointTs is racy: replace it with an atomic CAS loop that reads old := dispatcher.checkpointTs.Load(), compares old < dp.CheckpointTs, and attempts dispatcher.checkpointTs.CompareAndSwap(old, dp.CheckpointTs) until it succeeds (or until old >= dp.CheckpointTs), and move the log.Info call so it only runs after a successful CompareAndSwap; update references to dispatcher.checkpointTs, dp.CheckpointTs and the surrounding heartbeat handling logic to use this CAS pattern to ensure monotonic checkpoint updates.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/eventservice/event_broker.go`:
- Around line 1138-1140: The reset logic snapshots oldCheckpointTs once
(oldStat.checkpointTs.Load()) and then may perform a statPtr swap while
heartbeats update checkpointTs, causing regressions; fix by serializing reset
with heartbeat checkpoint updates: when building newStat (the code paths around
dispatcherInfo.GetStartTs(), newStat creation, and the statPtr swap retry loop),
always re-read checkpointTs, startTs, sentResolvedTs and lastScannedCommitTs
from the current oldStat immediately before attempting the CAS/swap and retry
the read on any CAS failure, or protect the whole reset-and-swap sequence with
the same mutex used by heartbeat updates so the replacement is built from the
latest atomic values and published only once consistent. Ensure references to
oldStat.checkpointTs.Load(), dispatcherInfo.GetStartTs(), newStat, and the
statPtr swap/CAS loop are updated accordingly.
---
Duplicate comments:
In `@pkg/eventservice/event_broker.go`:
- Around line 1265-1274: The current Load/Store sequence around
dispatcher.checkpointTs with dp.CheckpointTs is racy: replace it with an atomic
CAS loop that reads old := dispatcher.checkpointTs.Load(), compares old <
dp.CheckpointTs, and attempts dispatcher.checkpointTs.CompareAndSwap(old,
dp.CheckpointTs) until it succeeds (or until old >= dp.CheckpointTs), and move
the log.Info call so it only runs after a successful CompareAndSwap; update
references to dispatcher.checkpointTs, dp.CheckpointTs and the surrounding
heartbeat handling logic to use this CAS pattern to ensure monotonic checkpoint
updates.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 30d117b9-0900-4305-bf16-da499cd0edca
📒 Files selected for processing (3)
pkg/eventservice/dispatcher_stat.gopkg/eventservice/event_broker.gopkg/eventservice/event_broker_test.go
💤 Files with no reviewable changes (1)
- pkg/eventservice/dispatcher_stat.go
| oldCheckpointTs := oldStat.checkpointTs.Load() | ||
| newCheckpointTs := max(dispatcherInfo.GetStartTs(), oldCheckpointTs) | ||
|
|
There was a problem hiding this comment.
Synchronize reset with checkpoint advancement.
Line 1138 snapshots oldCheckpointTs once, and Lines 1171-1229 keep reusing that snapshot even if the swap retries. More importantly, any heartbeat that updates oldStat.checkpointTs after that snapshot but before statPtr is swapped is lost when newStat replaces the old struct, so startTs, checkpointTs, sentResolvedTs, and lastScannedCommitTs can all move backwards and replay already-acked data. This needs serialization with heartbeat checkpoint updates, or a reset flow that rebuilds/publishes the new state from the latest checkpoint atomically.
Also applies to: 1171-1229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/eventservice/event_broker.go` around lines 1138 - 1140, The reset logic
snapshots oldCheckpointTs once (oldStat.checkpointTs.Load()) and then may
perform a statPtr swap while heartbeats update checkpointTs, causing
regressions; fix by serializing reset with heartbeat checkpoint updates: when
building newStat (the code paths around dispatcherInfo.GetStartTs(), newStat
creation, and the statPtr swap retry loop), always re-read checkpointTs,
startTs, sentResolvedTs and lastScannedCommitTs from the current oldStat
immediately before attempting the CAS/swap and retry the read on any CAS
failure, or protect the whole reset-and-swap sequence with the same mutex used
by heartbeat updates so the replacement is built from the latest atomic values
and published only once consistent. Ensure references to
oldStat.checkpointTs.Load(), dispatcherInfo.GetStartTs(), newStat, and the
statPtr swap/CAS loop are updated accordingly.
This reverts commit 63e6dcf.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/eventservice/event_broker.go (1)
1135-1201:⚠️ Potential issue | 🔴 CriticalRebuild the reset state from the latest dispatcher snapshot before publishing it.
Lines 1135-1170 snapshot
oldCheckpointTs, fetchtableInfo, and buildnewStatonce before the CAS loop. A heartbeat can still advanceoldStat.checkpointTswithout changing the pointer, so the first CAS can succeed and publish a lowercheckpointTs/sentResolvedTs/lastScannedCommitTsthan the live dispatcher had already reached. If the pointer does change, the retry path still reuses the stalenewStatandtableInfofrom the first attempt. That can replay already-acked data and regress the checkpoint later reported back to the event store. Re-read the current stat and rebuild the replacement state on each retry, or serialize reset with checkpoint updates.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/eventservice/event_broker.go` around lines 1135 - 1201, The reset builds newStat once before the CAS loop which can publish stale checkpoint values; to fix, move the creation of newStat (calls to newDispatcherStat, newStat.copyStatistics, newStat.resetLowerBound) and the tableInfo lookup (c.schemaStore.GetTableInfo using newCheckpointTs) inside the for-loop after reloading oldStat := statPtr.Load(), and recompute oldCheckpointTs/newCheckpointTs from oldStat.checkpointTs.Load() and dispatcherInfo.GetStartTs() each iteration; preserve the existing epoch-staleness check (if oldStat.epoch >= dispatcherInfo.GetEpoch()) and oldStat.isRemoved.Store(true) logic, then attempt CompareAndSwap, retrying with rebuilt newStat on CAS failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/eventservice/event_broker.go`:
- Around line 1135-1201: The reset builds newStat once before the CAS loop which
can publish stale checkpoint values; to fix, move the creation of newStat (calls
to newDispatcherStat, newStat.copyStatistics, newStat.resetLowerBound) and the
tableInfo lookup (c.schemaStore.GetTableInfo using newCheckpointTs) inside the
for-loop after reloading oldStat := statPtr.Load(), and recompute
oldCheckpointTs/newCheckpointTs from oldStat.checkpointTs.Load() and
dispatcherInfo.GetStartTs() each iteration; preserve the existing
epoch-staleness check (if oldStat.epoch >= dispatcherInfo.GetEpoch()) and
oldStat.isRemoved.Store(true) logic, then attempt CompareAndSwap, retrying with
rebuilt newStat on CAS failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8f53ff71-5a61-4964-97b0-94c8d44a747a
📒 Files selected for processing (4)
pkg/eventservice/dispatcher_stat.gopkg/eventservice/dispatcher_stat_test.gopkg/eventservice/event_broker.gopkg/eventservice/event_broker_test.go
|
/test all |
1 similar comment
|
/test all |
What problem does this PR solve?
Issue Number: close #4492
event service dispatcher runtime state kept both
startTsandcheckpointTsas lower-bound candidates. After a reset, heartbeat could advance checkpoint,
but handshake and the next scan could still read the stale runtime
startTs.That may reopen a range below the effective event store checkpoint and trigger:
What is changed and how it works?
startTsfrompkg/eventservice/dispatcherStatcheckpointTsas the single runtime lower bound in event servicecheckpointTswhen sending handshake eventsreset -> heartbeat advances checkpoint -> second reset uses resolvedTs - 1checkpointTsas the only runtime lower bound is the correct fixCheck List
Tests
Questions
Will it cause performance regression or break compatibility?
No. The change only removes stale runtime state in event service and makes
reset/handshake/scan lower-bound handling consistent with the checkpoint
already enforced by event store.
Do you need to update user documentation, design documentation or monitoring documentation?
Yes. Added an internal design/analysis note describing the panic sequence and
the rationale for removing runtime
startTs.Release note