Skip to content

[DO NOT REVIEW] Handle non-numeric PubSubPosition in leader produce and checkpoint paths#2605

Closed
haoxu07 wants to merge 5 commits intolinkedin:mainfrom
haoxu07:fix-ng-getNumericOffset-crash
Closed

[DO NOT REVIEW] Handle non-numeric PubSubPosition in leader produce and checkpoint paths#2605
haoxu07 wants to merge 5 commits intolinkedin:mainfrom
haoxu07:fix-ng-getNumericOffset-crash

Conversation

@haoxu07
Copy link
Copy Markdown
Contributor

@haoxu07 haoxu07 commented Mar 12, 2026

Summary

When RT topics migrate from Kafka to Northguard, PubSubPosition instances become NGRangePosition which throws UnsupportedOperationException on getNumericOffset(). This causes crashes in the leader produce and offset checkpoint paths on followers consuming VT messages with NG position bytes.

Crash sites fixed (5 production, 2 test infrastructure):

# File Method Fix
1 VeniceWriter.java getKafkaMessageEnvelopeProvider (L1182) getNumericOffsetOrDefault() helper
2 VeniceWriter.java getHeartbeatKME (L2290) Same helper
3 VeniceWriter.java getKafkaValue (L2423) Same helper
4 PubSubUtil.java deserializePositionWithOffsetFallback (L387) Split symbolic check; catch UnsupportedOperationException
5 OffsetRecord.java checkpointRtPosition (L316) Catch UnsupportedOperationException, store -1 in legacy map
6 InMemoryPubSubPosition.java getNumericOffset Add ofNonNumeric() factory for NG simulation
7 MockInMemoryConsumerAdapter.java advancePosition Use getInternalOffset() for InMemoryPubSubPosition

Key design decisions:

  • The upstreamOffset long field in LeaderMetadata and upstreamOffsetMap in PartitionState are legacy fallbacks — the wire-format bytes in upstreamPubSubPosition / upstreamRealTimeTopicPubSubPositionMap are authoritative
  • Storing -1 for non-numeric positions is consistent with PubSubSymbolicPosition.EARLIEST.getNumericOffset()
  • Site Cleanup ServiceFactory::getVeniceMultiClusterWrapper #5 (OffsetRecord.checkpointRtPosition) is a latent crash unmasked by fix [server] Use interpolation for logging in venice-common #4 — fixing deserializePositionWithOffsetFallback allows NG positions to flow downstream to checkpointRtPosition where they previously fell back to EARLIEST

Crash chain (before this fix):

VT message (NG wire bytes in upstreamPubSubPosition)
    │
    ▼
extractUpstreamPosition → deserializePositionWithOffsetFallback
    │ getNumericOffset() on NGRangePosition → UnsupportedOperationException
    │ caught by outer RuntimeException catch → falls back to EARLIEST
    ▼
updateOffsetsFromConsumerRecord: EARLIEST.equals(EARLIEST) → TRUE → SKIP
    ✅ No crash (but position data lost)

After PubSubUtil fix only (without OffsetRecord fix):

deserializePositionWithOffsetFallback → returns NGRangePosition ✅
    │
    ▼
updateOffsetsFromConsumerRecord: EARLIEST.equals(ngPos) → FALSE → ENTER IF
    │
    ▼
checkpointRtPosition → leaderPosition.getNumericOffset() → 💥 NEW CRASH

After both fixes:

deserializePositionWithOffsetFallback → returns NGRangePosition ✅
    │
    ▼
checkpointRtPosition → catch(UnsupportedOperationException) → store -1 ✅
    wire-format bytes stored as authoritative position ✅

Test plan

  • Unit test: VeniceWriterUnitTest.testGetNumericOffsetOrDefaultWithKafkaPosition — verifies Kafka position passthrough
  • Unit test: VeniceWriterUnitTest.testGetNumericOffsetOrDefaultWithUnsupportedPosition — verifies -1 fallback for NG position
  • Unit test: PubSubUtilTest.testDeserializePositionWithOffsetFallbackNonNumericPosition — verifies NG position returned as-is
  • Unit test: TestOffsetRecord.testCheckpointRtPositionWithNonNumericPosition — verifies legacy offset map gets -1
  • Test infra: InMemoryPubSubPosition.ofNonNumeric() enables future integration tests with NG position simulation
  • CI: existing unit and integration tests pass (no behavioral change for Kafka positions)

🤖 Generated with Claude Code

… and checkpoint paths

When RT topics migrate from Kafka to Northguard, PubSubPosition instances
become NGRangePosition which throws UnsupportedOperationException on
getNumericOffset(). This change adds graceful handling at all crash sites
in the leader produce and offset checkpoint paths:

- VeniceWriter: wrap 3 sites (data PUT/DELETE, heartbeat, non-default
  leader metadata) with getNumericOffsetOrDefault() helper that returns
  -1 for non-numeric positions
- PubSubUtil.deserializePositionWithOffsetFallback: split symbolic
  position check from numeric comparison; catch UnsupportedOperationException
  and return the deserialized NG position as-is
- OffsetRecord.checkpointRtPosition: catch UnsupportedOperationException
  and store -1 in legacy upstreamOffsetMap (wire-format bytes in
  upstreamRealTimeTopicPubSubPositionMap are authoritative)
- InMemoryPubSubPosition: add ofNonNumeric() factory + getInternalOffset()
  to enable NG position simulation in tests
- MockInMemoryConsumerAdapter.advancePosition: use getInternalOffset()
  to avoid crash with non-numeric test positions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 12, 2026 08:26
…minology

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes crashes caused by non-Kafka PubSubPosition implementations (e.g., Northguard NGRangePosition) throwing UnsupportedOperationException on getNumericOffset() as RT topics migrate off Kafka, ensuring leader produce paths and follower checkpointing can safely handle non-numeric positions while preserving authoritative wire-format bytes.

Changes:

  • Add safe numeric-offset extraction in VeniceWriter to avoid crashes when upstream positions are non-numeric.
  • Update PubSubUtil.deserializePositionWithOffsetFallback and OffsetRecord.checkpointRtPosition to tolerate non-numeric positions and keep wire-format bytes authoritative.
  • Extend test infrastructure and unit tests to simulate/validate non-numeric position behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/venice-common/src/main/java/com/linkedin/venice/writer/VeniceWriter.java Use a helper to default numeric offsets to -1 when unsupported for leader metadata fields.
internal/venice-common/src/main/java/com/linkedin/venice/pubsub/PubSubUtil.java Avoid numeric-offset comparisons for symbolic or non-numeric positions during deserialization fallback.
internal/venice-common/src/main/java/com/linkedin/venice/offsets/OffsetRecord.java Catch UnsupportedOperationException when checkpointing RT positions and persist -1 in legacy offset map.
internal/venice-test-common/src/main/java/com/linkedin/venice/pubsub/mock/InMemoryPubSubPosition.java Add ofNonNumeric() to simulate positions that don’t support numeric offsets; propagate behavior across next/prev helpers.
internal/venice-test-common/src/main/java/com/linkedin/venice/pubsub/mock/adapter/consumer/MockInMemoryConsumerAdapter.java Adjust advancePosition to use getInternalOffset() for in-memory positions.
internal/venice-common/src/test/java/com/linkedin/venice/writer/VeniceWriterUnitTest.java Add unit tests covering numeric-offset extraction helper for numeric and non-numeric positions.
internal/venice-common/src/test/java/com/linkedin/venice/pubsub/PubSubUtilTest.java Add unit test asserting non-numeric deserialized positions are returned as-is.
internal/venice-common/src/test/java/com/linkedin/venice/offsets/TestOffsetRecord.java Add unit test ensuring checkpointing non-numeric positions stores -1 in legacy offset map.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +292 to 296
long startOffset = (startInclusive instanceof InMemoryPubSubPosition)
? ((InMemoryPubSubPosition) startInclusive).getInternalOffset()
: startInclusive.getNumericOffset();
long targetOffset = Math.addExact(startOffset, n);
return InMemoryPubSubPosition.of(targetOffset);
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advancePosition() loses the non-numeric simulation: if startInclusive is an InMemoryPubSubPosition created via ofNonNumeric(), this method computes offsets via getInternalOffset() but always returns InMemoryPubSubPosition.of(targetOffset), which re-enables numeric offsets. Consider preserving the numericOffsetSupported flag (e.g., return ofNonNumeric(targetOffset) when the input was non-numeric) so callers can advance positions without silently changing position semantics.

Suggested change
long startOffset = (startInclusive instanceof InMemoryPubSubPosition)
? ((InMemoryPubSubPosition) startInclusive).getInternalOffset()
: startInclusive.getNumericOffset();
long targetOffset = Math.addExact(startOffset, n);
return InMemoryPubSubPosition.of(targetOffset);
long startOffset;
boolean numericOffsetSupported = true;
if (startInclusive instanceof InMemoryPubSubPosition) {
InMemoryPubSubPosition inMemoryPosition = (InMemoryPubSubPosition) startInclusive;
startOffset = inMemoryPosition.getInternalOffset();
numericOffsetSupported = inMemoryPosition.isNumericOffsetSupported();
} else {
startOffset = startInclusive.getNumericOffset();
}
long targetOffset = Math.addExact(startOffset, n);
return numericOffsetSupported
? InMemoryPubSubPosition.of(targetOffset)
: InMemoryPubSubPosition.ofNonNumeric(targetOffset);

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +58
/**
* Creates a position that simulates a pub sub system whose positions do not support numeric offsets.
* The internal offset is still used for mock broker indexing, but {@link #getNumericOffset()}
* throws {@link UnsupportedOperationException}.
*/
public static InMemoryPubSubPosition ofNonNumeric(long offset) {
return new InMemoryPubSubPosition(offset, false);
}

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ofNonNumeric() currently only affects runtime behavior (getNumericOffset throws) but does not survive wire-format round-tripping: getPositionWireFormat()/toWireFormatBuffer() don’t encode numericOffsetSupported, so deserializing the bytes will produce a regular (numeric) InMemoryPubSubPosition. If this factory is intended for end-to-end simulation of non-Kafka positions, consider encoding the flag (or using a distinct typeId/factory) so the non-numeric behavior is preserved across serialization/deserialization.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +12
import com.linkedin.venice.kafka.protocol.GUID;
import com.linkedin.venice.kafka.protocol.state.ProducerPartitionState;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import ordering in this test file is inconsistent with other venice-common tests (static imports are typically grouped together at the top). Here, static Mockito imports appear after non-static imports, which may fail checkstyle/spotless. Please regroup imports so all static imports are together and separated from non-static imports.

Copilot uses AI. Check for mistakes.
haoxu07 and others added 2 commits March 12, 2026 01:34
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 12, 2026 08:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +2306 to +2313
* Extract the numeric offset from a PubSubPosition, returning -1 if the position type
* does not support numeric offsets.
*/
static long getNumericOffsetOrDefault(PubSubPosition position) {
try {
return position.getNumericOffset();
} catch (UnsupportedOperationException e) {
return -1;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNumericOffsetOrDefault() hard-codes -1 as the fallback. Since this value is semantically tied to EARLIEST, consider returning PubSubSymbolicPosition.EARLIEST.getNumericOffset() instead of a magic number to keep the meaning centralized (and consistent if the sentinel ever changes).

Suggested change
* Extract the numeric offset from a PubSubPosition, returning -1 if the position type
* does not support numeric offsets.
*/
static long getNumericOffsetOrDefault(PubSubPosition position) {
try {
return position.getNumericOffset();
} catch (UnsupportedOperationException e) {
return -1;
* Extract the numeric offset from a PubSubPosition, returning the numeric offset of
* {@link PubSubSymbolicPosition#EARLIEST} if the position type does not support
* numeric offsets.
*/
static long getNumericOffsetOrDefault(PubSubPosition position) {
try {
return position.getNumericOffset();
} catch (UnsupportedOperationException e) {
return PubSubSymbolicPosition.EARLIEST.getNumericOffset();

Copilot uses AI. Check for mistakes.
Comment on lines +321 to +322
// Store -1 as legacy fallback; the wire-format bytes above are the authoritative position.
numericOffset = -1;
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpointRtPosition() uses a hard-coded -1 sentinel when getNumericOffset() is unsupported. Consider using PubSubSymbolicPosition.EARLIEST.getNumericOffset() (or a shared constant) instead of repeating the magic number, so the legacy fallback semantics remain explicit and consistent.

Suggested change
// Store -1 as legacy fallback; the wire-format bytes above are the authoritative position.
numericOffset = -1;
// Store EARLIEST's numeric offset as legacy fallback; the wire-format bytes above are the authoritative position.
numericOffset = PubSubSymbolicPosition.EARLIEST.getNumericOffset();

Copilot uses AI. Check for mistakes.
Simplify test assertion to verify checkpointRtPosition doesn't throw
with a non-numeric position, rather than inspecting internal state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@sushantmane sushantmane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to handler this in internal code and not in OSS

@haoxu07 haoxu07 changed the title Handle non-numeric PubSubPosition in leader produce and checkpoint paths [DO NOT REVIEW] Handle non-numeric PubSubPosition in leader produce and checkpoint paths Mar 12, 2026
@haoxu07 haoxu07 closed this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants