Skip to content

[protocol] Stage degradedDatacenters on AdminOperation v100#2816

Open
mynameborat wants to merge 1 commit into
linkedin:mainfrom
mynameborat:protocol/stage-degraded-dcs-on-add-version
Open

[protocol] Stage degradedDatacenters on AdminOperation v100#2816
mynameborat wants to merge 1 commit into
linkedin:mainfrom
mynameborat:protocol/stage-degraded-dcs-on-add-version

Conversation

@mynameborat
Copy link
Copy Markdown
Contributor

Problem Statement

The degraded-mode batch push feature needs to propagate the set of currently-degraded datacenters from the parent controller to child controllers so that AdminExecutionTask on a child controller can decide whether to skip ingestion for its own region. Today, markDatacenterDegraded only writes to the parent's local state and produces no admin-topic message, so child controllers never learn about degraded DCs. The isDegradedDC check in AdminExecutionTask always evaluates to false on a child, which means a degraded DC silently ingests the version anyway when versionSwapDeferred=true is set (the auto-conversion path used by degraded-mode pushes).

The selected fix is to embed the degraded-DC set in the AddVersion admin message itself — atomic with version creation and consistent with how other version-creation parameters (e.g., targetedRegions, versionSwapDeferred) flow from parent to child.

Solution

Stage a new degradedDatacenters field on the AddVersion record in a new AdminOperation schema version (v100). Following the established pattern (see #2806, #2814), this PR introduces the schema only; Java wiring (parent populates the field, AdminExecutionTask reads it) will land in a follow-up PR.

  • New schema file services/venice-controller/src/main/resources/avro/AdminOperation/v100/AdminOperation.avsc — verbatim copy of v99 with one added field on AddVersion:
    degradedDatacenters : array<string>, default []
    
  • build.gradle versionOverrides comment extended to note v100 is also staged. The actual pin stays at v98 — the generated Java AdminOperation class is unchanged, so no producers or consumers can see the new field yet.
  • AvroProtocolDefinition.ADMIN_OPERATION stays at 98. AdminOperationSerializer.initProtocolMap() continues to load v1..v98 only; v99 and v100 remain inert until activated by a future PR.

This is a forward-compatible schema-only change with no behavioral impact.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

Schema-only change; no behavior change to test. Verified the existing protocol tests pass:

  • AdminOperationProtocolCompatibilityTest.testAdminOperationProtocolCompatibility — PASSED

  • AdminOperationSerializerTest (testAdminOperationSerializer, testDownloadAndSchemaIfNecessary, testGetSchema, testSerializeDeserializeWithDocChange, testValidateAdminOperation) — all PASSED

  • New unit tests added.

  • New integration tests added.

  • Modified or extended existing tests.

  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Add an array<string> field on the AddVersion admin operation
carrying the set of datacenters marked as degraded at version
creation time. Pinned via build.gradle versionOverrides until
the Java wiring (parent populates the field, AdminExecutionTask
reads it to enforce skipConsumption for degraded DCs even when
versionSwapDeferred=true) lands in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mynameborat mynameborat force-pushed the protocol/stage-degraded-dcs-on-add-version branch from d99db19 to 7bd7420 Compare May 22, 2026 06:01
@mynameborat
Copy link
Copy Markdown
Contributor Author

Updated the field shape from array<string> with empty-array default to ["null", array<string>] with null default — matches the existing targetedRegions convention on the same record.

Why: The avroutil vanilla code-gen does not initialize non-union defaults in the generated SpecificRecord constructor; the field comes out null. Avro then throws an NPE when serializing because a non-nullable array field can't be null on the wire. Consumers handle null naturally via field != null && ....

Validated locally with AdminOperationSerializerTest and AdminOperationProtocolCompatibilityTest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant