internalcatalog: bump HDFS replication on metadata.json after write#584
internalcatalog: bump HDFS replication on metadata.json after write#584mkuchenbecker wants to merge 2 commits into
Conversation
After the synchronous TableMetadataParser.write(...) inside doCommit(...), call fs.setReplication(<newMetadataLocation>, 30) when the storage client is HdfsStorageClient. Non-HDFS clients are skipped. setReplication failures are a degraded-durability event, not a commit failure -- the file is already durably written at the cluster default replication, so failures are logged at WARN and swallowed. The call is wrapped in metricsReporter.executeWithStats under a new metadata_replication_set_latency metric for observability. The target replication factor lives on CatalogConstants as METADATA_FILE_HDFS_REPLICATION = 30 (with a TODO to promote to a Spring config property in a follow-up). This is paired with an li-openhouse PR that drops dfs.replication on holdem-hdfs-site-extra.xml from 30 -> 10. Together they shift the durability of newly-written .metadata.json files OFF the synchronous write-ack path: write returns when the file is durable at repl=10, then we asynchronously bump to repl=30 for long-term durability. This PR must merge before or together with the li-openhouse PR -- if li-openhouse merges first, every new metadata.json on Holdem lives at repl=10 forever until this ships, which is a durability regression. Note: the trigger was "the manifest json file." In Iceberg the only JSON file written per commit is <version>-<uuid>.metadata.json (the table metadata file). Manifest files proper (*-m#.avro) and the manifest list (snap-*.avro) are Avro. If the same treatment is wanted for Avro manifests, that's a separate change requiring a FileIO wrapper. Commit uses --no-verify because the repo's pre-commit/pre-push hooks invoke ./gradlew spotlessCheck which triggers CopyGitHooksTask, and that task fails inside a git worktree (worktree .git is a file, not a dir). spotlessCheck was verified independently with -x CopyGitHooksTask. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| * immediately after the synchronous write returns. Paired with a li-openhouse PR that | ||
| * lowers the cluster default {@code dfs.replication}. | ||
| * | ||
| * <p>TODO(mkuchenb): promote this to a Spring config property |
There was a problem hiding this comment.
Done in 35d3d23. Promoted to a Spring @ConfigurationProperties bean: MetadataReplicationProperties (prefix openhouse.internal.catalog.metadata.replication) with enabled (default false), target (default 30), timeoutMs (default 10000). CatalogConstants now only holds the default value the bean falls back to.
| metricsReporter.executeWithStats( | ||
| () -> { | ||
| try { | ||
| fs.setReplication(new Path(newMetadataLocation), targetReplication); |
There was a problem hiding this comment.
We should have a strict timeout.
There was a problem hiding this comment.
Done in 35d3d23. The setReplication RPC is dispatched to a static daemon single-thread ExecutorService and bounded by Future.get(timeoutMs, MILLISECONDS). On timeout the future is cancelled with interrupt and a new metadata_replication_set_timeout counter is incremented; commit succeeds regardless. Default timeout 10s, configurable via MetadataReplicationProperties.timeoutMs. Added testDoCommitSucceedsWhenReplicationApplyTimesOut which uses a 100ms configured timeout vs a 3s-sleeping setReplication stub to prove the timeout fires deterministically.
| System.currentTimeMillis() - metadataUpdateStartTime); | ||
| // Set replication after the synchronous write so the write path is not blocked | ||
| // on the higher replica count. Failures are logged and do not fail the commit. | ||
| bumpMetadataReplicationIfHdfs(newMetadataLocation); |
There was a problem hiding this comment.
calling this function should be bated by a config
There was a problem hiding this comment.
Done in 35d3d23. The call site in doCommit is now gated on metadataReplicationProperties.isEnabled(), which defaults to false (opt-in per deployment, safe rollout). When the gate is off, a metadata_replication_set_disabled counter is incremented so the gate state is visible in metrics. New test testDoCommitDoesNotApplyReplicationWhenDisabled verifies setReplication is never invoked when disabled even with an HDFS storage client.
| * | ||
| * @param newMetadataLocation absolute HDFS path of the newly-written metadata.json | ||
| */ | ||
| private void bumpMetadataReplicationIfHdfs(String newMetadataLocation) { |
There was a problem hiding this comment.
rename to applyReplicationIfHDFS with the repl as a parameter.
There was a problem hiding this comment.
Done in 35d3d23. Renamed to applyReplicationIfHDFS. Signature: private void applyReplicationIfHDFS(String location, short replication, long timeoutMs). The replication factor and timeout are now both parameters, pulled from MetadataReplicationProperties by the caller in doCommit.
|
i believe tables-service FS only touches metadata so we can set replication as per hdfs client config centrally. would prefer that solution because we don't merge config into the business logic and config for hdfs lives in one place |
Responds to the four review comments on PR linkedin#584: 1. (CatalogConstants.java) Promote the replication factor from a constant to a Spring config property NOW, not as a follow-up. Adds new MetadataReplicationProperties (prefix openhouse.internal.catalog.metadata.replication) with fields enabled, target, timeoutMs. CatalogConstants now holds only a default value (DEFAULT_METADATA_FILE_HDFS_REPLICATION = 30) that the properties bean falls back to. 2. (OpenHouseInternalTableOperations.java setReplication site) Add a strict per-call timeout. The setReplication RPC is dispatched to a static daemon single-thread ExecutorService and bounded by Future.get(timeoutMs). On timeout the future is cancelled with interrupt and the event is counted under a new metadata_replication_set_timeout counter; commit succeeds regardless. Default timeout 10s, configurable. 3. (doCommit call site) Gate the replication apply on metadataReplicationProperties.isEnabled(). Default is false -- the bump is opt-in per deployment. When disabled, increment a metadata_replication_set_disabled counter so we can see the gate state in metrics. 4. (helper method) Rename bumpMetadataReplicationIfHdfs to applyReplicationIfHDFS and take the replication factor + timeoutMs as parameters rather than reading them from constants. The caller in doCommit pulls the values from the properties bean. Wired the new properties bean through: - OpenHouseInternalCatalog (production construction site, @Autowired). - OpenHouseInternalTableOperationsTest (4 construction sites, plus a default enabled instance in @beforeeach so existing tests still exercise the storage-guard path). - RepositoryTestWithSettableComponents (4 construction sites, @Autowired field). Tests: - testDoCommitAppliesReplicationOnHdfs (renamed) - testDoCommitDoesNotApplyReplicationOnLocalStorage (renamed) - testDoCommitSucceedsWhenReplicationApplyFails (renamed) - testDoCommitDoesNotApplyReplicationWhenDisabled (new -- gate) - testDoCommitSucceedsWhenReplicationApplyTimesOut (new -- timeout) All green in :iceberg:openhouse:internalcatalog:test and :services:tables:test RepositoryTestWithSettableComponents. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Bump the HDFS replication factor on each newly-written
metadata.jsonfile to 30 immediately afterTableMetadataParser.write(...)returns insideOpenHouseInternalTableOperations.doCommit(...). Paired with an internalli-openhouseinfra PR that dropsdfs.replicationon Holdem from 30 -> 10, this moves the durability cost off the synchronous commit-write-ack path while preserving repl=30 durability on the persisted file.Ordering constraint (important): this PR must merge before or together with the paired
li-openhousePR. Ifli-openhouselands first, every newly-writtenmetadata.jsonon Holdem will live at repl=10 indefinitely until this PR ships — a durability regression. Reviewers, please coordinate on merge order.Changes
Internal API change / new behavior in
doCommit: Added a privatebumpMetadataReplicationIfHdfs(...)helper invoked immediately after the existing synchronous metadata write. Behavior:storage.getClient() instanceof HdfsStorageClient. Non-HDFS clients (local, blob) silently skip — their durability is managed elsewhere.FileSystemfromfileIOManager.getStorage(fileIO).getClient().getNativeClient(), mirroring the existingupdateMetadataFieldForTable(...)helper at the bottom of the same file.fs.setReplication(new Path(newMetadataLocation), CatalogConstants.METADATA_FILE_HDFS_REPLICATION).metricsReporter.executeWithStats(...)under a newmetadata_replication_set_latencymetric, matching the existing class-level Micrometer convention. (No OpenTelemetry tracing was introduced — none exists in this class today.)Performance improvement (rationale): After the paired
li-openhousechange dropsdfs.replicationto 10, synchronousTableMetadataParser.write(...)will return once the file is durable at 10 datanodes instead of 30. The async bump in this PR restores on-disk durability to repl=30 without putting that pipeline-fill cost on the commit-write-ack path. The expected effect is a meaningful drop in commit p50/p99 metadata-write latency on Holdem; the new metric will quantify both the savings and the bump's own cost in production.Configuration choice:
METADATA_FILE_HDFS_REPLICATION = (short) 30is added as a constant onCatalogConstants. There is no established Spring-config-into-OpenHouseInternalTableOperationsconvention today, and the constructor is@AllArgsConstructor-generated, so adding a config-injected field would ripple through every construction site. The constant carries an explicitTODOto promote to a Spring config property (e.g.openhouse.metadata.hdfs.replication) in a follow-up PR.Interpretation note for reviewers: the internal trigger described "the manifest json file." In Iceberg's commit layout the only JSON file written per commit is the table metadata file
<version>-<uuid>.metadata.json. Manifest files proper (*-m#.avro) and the manifest list (snap-*.avro) are Avro. This PR treatsmetadata.jsononly. If the same durability-bump treatment is wanted for Avro manifest/manifest-list files, that requires aFileIOwrapper and is a separate follow-up.Files changed (4):
iceberg/openhouse/internalcatalog/.../OpenHouseInternalTableOperations.java— new helper + call site indoCommit,org.apache.hadoop.fs.Pathimport added.iceberg/openhouse/internalcatalog/.../CatalogConstants.java— newMETADATA_FILE_HDFS_REPLICATIONshort constant + TODO.iceberg/openhouse/internalcatalog/.../InternalCatalogMetricsConstant.java— newMETADATA_REPLICATION_SET_LATENCYmetric name constant.iceberg/openhouse/internalcatalog/src/test/.../OpenHouseInternalTableOperationsTest.java— 3 new tests, 1 existing-test verify-count update.Testing Done
Added tests (all on
OpenHouseInternalTableOperationsTest, mirroring the file's existing Mockito-based style):testDoCommitBumpsMetadataReplicationOnHdfs— sets up the storage mock to return anHdfsStorageClientwhose native client is a mockFileSystem, drives a commit, captures the args tosetReplication(Path, short)and asserts the replication factor equalsCatalogConstants.METADATA_FILE_HDFS_REPLICATION(30) and the path lives under the table location.testDoCommitDoesNotBumpReplicationOnLocalStorage— drives a commit with aLocalStorageClientand verifiessetReplicationwas never invoked (theinstanceof HdfsStorageClientguard short-circuits).testDoCommitSucceedsWhenReplicationBumpFails— stubssetReplicationto throwIOException, drives a commit, asserts the commit completes successfully (houseTableRepository.save(...)is invoked once) and the bump was attempted. Validates the warn-and-swallow durability/availability trade-off.Updated existing test:
testDoCommitUpdateMetadataForInitalVersionCommit—fileIOManager.getStorage(...)is now called twice on the replicated-initial-version path (once by the new bump helper, once by the pre-existingupdateMetadataFieldForTable). Updated the verify counts and added inline comments explaining each call's purpose.Module test run:
./gradlew :iceberg:openhouse:internalcatalog:test— all 69 tests green (including the 3 new ones).Spotless:
./gradlew :iceberg:openhouse:internalcatalog:spotlessCheck— clean (after onespotlessApplypass during authoring).Additional Information
PR plan:
openhouserepo) — adds the synchronoussetReplication(..., 30)bump aftermetadata.jsonis written. Must merge first or together with (2).li-openhousePR (separate repo, in flight) — lowersdfs.replicationfrom 30 -> 10 inholdem-hdfs-site-extra.xml. Must NOT merge before (1) or newmetadata.jsonfiles on Holdem will live at repl=10 until this PR ships.METADATA_FILE_HDFS_REPLICATIONfrom a constant onCatalogConstantsto a Spring config property (openhouse.metadata.hdfs.replication) so deployments can tune the durable replication factor without a code change.*-m#.avro,snap-*.avro), that needs aFileIOwrapper; out of scope here.