Skip to content

Skip metadata.json parse in drop path#589

Merged
abhisheknath2011 merged 3 commits into
linkedin:mainfrom
dushyantk1509:dushyantk1509/drop-table-skip-loadtable
May 26, 2026
Merged

Skip metadata.json parse in drop path#589
abhisheknath2011 merged 3 commits into
linkedin:mainfrom
dushyantk1509:dushyantk1509/drop-table-skip-loadtable

Conversation

@dushyantk1509
Copy link
Copy Markdown
Contributor

Summary

Dropping a table currently goes through catalog.loadTable twice:

  1. TablesServiceImpl.deleteTablefindById parses metadata.json to build a TableDto for the ACL check.
  2. OpenHouseInternalCatalog.dropTableloadTable(id).location() parses metadata.json again to get the table base directory for fileIO.deletePrefix on purge.

Both parses are redundant — everything drop needs (databaseId, tableId, tableUUID, the metadata.json path) is already on the HTS row. The redundant parses also make drop fragile: if metadata is semantically invalid (e.g. Cannot find schema with current-schema-id=6 from schemas), drop fails even though the HTS row and storage files are still well-formed.

This PR removes both loadTable calls from the drop path:

  • OpenHouseInternalCatalog#findHouseTable(TableIdentifier) — HTS-only lookup that returns the HouseTable row, no metadata parse.
  • OpenHouseInternalRepository#findStubById(TableDtoPrimaryKey) — returns a partial TableDto (databaseId, tableId, tableUUID, tableLocation) built from the HTS row. Enough for the OPA auth check, which only uses db + UUID.
  • TablesServiceImpl#deleteTable uses findStubById instead of findById.
  • OpenHouseInternalCatalog#dropTable uses findHouseTable and derives the storage base location as the parent directory of the metadata.json path — same derivation OpenHouseInternalRepositoryImpl#save already uses in its replace flow, since OpenHouse writes metadata.json directly under <base>/ (no /metadata/ subdir).

A side benefit: corrupted-metadata tables can now be dropped end-to-end through the normal API.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

For all the boxes checked, please include additional details of the changes made in this pull request.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

For all the boxes checked, include a detailed description of the testing done for the changes made in this pull request.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

For all the boxes checked, include additional details of the changes made in this pull request.

Dushyant Kumar and others added 2 commits May 15, 2026 16:54
…ropped

Previously, dropping a table went through catalog.loadTable twice:
1. TablesServiceImpl.deleteTable -> openHouseInternalRepository.findById
   for the auth-check TableDto, which parses metadata.json.
2. OpenHouseInternalCatalog.dropTable -> loadTable(id).location() to get
   the table base directory for fileIO.deletePrefix on purge.

Both fail when metadata.json is corrupted (e.g. "Cannot find schema with
current-schema-id=6 from schemas"), making it impossible to drop the
table through the normal flow.

This change avoids loadTable in both spots by going directly to HTS:

- OpenHouseInternalCatalog#findHouseTable: HTS-only lookup returning the
  HouseTable row, no metadata parse.
- OpenHouseInternalRepository#findStubById: returns a partial TableDto
  (databaseId, tableId, tableUUID, tableLocation) built from the HouseTable
  row. Enough for the auth check, which only needs db + UUID.
- TablesServiceImpl#deleteTable now uses findStubById.
- OpenHouseInternalCatalog#dropTable uses findHouseTable and derives the
  table base as the parent directory of the metadata.json path -- the
  same derivation OpenHouseInternalRepositoryImpl#save already uses in
  the replace flow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Unit tests:
- OpenHouseInternalCatalogTest: findHouseTable happy path + HouseTableNotFoundException
  handling; dropTable behavior across {row missing, purge=true, purge=false} including
  verifying the metadata.json -> base location derivation via deletePrefix arg.
- OpenHouseInternalRepositoryImplTest: findStubById happy path (asserting which fields
  are populated and which stay null), empty-Optional propagation, and the instanceof
  guard for non-OpenHouseInternalCatalog catalogs.

Integration test:
- TablesServiceTest (h2): testTableDeleteSucceedsWhenMetadataJsonIsCorrupted creates a
  table, overwrites metadata.json with garbage so TableMetadataParser would throw, then
  verifies deleteTable still succeeds and the HTS row is gone. Regression coverage for
  the corrupted-table scenario this change set fixes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dushyantk1509 dushyantk1509 marked this pull request as ready for review May 18, 2026 15:52
- Switch metadata.json -> table base derivation in OpenHouseInternalCatalog.dropTable
  from substring to org.apache.hadoop.fs.Path.getParent().toString() (URI-aware).
- Add .endsWith(".metadata.json") guard before deriving the base, so a directory
  value in metadata_location (from a bad migration / manual edit) fails loudly
  instead of letting deletePrefix wipe the parent directory.
- Extract derivation + guard into private getTableBaseLocation helper.
- Rename OpenHouseInternalRepository.findStubById -> findTableRefById to convey
  that the returned partial TableDto is a catalog-level reference (identifiers +
  tableUUID + metadata.json location), not a generic stub.
- Add unit test for the new guard
  (dropTableRefusesWhenMetadataLocationIsNotAMetadataJsonFile).
- Rename existing findStubById tests to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@abhisheknath2011 abhisheknath2011 merged commit 1ff529e into linkedin:main May 26, 2026
1 check passed
cbb330 pushed a commit that referenced this pull request May 28, 2026
## Summary

Reverts commit `637896219c53cad672a4a3d381c1dad302971854` — "Support
returning HTS fields in table list api (#579)".

The revert conflicted with #589 (merged after #579) in
`OpenHouseInternalCatalog.java`. To make the conflict visible to the
reviewer, this PR is split into two commits:

1. **Revert with conflict markers preserved** (`a15b0da5`) — raw `git
revert` output, `OpenHouseInternalCatalog.java` left with `<<<<<<<` /
`=======` / `>>>>>>>` markers. Committed with `--no-verify` because the
markers fail spotless.
2. **Resolve conflict** (`387dd773`) — keep `findHouseTable` (from #589,
still used by the drop path), drop `listHouseTables` (from #579, the new
addition being reverted).

The intent of the split is to let the reviewer inspect what conflicted,
then see exactly how it was resolved.

## Changes

- [X] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [X] Refactoring
- [ ] Documentation
- [X] Tests

Reverts the public list-table API additions and the supporting tests
from #579.

## Testing Done

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [X] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

Tests added by #579 are removed as part of the revert. Relying on CI to
confirm the remaining suite (including the #589 drop-path tests that
depend on `findHouseTable`) still passes.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: mkuchenbecker <mkuchenbecker@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
maluchari added a commit that referenced this pull request May 29, 2026
Reverts the 17 commits that landed on main after v0.5.417, bringing the
tree back to exactly the v0.5.417 state. Squashed into a single revert
commit for reviewability and to allow reinstating everything as one unit
(revert this commit to bring all 17 changes back).

Reverted commits (v0.5.417..main, newest first):
- Revert #579 (HTS fields in table list api) (#610)
- feat(optimizer): [3/N] Analyzer (#533)
- [DataLoader] Handle Cast(Literal, TIMESTAMP/DATE/TIME) in scan
optimizer (#569 follow-up) (#583)
- Skip metadata.json parse in drop path (#589)
- feat(optimizer): [2/N] Optimizer REST Service and Controller (#531)
- [BDP-102028] feat(optimizer): [1/N] Optimizer Database (#530)
- [RTAS]: Fix bug - remove fs scheme from tableLocation in commit (cont)
(#594)
- Trigger ELR process (#593)
- [BDP-102028] feat(optimizer): [0/N] Optimizer API and internal model
(#527)
- Fail retention app when the columnPattern mismatch partition spec
(#552)
- [DataLoader] Drop OpenTelemetry minimum version to 1.38.0 (#590)
- [DataLoader] Emit OpenTelemetry metrics for read operations (#582)
- Cache iceberg metadata to reduce redundant requests to storage (#509)
- bump iceberg 1.2 version to 1.2.0.17 (#587)
- Support returning HTS fields in table list api (#579)
- [DataLoader] Add unique id property to OpenHouseDataLoader (#580)
- [DataLoader] Add OpenTelemetry metrics support (#575)

## Summary

<!--- HINT: Replace #nnn with corresponding Issue number, if you are
fixing an existing issue -->

[Issue](https://github.com/linkedin/openhouse/issues/#nnn)] Briefly
discuss the summary of the changes made in this
pull request in 2-3 lines.

## Changes

- [ ] Client-facing API Changes
- [ ] Internal API Changes
- [ ] Bug Fixes
- [ ] New Features
- [ ] Performance Improvements
- [ ] Code Style
- [ ] Refactoring
- [ ] Documentation
- [ ] Tests

For all the boxes checked, please include additional details of the
changes made in this pull request.

## Testing Done
<!--- Check any relevant boxes with "x" -->

- [ ] Manually Tested on local docker setup. Please include commands
ran, and their output.
- [ ] Added new tests for the changes made.
- [ ] Updated existing tests to reflect the changes made.
- [ ] No tests added or updated. Please explain why. If unsure, please
feel free to ask for help.
- [ ] Some other form of testing like staging or soak time in
production. Please explain.

For all the boxes checked, include a detailed description of the testing
done for the changes made in this pull request.

# Additional Information

- [ ] Breaking Changes
- [ ] Deprecations
- [ ] Large PR broken into smaller PRs, and PR plan linked in the
description.

For all the boxes checked, include additional details of the changes
made in this pull request.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants