Containers: batch container data-model column tag retrieval to avoid subtree fan-out#27836
Containers: batch container data-model column tag retrieval to avoid subtree fan-out#27836
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves how Container dataModel column tags are fetched to address performance problems and incorrect tag leakage in deeply nested container hierarchies. It replaces prefix-based tag retrieval with an exact-match (by target FQN hash) batched lookup, and adds integration tests to validate correct tag isolation and nested-struct tag handling.
Changes:
- Update
ContainerRepositoryto fetch dataModel column tags via flattened columns + exact target FQN hash lookups. - Add
TagUsageDAOquery/helper to retrieve tag usages by a list of target hashes with chunking. - Add integration tests to ensure parent containers don’t receive child container column tags and that nested struct child column tags are returned.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/ContainerRepository.java | Switch column-tag population from prefix scanning to flattened-column exact hash batching. |
| openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java | Add TagUsageDAO query + chunked helper for fetching tags by target FQN hashes. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/ContainerResourceIT.java | Add regression tests for tag leakage across container hierarchy and nested-struct column tag retrieval. |
populateDataModelColumnTags previously called addDerivedTagsGracefully once per flattened column, which internally batches across that column's own tags but issues a separate derived-tag DB lookup for every column. On data models with many columns (or struct types with deep nesting) this becomes an N+1 pattern. Refactor: - Pre-compute Map<String, Column> hashToColumn once (LinkedHashMap to preserve column order) so we no longer hash each FQN twice — once for the target-hash list and again on lookup. - After fetching tags by target hash, flatten all returned TagLabels into a single list and call TagLabelUtil.batchFetchDerivedTags(...) once for the whole data model. - Per column, use addDerivedTagsWithPreFetched(columnTags, derivedMap) to avoid further DB lookups. - Fall back to the per-column addDerivedTagsGracefully path if the batch derived-tag fetch raises, preserving existing semantics. Net effect: total derived-tag DB queries drop from O(N) to 1 regardless of column count or nesting depth. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fetchAndSetDataModelColumnTags is the bulk fieldFetcher used by setFieldsInBulk on every page of containers (search index rebuilds, list endpoints, PII masking, etc). It batched container-level tags in a single IN query, but then called addDerivedTagsGracefully once per container to enrich with derived tags. With pages of 50 and a service holding 100k containers, that's up to 100k extra derived-tag DB round-trips on container-level tags alone — separate from the per- column N+1 already addressed in populateDataModelColumnTags. Refactor: - Flatten all returned container TagLabels and call TagLabelUtil.batchFetchDerivedTags(allContainerTags) once per bulk batch. - Per container, use addDerivedTagsWithPreFetched(containerTags, derivedTagsMap) so no further DB lookup happens. - Fall back to per-container addDerivedTagsGracefully if the batch fetch raises, preserving existing semantics. Net effect: container-level derived-tag DB queries drop from O(N) per page to 1, regardless of page size. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ContainerRepository.listChildren paged through the relationship table and then called Entity.getEntity once per child reference to hydrate the entity. With the UI's default page size of 10 that's already 10 sub-queries per page; the API allows up to 1,000,000 per page, so the loop scales linearly with page size. Replace the per-ref loop with a single dao.findEntitiesByIds(ids, Include.ALL) call, then look each ref up in the resulting map to preserve the relationship-offset ordering returned by findToWithOffset. findEntitiesByIds chunks at 30k IDs internally, so callers requesting the maximum page are still safe. Refs whose entity was deleted between the relationship lookup and the bulk fetch are skipped rather than failing the whole page (mirrors the per-ref Entity.getEntity behaviour where a missing entity would have thrown). Net effect on a default-paginated UI request: ~11 queries → 2 (one findToWithOffset, one bulk findEntitiesByIds). The countFindTo total query is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removing `children` from the Container `fields=` allow-list (the slim payload
change for Tahoe-shaped buckets) broke two integration tests that fetched
the parent with `fields=["*"]` and walked `bucket.children.root`. With the
new contract that field is `None`, so those assertions raised AttributeError.
Migrate both tests to the dedicated `/v1/containers/name/{fqn}/children`
endpoint via a new `OpenMetadata.list_container_children(fqn, limit, offset)`
helper. The helper is the public migration path SDK consumers should follow
for the same reason; documenting it in code is cheaper than a release note.
Affected:
- tests/integration/auto_classification/containers/test_container_classification.py
- tests/integration/s3/test_s3_storage.py
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…xity The 30s in-memory count cache had two problems that outweighed its benefit: - Per-pod (Guava, not Redis) so two pods serve different paging.totals; paging gets visibly inconsistent during the TTL window. - Applied to every entity, including small tables (tag, glossary_term, …) where SELECT count(*) is microseconds and caching adds pure overhead. Drop the cache, the cachedListCount wrapper, and the buildListCountCacheKey helper; switch the three callers in listAfter / listAfterWithOffset / listBefore to call dao.listCount directly. If a specific entity's count is ever measured to be slow, the right fix is targeted (Postgres pg_class.reltuples estimate, or just dropping paging.total entirely for that endpoint), not a per-pod global cache. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five small Copilot/gitar findings on PR #27836: - OpenMetadataServerHealthCheckTest: comment said "intentionally low (1 ms)" while the assertion enforces 50 ms; updated comment to match. - ContainerRepository derived-tags fallback: pass the exception object to LOG.warn so the stack trace is preserved (was logging only ex.getMessage()). - HikariCPDataSourceFactory: drop the stale "default 0 (disabled)" inline comment that contradicts the 60s default we now apply. - DataAssetsHeader: include `entityType` in the breadcrumb fetch effect's dependency array so switching entity type while keeping FQN re-runs. - storageAPI / ContainerChildren: type the /children response as `PagingResponse<Container[]>` (matches the slim Container summaries the server now returns) and tighten the component's state + column generics to `Container[]` instead of `EntityReference[]`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
basedpyright runs in --baselinemode=discard for the static-checks CI job, so new uses of self.get_suffix / self._use_raw_data / dict-typed responses on self.client.get(...) are flagged even though the same patterns are baselined in older mixin code. Sidestep the new errors by hardcoding the /containers path (the REST client prepends /api/v1) and narrowing the response via isinstance(resp, dict) before calling .get(...). Drop the _use_raw_data branch — the helper has a typed return so callers don't need a raw-response escape hatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
Two gitar review findings on the container tag-fetch refactor: - The setTags=false branch of populateDataModelColumnTags ran a c.setTags(c.getTags()) loop over every flattened column — a literal no-op carried over from Entity.populateEntityFieldTags. Replace it with an early return and a comment explaining the prior behavior. - The batchFetchDerivedTags try/catch + warn-and-fallback existed verbatim in both fetchAndSetDataModelColumnTags (bulk path) and populateDataModelColumnTags (single path). Extract a private tryBatchFetchDerivedTags helper so the fallback behavior stays in lockstep across both paths. No behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- ContainerChildren: include `t` in the columns useMemo dep array so
in-flight language switches re-render the column titles. The memo had
[] but referenced t('label.name'), so the title stayed in the prior
language until the component remounted.
- ContainerPage: eagerly fetch the children paging.total on container
navigation (limit=0, no row payload) so the Children tab badge shows
the correct count immediately. Without this, the badge stayed at 0
for containers that default to a non-Children tab (e.g. ones with a
dataModel that opens the Schema tab) until the user clicked Children
and ContainerChildren reported the count via its context setter.
- ContainerPage.test: add a default mockResolvedValue for the new
getContainerChildrenByName call so existing tests don't blow up on
an unmocked .then.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drop the dedicated slow-request log entirely and let those lines flow into the main application log alongside everything else. Three pieces removed: - conf/openmetadata.yaml: the org.openmetadata.slowrequest logger block that pinned level=OFF (additive=false) and routed records to a separate ./logs/slow-requests.log file. Without the override, the topic now inherits root (INFO + server.log + stdout). - openmetadata-service / integration-tests logback.xml: matching per-logger entries removed. - OpenMetadataApplication.routeSlowRequestLogToServerLog: the runtime hack that reached into the logback context to flip additivity on. With the YAML override gone, the workaround it was working around is gone too. RequestLatencyContext still uses @slf4j(topic = "org.openmetadata.slowrequest") so operators can grep / filter by name, but the slow-request emission is now LOG.info instead of LOG.warn — the user's request was "log to main logger in INFO mode". The other LOG.warn calls in the class stay (they're real config-error warnings, not slow-request emissions). Addresses Copilot review 4209338255 (don't override operator logging config) by removing the override entirely. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace `(container?.children ?? []) as unknown as Container[]` with a structural row type (`ContainerChildRow`) that only declares the fields the table actually renders. Both the slim Container summaries returned by the /children endpoint and the EntityReference[] the customize-page widget preview feeds in are structurally assignable to it, so no cast is needed and the compiler will catch a real shape mismatch (e.g., if `description` ever moved off Container) instead of letting it slip through. Suggested in Copilot review on ContainerChildren.tsx:130. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 29 out of 29 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
openmetadata-ui/src/main/resources/ui/src/components/DataAssets/DataAssetsHeader/DataAssetsHeader.component.tsx:355
fetchContainerAncestorsdoes not guard against stale in-flight requests. If the user navigates quickly between containers (or an earlier request resolves after a later navigation), the older response can callsetParentContainers(...)and overwrite the newer container’s breadcrumbs.
Consider tracking the latest requested FQN in a useRef (or using an AbortController) and only applying ancestors/setIsBreadcrumbLoading(false) when the resolving request still matches the current dataAsset.fullyQualifiedName.
const fetchContainerAncestors = async (containerFqn: string) => {
// Always reset state at the top so a navigation to a container without an FQN
// (or to one whose ancestor fetch fails) doesn't keep painting the previous
// container's breadcrumbs. Without this reset, switching between containers
// could leave stale ancestors visible after either an early return or an error.
setParentContainers([]);
if (isEmpty(containerFqn)) {
return;
}
setIsBreadcrumbLoading(true);
try {
// Single batched server call replaces what used to be one
// getContainerByName per ancestor level.
const ancestors = await getContainerAncestors(containerFqn);
setParentContainers(ancestors ?? []);
} catch (error) {
showErrorToast(error as AxiosError, t('server.unexpected-response'));
} finally {
setIsBreadcrumbLoading(false);
| # Logging settings. | ||
| # https://logback.qos.ch/manual/layouts.html#conversionWord | ||
| # Set LOG_FORMAT=json for structured logs. The default text format preserves legacy output. | ||
| logging: | ||
| level: ${LOG_LEVEL:-INFO} | ||
| loggers: | ||
| org.openmetadata.service.util.jdbi.OMSqlLogger: | ||
| level: ${SQL_LOG_LEVEL:-INFO} | ||
| additive: false | ||
| appenders: | ||
| - type: file | ||
| layout: | ||
| type: om-event-layout | ||
| format: ${LOG_FORMAT:-text} | ||
| pattern: "%level [%d{ISO8601,UTC}] [%t] %logger{5} - %msg%n" | ||
| appendLineSeparator: true | ||
| threshold: DEBUG | ||
| currentLogFilename: ./logs/queries.log | ||
| archivedLogFilenamePattern: ./logs/queries-%d{yyyy-MM-dd}-%i.log.gz | ||
| archivedFileCount: 7 | ||
| timeZone: UTC | ||
| maxFileSize: 50MB |
There was a problem hiding this comment.
With the dedicated org.openmetadata.slowrequest logger removed from the sample config, the RequestLatencyContext slow-request logs (topic org.openmetadata.slowrequest) will now inherit the root logger settings and flow into the main appenders by default. Combined with the switch to LOG.info(...), this can significantly increase log volume and also removes the previous ability to toggle/route these logs separately.
If the intent is to keep slow-request logging optional or isolated, consider re-adding an explicit org.openmetadata.slowrequest logger entry here (potentially with its own appender/level) or documenting the new recommended override mechanism for operators.
The container detail page no longer inlines `children` into the parent
payload (it was unbounded for object-store buckets), and added a server-side
ancestor resolver to skip the per-level breadcrumb fetch loop. Mirror both
in the Java and Python SDKs so external consumers have a clean migration
path off `fields=children`.
Java SDK
- ContainerService: listChildren(fqn, ListParams) → ListResponse<Container>
(slim projection, paginated via limit/offset) and listAncestors(fqn) →
List<EntityReference> (root → immediate parent, single call).
- Containers fluent: top-level `Containers.listChildren(fqn).limit(N).fetch()`
and `Containers.listAncestors(fqn)`, plus `FluentContainer.children()` /
`.ancestors()` for chaining off an already-fetched container.
Python SDK
- Containers.list_children(fqn, limit=, offset=) → EntityList[Container],
delegates to the existing `OMetaContainerMixin.list_container_children`
helper.
- Containers.list_ancestors(fqn) → list[EntityReference], hits
/v1/containers/name/{fqn}/ancestors directly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The line speaks for itself; the why is in the commit history and the ContainerResource Swagger description. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A 50ms wall-clock bound in a unit test is flaky under JVM warmup, GC, and CI scheduler noise — and it doesn't actually verify the "no I/O" contract, just that whatever the test happened to do finished fast. The remaining test covers what we actually care about: the probe returns healthy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
With fields=children removed, /v1/containers/name/{fqn}/children is the
primary way to enumerate a container's children. The endpoint took a
SecurityContext but never called authorizer.authorize on it, so a caller
without VIEW_BASIC on the parent could still walk its subtree.
Mirror the gate listAncestors got: build a VIEW_BASIC OperationContext +
ResourceContextByName(fqn) and authorize before delegating to the repo.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
URLEncoder is form-encoding, not path-encoding: it emits + for spaces and leaves / unencoded, both of which the server treats differently in a path segment than a query parameter. EntityServiceBase already has the right helper (buildPathWithEncodedName) backed by okhttp3.HttpUrl.Builder for RFC-3986 path-segment encoding — but it was private. Promote it to protected and reuse for /children and /ancestors so a container FQN with a space (or any character that differs between form and path encoding) hits the right endpoint instead of 404'ing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Code Review ✅ Approved 11 resolved / 11 findingsOptimizes container tag retrieval by replacing unbounded prefix lookups with a batched exact-match strategy for column FQN hashes. All identified issues, including redundant hash computation, authorization gaps, and duplicate tag processing logic, have been resolved. ✅ 11 resolved✅ Performance: FQN hash computed twice per column in populateDataModelColumnTags
✅ Bug: Bulk-fetched children skip setFields/setDefaultFields hydration
✅ Security: listAncestors endpoint has no authorization check
✅ Quality: getAncestors returns mutable lists instead of unmodifiable
✅ Quality: EntityReferenceRow should be a Java 21 record
...and 6 more resolved from earlier reviews OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 33 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (2)
openmetadata-service/src/main/resources/logback.xml:27
RequestLatencyContextlogs to theorg.openmetadata.slowrequestlogger topic. With the dedicated logger removed here, slow-request diagnostics will inherit the root logger configuration (STDOUT + main file) and can no longer be routed to a separate file / disabled via a dedicated logger level. If this was not intentional, consider re-adding an explicitorg.openmetadata.slowrequestlogger configuration (level + optional file appender) so operators can control noise and retention for slow-request logs independently of the main application logs.
<logger name="org.openmetadata.service.Entity" level="ERROR" />
<!-- Add specific logger for SearchIndexApp and ReindexingJobLogger -->
<logger name="org.openmetadata.service.apps.bundles.searchIndex" level="INFO" />
<root level="${LOG_LEVEL:-INFO}">
<appender-ref ref="STDOUT" />
<appender-ref ref="file"/>
</root>
openmetadata-sdk/src/main/java/org/openmetadata/sdk/services/storages/ContainerService.java:79
encodeFqnusesURLEncoder.encode(...)directly for a URL path segment.URLEncoderis intended for query-string encoding and will encode spaces as+, which is not a valid path encoding and is inconsistent with other SDK services (e.g. PipelineService replaces+with%20). This can break requests for FQNs containing spaces (or other characters that need path encoding).
Consider switching to a shared path-segment encoder (or at minimum .replace("+", "%20")) to keep SDK URL construction correct and consistent across services.
|
|



Summary
Container retrieval was running a depth-unbounded LIKE 'hashPrefix.%' against tag_usage to fetch tags for a container's data-model columns. Because Containers can contain other
Containers indefinitely (S3/ADLS/GCS folder hierarchies), this prefix match returned the entire descendant subtree of every nested container, then filtered in memory by exact hash match.
For a instance with ~500K assets under one bucket, a single top-level container GET matched ~595K rows out of 7.4M.
This PR replaces the prefix LIKE with a batched exact-match IN over the columns' hashes — Container-only, since it's the only entity type with unbounded child nesting.
Background
Companion to #27745 / #27824, which restored the partial-index match by adding back the WHERE state = 1 predicate. Those PRs unblock the seq scan; this PR removes the unnecessary fan-out
so the indexed seek doesn't return 595K rows we don't need.
Why Container only
Surveyed all 8 callers of Entity.populateEntityFieldTags (Table, Container, File, Topic, DashboardDataModel, APIEndpoint, Worksheet, TestCase). Only Container's children can themselves
nest — the rest have schema-bounded child sets where the prefix LIKE is fine. No shared-helper change, no blast radius on the other 7 entity types.
Changes
CollectionDAO.TagUsageDAO
ContainerRepository
columns (including struct children), hashes their FQNs, and calls the new batched DAO.
Untouched:
Behavior
Same observable semantics (derived-tag wrapping, setTags=false skip, struct-child flattening). Difference is purely in how the data is fetched:
┌───────────────────────────────────────┬───────────────────────────────────────────────┬─────────────────────────────────────────────────┐
│ │ Before │ After │
├───────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────────────────────┤
│ Query │ WHERE targetfqnhash_lower LIKE 'hash.%' │ WHERE targetFQNHash IN (...) │
├───────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────────────────────┤
│ Match scope │ Entire descendant subtree │ Exact column FQNs only │
├───────────────────────────────────────┼───────────────────────────────────────────────┼─────────────────────────────────────────────────┤
│ Rows fetched on a top-level container │ All tags below it (595K in the reported case) │ Tag count of the data model's flattened columns │
└───────────────────────────────────────┴───────────────────────────────────────────────┴─────────────────────────────────────────────────┘
Tests
Added to ContainerResourceIT:
parent's column tags are correct and (b) the child container's tag does not leak into any of the parent's columns. Regression test for the over-fetch.
Verification
Summary by Gitar
ContainerServiceto use the base class methodbuildPathWithEncodedNameinstead of a manualURLEncoderimplementation.buildPathWithEncodedNameinEntityServiceBasetoprotectedto enable access by subclasses.This will update automatically on new commits.