feat(search): cluster fitness diagnostic — REST endpoint + ops CLI#28447
feat(search): cluster fitness diagnostic — REST endpoint + ops CLI#28447mohityadav766 wants to merge 4 commits into
Conversation
Adds GET /v1/system/search/fitness and an `openmetadata-ops search-fitness` subcommand that produce a structured report on whether the configured Elasticsearch/OpenSearch cluster is sized for the data OpenMetadata is storing. Checks per-index data footprint (size + avg doc bytes), disk watermarks, heap/CPU, thread-pool rejections, circuit breakers, shard layout, and shard-density vs heap, with AWS OpenSearch sizing guidance (1.45 storage overhead, ≤25 shards/GB heap, +1 data-node buffer, dedicated-master at ≥10 data nodes) baked into the recommendations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agger The class-level @hidden already covers /v1/system/*, but adding @hidden explicitly on /search/fitness plus hidden=true on @operation guarantees the diagnostic stays out of any generated client SDK or swagger UI even if the class-level annotation is ever removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a search cluster “fitness” diagnostic to help operators evaluate whether the configured Elasticsearch/OpenSearch cluster is appropriately sized for the current OpenMetadata search footprint, exposed via an admin-only REST endpoint and an openmetadata-ops CLI command.
Changes:
- Introduces
SearchClusterFitnessAnalyzerthat probes ES/OS via raw REST GETs and emits a structuredSearchClusterFitnessReport(signals, node/index footprints, sizing guidance). - Adds
GET /v1/system/search/fitness(admin-only) andopenmetadata-ops search-fitness(--jsonoptional) for retrieving the report. - Updates default
elasticsearch.clusterAliasinconf/openmetadata.yamland adds an integration test for the endpoint.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java | Adds search-fitness ops subcommand and report printing/JSON output. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SizingGuidance.java | New POJO for capacity guidance output. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SearchRestProbe.java | New helper to issue raw GETs to ES/OS and parse responses to JsonNode. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SearchClusterFitnessRules.java | Adds sizing/threshold constants (heap, shards, disk watermarks, etc.). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SearchClusterFitnessReport.java | New top-level report model for the diagnostic output. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SearchClusterFitnessAnalyzer.java | Core analyzer implementation that probes the cluster and generates signals + sizing guidance. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/NodeFootprint.java | New per-node footprint model (heap/disk/CPU/threadpool/breakers). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/IndexFootprint.java | New per-index footprint model (size/docs/shards/health). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/FitnessVerdict.java | New enum for overall verdict (READY/STRAINED/OVERLOADED/UNKNOWN). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/FitnessSignal.java | New model for individual fitness signals (severity/threshold/recommendation). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/FitnessSeverity.java | New enum for signal severity. |
| openmetadata-service/src/main/java/org/openmetadata/service/resources/system/SystemResource.java | Adds admin-only REST endpoint /v1/system/search/fitness. |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SearchClusterFitnessResourceIT.java | Adds integration test for the new endpoint (admin happy path). |
| conf/openmetadata.yaml | Changes default elasticsearch.clusterAlias to "openmetadata". |
| private void checkClusterStatus( | ||
| List<FitnessSignal> signals, JsonNode health, List<String> inaccessible) { | ||
| if (health == null) { | ||
| inaccessible.add("/_cluster/health"); | ||
| return; | ||
| } |
| String reason = text(explain, "unassigned_info"); | ||
| if (reason == null) { | ||
| JsonNode info = explain.get("unassigned_info"); | ||
| if (info != null) { | ||
| String r = text(info, "reason"); | ||
| String details = text(info, "details"); | ||
| if (r != null) { | ||
| result = details == null ? r : r + " (" + details + ")"; | ||
| } | ||
| } | ||
| } else { | ||
| result = reason; |
| nodes.stream() | ||
| .map(NodeFootprint::getHeapMaxBytes) | ||
| .filter(java.util.Objects::nonNull) | ||
| .mapToLong(Long::longValue) | ||
| .sum(); |
| int dataNodes = | ||
| nodes.isEmpty() | ||
| ? Math.max(1, report.getDataNodes() == null ? 1 : report.getDataNodes()) | ||
| : nodes.size(); |
| assertThat(sizing.has("recommendedDataNodes")).isTrue(); | ||
| assertThat(sizing.has("recommendedHeapPerNodeBytes")).isTrue(); | ||
| assertThat(sizing.has("rationale")).isTrue(); | ||
| assertThat(sizing.get("rationale").asText()).isNotBlank(); |
| @Test | ||
| void admin_can_fetch_fitness_report_with_signals_and_sizing() throws Exception { | ||
| final OpenMetadataClient client = SdkClients.adminClient(); | ||
|
|
||
| final String body = | ||
| client | ||
| .getHttpClient() | ||
| .executeForString( | ||
| HttpMethod.GET, | ||
| "/v1/system/search/fitness", | ||
| null, | ||
| RequestOptions.builder().build()); | ||
|
|
- conf/openmetadata.yaml: revert clusterAlias default to "" — the change
to "openmetadata" was a breaking change for any deployment upgrading
without setting ELASTICSEARCH_CLUSTER_ALIAS (indices would silently
appear missing). The fitness analyzer reads the configured alias
via SearchRepository.getClusterAlias() and works against any value.
- analyze(): split into collectClusterSnapshot() / initReport() /
runChecks() / finalizeReport() with a private ClusterSnapshot holder.
Each method now fits the 15-line guideline.
- Lazy-fetch /_cat/shards and /_cluster/allocation/explain — only
hit when unassigned_shards > 0. Avoids the guaranteed 400 from
allocation/explain on healthy clusters and the multi-MB shard list
on large clusters.
- extractAllocationExplainReason: drop the buggy asText("unassigned_info")
call (asText() on an object node returns ""). Walk into the object
directly via .isObject() check.
- Treat Jackson NullNode/MissingNode as inaccessible in checkClusterStatus
and checkDedicatedMaster via new isUsable() helper. Probe returns
NullNode on failure; previous null-only guard let degraded responses
through.
- checkShardsPerHeapGb and buildSizingGuidance now filter to data-role
nodes (data, data_*) before summing heap or counting nodes. Including
master/coordinator heap understated shard density and over-estimated
available heap per data node.
- SearchRestProbe.elasticGet: guard the Rest5Client cast with instanceof
so a future ES client swap surfaces a clear LOG.debug instead of
ClassCastException-swallowed-by-catch.
- SearchClusterFitnessResourceIT:
- Loosen sizing assertions so INSUFFICIENT_DATA verdict (fresh cluster,
no OM indices) does not fail the test on missing recommendedDataNodes
/ recommendedHeapPerNodeBytes — those are intentionally null in that
case.
- Add non_admin_cannot_fetch_fitness_report covering 401/403 rejection
for a DataConsumer token.
- Add unauthenticated_request_is_rejected covering the 401 path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
||
| private void checkPendingTasks( | ||
| List<FitnessSignal> signals, JsonNode health, List<String> inaccessible) { | ||
| if (health == null || !health.has("number_of_pending_tasks")) { |
There was a problem hiding this comment.
💡 Quality: Inconsistent null-guard: isUsable() vs == null across check methods
The new isUsable() helper correctly handles the NullNode returned by SearchRestProbe.get() on failure, and it's applied in checkClusterStatus, checkUnassignedShards, and checkDedicatedMaster. However, checkPendingTasks (line 492) still uses the old health == null guard. While this doesn't cause a runtime bug (NullNode.has() returns false, so the method exits early anyway), it means a failed /_cluster/health probe won't contribute to inaccessibleMetrics from checkPendingTasks — unlike the consistent handling in checkClusterStatus. Since both methods share the same health node from the snapshot, this is cosmetic, but using isUsable() uniformly would be clearer.
Replace health == null with isUsable(health) for consistency with other check methods.:
if (!isUsable(health) || !health.has("number_of_pending_tasks")) {
return;
}
- Apply fix
Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎
| private int countDataNodes(List<NodeFootprint> nodes, SearchClusterFitnessReport report) { | ||
| long withDataRole = | ||
| nodes.stream() | ||
| .filter(n -> n.getRoles() != null) | ||
| .filter( | ||
| n -> n.getRoles().stream().anyMatch(r -> r.equals("data") || r.startsWith("data_"))) | ||
| .count(); | ||
| int result; | ||
| if (withDataRole > 0) { | ||
| result = (int) withDataRole; | ||
| } else if (report.getDataNodes() != null && report.getDataNodes() > 0) { | ||
| result = report.getDataNodes(); | ||
| } else { | ||
| result = Math.max(1, nodes.size()); | ||
| } |
There was a problem hiding this comment.
💡 Quality: Duplicate data-node filtering logic in countDataNodes and filterDataNodes
The lambda n.getRoles().stream().anyMatch(r -> r.equals("data") || r.startsWith("data_")) is duplicated between countDataNodes (line 154) and filterDataNodes (line 1405). Extracting a private static boolean isDataNode(NodeFootprint n) predicate would reduce duplication and make it easier to update the role-matching logic in one place.
Extract shared predicate to eliminate duplicated role-matching logic.:
private static boolean isDataNode(NodeFootprint n) {
return n.getRoles() != null
&& n.getRoles().stream().anyMatch(r -> r.equals("data") || r.startsWith("data_"));
}
private List<NodeFootprint> filterDataNodes(List<NodeFootprint> nodes) {
return nodes.stream().filter(SearchClusterFitnessAnalyzer::isDataNode).toList();
}
private int countDataNodes(List<NodeFootprint> nodes, SearchClusterFitnessReport report) {
long withDataRole = nodes.stream().filter(SearchClusterFitnessAnalyzer::isDataNode).count();
// ... rest unchanged
}
- Apply fix
Check the box to apply the fix or reply for a change | Was this helpful? React with 👍 / 👎
Code Review 👍 Approved with suggestions 5 resolved / 7 findingsIntroduces a search cluster fitness diagnostic tool for both REST API and CLI to identify sizing and performance bottlenecks. Refactor identified minor inconsistencies in null-guarding and duplicate data-node filtering logic that should be unified for improved maintainability. 💡 Quality: Inconsistent null-guard: isUsable() vs == null across check methodsThe new Replace `health == null` with `isUsable(health)` for consistency with other check methods.💡 Quality: Duplicate data-node filtering logic in countDataNodes and filterDataNodes📄 openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SearchClusterFitnessAnalyzer.java:149-163 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/fitness/SearchClusterFitnessAnalyzer.java:1402-1406 The lambda Extract shared predicate to eliminate duplicated role-matching logic.✅ 5 resolved✅ Bug: Default clusterAlias change breaks existing deployments
✅ Bug: /_cluster/allocation/explain called unconditionally on every run
✅ Quality: SearchClusterFitnessAnalyzer.analyze() exceeds method length guidelines
✅ Performance: All cluster endpoints fetched eagerly even when most data is unused
✅ Edge Case: ElasticSearch low-level client cast may fail with newer ES versions
🤖 Prompt for agentsOptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
| } else if (used | ||
| >= lowWatermark * (1 - SearchClusterFitnessRules.WATERMARK_PROXIMITY_FRACTION)) { | ||
| severity = FitnessSeverity.WARN; | ||
| recommendation = | ||
| "Approaching low watermark. Provision more disk during the next maintenance window."; | ||
| } else if (used >= SearchClusterFitnessRules.DISK_USAGE_FAIL_PERCENT) { | ||
| severity = FitnessSeverity.FAIL; | ||
| } else if (used >= SearchClusterFitnessRules.DISK_USAGE_WARN_PERCENT) { | ||
| severity = FitnessSeverity.WARN; |
| } | ||
| Set<String> canonicalNames = openMetadataCanonicalNames(clusterAlias); | ||
| Set<String> openMetadataAliases = openMetadataAliases(clusterAlias); | ||
| Set<String> indicesWithOmAlias = indicesCarryingOmAlias(catAliases, openMetadataAliases); |
| List<FitnessSignal> signals, | ||
| List<NodeFootprint> nodes, | ||
| JsonNode clusterSettings, | ||
| List<String> inaccessible) { |
| s.rootInfo = probe.get("/"); | ||
| s.clusterHealth = probe.get("/_cluster/health"); | ||
| s.clusterStats = probe.get("/_cluster/stats"); | ||
| s.nodesStats = probe.get("/_nodes/stats"); |
| @Command( | ||
| name = "search-fitness", | ||
| description = | ||
| "Diagnose whether the configured Elasticsearch/OpenSearch cluster is sized for the " | ||
| + "current OpenMetadata data footprint. Reports per-index size + avg doc bytes, " | ||
| + "disk watermarks, heap/CPU, thread-pool rejections, circuit breakers, shard " | ||
| + "layout, and capacity recommendations.") |
|
🟡 Playwright Results — all passed (11 flaky)✅ 4251 passed · ❌ 0 failed · 🟡 11 flaky · ⏭️ 88 skipped
🟡 11 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |



Describe your changes:
Adds a search cluster fitness diagnostic so devops can quickly judge whether
the configured Elasticsearch/OpenSearch cluster is sized for the data
OpenMetadata is storing. Exposed two ways: `GET /v1/system/search/fitness`
(admin-only) and `openmetadata-ops search-fitness` (`--json` for raw output).
I built this because after an AWS-managed OpenSearch migration, sizing
issues (oversized shards, disk watermarks, undersized small instances being
bombarded by search/index traffic, fried reindex jobs) are hard to diagnose
in one shot — this consolidates the signals into a structured report with
recommendations.
Type of change:
High-level design:
A single analyzer (`SearchClusterFitnessAnalyzer`) probes the cluster via raw
REST GETs through a small `SearchRestProbe` helper that works against both
ES `Rest5Client` and OS `OpenSearchGenericClient`. All analysis walks Jackson
`JsonNode`, so engine-specific typed responses aren't needed and missing
fields on managed clusters (AWS-restricted endpoints) degrade gracefully —
captured in `inaccessibleMetrics` rather than failing the run.
Signals produced (each carries severity + observed + threshold + rationale + recommendation):
cluster status, pending tasks, unassigned shards (primary vs replica
breakdown with `/_cluster/allocation/explain` reason inlined), shard budget,
shards-per-heap-GB density, dedicated-master recommendation at scale,
per-OpenMetadata-index data footprint (size + avg doc bytes — depth, not just
count), oversized/over-sharded indices, disk low/high/flood watermarks per
node, JVM heap / CPU pressure, write/search thread-pool queue depth and
rejections, circuit-breaker trips. Sizing guidance applies the AWS
OpenSearch storage formula `source × (1 + replicas) × 1.45` with the +1
data-node buffer and the ≤25 shards/GB-heap rule baked into
`SearchClusterFitnessRules` constants (each documented with the AWS source).
OpenMetadata-managed indices are identified by canonical name, prefix
(catches versioned indices like `*rebuild`), and alias intersection
from `/_cat/aliases` — so the report works against reindex-alias-swap state.
When zero OM indices match, the report self-diagnoses: lists the top-20
indices actually present on the cluster, emits an `openmetadata.indices_missing`
signal pointing to clusterAlias config, and reports
sizing as `INSUFFICIENT_DATA` instead of a fake recommendation. Verdict
rolls up as READY / STRAINED / OVERLOADED / UNKNOWN.
Also bundles a one-line `conf/openmetadata.yaml` change to default
`clusterAlias` to `"openmetadata"` (matches the prefix actually used by the
docker-compose stack in this repo, and the prefix the fitness tool then
matches against by default).
Tests:
Use cases covered
receives a structured report with signals, indices, sizing guidance.
`openmetadata.indices_missing` signal (verified manually against a fresh
single-node OS 3.4 cluster).
Backend integration tests
Manual testing performed
cluster with 64 OpenMetadata indices, 245k docs, 203 MB primary data.
(>7× AWS shards/GB limit), 82.9% disk usage (~2% from low watermark),
and correctly classified 164 unassigned shards as expected single-node
replica state (INFO, not FAIL).
UI screen recording / screenshots:
Not applicable.
Checklist:
🤖 Generated with Claude Code