Skip to content

Add hybrid-rrf pipeline check and improve embedding error reporting#26936

Merged
pmbrull merged 8 commits intomainfrom
task/fix-status-validation-in-systemrepositor-c5927228
Apr 7, 2026
Merged

Add hybrid-rrf pipeline check and improve embedding error reporting#26936
pmbrull merged 8 commits intomainfrom
task/fix-status-validation-in-systemrepositor-c5927228

Conversation

@pmbrull
Copy link
Copy Markdown
Collaborator

@pmbrull pmbrull commented Apr 1, 2026

Summary

  • Adds validation of the hybrid-rrf search pipeline in the system health check when embeddings are enabled. If the pipeline is missing, the status reports an actionable error ("Run a reindex to create it"), differentiating 404 from 5xx/connectivity errors.
  • Improves error reporting when the vector search service fails to initialize: reads the stored init error to surface the actual failure (embedding client config vs OpenSearch vector service), replacing the generic "not configured properly" message.
  • Adds checkHybridSearchPipeline() to OpenSearchVectorService and SearchRepository, returning Optional<String> (empty = OK, present = error detail) to check pipeline existence via a GET request.
  • Skips expensive retry of initializeVectorSearchService() when a previous init error is already recorded.

Closes https://github.com/open-metadata/ai-platform/issues/473

Test plan

  • Unit tests added for checkHybridSearchPipeline (200 OK, 404, 5xx, and exception cases)
  • Unit tests added for SystemRepository embeddings validation paths (7 tests covering all branches)
  • Verify health check shows hybrid-rrf pipeline status when embeddings are enabled
  • Verify improved error messages when embedding client fails to initialize
  • Verify improved error messages when OpenSearch vector service fails
  • Run mvn spotless:apply - confirmed formatting is clean

…n status validation

- Add isHybridSearchPipelineAvailable() to OpenSearchVectorService and SearchRepository
  to verify the hybrid-rrf search pipeline exists in OpenSearch
- Update getEmbeddingsValidation in SystemRepository to also validate the pipeline
  when embeddings are enabled, failing with actionable message if missing
- Retry vector service initialization during validation to surface the actual
  failure reason (embedding client vs OpenSearch vector service) instead of
  the generic "not configured properly" message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 13:08
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances semantic search health validation for OpenSearch-backed embeddings by checking for the required hybrid-rrf search pipeline and by improving initialization error reporting to better distinguish embedding client vs. vector service failures.

Changes:

  • Add a hybrid-rrf pipeline existence check (via GET /_search/pipeline/hybrid-rrf) and surface an actionable “reindex to create it” message when missing.
  • Update embeddings system validation to retry vector service initialization during health checks and produce more specific failure messages.
  • Add unit tests for OpenSearchVectorService.isHybridSearchPipelineAvailable() covering success, 404, and exception cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java Extends embeddings health validation to retry init and validate hybrid pipeline availability with improved messaging.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java Exposes isHybridSearchPipelineAvailable() by delegating to OpenSearchVectorService when applicable.
openmetadata-service/src/main/java/org/openmetadata/service/search/vector/OpenSearchVectorService.java Adds isHybridSearchPipelineAvailable() that checks the hybrid-rrf search pipeline via the OpenSearch generic client.
openmetadata-service/src/test/java/org/openmetadata/service/search/vector/OpenSearchVectorServiceTest.java Adds unit tests for the new pipeline-availability method.

Comment on lines 578 to +597
if (searchRepository.getVectorIndexService() == null) {
return retryInitAndReportError(
searchRepository, embeddingsValidation, description, configMessage);
}

try {
StepValidation embeddingResult =
validateEmbeddingGeneration(
searchRepository.getEmbeddingClient(),
embeddingsValidation,
description,
configMessage);

if (Boolean.FALSE.equals(embeddingResult.getPassed())) {
return embeddingResult;
}

return validateHybridSearchPipeline(
searchRepository, embeddingResult, description, configMessage);
} catch (Exception e) {
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embeddings health-check logic now includes additional branches (retrying initializeVectorSearchService(), differentiating embedding-client vs vector-service initialization, and validating hybrid-rrf pipeline presence). There don’t appear to be unit tests covering these new paths; adding tests for (1) pipeline exists, (2) pipeline missing (404), and (3) pipeline check failure (exception/401/5xx) would help prevent regressions in health-check reporting.

Copilot uses AI. Check for mistakes.
- Fix bug: retry path now validates embedding generation before checking
  the hybrid pipeline (gitar)
- Return Optional<String> from checkHybridSearchPipeline instead of
  boolean, differentiating 404 (missing) from 5xx/connectivity errors
  with specific messages (copilot)
- Include initialization exception message in StepValidation so
  operators don't need to hunt logs (copilot)
- Return Optional.empty() for non-OpenSearch implementations since
  hybrid pipeline is OpenSearch-specific (gitar)
- Add test for 5xx response case in pipeline check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
initializeVectorSearchService() catches exceptions internally and never
rethrows, so the try-catch in retryInitAndReportError never captured the
error. Store the exception message in vectorServiceInitError field and
read it in the validation to include in the health check message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 1, 2026 15:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

@Getter private EmbeddingClient embeddingClient;
@Getter private VectorIndexService vectorIndexService;
@Getter private VectorEmbeddingHandler vectorEmbeddingHandler;
@Getter private String vectorServiceInitError;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectorServiceInitError is written inside a synchronized method (initializeVectorSearchService) but read via an unsynchronized getter. Without volatile (or synchronized access), other threads may not observe the updated value, which can lead to stale/missing error details in validations/logs. Consider making this field volatile (and optionally clearing it on successful init to avoid stale errors).

Suggested change
@Getter private String vectorServiceInitError;
@Getter private volatile String vectorServiceInitError;

Copilot uses AI. Check for mistakes.
StepValidation embeddingsValidation,
String description,
String configMessage) {
searchRepository.initializeVectorSearchService();
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryInitAndReportError() triggers initializeVectorSearchService() as part of the system validation endpoint. When initialization keeps failing (e.g., misconfigured embedding provider), repeated calls to /system/validate will keep retrying expensive init logic and may also perform side effects (e.g., pipeline creation/handler registration attempts), potentially spamming logs and impacting the embedding provider/OpenSearch. Consider throttling retries (e.g., retry once per process lifetime or with a backoff timestamp) and reporting the last captured init error instead of re-initializing on every validation call.

Suggested change
searchRepository.initializeVectorSearchService();
// Avoid repeatedly triggering expensive vector search initialization from the
// validation endpoint when we already know initialization has failed.
String previousInitError = searchRepository.getVectorServiceInitError();
if (previousInitError == null) {
// Either initialization has not been attempted yet or it has not recorded an error,
// so we allow a single retry attempt.
searchRepository.initializeVectorSearchService();
} else {
LOG.debug(
"Skipping vector search service initialization retry during system validation "
+ "because a previous attempt failed with error: {}",
previousInitError);
}

Copilot uses AI. Check for mistakes.
Comment on lines +490 to +495
public Optional<String> checkHybridSearchPipeline() {
if (vectorIndexService instanceof OpenSearchVectorService openSearchVectorService) {
return openSearchVectorService.checkHybridSearchPipeline();
}
return Optional.empty();
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions adding isHybridSearchPipelineAvailable() on OpenSearchVectorService/SearchRepository, but the implementation adds checkHybridSearchPipeline() returning Optional<String> instead. To avoid confusion for reviewers and future maintainers, please either update the PR description (and any referenced docs/issues) or rename/adjust the method to match the intended API contract.

Copilot uses AI. Check for mistakes.
Cover the new validation branches: pipeline available, pipeline missing
(404), pipeline 5xx, embedding client init failure with error message,
vector service init failure with embedding client OK, retry with no
recovery, and Elasticsearch not supported.

Make getEmbeddingsValidation package-private for testability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

🟡 Playwright Results — all passed (19 flaky)

✅ 3448 passed · ❌ 0 failed · 🟡 19 flaky · ⏭️ 223 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 452 0 3 2
🟡 Shard 2 617 0 2 32
🟡 Shard 3 617 0 3 27
🟡 Shard 4 618 0 6 47
✅ Shard 5 587 0 0 67
🟡 Shard 6 557 0 5 48
🟡 19 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Dashboard Data Model - customization should work (shard 1, 1 retry)
  • Flow/Tour.spec.ts › Tour should work from welcome screen (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Bulk selection operations (shard 2, 1 retry)
  • Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
  • Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
  • Flow/ExploreDiscovery.spec.ts › Should display deleted assets when showDeleted is checked and deleted is not present in queryFilter (shard 3, 1 retry)
  • Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
  • Pages/DescriptionVisibility.spec.ts › Customized Table detail page Description widget shows long description (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Comprehensive domain rename with ALL relationships preserved (shard 4, 1 retry)
  • Pages/Domains.spec.ts › Multiple consecutive domain renames preserve all associations (shard 4, 1 retry)
  • Pages/DomainUIInteractions.spec.ts › Delete data product via UI (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Glossary Term Add, Update and Remove for child entities (shard 4, 1 retry)
  • Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6, 1 retry)
  • Pages/HyperlinkCustomProperty.spec.ts › should show No Data placeholder when hyperlink has no value (shard 6, 1 retry)
  • Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
  • Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
  • VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

- Annotate getEmbeddingsValidation with @VisibleForTesting to document
  the package-private visibility is intentional for testing
- Use e.toString() instead of e.getMessage() for vectorServiceInitError
  and pipeline check errors to avoid null messages
- Mock Bedrock config in tests instead of null to eliminate noisy
  ERROR logs during test runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 2, 2026 07:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

- Remove guard that prevented init retries after a previous error,
  allowing recovery from transient failures (e.g., OpenSearch restart).
  initializeVectorSearchService() is already idempotent.
- Include truncated response body in non-404 pipeline check errors
  for actionable diagnostics (e.g., 403 permission details)
- Rename testEmbeddingsPassButPipelineMissing to
  testEmbeddingsFailWhenPipelineMissing for clarity
- Use imported Optional instead of fully qualified in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 2, 2026

Code Review ✅ Approved 4 resolved / 4 findings

Adds hybrid-RRF pipeline validation and improves embedding error reporting in status validation, addressing retry path validation gaps, hybrid pipeline availability checks, and error state management. No issues remain.

✅ 4 resolved
Bug: Retry path skips embedding generation validation

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:617-622
When the vector service is initially null and retryInitAndReportError successfully re-initializes it (line 617), the code jumps directly to validateHybridSearchPipeline without first calling validateEmbeddingGeneration. The happy path (lines 583-596) validates embedding generation first (dimension mismatch, all-zeros, null checks) before checking the hybrid pipeline. This means a misconfigured embedding client (e.g., wrong model dimensions) could pass validation after a retry, reporting "Embeddings and hybrid search pipeline are working correctly" when embeddings are actually broken.

Quality: isHybridSearchPipelineAvailable returns false for non-OpenSearch

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:487-492
In SearchRepository.isHybridSearchPipelineAvailable(), the method returns false when vectorIndexService is not an OpenSearchVectorService instance. This means the health check will always report the hybrid pipeline as missing for any future non-OpenSearch vector service implementations, causing a false-negative validation failure. Consider returning true (not applicable) for non-OpenSearch implementations, or making isHybridSearchPipelineAvailable() part of the VectorIndexService interface with a default true implementation.

Quality: Success message mentions hybrid pipeline for non-OpenSearch backends

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:490-495 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/SystemRepository.java:676-681
SearchRepository.checkHybridSearchPipeline() returns Optional.empty() when the vector service is not OpenSearch, which causes validateHybridSearchPipeline to report "Embeddings and hybrid search pipeline are working correctly" even for backends that don't have such a pipeline. This is harmless today since only OpenSearch is supported, but could mislead operators if other backends are added.

Bug: vectorServiceInitError not cleared on successful retry

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:439 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java:448
When initializeVectorSearchService() is called and fails, vectorServiceInitError is set to the exception message. On a subsequent retry that succeeds, vectorServiceInitialized is set to true but vectorServiceInitError is never cleared back to null. While this doesn't cause a bug in the current code paths (because the error is only read when getVectorIndexService() == null), it's a latent defect: any future code that reads vectorServiceInitError after a successful recovery will see a stale error message, which could lead to confusing diagnostics.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 2, 2026

@pmbrull pmbrull merged commit 426081b into main Apr 7, 2026
51 checks passed
@pmbrull pmbrull deleted the task/fix-status-validation-in-systemrepositor-c5927228 branch April 7, 2026 08:09
SaaiAravindhRaja pushed a commit to SaaiAravindhRaja/OpenMetadata that referenced this pull request Apr 12, 2026
…pen-metadata#26936)

* Add hybrid-rrf pipeline check and improve embedding error reporting in status validation

- Add isHybridSearchPipelineAvailable() to OpenSearchVectorService and SearchRepository
  to verify the hybrid-rrf search pipeline exists in OpenSearch
- Update getEmbeddingsValidation in SystemRepository to also validate the pipeline
  when embeddings are enabled, failing with actionable message if missing
- Retry vector service initialization during validation to surface the actual
  failure reason (embedding client vs OpenSearch vector service) instead of
  the generic "not configured properly" message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address PR review feedback from Copilot and Gitar

- Fix bug: retry path now validates embedding generation before checking
  the hybrid pipeline (gitar)
- Return Optional<String> from checkHybridSearchPipeline instead of
  boolean, differentiating 404 (missing) from 5xx/connectivity errors
  with specific messages (copilot)
- Include initialization exception message in StepValidation so
  operators don't need to hunt logs (copilot)
- Return Optional.empty() for non-OpenSearch implementations since
  hybrid pipeline is OpenSearch-specific (gitar)
- Add test for 5xx response case in pipeline check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Store vector service init error for status reporting

initializeVectorSearchService() catches exceptions internally and never
rethrows, so the try-catch in retryInitAndReportError never captured the
error. Store the exception message in vectorServiceInitError field and
read it in the validation to include in the health check message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add unit tests for embeddings validation paths in SystemRepository

Cover the new validation branches: pipeline available, pipeline missing
(404), pipeline 5xx, embedding client init failure with error message,
vector service init failure with embedding client OK, retry with no
recovery, and Elasticsearch not supported.

Make getEmbeddingsValidation package-private for testability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clear vectorServiceInitError on successful initialization

Prevents stale error state after a successful retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make vectorServiceInitError volatile and skip redundant init retries

- Add volatile to vectorServiceInitError for thread-safe reads from
  the unsynchronized getter
- Skip calling initializeVectorSearchService() during validation when
  a previous init error is already recorded, avoiding expensive retries
  and log spam on every /system/validate call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address review: @VisibleForTesting, e.toString(), mock Bedrock config

- Annotate getEmbeddingsValidation with @VisibleForTesting to document
  the package-private visibility is intentional for testing
- Use e.toString() instead of e.getMessage() for vectorServiceInitError
  and pipeline check errors to avoid null messages
- Mock Bedrock config in tests instead of null to eliminate noisy
  ERROR logs during test runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Allow transient failure recovery, include response body, fix test style

- Remove guard that prevented init retries after a previous error,
  allowing recovery from transient failures (e.g., OpenSearch restart).
  initializeVectorSearchService() is already idempotent.
- Include truncated response body in non-404 pipeline check errors
  for actionable diagnostics (e.g., 403 permission details)
- Rename testEmbeddingsPassButPipelineMissing to
  testEmbeddingsFailWhenPipelineMissing for clarity
- Use imported Optional instead of fully qualified in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
SaaiAravindhRaja pushed a commit to SaaiAravindhRaja/OpenMetadata that referenced this pull request Apr 12, 2026
…pen-metadata#26936)

* Add hybrid-rrf pipeline check and improve embedding error reporting in status validation

- Add isHybridSearchPipelineAvailable() to OpenSearchVectorService and SearchRepository
  to verify the hybrid-rrf search pipeline exists in OpenSearch
- Update getEmbeddingsValidation in SystemRepository to also validate the pipeline
  when embeddings are enabled, failing with actionable message if missing
- Retry vector service initialization during validation to surface the actual
  failure reason (embedding client vs OpenSearch vector service) instead of
  the generic "not configured properly" message

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address PR review feedback from Copilot and Gitar

- Fix bug: retry path now validates embedding generation before checking
  the hybrid pipeline (gitar)
- Return Optional<String> from checkHybridSearchPipeline instead of
  boolean, differentiating 404 (missing) from 5xx/connectivity errors
  with specific messages (copilot)
- Include initialization exception message in StepValidation so
  operators don't need to hunt logs (copilot)
- Return Optional.empty() for non-OpenSearch implementations since
  hybrid pipeline is OpenSearch-specific (gitar)
- Add test for 5xx response case in pipeline check

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Store vector service init error for status reporting

initializeVectorSearchService() catches exceptions internally and never
rethrows, so the try-catch in retryInitAndReportError never captured the
error. Store the exception message in vectorServiceInitError field and
read it in the validation to include in the health check message.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add unit tests for embeddings validation paths in SystemRepository

Cover the new validation branches: pipeline available, pipeline missing
(404), pipeline 5xx, embedding client init failure with error message,
vector service init failure with embedding client OK, retry with no
recovery, and Elasticsearch not supported.

Make getEmbeddingsValidation package-private for testability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Clear vectorServiceInitError on successful initialization

Prevents stale error state after a successful retry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Make vectorServiceInitError volatile and skip redundant init retries

- Add volatile to vectorServiceInitError for thread-safe reads from
  the unsynchronized getter
- Skip calling initializeVectorSearchService() during validation when
  a previous init error is already recorded, avoiding expensive retries
  and log spam on every /system/validate call

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Address review: @VisibleForTesting, e.toString(), mock Bedrock config

- Annotate getEmbeddingsValidation with @VisibleForTesting to document
  the package-private visibility is intentional for testing
- Use e.toString() instead of e.getMessage() for vectorServiceInitError
  and pipeline check errors to avoid null messages
- Mock Bedrock config in tests instead of null to eliminate noisy
  ERROR logs during test runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Allow transient failure recovery, include response body, fix test style

- Remove guard that prevented init retries after a previous error,
  allowing recovery from transient failures (e.g., OpenSearch restart).
  initializeVectorSearchService() is already idempotent.
- Include truncated response body in non-404 pipeline check errors
  for actionable diagnostics (e.g., 403 permission details)
- Rename testEmbeddingsPassButPipelineMissing to
  testEmbeddingsFailWhenPipelineMissing for clarity
- Use imported Optional instead of fully qualified in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants