Add K8s native scheduler by harshach · Pull Request #24940 · open-metadata/OpenMetadata

harshach · 2025-12-21T01:04:19Z

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

New Kubernetes pipeline client:
- K8sPipelineClient.java (1,512 lines) implements native K8s scheduler using Jobs and CronJobs, eliminating Apache Airflow dependency
Resource management:
- ConfigMaps store pipeline configuration, Secrets handle sensitive data, CronJobs for scheduled pipelines, Jobs for on-demand runs
Enterprise reliability features:
- Resilience4j retry logic with exponential backoff, optimistic locking with resourceVersion, automatic rollback on partial deployment failures
Comprehensive test coverage:
- Unit tests with Mockito mocking K8s API, integration tests using K3s Testcontainers for real cluster validation

_{This will update automatically on new commits.}

gitar-bot · 2025-12-21T01:11:05Z

🔍 CI failure analysis for 1fcdce6: Three unrelated CI failures detected: (1) maven-postgresql-rdf-ci has a CI workflow configuration error, (2) playwright-ci-postgresql has flaky E2E tests (99.7% pass rate), (3) py-run-tests has a pre-existing AttributeError bug in metadata/sampler/processor.py affecting Trino classifier tests. All failures are unrelated to the K8s pipeline client changes in this PR.

Combined CI Failure Analysis

This PR encountered three separate CI failures, all unrelated to the PR's code changes:

Failure 1: maven-postgresql-rdf-ci

Issue: CI workflow configuration error

Root Cause: Missing configuration file reference after Docker containers started

Relationship: UNRELATED - PR adds K8s client code, not CI workflows

Failure 2: playwright-ci-postgresql (6, 6)

Issue:

1 failed: GlossaryImportExport.spec.ts (file upload timeout)
3 flaky tests
384 passed (99.7% pass rate)

Root Cause: Environmental flakiness (I/O timeouts, access token failures)

Relationship: UNRELATED - PR adds backend Java code, not Playwright/frontend

Failure 3: py-run-tests (3.11) ⚠️

Issue: Python integration tests failed with 7 errors in Trino classifier tests:

ERROR test_auto_classification_workflow[{database_service}.minio.my_schema.*]

Result: 483 passed, 21 skipped, 1 xfailed, 7 errors

Root Cause: Pre-existing bug in sampler processor at metadata/sampler/processor.py:169:

AttributeError: 'NoneType' object has no attribute 'name'

File "metadata/sampler/processor.py", line 169, in _copy_service_config
    config_copy.catalog = database.name.root  # type: ignore
                          ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'name'

The Problem: Code assumes database.name exists but it's None for Trino/MinIO test entities. All 7 failures occur when processing tables from docker_test_trino0.minio.my_schema.*

Evidence:

Specific error location: metadata/sampler/processor.py line 169
Affected test: ingestion/tests/integration/trino/test_classifier.py
Same AttributeError for all 7 test cases
Trino client logs show "failed after 3 attempts"

Relationship to PR: UNRELATED - This PR introduces:

K8sPipelineClient.java (Java backend for K8s scheduling)
Unit/integration tests for K8s functionality
Maven dependencies (resilience4j, testcontainers)

The PR does NOT:

Modify Python ingestion code
Change metadata/sampler/processor.py
Affect Trino connector or classifier
Modify database entity handling

This is a pre-existing bug in the Python ingestion framework.

Assessment

All three failures are infrastructure/environment/pre-existing code issues:

RDF CI: Configuration bug in CI workflow
Playwright CI: Flaky tests (environmental)
Python tests: Pre-existing null pointer bug in sampler

Why These Are Unrelated

PR introduces Java backend code for K8s pipeline scheduling. None of these changes interact with:

RDF test workflows
Glossary/frontend E2E tests
Python ingestion sampler code
Trino connector functionality

Code Review 👍 Approved with suggestions

Well-designed K8s pipeline client with robust error handling, but missing retry consistency for automation jobs and potential security concern with kubeconfig handling.

⚠️

Security: Kubeconfig content potentially logged in error messages

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:539

The initializeK8sClient method accepts kubeconfig content directly from configuration parameters. If the K8s client initialization fails, the exception message could potentially include sensitive credential information.

While this specific implementation doesn't directly log the kubeconfig content, the configuration parameters map containing kubeConfigContent could be exposed in debug logs or error traces.

Impact: Potential credential exposure in logs if the kubeconfig content contains sensitive tokens or certificates.

Recommendation:

Avoid storing/passing kubeconfig content through the params map - prefer kubeconfig paths or in-cluster auth
If kubeconfig content is required, ensure it's never included in logs or exception messages
Consider redacting sensitive fields from error messages in initializeK8sClient

⚠️

Bug: Automation/app jobs don't use retry logic like other K8s operations

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:962 📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:987

The runAutomationsWorkflow and runApplicationFlow methods directly call batchApi.createNamespacedJob() without the retry wrapper (executeWithRetry) that other K8s operations use. This is inconsistent and could lead to transient failures when running automations or applications.

Impact: Automation workflows and application runs may fail unnecessarily on transient K8s API errors (429, 5xx), while other operations would succeed after retries.

Code patterns:

runPipeline uses: executeWithRetry(() -> batchApi.createNamespacedJob(...))
runAutomationsWorkflow uses: batchApi.createNamespacedJob(...).execute() (no retry)
runApplicationFlow uses: batchApi.createNamespacedJob(...).execute() (no retry)

Recommendation: Wrap the API calls in executeWithRetry() for consistency:

executeWithRetry(() -> batchApi.createNamespacedJob(namespace, job).execute());

More details 💡 2 suggestions

💡 Edge Case: sanitizeName could produce duplicate resource names for distinct inputs

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:1573

The sanitizeName method performs aggressive sanitization by replacing all non-alphanumeric characters with hyphens and truncating to 53 characters. This could lead to name collisions where distinct pipeline names produce the same sanitized output.

Examples:

pipeline_one → pipeline-one
pipeline.one → pipeline-one
MyPipeline123!@#ABC → mypipeline123-abc
my-pipeline-123-abc → mypipeline123-abc (potential collision)

Impact: If two pipelines have names that sanitize to the same value, they would share K8s resources (ConfigMaps, Secrets, CronJobs), leading to conflicts or data overwrites.

Recommendation: Consider using a hash suffix for collision avoidance, or validate uniqueness at the pipeline creation level based on sanitized names. At minimum, log a warning when sanitization significantly alters the name.

💡 Code Quality: getQueuedPipelineStatusInternal silently swallows ApiException

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:920

The getQueuedPipelineStatusInternal method catches ApiException and only logs it, returning an empty list. Unlike other methods that propagate errors, this could mask connectivity issues or authorization problems.

Code:

} catch (ApiException e) {
  LOG.error("Failed to get pipeline status: {}", e.getResponseBody());
}
return statuses;

Impact: Callers cannot distinguish between "no jobs exist" and "failed to query K8s API". This could lead to incorrect UI states or operational confusion when the K8s API is unavailable.

Recommendation: Consider propagating the exception or returning a result object that indicates whether the query succeeded. At minimum, include more context in the log (status code, pipeline name, correlation ID).

What Works Well

Comprehensive K8s integration: Clean implementation of Jobs/CronJobs for both scheduled and on-demand pipelines
Robust error handling: Detailed error messages with helpful hints based on HTTP status codes, proper rollback on deployment failures
Thread-safe factory: PipelineServiceClientFactory now uses proper double-checked locking with volatile field
Observability: Good use of MDC for correlation IDs, structured logging throughout
Security hardening: Pod security context defaults (runAsNonRoot, runAsUser=1000, dropped capabilities, read-only filesystem)
Resilience: Retry logic with exponential backoff for transient K8s API failures (429, 5xx)
Excellent test coverage: Unit tests with mocks, integration tests with K3s testcontainers, and resource tests

Recommendations

Address the inconsistent retry usage in runAutomationsWorkflow and runApplicationFlow methods
Consider the potential for name collisions from aggressive sanitization
Review kubeconfig content handling to prevent credential leakage in logs

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
✅ Code review is on Gitar will review this change.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

`Auto-apply`	`✅ Code review`	`Compact`
`gitar auto-apply:on`	`gitar code-review:off`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)}

sonarqubecloud · 2025-12-21T03:08:31Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Add K8s native scheduler

1fcdce6

harshach requested a review from a team as a code owner December 21, 2025 01:04

harshach had a problem deploying to test December 21, 2025 01:04 — with GitHub Actions Failure

harshach temporarily deployed to test December 21, 2025 01:04 — with GitHub Actions Inactive

harshach requested a review from pmbrull December 21, 2025 01:04

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Dec 21, 2025

pmbrull approved these changes Dec 21, 2025

View reviewed changes

pmbrull merged commit 73464cc into main Dec 21, 2025
25 of 28 checks passed

pmbrull deleted the k8s-scheduler branch December 21, 2025 11:59

ShaileshParmar11 pushed a commit that referenced this pull request Dec 26, 2025

Add K8s native scheduler (#24940)

87d5ea8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add K8s native scheduler#24940

Add K8s native scheduler#24940
pmbrull merged 1 commit intomainfrom
k8s-scheduler

harshach commented Dec 21, 2025 •

edited by gitar-bot Bot

Loading

Uh oh!

gitar-bot Bot commented Dec 21, 2025 •

edited

Loading

Combined CI Failure Analysis

Failure 1: maven-postgresql-rdf-ci

Failure 2: playwright-ci-postgresql (6, 6)

Failure 3: py-run-tests (3.11) ⚠️

Assessment

Why These Are Unrelated

What Works Well

Recommendations

Uh oh!

sonarqubecloud Bot commented Dec 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harshach commented Dec 21, 2025 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

gitar-bot Bot commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Combined CI Failure Analysis

Failure 1: maven-postgresql-rdf-ci

Failure 2: playwright-ci-postgresql (6, 6)

Failure 3: py-run-tests (3.11) ⚠️

Assessment

Why These Are Unrelated

What Works Well

Recommendations

Uh oh!

sonarqubecloud Bot commented Dec 21, 2025

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshach commented Dec 21, 2025 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented Dec 21, 2025 •

edited

Loading