Skip to content

Add K8s native scheduler#24940

Merged
pmbrull merged 1 commit intomainfrom
k8s-scheduler
Dec 21, 2025
Merged

Add K8s native scheduler#24940
pmbrull merged 1 commit intomainfrom
k8s-scheduler

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented Dec 21, 2025

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • New Kubernetes pipeline client:
    • K8sPipelineClient.java (1,512 lines) implements native K8s scheduler using Jobs and CronJobs, eliminating Apache Airflow dependency
  • Resource management:
    • ConfigMaps store pipeline configuration, Secrets handle sensitive data, CronJobs for scheduled pipelines, Jobs for on-demand runs
  • Enterprise reliability features:
    • Resilience4j retry logic with exponential backoff, optimistic locking with resourceVersion, automatic rollback on partial deployment failures
  • Comprehensive test coverage:
    • Unit tests with Mockito mocking K8s API, integration tests using K3s Testcontainers for real cluster validation

This will update automatically on new commits.


@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Dec 21, 2025

🔍 CI failure analysis for 1fcdce6: Three unrelated CI failures detected: (1) maven-postgresql-rdf-ci has a CI workflow configuration error, (2) playwright-ci-postgresql has flaky E2E tests (99.7% pass rate), (3) py-run-tests has a pre-existing AttributeError bug in metadata/sampler/processor.py affecting Trino classifier tests. All failures are unrelated to the K8s pipeline client changes in this PR.

Combined CI Failure Analysis

This PR encountered three separate CI failures, all unrelated to the PR's code changes:

Failure 1: maven-postgresql-rdf-ci

Issue: CI workflow configuration error

Root Cause: Missing configuration file reference after Docker containers started

Relationship: UNRELATED - PR adds K8s client code, not CI workflows


Failure 2: playwright-ci-postgresql (6, 6)

Issue:

  • 1 failed: GlossaryImportExport.spec.ts (file upload timeout)
  • 3 flaky tests
  • 384 passed (99.7% pass rate)

Root Cause: Environmental flakiness (I/O timeouts, access token failures)

Relationship: UNRELATED - PR adds backend Java code, not Playwright/frontend


Failure 3: py-run-tests (3.11) ⚠️

Issue: Python integration tests failed with 7 errors in Trino classifier tests:

ERROR test_auto_classification_workflow[{database_service}.minio.my_schema.*]

Result: 483 passed, 21 skipped, 1 xfailed, 7 errors

Root Cause: Pre-existing bug in sampler processor at metadata/sampler/processor.py:169:

AttributeError: 'NoneType' object has no attribute 'name'

File "metadata/sampler/processor.py", line 169, in _copy_service_config
    config_copy.catalog = database.name.root  # type: ignore
                          ^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'name'

The Problem: Code assumes database.name exists but it's None for Trino/MinIO test entities. All 7 failures occur when processing tables from docker_test_trino0.minio.my_schema.*

Evidence:

  • Specific error location: metadata/sampler/processor.py line 169
  • Affected test: ingestion/tests/integration/trino/test_classifier.py
  • Same AttributeError for all 7 test cases
  • Trino client logs show "failed after 3 attempts"

Relationship to PR: UNRELATED - This PR introduces:

  • K8sPipelineClient.java (Java backend for K8s scheduling)
  • Unit/integration tests for K8s functionality
  • Maven dependencies (resilience4j, testcontainers)

The PR does NOT:

  • Modify Python ingestion code
  • Change metadata/sampler/processor.py
  • Affect Trino connector or classifier
  • Modify database entity handling

This is a pre-existing bug in the Python ingestion framework.


Assessment

All three failures are infrastructure/environment/pre-existing code issues:

  1. RDF CI: Configuration bug in CI workflow
  2. Playwright CI: Flaky tests (environmental)
  3. Python tests: Pre-existing null pointer bug in sampler

Why These Are Unrelated

PR introduces Java backend code for K8s pipeline scheduling. None of these changes interact with:

  • RDF test workflows
  • Glossary/frontend E2E tests
  • Python ingestion sampler code
  • Trino connector functionality
Code Review 👍 Approved with suggestions

Well-designed K8s pipeline client with robust error handling, but missing retry consistency for automation jobs and potential security concern with kubeconfig handling.

⚠️ Security: Kubeconfig content potentially logged in error messages

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:539

The initializeK8sClient method accepts kubeconfig content directly from configuration parameters. If the K8s client initialization fails, the exception message could potentially include sensitive credential information.

While this specific implementation doesn't directly log the kubeconfig content, the configuration parameters map containing kubeConfigContent could be exposed in debug logs or error traces.

Impact: Potential credential exposure in logs if the kubeconfig content contains sensitive tokens or certificates.

Recommendation:

  1. Avoid storing/passing kubeconfig content through the params map - prefer kubeconfig paths or in-cluster auth
  2. If kubeconfig content is required, ensure it's never included in logs or exception messages
  3. Consider redacting sensitive fields from error messages in initializeK8sClient
⚠️ Bug: Automation/app jobs don't use retry logic like other K8s operations

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:962 📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:987

The runAutomationsWorkflow and runApplicationFlow methods directly call batchApi.createNamespacedJob() without the retry wrapper (executeWithRetry) that other K8s operations use. This is inconsistent and could lead to transient failures when running automations or applications.

Impact: Automation workflows and application runs may fail unnecessarily on transient K8s API errors (429, 5xx), while other operations would succeed after retries.

Code patterns:

  • runPipeline uses: executeWithRetry(() -> batchApi.createNamespacedJob(...))
  • runAutomationsWorkflow uses: batchApi.createNamespacedJob(...).execute() (no retry)
  • runApplicationFlow uses: batchApi.createNamespacedJob(...).execute() (no retry)

Recommendation: Wrap the API calls in executeWithRetry() for consistency:

executeWithRetry(() -> batchApi.createNamespacedJob(namespace, job).execute());
More details 💡 2 suggestions
💡 Edge Case: sanitizeName could produce duplicate resource names for distinct inputs

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:1573

The sanitizeName method performs aggressive sanitization by replacing all non-alphanumeric characters with hyphens and truncating to 53 characters. This could lead to name collisions where distinct pipeline names produce the same sanitized output.

Examples:

  • pipeline_onepipeline-one
  • pipeline.onepipeline-one
  • MyPipeline123!@#ABCmypipeline123-abc
  • my-pipeline-123-abcmypipeline123-abc (potential collision)

Impact: If two pipelines have names that sanitize to the same value, they would share K8s resources (ConfigMaps, Secrets, CronJobs), leading to conflicts or data overwrites.

Recommendation: Consider using a hash suffix for collision avoidance, or validate uniqueness at the pipeline creation level based on sanitized names. At minimum, log a warning when sanitization significantly alters the name.

💡 Code Quality: getQueuedPipelineStatusInternal silently swallows ApiException

📄 openmetadata-service/src/main/java/org/openmetadata/service/clients/pipeline/k8s/K8sPipelineClient.java:920

The getQueuedPipelineStatusInternal method catches ApiException and only logs it, returning an empty list. Unlike other methods that propagate errors, this could mask connectivity issues or authorization problems.

Code:

} catch (ApiException e) {
  LOG.error("Failed to get pipeline status: {}", e.getResponseBody());
}
return statuses;

Impact: Callers cannot distinguish between "no jobs exist" and "failed to query K8s API". This could lead to incorrect UI states or operational confusion when the K8s API is unavailable.

Recommendation: Consider propagating the exception or returning a result object that indicates whether the query succeeded. At minimum, include more context in the log (status code, pipeline name, correlation ID).

What Works Well

  • Comprehensive K8s integration: Clean implementation of Jobs/CronJobs for both scheduled and on-demand pipelines
  • Robust error handling: Detailed error messages with helpful hints based on HTTP status codes, proper rollback on deployment failures
  • Thread-safe factory: PipelineServiceClientFactory now uses proper double-checked locking with volatile field
  • Observability: Good use of MDC for correlation IDs, structured logging throughout
  • Security hardening: Pod security context defaults (runAsNonRoot, runAsUser=1000, dropped capabilities, read-only filesystem)
  • Resilience: Retry logic with exponential backoff for transient K8s API failures (429, 5xx)
  • Excellent test coverage: Unit tests with mocks, integration tests with K3s testcontainers, and resource tests

Recommendations

  • Address the inconsistent retry usage in runAutomationsWorkflow and runApplicationFlow methods
  • Consider the potential for name collisions from aggressive sanitization
  • Review kubeconfig content handling to prevent credential leakage in logs

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
✅ Code review is on Gitar will review this change.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

Auto-apply ✅ Code review Compact
gitar auto-apply:on         
gitar code-review:off         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)

@sonarqubecloud
Copy link
Copy Markdown

@pmbrull pmbrull merged commit 73464cc into main Dec 21, 2025
25 of 28 checks passed
@pmbrull pmbrull deleted the k8s-scheduler branch December 21, 2025 11:59
ShaileshParmar11 pushed a commit that referenced this pull request Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants