Skip to content

feat: Kubernetes runner and S3 storage adapters#93

Merged
rorybyrne merged 6 commits into
mainfrom
092-feat-kubernetes-runner
Mar 22, 2026
Merged

feat: Kubernetes runner and S3 storage adapters#93
rorybyrne merged 6 commits into
mainfrom
092-feat-kubernetes-runner

Conversation

@rorybyrne

Copy link
Copy Markdown
Contributor

Summary

  • Add K8s Job-based hook and source runners (infrastructure/k8s/) with full security parity to Docker runners
  • Config-driven backend selection via runner.backend: "oci" | "k8s" using Dishka conditional activation (Marker/when=)
  • Harden FilesystemStorageAdapter for S3 CSI mounts (copy+delete fallback when rename() fails)
  • Extract shared runner utilities (runner_utils.py) used by both OCI and K8s runners
  • Thread identity through runner DTOs: HookInputs.deposition_srn, SourceInputs.convention_srn
  • Add kubernetes-asyncio as optional [k8s] dependency
  • Fix all type checker diagnostics: V1Job objects instead of raw dicts, CursorResult narrowing, TermConstraints isinstance check
  • Upgrade Dishka to 1.9.1 for conditional activation support

Test plan

  • just test — 901 unit tests pass
  • just lint — 0 errors, 0 warnings (fully clean)
  • K8s runner tests cover: Job spec generation, security context, scheduling watch, execution watch, orphan handling, cleanup, identity threading
  • Storage fallback tests cover: rename works locally, copy+delete on OSError, idempotent retry, error wrapping

Closes #92

…r backend

- Replace OciProvider with RunnerProvider supporting both OCI and K8s backends
- Add K8sConfig for Kubernetes-specific settings (namespace, PVC, etc.)
- Add RunnerConfig with backend selection ("oci" or "k8s")
- Remove runner field from SourceDefinition (now global config)
- Add deposition_srn to HookInputs and convention_srn to SourceInputs
- Extract shared runner utilities (memory parsing, progress parsing)
- Add K8s Job-based runners with security hardening and orphan handling
- Update storage adapter for cross-device compatibility (S3 CSI)
- Add kubernetes-asyncio as optional dependency with [k8s] extra
- Add proper error classification and health checks for K8s API

refactor: extract utility functions from OciHookRunner to shared module

Move parse_memory, parse_progress_file, and detect_rejection functions
to runner_utils module to improve code reusability and testability.
Update tests to use extracted functions directly and add missing
deposition_srn parameter to HookInputs test instances.
@greptile-apps

greptile-apps Bot commented Mar 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds Kubernetes Job-based runner infrastructure (K8sHookRunner, K8sSourceRunner) as a config-driven alternative to the existing OCI (Docker) runners, hardens FilesystemStorageAdapter for S3 CSI mounts, and extracts shared runner utilities into runner_utils.py. The backend is selected via runner.backend: "oci" | "k8s" using Dishka 1.9.1's conditional activation, and all concerns raised in the previous review round appear to have been addressed.

Key changes:

  • K8s runners implement the full two-phase lifecycle (scheduling watch → completion watch) with orphan detection, timeout safety nets, and diagnostic failure reporting, with security parity to the OCI runners (capabilities dropped, non-root UID/GID, read_only_root_filesystem for hooks)
  • label_value() and sanitize_label() in naming.py correctly sanitize SRN strings into K8s-compliant label values, resolving the colon-in-label-value issue from the previous review
  • FilesystemStorageAdapter.save_file fallback path now wraps copy2 OSError in InfrastructureError consistently with move_source_files_to_deposition
  • RunnerConfig adds a @model_validator that rejects backend=="k8s" with empty data_pvc_name at parse time, providing an actionable error before the first API call
  • _wait_for_completion in the source runner now has the final-poll safety net after the timeout loop, consistent with the hook runner

Remaining minor items:

  • _cleanup_job in source_runner.py omits the exception from the warning log (unlike the equivalent in runner.py), which would make debugging cleanup failures harder
  • sanitize_label truncates SHA-256 digest strings to 56 of 64 hex chars within the 63-char K8s label budget; functionally safe in practice but the sha256: prefix consumes budget that could be saved by stripping it before sanitizing

Confidence Score: 5/5

  • Safe to merge — all previously flagged critical issues are resolved, and the remaining items are non-blocking P2 suggestions.
  • All P0/P1 concerns from previous review rounds (invalid K8s label values, missing InfrastructureError wrapping, missing cross-field config validation, missing final-poll safety net in source runner) are addressed. The implementation passes 901 unit tests with zero lint errors. The only remaining findings are a missing error field in one log warning and a SHA-256 label truncation concern that is safe in practice.
  • server/osa/infrastructure/k8s/source_runner.py (minor cleanup warning log), server/osa/infrastructure/k8s/naming.py (digest truncation)

Important Files Changed

Filename Overview
server/osa/infrastructure/k8s/runner.py New K8s hook runner: well-structured Job lifecycle with orphan detection, scheduling/completion phases, final-poll timeout safety net, and full security parity (read-only rootfs, dropped caps, non-root UID, no network policy for hooks). Label sanitization via label_value() correctly addresses the previous SRN-colon issue.
server/osa/infrastructure/k8s/source_runner.py New K8s source runner mirrors hook runner structure. The final-poll safety net after the timeout loop is present (addresses previous thread concern). Minor: _cleanup_job warning log omits error details unlike runner.py. osa.io/digest label truncates SHA-256 hashes after 56 hex chars.
server/osa/infrastructure/k8s/naming.py sanitize_label and label_value correctly handle K8s label constraints (alphanumeric/dash/dot/underscore, 63-char max). Full SHA-256 digests are silently truncated to 56 hex chars within the budget; functionally fine in practice but worth noting.
server/osa/infrastructure/persistence/adapter/storage.py save_file S3 CSI fallback now wraps copy2 OSErrors in InfrastructureError (addresses previous thread concern). move_source_files_to_deposition already had consistent wrapping. Both atomic-write paths are covered by the new test suite.
server/osa/infrastructure/runner_utils.py Extracted shared utilities (progress parsing, rejection detection, memory conversion, PVC-relative path) used by both OCI and K8s runners. Deferred import logfire inside parse_records_file/parse_session_file remains (previously flagged).
server/osa/config.py Adds K8sConfig and RunnerConfig with a @model_validator that rejects backend=="k8s" when data_pvc_name is empty — addressing the previous thread's missing cross-field validation concern. Tests in test_config.py cover all four branches.
server/osa/infrastructure/k8s/di.py Dishka conditional activation (Marker/when=) cleanly routes HookRunner and SourceRunner resolution to either OCI or K8s factories based on runner.backend. Startup health check is performed during APP-scope K8s client initialization.
server/osa/infrastructure/k8s/health.py Startup health check validates namespace access (RBAC), namespace existence, and PVC availability with actionable error messages for each failure mode.
server/tests/unit/infrastructure/k8s/test_k8s_hook_runner.py Comprehensive unit tests covering Job spec generation, security context, scheduling/execution watch, orphan handling, cleanup, and identity threading. Label assertions now verify sanitized SRN values.
server/tests/unit/infrastructure/test_file_storage_move.py New test file covering both move_source_files_to_deposition and save_file fallback paths: rename works locally, copy+delete on OSError, idempotent retry, and InfrastructureError wrapping for copy failures.

Sequence Diagram

sequenceDiagram
    participant Svc as Service Layer
    participant DI as Dishka DI
    participant Runner as K8sHookRunner / K8sSourceRunner
    participant K8s as Kubernetes API
    participant PVC as PVC (S3 CSI / local)

    Svc->>DI: resolve HookRunner (backend="k8s")
    DI->>Runner: inject ApiClient + K8sConfig

    Svc->>Runner: run(hook, inputs, work_dir)
    Runner->>PVC: write input files (record.json, config.json)

    Runner->>K8s: list_namespaced_job (orphan check, label selector)
    K8s-->>Runner: [] (no existing job)

    Runner->>K8s: create_namespaced_job (V1Job spec)
    Note over Runner,K8s: labels: osa.io/role, osa.io/hook, osa.io/deposition (sanitized)

    loop Phase 1 — Scheduling (120s timeout)
        Runner->>K8s: list_namespaced_pod (job-name selector)
        K8s-->>Runner: pod phase
    end

    loop Phase 2 — Completion (timeout_seconds + 30)
        Runner->>K8s: read_namespaced_job
        K8s-->>Runner: status.succeeded / conditions
    end

    alt succeeded
        Runner->>PVC: parse progress.jsonl → HookResult
    else failed
        Runner->>K8s: list_namespaced_pod (diagnose OOM / exit code)
        Runner-->>Svc: HookResult(FAILED)
    end

    Runner->>K8s: delete_namespaced_job (cleanup, propagation=Background)
Loading

Comments Outside Diff (2)

  1. server/osa/infrastructure/k8s/source_runner.py, line 438-448 (link)

    Missing error details in cleanup warning log

    _cleanup_job in source_runner.py logs the warning without including the exception, making it harder to debug cleanup failures in production. The equivalent method in runner.py (line 503) correctly includes "error": str(exc) in the extra fields.

  2. server/osa/infrastructure/k8s/naming.py, line 9-17 (link)

    SHA-256 digest labels silently truncate the hash prefix

    sanitize_label truncates to 63 characters. A full digest like sha256:abc123... (71 chars after colon→dot substitution: sha256. + 64 hex chars = 71 chars) is truncated to sha256. + 56 hex chars. Only the first 56 of 64 hex characters are stored in osa.io/digest.

    While a SHA-256 collision on the truncated prefix is astronomically unlikely in practice, the label silently no longer uniquely identifies the full digest. Consider stripping the sha256: prefix before sanitizing to preserve more entropy within the 63-char budget:

    def label_value_for_digest(digest: str) -> str:
        # "sha256:abc..." → "abc..." (algorithm prefix is implicit)
        raw = digest.split(":", 1)[-1] if ":" in digest else digest
        return sanitize_label(raw)

    This is already the pattern used by job_name(), which extracts only the ID fragment from SRN strings.

Reviews (4): Last reviewed commit: "feat: add k8s memory quantity conversion..." | Re-trigger Greptile

Comment on lines +243 to +247
labels = {
"osa.io/role": "hook",
"osa.io/hook": hook.name,
"osa.io/deposition": deposition_srn,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Invalid Kubernetes label values — SRN strings contain colons

The deposition_srn field (e.g., "urn:osa:localhost:dep:abc123") is used directly as a Kubernetes label value. Kubernetes label values may only contain alphanumerics, -, _, and . — colons (:) are not permitted. When create_namespaced_job is called against a real cluster the API server will reject the request with a validation error like:

Invalid value: "urn:osa:localhost:dep:abc123": a valid label must be an empty string or
consist of alphanumeric characters, '-', '_' or '.', and must start and end with an
alphanumeric character (regex: (([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?)

The unit tests don't catch this because they use in-process Python objects, which don't apply K8s API validation.

The same problem affects the label selector in _check_existing_job:

label_selector = f"osa.io/hook={hook_name},osa.io/deposition={deposition_srn}"

Fix: extract only the ID fragment from the SRN (the same way job_name() already does with srn_parts[-1]), or sanitize the string before embedding it in labels:

# In _build_job_spec (and _check_existing_job):
dep_id = deposition_srn.split(":")[-1] if deposition_srn else ""
labels = {
    "osa.io/role": "hook",
    "osa.io/hook": hook.name,
    "osa.io/deposition": dep_id,
}

The identical issue exists in source_runner.py for the osa.io/convention label.

Comment on lines +173 to +176
label_parts = ["osa.io/role=source"]
if convention_srn:
label_parts.append(f"osa.io/convention={convention_srn}")
label_selector = ",".join(label_parts)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Invalid Kubernetes label values — SRN contains colons (same issue as runner.py)

convention_srn (e.g., "urn:osa:localhost:conv:xyz") is used directly as a label value and as part of a label selector. Both will be rejected by the K8s API server at runtime.

label_parts.append(f"osa.io/convention={convention_srn}")

and

labels["osa.io/convention"] = convention_srn

Fix: use only the trailing ID fragment, consistent with the job_name() helper:

conv_id = convention_srn.split(":")[-1] if convention_srn else ""
labels["osa.io/convention"] = conv_id
...
label_parts.append(f"osa.io/convention={conv_id}")

Comment on lines +170 to +188
async def _check_existing_job(
self, batch_api: BatchV1Api, namespace: str, convention_srn: str
) -> str | None:
label_parts = ["osa.io/role=source"]
if convention_srn:
label_parts.append(f"osa.io/convention={convention_srn}")
label_selector = ",".join(label_parts)

try:
job_list = await batch_api.list_namespaced_job(namespace, label_selector=label_selector)
except Exception as exc:
raise classify_api_error(exc) from exc

for job in job_list.items:
if job.status.succeeded:
return "succeeded"
if job.status.active:
return f"active:{job.metadata.name}"
return None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Orphan detection missing source-level discriminator

The label selector used to detect an already-running or completed source job filters only on osa.io/role=source (and optionally osa.io/convention). The _build_job_spec method also does not emit a label for the source identity (image or any other identifier).

If more than one distinct source image is ever run for the same convention — either currently or in a retry scenario — _check_existing_job will find the first match (any source job for that convention) and either short-circuit to its output or attach to a running job that belongs to a completely different source container. Contrast this with the hook runner, which includes osa.io/hook={hook.name} as an additional discriminator.

Even if the current domain model enforces one source per convention, making the labelling explicit prevents accidental cross-matching if that constraint is relaxed later, and makes debugging in kubectl much easier.

Consider adding the image reference (or a sanitised form of it) as an additional label:

# sanitise colon / slash / @ so the value is k8s-label-safe
image_label = re.sub(r"[^a-zA-Z0-9._-]", "-", source.image)[:63].strip("-")
labels["osa.io/source-image"] = image_label

and including it in the selector in _check_existing_job.

Comment on lines +383 to +384

return "failed:WatchTimeout"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing final-poll safety net on timeout

K8sHookRunner._wait_for_completion (in runner.py) does one extra read after the deadline loop to guard against a race where the job completes in the last millisecond:

# Timed out — poll once more
try:
    job = await batch_api.read_namespaced_job(job_name, namespace)
    if job.status.succeeded:
        return "succeeded"
except Exception:
    pass

This source-runner equivalent simply returns "failed:WatchTimeout" without the final poll, making it slightly more likely to report a false timeout under load. The two methods should be consistent.

Suggested change
return "failed:WatchTimeout"
# Timed out — poll once more to guard against a last-second completion
try:
job = await batch_api.read_namespaced_job(job_name, namespace)
if job.status.succeeded:
return "succeeded"
except Exception:
pass
return "failed:WatchTimeout"

labels = spec.spec.template.metadata.labels
assert labels["osa.io/role"] == "hook"
assert labels["osa.io/hook"] == "validate_dna"
assert labels["osa.io/deposition"] == "urn:osa:localhost:dep:abc123"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unit test doesn't catch invalid label value

The test at line 183 asserts that labels["osa.io/deposition"] == "urn:osa:localhost:dep:abc123", but this confirms the bug rather than catching it — Kubernetes Python client objects accept any string, so no validation error fires in the unit test. Once the label sanitization is fixed in the runner, this test assertion should be updated to verify the sanitized value (e.g., "abc123") is stored instead.

Install k8s extra dependencies across all CI jobs to ensure
Kubernetes-related functionality is available during testing,
type checking, and deployment processes.
Replace string-based SRN handling with proper SRN type objects in
source and validation runners to improve type safety and enable
better K8s label generation. Add K8s label utilities for converting
SRNs to DNS-compliant label values within 63-character limit.
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

@github-actions

github-actions Bot commented Mar 21, 2026

Copy link
Copy Markdown

Code Coverage

Package Line Rate Complexity Health
. 93% 0
application 100% 0
application.api 100% 0
application.api.rest 82% 0
application.api.v1 82% 0
application.api.v1.routes 64% 0
application.event 100% 0
domain 100% 0
domain.auth 100% 0
domain.auth.command 90% 0
domain.auth.event 100% 0
domain.auth.model 93% 0
domain.auth.port 99% 0
domain.auth.query 93% 0
domain.auth.service 91% 0
domain.auth.util 100% 0
domain.auth.util.di 81% 0
domain.curation 100% 0
domain.curation.adapter 100% 0
domain.curation.command 100% 0
domain.curation.event 100% 0
domain.curation.handler 92% 0
domain.curation.model 100% 0
domain.curation.port 100% 0
domain.curation.query 100% 0
domain.curation.service 100% 0
domain.deposition 100% 0
domain.deposition.adapter 100% 0
domain.deposition.command 87% 0
domain.deposition.event 100% 0
domain.deposition.handler 100% 0
domain.deposition.model 98% 0
domain.deposition.port 100% 0
domain.deposition.query 86% 0
domain.deposition.service 97% 0
domain.deposition.util.di 94% 0
domain.discovery 100% 0
domain.discovery.model 100% 0
domain.discovery.port 100% 0
domain.discovery.query 100% 0
domain.discovery.service 94% 0
domain.discovery.util 100% 0
domain.discovery.util.di 94% 0
domain.export 100% 0
domain.export.adapter 100% 0
domain.export.command 100% 0
domain.export.event 100% 0
domain.export.model 100% 0
domain.export.port 100% 0
domain.export.query 100% 0
domain.export.service 100% 0
domain.feature 100% 0
domain.feature.event 100% 0
domain.feature.handler 100% 0
domain.feature.model 0% 0
domain.feature.port 100% 0
domain.feature.service 96% 0
domain.feature.util 100% 0
domain.feature.util.di 100% 0
domain.index 100% 0
domain.index.event 100% 0
domain.index.handler 75% 0
domain.index.model 84% 0
domain.index.service 100% 0
domain.record 100% 0
domain.record.adapter 100% 0
domain.record.command 100% 0
domain.record.event 100% 0
domain.record.handler 100% 0
domain.record.model 100% 0
domain.record.port 100% 0
domain.record.query 100% 0
domain.record.service 90% 0
domain.search 100% 0
domain.search.adapter 100% 0
domain.search.command 100% 0
domain.search.event 100% 0
domain.search.model 100% 0
domain.search.port 100% 0
domain.search.query 100% 0
domain.search.service 100% 0
domain.semantics 100% 0
domain.semantics.command 94% 0
domain.semantics.event 100% 0
domain.semantics.handler 100% 0
domain.semantics.model 100% 0
domain.semantics.port 100% 0
domain.semantics.query 90% 0
domain.semantics.service 100% 0
domain.semantics.util 100% 0
domain.semantics.util.di 93% 0
domain.shared 94% 0
domain.shared.authorization 87% 0
domain.shared.model 93% 0
domain.shared.port 100% 0
domain.source 100% 0
domain.source.event 100% 0
domain.source.handler 92% 0
domain.source.model 100% 0
domain.source.port 100% 0
domain.source.schedule 67% 0
domain.source.service 98% 0
domain.validation 100% 0
domain.validation.adapter 100% 0
domain.validation.command 0% 0
domain.validation.event 100% 0
domain.validation.handler 90% 0
domain.validation.model 85% 0
domain.validation.port 100% 0
domain.validation.query 100% 0
domain.validation.service 89% 0
domain.validation.util.di 92% 0
infrastructure 89% 0
infrastructure.auth 56% 0
infrastructure.event 79% 0
infrastructure.http 92% 0
infrastructure.index 82% 0
infrastructure.index.vector 73% 0
infrastructure.k8s 78% 0
infrastructure.messaging 100% 0
infrastructure.oci 54% 0
infrastructure.persistence 80% 0
infrastructure.persistence.adapter 63% 0
infrastructure.persistence.mappers 56% 0
infrastructure.persistence.repository 32% 0
infrastructure.source 91% 0
sdk 100% 0
sdk.index 100% 0
util 100% 0
util.di 62% 0
Summary 80% (5892 / 7350) 0

Comment on lines +92 to 99
try:
Path(tmp_path).rename(target)
except OSError:
shutil.copy2(tmp_path, target)
Path(tmp_path).unlink(missing_ok=True)
except Exception:
Path(tmp_path).unlink(missing_ok=True)
raise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 save_file copy failure not wrapped in InfrastructureError

When rename() fails on an S3 CSI mount and the fallback shutil.copy2 itself raises, the raw OSError propagates uncaught out of save_file. The outer except Exception: block just cleans up the temp file and re-raises. This is inconsistent with move_source_files_to_deposition, which wraps the same failure in InfrastructureError.

Any caller that catches only InfrastructureError (the domain error type) for storage failures will silently miss this path, turning an expected infrastructure error into an unhandled exception. There is also no unit test for this scenario in TestSaveFileFallback, meaning the gap is invisible.

        try:
            Path(tmp_path).rename(target)
        except OSError:
            try:
                shutil.copy2(tmp_path, target)
                Path(tmp_path).unlink(missing_ok=True)
            except OSError as e:
                raise InfrastructureError(f"Failed to write file {filename}: {e}") from e

Comment thread server/osa/config.py
Comment on lines +86 to +101
class K8sConfig(BaseModel):
"""Kubernetes-specific runner settings, required when runner.backend == "k8s"."""

namespace: str = "osa"
service_account: str | None = None
data_pvc_name: str = ""
data_mount_path: str = "/data"
image_pull_secrets: list[str] = []
job_ttl_seconds: int = 300


class RunnerConfig(BaseModel):
"""Runner backend selection and Kubernetes configuration."""

backend: Literal["oci", "k8s"] = "oci"
k8s: K8sConfig = K8sConfig()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing cross-field validation: data_pvc_name required when backend == "k8s"

K8sConfig.data_pvc_name defaults to "". If an operator enables the K8s backend but forgets to set this field, the PVC claim name embedded in every Job spec will be an empty string. The error will only surface at startup via the health-check call to read_namespaced_persistent_volume_claim("", namespace), which returns a cryptic K8s 404. A @model_validator on RunnerConfig would surface the problem at config-parse time with a clear message:

class RunnerConfig(BaseModel):
    backend: Literal["oci", "k8s"] = "oci"
    k8s: K8sConfig = K8sConfig()

    @model_validator(mode="after")
    def validate_k8s_required_fields(self) -> "RunnerConfig":
        if self.backend == "k8s" and not self.k8s.data_pvc_name:
            raise ValueError(
                "runner.k8s.data_pvc_name is required when runner.backend == 'k8s'. "
                "Set OSA_RUNNER__K8S__DATA_PVC_NAME."
            )
        return self

Comment on lines +69 to +86
def parse_records_file(output_dir: Path) -> list[dict[str, Any]]:
"""Parse records.jsonl from source output directory."""
import logfire

records: list[dict[str, Any]] = []
records_file = output_dir / "records.jsonl"
if not records_file.exists():
return records

for line in records_file.read_text().strip().split("\n"):
if not line.strip():
continue
try:
records.append(json.loads(line))
except json.JSONDecodeError:
logfire.warn("Skipping invalid JSON line in records.jsonl")
continue
return records

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Deferred import logfire inside utility functions

logfire is imported lazily inside parse_records_file and parse_session_file. Since logfire is a declared runtime dependency used at module level throughout the rest of the codebase, there is no reason to defer the import here. Keeping the import at the top of the module makes the dependency explicit, avoids a small per-call overhead, and prevents surprising ImportErrors surfacing at parse time rather than module load time.

import logfire
import json
...

def parse_records_file(output_dir: Path) -> list[dict[str, Any]]:
    records: list[dict[str, Any]] = []
    ...

Comment thread server/osa/infrastructure/k8s/runner.py Outdated
Comment on lines +329 to +335
def _relative_path(self, path: Path) -> str:
"""Strip the data mount prefix to get a PVC-relative subpath."""
mount = self._config.data_mount_path.rstrip("/")
path_str = str(path)
if not path_str.startswith(mount):
raise ValueError(f"Path {path} is outside the data mount prefix {mount}")
return path_str[len(mount) :].lstrip("/")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _relative_path duplicated verbatim in source_runner.py

K8sHookRunner._relative_path (lines 329–335) and K8sSourceRunner._relative_path (lines 313–318 of source_runner.py) are character-for-character identical. Extracting this to runner_utils.py or a shared base class would eliminate the duplication — a future change to the mount-path stripping logic would otherwise need to be applied in two places.

…method

Remove unused CoreV1Api parameter from _wait_for_completion method
in both K8sHookRunner and K8sSourceRunner classes to clean up the
method signature and improve code maintainability.
- Add model validator to ensure data_pvc_name is set when using K8s backend
- Extract relative_path utility function to reduce code duplication
- Improve file storage error handling with proper exception wrapping
- Add comprehensive tests for K8s config validation and file operations
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

Convert Docker-style memory strings to K8s resource quantities to
ensure proper memory limit specification in Kubernetes jobs

fix: improve temp file cleanup in storage adapter

Move temp file cleanup outside exception handler and add warning
logging when cleanup fails to prevent masking original errors
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

@rorybyrne rorybyrne merged commit 850703c into main Mar 22, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Kubernetes runner and S3 storage adapters

1 participant