[WIP/PoC] Centralize KFP driver as standalone ml-pipeline-driver service#2
Open
droctothorpe wants to merge 1 commit into
Open
Conversation
2 tasks
da711da to
93b2f1a
Compare
Owner
|
Hey @droctothorpe, Could you please open an MR from this branch to master? You can call me paranoid, but I’m not fully sure the new resources will be applied properly, so let’s verify it :) |
b59dd60 to
8e17485
Compare
b238bc5 to
a7cf9e3
Compare
…ne-driver service Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: test <test@test.com> Signed-off-by: droctothorpe <mythicalsunlight@gmail.com>
a7cf9e3 to
7a10d17
Compare
068cff6 to
0f6119c
Compare
000ae2e to
8c4d9c5
Compare
e7e61b8 to
e85dac2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
This PoC explores moving the KFP driver from a per-pod Argo executor plugin
sidecar (injected into every workflow pod) to a dedicated, centrally-running
Kubernetes Deployment (
ml-pipeline-driver). Argo Workflows communicates withthe driver via HTTP address mode
rather than injecting a sidecar into each task pod.
Motivation
/kfp/log)kubectl logsml-pipelineSAWhat changed
backend/src/driver/package mainto importablepackage driver.rpc_handler.go+ helpers intohandler.go; removed standalone server files (main.go,execution_paths.go).workflow.metadata.namespace/objectMeta.namespace).os.TempDir()instead of the Argo shared/kfp/logvolume.handler_test.go.backend/src/driver/cmd/main.go(new)ml-pipeline-driver.POST /api/v1/template.executeandGET /healthzon:8080.flag.Set("logtostderr","true")beforeflag.Parse()so glog emits to stderr.backend/Dockerfile.driver(new)backend/Makefileimage_drivertarget; included inimage_all.backend/src/apiserver/main.gobackend/src/v2/compiler/argocompiler/plugin.godriverEndpointURL→http://ml-pipeline-driver.kubeflow.svc.cluster.local:8080/api/v1/template.executebackend/src/v2/compiler/argocompiler/{argo,container,dag}.goValueFrom.ExpressiontotaskResultExtract()(jsonpath overoutputs.result),which is the only output field Argo HTTP templates actually populate.
manifests/kustomize/base/pipeline/ml-pipeline-driver-{sa,role,rolebinding,deployment,service}.yaml.ml-pipeline-driver-plugin-cm.yaml(sidecar config no longer needed).kustomization.yaml.reader) are now bound to the
ml-pipelineSA.manifests/kustomize/env/cert-manager/platform-agnostic-standalone-tls/kustomization.yaml.manifests/kustomize/base/installs/multi-user/pipelines-profile-controller/sync.pyml-pipelineSA instead..github/resources/manifests/driver-plugin-cm-path.yamlpatches (no longer applicable).kubernetes-native, tls-enabled).
.github/workflows/image-builds*.ymlml-pipeline-driverimage build targets.test_data/compiled-workflows/Manual testing notes
root-driver,add-driver) succeed and driver logsare now visible via
kubectl logs.kfpmodule) inthe manual test environment — this is a test-script issue
(
install_kfp_package=False) and is unrelated to the driver changes.ScheduledWorkflowCRD (scheduled-workflow-crd.yaml) must be present inthe cluster; without it the persistence agent fails to watch
ScheduledWorkflowresources and runs remain stuck in "pending execution".
Known gaps / TODO
ml-pipeline-driverimage.actionlintCI check flags a type mismatch inimage-builds-master.ymlintroduced by the new image target — needs a follow-up fix.ml-pipeline-drivershould run as a sidecar toml-pipeline(same pod, different container) vs. a fully independent Deployment.ml-pipeline-drivernot yet addressed.