[feature] documentation for production grade deployment of kubeflow pipelines #6204

darthsuogles · 2021-08-01T19:20:14Z

Feature Area

/area documentation
/area samples
/area deployment

What feature would you like to see?

Documentation for production-grade deployment of kubeflow pipelines.

What is the use case or pain point?

Is there a workaround currently?

Unaware

Love this idea? Give it a 👍. We prioritize fulfilling features with the most 👍.

Bobgy · 2021-08-06T00:48:05Z

I have some personal notes on the topic, will try to document them.

darthsuogles · 2021-08-19T01:56:27Z

Thank you!
Any chance you had time working on this in the past couple of weeks?

vinayan3 · 2021-08-20T23:32:57Z

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

ml-pipeline
metadata-grpc-service
ml-pipeline-visualizationserver

Things I'm not sure about are:

controller-manager-service

Bobgy · 2021-08-20T23:48:45Z

Posting my unedited notes first, will try to revisit. Looking forward to any feedback.

Some of these tips are Google Cloud specific, but most of them are general advice.

Deploy in a regional cluster, even if your workload runs on zonal nodepools. Regional clusters have multiple instances of K8s api server, so K8s api is highly available. During scaling, upgrade or many maintenance operations, zonal cluster k8s api servers are not responsive.
For KFP on GCP configure a nodepool default Google Service Account (GSA) with minimal permissions. You can grant serviceAccountUser permission to users/GSAs on this GSA to allow access to the proxy.
Recommend enabling nodepool autoscaling when there are too many workloads.
Set memory/CPU requests/limit on pipeline steps to guarantee they are not evicted when the cluster is under resource constraints. Also, Kubernetes use resource requests as the signal for node pool scaling, so when you enable auto-scaling, you should always set resource requests, so that Kubernetes can properly identify when you want to scale up/down.
Resource requests/limits can be set using KFP DSL, example pipeline. Reference: https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.dsl.html#kfp.dsl.Sidecar.set_memory_limit.
Set memory/CPU requests/limit on system services, latest KFP release already have sane default values. However, KFP API server (ml-pipeline deployment), KFP persistence agent (ml-pipeline-persistence-agent deployment) and argo workflow controller (workflow-controller deployment) memory/CPU needs are roughly linear to the number of concurrent workflows (even if they are completed), Therefore do:
Reduce workflow TTL of completed workflows to match your use-case. Default is 1 day.
Monitor these deployments and set requests/limits based on real usage + some buffer.
Set up retry strategies for steps in error state. There are two types of failures, error and failure. Error refers to orchestration system problems. While failure refers to user container failures. So it’s recommended to specify retryStrategy at least for errors, and depending on use-case also for failures.
Example: you can set set_retry(policy="Always"). # or “OnError”
If you need to customize the deployment, pull KFP manifests as an upstream and follow the off the shelf application workflow of kustomize. This allows infrastructure as code and easy upgrades.
A bonus point is to use gitops (there are many tools for similar purposes), put your infrastructure as code in a repo and use a gitops tool to sync it to production. In this way, you can version control, roll back, etc.
Use managed storage (Cloud SQL & Cloud Storage) to simplify lifecycle management: https://github.com/kubeflow/pipelines/tree/master/manifests/kustomize/sample .
Configure a lifecycle policy (e.g. clean up intermediate artifacts after 7 days) for the object store you are using, e.g. for minio and for gcs. Note, on the default minio bucket, intermediate artifacts are stored in minio://mlpipeline/artifacts, pipeline templates are stored in minio://mlpipeline/pipelines, so do not set a lifecycle for pipeline templates, they should be kept.

Bobgy · 2021-08-21T00:00:07Z

@Bobgy in terms of production some guidance on what components can have > 1 replica would be very useful. Initially, I'm planning to try to increase the replica count to 2 for ml-pipeline-ui. This should allow users to see something even if other things are down.

The other things that I think could have replica count >1 is:

ml-pipeline

metadata-grpc-service

ml-pipeline-visualizationserver

Things I'm not sure about are:

controller-manager-service

This is something I haven't experimented much, from my understanding:

ml-pipeline-ui
ml-pipeline*
metadata-grpc-service*
ml-pipeline-visualizationserver

can be made multi replica right now.

There is a caveat that ml-pipeline and metadata-grpc-service upgrade DB schema on start up, so if you are doing an upgrade, recommend changing replica to 1 first.

The controllers should be able to run in leader election mode: one instance is leader, one instance is standby, whenever the leader dies, the standby instance takes over. However, I believe for KFP controllers some dependency upgrade might be necessary and we need to expose flags.
Argo workflow controller can be set up this way now. https://argoproj.github.io/argo-workflows/high-availability/

vinayan3 · 2021-08-21T21:21:22Z

@Bobgy I've taken the suggestions above for the things that can have more than replica count one. I've also added in PodDisruptionBudgets and put Pod Topology Spread Constraints to avoid all the replicas going onto a single node.

I'll have to look into getting the argo workflow controller to have an active / passive mode.

Thanks for suggestions and advice. It's really appreciated.

Bobgy · 2021-08-22T00:43:24Z

Cool, interested to see how that plays out.

rubenaranamorera · 2021-08-23T11:44:29Z

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

Bobgy · 2021-08-25T06:36:46Z

@rubenaranamorera There's a feature request in #6001.

Bobgy · 2021-08-25T06:42:05Z

minor update, I added a last point in my comment above about configuring a lifecycle policy for the object store.

NikeNano · 2021-08-25T09:56:40Z

@Bobgy Is there any easy way to integrate kubeflow pipelines directly with gitops? currently we are just converting our pipelines to Argo workflows. We can run and schedule those pipelines, but we are losing all fancy kubeflow capabilities from the UI and complicates things for DataScientists. Any ideas on this?

You(@rubenaranamorera ) can use the SDK if you like to, I did some stuff with this for github actions(it has not been update in quite some time so might need some love to work for you) https://github.com/NikeNano/kubeflow-github-action.

stale · 2022-03-02T21:05:03Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy · 2022-03-02T23:37:44Z

/lifecycle freeze

vinayan3 · 2022-03-03T04:08:32Z

So after more than 6 months of running the configuration with replica > 1 there hasn't been any issues.

Also, for argocd workflows the controller may not need to be run with more than replica / sharded unless there is huge number of workflows. The pod gracefully restarts on other nodes and is able to pick up work where it left it off.

Would there be interest in creating an overlay for HA?

daro1337 · 2024-03-01T12:36:46Z

@vinayan3 could you please sum up which components can be easily scaled and did not bring any malfunction for your deployment? Thanks in advance

github-actions · 2024-06-18T07:42:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2024-07-10T07:41:48Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

darthsuogles added the kind/feature label Aug 1, 2021

Bobgy self-assigned this Aug 6, 2021

zijianjoy added this to Needs triage in Project Health via automation Aug 6, 2021

Bobgy added this to Needs triage in KFP Runtime Triage via automation Aug 6, 2021

Bobgy removed this from Needs triage in Project Health Aug 6, 2021

Bobgy moved this from Needs triage to P1 in KFP Runtime Triage Aug 6, 2021

Bobgy mentioned this issue Aug 25, 2021

[Doc] Operator best practices doc #5333

Closed

stale bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022

zijianjoy removed the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Mar 2, 2022

MatthiasCarnein mentioned this issue Jul 15, 2022

[backend] the cache server does not check whether the cached artifacts have been deleted #7939

Open

juliusvonkohout mentioned this issue Jul 15, 2022

feat(backend): isolate artifacts per namespace/profile/user using only one bucket #7725

Open

1 task

github-actions bot added the lifecycle/stale The issue / pull request is stale, any activities remove this label. label Jun 18, 2024

github-actions bot closed this as completed Jul 10, 2024

KFP Runtime Triage automation moved this from P1 to Closed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] documentation for production grade deployment of kubeflow pipelines #6204

[feature] documentation for production grade deployment of kubeflow pipelines #6204

darthsuogles commented Aug 1, 2021 •

edited

Loading

Bobgy commented Aug 6, 2021

darthsuogles commented Aug 19, 2021

vinayan3 commented Aug 20, 2021

Bobgy commented Aug 20, 2021 •

edited

Loading

Bobgy commented Aug 21, 2021

vinayan3 commented Aug 21, 2021

Bobgy commented Aug 22, 2021

rubenaranamorera commented Aug 23, 2021

Bobgy commented Aug 25, 2021

Bobgy commented Aug 25, 2021

NikeNano commented Aug 25, 2021

stale bot commented Mar 2, 2022

Bobgy commented Mar 2, 2022

vinayan3 commented Mar 3, 2022

daro1337 commented Mar 1, 2024 •

edited

Loading

github-actions bot commented Jun 18, 2024

github-actions bot commented Jul 10, 2024

[feature] documentation for production grade deployment of kubeflow pipelines #6204

[feature] documentation for production grade deployment of kubeflow pipelines #6204

Comments

darthsuogles commented Aug 1, 2021 • edited Loading

Feature Area

What feature would you like to see?

What is the use case or pain point?

Is there a workaround currently?

Bobgy commented Aug 6, 2021

darthsuogles commented Aug 19, 2021

vinayan3 commented Aug 20, 2021

Bobgy commented Aug 20, 2021 • edited Loading

Bobgy commented Aug 21, 2021

vinayan3 commented Aug 21, 2021

Bobgy commented Aug 22, 2021

rubenaranamorera commented Aug 23, 2021

Bobgy commented Aug 25, 2021

Bobgy commented Aug 25, 2021

NikeNano commented Aug 25, 2021

stale bot commented Mar 2, 2022

Bobgy commented Mar 2, 2022

vinayan3 commented Mar 3, 2022

daro1337 commented Mar 1, 2024 • edited Loading

github-actions bot commented Jun 18, 2024

github-actions bot commented Jul 10, 2024

darthsuogles commented Aug 1, 2021 •

edited

Loading

Bobgy commented Aug 20, 2021 •

edited

Loading

daro1337 commented Mar 1, 2024 •

edited

Loading