Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a bridge between MLflow and Kubeflow #6647

Closed
jagane-infinstor opened this issue Sep 13, 2022 · 43 comments
Closed

Build a bridge between MLflow and Kubeflow #6647

jagane-infinstor opened this issue Sep 13, 2022 · 43 comments

Comments

@jagane-infinstor
Copy link

/kind feature

Why you need this feature:
We want to run AI workloads in Kubernetes and use MLflow for Experiment Tracking and Model Management.

Describe the solution you'd like:
We would like to design and run a DAG in Kubernetes, with each node of the DAG being an MLflow Project.

MLflow is very popular among Data Scientists, Data Engineers and MLOps staff. Its strength is ML Experiment Tracking and ML Model Management. However, MLflow does not include any compute capability. Kubeflow, on the other hand, is very strong in managing compute via Kubernetes. It would be useful for Kubeflow to include functionality to build a DAG out of MLflow Projects (a packaged reproducible piece of ML code) and run it in Kubernetes.

We believe that for this project to be successful:

  • It is necessary for this system to support parallelization of any of the nodes in the DAG
  • Data Partitioning in order to feed the parallel compute instances (pods) is integral
  • Reuse of existing Kubeflow concepts and existing MLflow concepts, wherever possible, is desirable

Anything else you would like to add:
We have been working on a proof of concept - MLflow Parallels, an Apache Licensed open source project. https://mlflow-parallels.org

@jagane-infinstor
Copy link
Author

MLflow Projects Specification is available here:
https://www.mlflow.org/docs/latest/projects.html

Note that we are only interested in MLflow Projects that specify the environment for the software by means of a Dockerfile/Docker container. We are not interested in supporting MLflow Projects that specify their environment as a conda.yaml file.

@jbottum
Copy link
Contributor

jbottum commented Sep 13, 2022

@jagane-infinstor Great timing. The Kubeflow user survey identified that a good percentage of Kubeflow users (43%) also leverage MLFlow. I believe that Kubeflow needs a Model Registry component and we need to consider integrating or building. I am interested in bringing this idea to the contributors and users to see if they have opinions on requirements and timing. I believe this would be a good discussion topic for the Sept 27 Kubeflow Community Meeting.
Screen Shot 2022-09-13 at 4 41 26 PM

@amolsr
Copy link

amolsr commented Sep 14, 2022

We tried integrating it in kubeflow dashboard through Iframe, it was not able to compare experiements. Then if we add it as an external URI then we need to think authentication part, like how we can use Kubeflow Creds to authenticate in MLFlow.

@amolsr
Copy link

amolsr commented Sep 14, 2022

#6564 needs to be merged for MLFlow to work properly in iframe.

@jagane-infinstor
Copy link
Author

MLflow open source does not have any authentication built in. The commercial offerings such as Databricks, AzureML and Infinstor (I work for InfinStor) include authentication. Databricks for example, uses a bearer token.

@jagane-infinstor
Copy link
Author

In another context, we utilized an MIT licensed piece of software called single spa - https://single-spa.js.org/ to integrate the UIs of multiple disparate projects. That may be an option for us here.

@Madaditya
Copy link

Deploy MlFlow on Kubernetes using the helm chart and proxy it on the kubeflow-gateway's subpath.
That way you have mlflow deployed on the same cluster and gateway as istio and can push metrics to it via its cluster ip

@amolsr
Copy link

amolsr commented Sep 15, 2022

MLflow open source does not have any authentication built in. The commercial offerings such as Databricks, AzureML and Infinstor (I work for InfinStor) include authentication. Databricks for example, uses a bearer token.

For Grafana they had something called as auth-proxy configuration. By which you can use kubeflow-userid to create and authenticate user in grafana. I think for bridging MLflow to kubeflow we should be configuring something like that to keep the multitenant aspect of kubeflow intact. Or we can configure it on istio Authorization Policy.

@jagane-infinstor
Copy link
Author

@Madaditya @amolsr - there are two different use models for MLflow integrated with Kubeflow. The first model is what you have outlined - using a helm chart to create an MLflow instance within the K8s cluster. The other is when the user has an external MLflow service such as Databricks, AzureML or InfinStor. In this case, the kubeflow components and user code in pods created by kubeflow components need to be able to access the external MLflow service. It is important to make this work as well.

@vinaydel
Copy link

vinaydel commented Sep 20, 2022

2 deployment models for mlflow makes sense , embedded mlflow vs managed mlflow.

@jbottum @jagane-infinstor maybe its a data point / point-of-view for how mlflkow and kf can play nicely with each other

The way we are currently deploying/integrating with mlflow is the former model ( as highlighted by @amolsr ), and it is positioned as our experiment tracking tool of choice for training step/stage of your end2end kubeflow pipeline, primarily since mlflow has better UX around capturing and querying model stats.
With auto-logging support it makes it easier too since we need not to write custom kfp outputs to record obvious metrics/params with kubeflow. As you very well know, by virtue of ml model registry, mlflow can track lineage better between your run and model as well. To make it easier for our end user, we have a poor man's way to tie kubeflow experiment with mlflow experiment & kubeflow pipeline run with mlflow run via hyperlinks we add during the KF run. This allows users to navigate back and forth between the two experiment tracking tool relatively seamlessly.

If one needs to serve, we have a custom kfp component that looks at mlflow model registry artifact path and registers the kserve endpoint as well. So key touch points between mlflow and KF pipeline could look like
kf feature engg step -> .. other steps... -> kf training step running under mlflow run context -> ... other steps... -> kserve step that registers model managed by mlflow
naturally we setup mlflow context through pod defaults so that any notebook or kfp component pod have all the envs setup appropriately.

@jbottum
Copy link
Contributor

jbottum commented Sep 21, 2022

I appreciate the discussion and options. If we are going to have a Phase 1 for Kubeflow 1.7, I believe that we will need to set-up a review meeting soon (perhaps Thu, Sept 29) and include folks in this thread along with others i.e. @benjamintanweihao @thesuperzapper @DomFleischmann @kimwnasptd @james-jwu @zijianjoy @richardsliu. Before that review, I think we should raise the topic in the Tuesday, Sept 27 Community Meeting 8am PT. I would like gather an initial view on 1) Is there significant user / distribution interest, especially for KF 1.7 Phase 1, 2) do we need to support multiple architectures and can those customization(s) be developed and supported effectively, perhaps semi-independently (like KFP-Argo and KFP-Tekton) 3) do we have teams who can sustain a strategic commitment, as I expect this is a XXL sized feature that will take multiple releases. 4) Can a team show a 5-10 minute prototype in the Community meeting (to help show the functional vision)?

@jagane-infinstor
Copy link
Author

@jbottum - appreciate your setting the requirements for this. We are happy to make the presentation at the Sep 27, 2022 Community Meeting and to show a demonstration of some parts of this capability. Do you folks have a specific template for the presentation - I realize that we are probably going to be time bound and I want to make best use of everybody's time.

@aronchick
Copy link
Contributor

I'm really excited for this - I've been hoping for the two communities to collaborate for some time :)

@jagane-infinstor
Copy link
Author

I am attaching a project proposal pdf. Please note that this project has been renamed 'Concurrent' and the new website is available at https://concurrent-ai.org/

@jagane-infinstor
Copy link
Author

Build a bridge between MLflow and Kubeflow #6647.pdf

@ca-scribner
Copy link
Contributor

This looks really promising! I wonder how this interacts/overlaps with kubeflow pipelines, but in general this looks really nice and the demo was excellent. I'd love to contribute to the conversation going forward

@jagane-infinstor
Copy link
Author

Thanks, Andrew. I will invite you to the 1 hour session that we discussed at the Community meeting this morning.

Of the top of my head, I would say that the difference between KFP and Concurrent are the following:

  1. Concurrent uses MLflow wherever MLflow features are well proven in the field. This includes artifact management, experiment tracking and model management. KFP seems to have built in solutions for these aspects. This means Concurrent necessarily has to make sure MLflow credentials are carried around everywhere, MLflow related env vars are properly passed around, etc.
  2. Concurrent is focused on Parallelization - dividing data up between parallel instances of a DAG node, making sure that the data is accessible to the parallel instances, making sure the DAG controller can create and manage the parallel instances, etc.
  3. Concurrent is designed to deal with multiple k8s clusters in a DAG
  4. KFP may be better suited for live inference use cases. Concurrent's design center is not for fleets of containers that sit behind a load balancer - it is for pre-processing and runnning batch inference.

@terrytangyuan
Copy link
Member

terrytangyuan commented Sep 29, 2022

There seems to be some overlap with KFP. KFP already uses Argo Workflows under the hood, which constructs K8s-native pipelines out of the box. Is there any FAQs page that illustrate the differences or relationships?

@jagane-infinstor
Copy link
Author

We do have a FAQ page with some comparison with KFP here: https://docs.concurrent-ai.org/files/faq/
I would like to add the following details:

  • Concurrent uses MLflow in the areas where MLflow has strong user acceptance - experiment tracking and model management
  • Concurrent is designed for multi-kubernetes, i.e. running parts of the DAG in one k8s and other parts in another k8s, possibly located in a different region
  • Concurrent takes on the responsibility of data management; since we want to run parts of the DAG in different k8s cluster, it becomes our responsibility to ensure that the data is available in each of these locations
  • Concurrent is designed for users who are not expert in k8s, docker, etc. - python knowledge is the only pre-requisite. For example, one can use Concurrent without ever running kubectl - all logs are stashed away in MLflow as artifacts.

In summary, while there is overlap in the end goal, there are philosophical differences that result in a very different looking component and very different end user profile. I believe that KFP and Concurrent can co-exist in kubeflow and service users of different profiles.

@jagane-infinstor
Copy link
Author

There seems to be some overlap with KFP. KFP already uses Argo Workflows under the hood, which constructs K8s-native pipelines out of the box. Is there any FAQs page that illustrate the differences or relationships?

Hello @terrytangyuan - to directly speak to your comment re. Argo Workflows, Concurrent is designed to use multiple kubernetes clusters, possibly distributed across the WAN. Argo Workflows is limited to a single k8s cluster, and a single k8s cluster cannot be distributed across the WAN since etcd uses the raft protocol for consensus, which is not suitable for use across the WAN.

Concurrent's design center is multi k8s across WAN links - stepping out of the bounds of a single k8s cluster enables us to use consensus algorithms will allow us to do this.

@terrytangyuan
Copy link
Member

terrytangyuan commented Sep 30, 2022

Argo Workflows is limited to a single k8s cluster.

Multi-cluster support is on our roadmap. It's the top-voted issue, and we already have a working POC. argoproj/argo-workflows#3523

@jagane-infinstor
Copy link
Author

Argo Workflows is limited to a single k8s cluster.

Multi-cluster support is on our roadmap. It's the top-voted issue, and we already have a working POC. argoproj/argo-workflows#3523

@terrytangyuan thanks for that pointer. This is within a single Region, i.e. no WAN links between clusters?

@juliusvonkohout
Copy link
Member

@jagane-infinstor i am the annoying one from the meeting with all the security questions ;-)

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Sep 30, 2022

I think the main requirements are:

  1. standalone integration into Kubeflow, no external dependencies
  2. We need to be able to integrate it with the current testing infrastructure
  3. Integration into the istio authorization layer with the kubeflow-userid header and or serviceaccounttokens
  4. Strict separation between different kubeflow user anmespaces
  5. The mlflow containers should reside only in the kubeflow namespace and be secured with istio authorization policies.
  6. Only the concurrent / KFP pipelines should run in the user namespaces (rootless of course)
  7. No root containers anywhere. Instead of a fuse mount you can just use an emptydir or PVC and download the data to it. Kserve can serve models that way and KFP can runn all pipelines with lots of data processing rootless.
  8. There should be a menu entry (iframe) in the kubeflow sidebar that shows only stuff allowed for the current namespace
  9. The API should of course also have the same authorization and namespace isolation
  10. You need to support the integrated Minio as artifact storage. There are ongoing efforts to secure it feat(backend): isolate artifacts per namespace/profile/user using only one bucket pipelines#7725

You also need to decide whether you want to support KFP v1 and or v2. I would start with integrating mlflow for lineage, parameters and model tracking first. So just as an alternative to the current google ml-metadata. There should be a switch to select either mlflow or ml-metadata as metadata backend in the cluster, KF pipelines executed via argo should write the appropriate information to the selected endpoint only. This is where you would need to extend KFP to support Mlflow too. This also implies some effort to integrate all the mlflow information into the KFP runs pages.

After this is done you might think about implementing "concurrent" as a KFP component where you just input the mlflow specification, so this preprocessing can become a regular part of a pipeline. Please have a look at https://www.kubeflow.org/docs/components/pipelines/v2/author-a-pipeline/components/#3-custom-container-components

@jagane-infinstor
Copy link
Author

Here's the recording of the deep dive meeting held this morning:

https://youtu.be/mdTi9AVVTrc

@terrytangyuan
Copy link
Member

There should be a switch to select either mlflow or ml-metadata as metadata backend in the cluster

ML-metadata already works well as a metadata backend. What are the additional benefits MLFlow brings (as metadata backend) that ml-metadata does not cover? MLFlow has a lot of dependencies. I'd imagine introducing a new metadata backend will add a lot of complexity and maintenance overhead.

From the meeting, you mentioned:

Make Concurrent an engine for KFP

It seems like a new project without much adoption and traction. I am not sure if it's worth adding the complexity to the codebase. Who owns that project? Is it vendor-neutral?

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Oct 4, 2022

I have an important excerpt from the slack channel:

"I am just wondering if it would be possible for KFP to open up more the initcontainer approach discussed by
@Julius von Kohout
for KFP sdk, so that we can have an easy way to do some root related
stuff inside pod directly, such as mount a s3fs-fuse partition without
using a privileged container."

That is already the case in KFP. Instead off ugly NON-rootless fuse KFP uses proper PVCs and emptydirs for S3/GCP data import and export. There is no need to change KFP stuff. Fuse needs root, so it is NOT allowed in serious enterprise environments in any kind of container, no matter whether that is an initcontainer or sidecar. I spent a lot of time with a former google KFP developer (@Bobgy on github) to get rid of exactly such unnecessary root stuff by using a proper architecture. If you check the Kubeflow architecture you will understand that giving an initcontainer or sidecar root permissions means giving any user root permissions. And even if that would not be the case, any serious company security policy will not allow this. I can only repeat myself: Many other projects have done this sucessfully rootless and mlflow can do so too.

"The customer image for KFP component (step in a pipeline) would be an issue. Very less DS people in our org are able to write Docker Files, even we have CI/CD pipeline for building custom image with GitLab setup.
We observed that the most DS team uses either custom image from Docker hub or Python Function Component to do the work. This approach allows us to export and import the pipeline everywhere in a KF system, on-prem or cloud."

Customer specific images are not an issue. Having to build them yourself might be an issue. That is what i am describing above. Mlflow must be compatible with arbitrary images from the depths of the internet (Dockerhub) as long as they have python3 installed and can install python packages as non-root. That would be the same contract that KFP currently demands. If Mlflow uses something other than python they can just inject a binary.

"The more custom image components involved in pipeline, means we also need to migrate the image into an image registry provided by the cloud vendor for cloud migration later on. We will also need to migrate all the secrets for image repositories."

I think you are using the wrong term. Custom image component just means a component that uses a custom image so specifying a non-default base image in your pipeline. It does not say anything about building it or managing registries.You want custom base images in your pipeline. You just do NOT want to build them at runtime or provide a Dockerfile for them and manage a registry including secrets.I know that you mean the same as I, but we really need to use the right and precise terms, otherwise it is too confusing for the other people following this thread. We really need to help infinstor understand the problem precisely.

From the KFP documentation:"base_image – Optional. Specify a CUSTOM OCI container IMAGE to use in the component. For lightweight components, the image needs to have python 3.5+. Default is the python image corresponding to the current python environment."

Maybe you meant https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.containers.html with custom image component. I would call this "image builder component" in KFP. This is something old and ugly that is not usable in serious enterprise environments. All other Kubeflow components have moved away from stuff like this. Actually the next step discussed with the kubeflow/manifest maintainer is getting rid of the remaining root in istio by switching to istio-cni.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Oct 4, 2022

There should be a switch to select either mlflow or ml-metadata as metadata backend in the cluster

ML-metadata already works well as a metadata backend. What are the additional benefits MLFlow brings (as metadata backend) that ml-metadata does not cover? MLFlow has a lot of dependencies. I'd imagine introducing a new metadata backend will add a lot of complexity and maintenance overhead.

From the meeting, you mentioned:

Make Concurrent an engine for KFP

It seems like a new project without much adoption and traction. I am not sure if it's worth adding the complexity to the codebase. Who owns that project? Is it vendor-neutral?

Ml-metadata does not work well as a metadata backaend. Please have a look at all the bugs here. The most important things are 1. not namespace isolated 2. You cannot delete database entries.

If Mlflow is willing to solve tha, itt would be a clear benefit. Also outside of Kubeflow MLflow is the dominant metadata backend. I would also appreciate if ml-emtatdata gets fixed, but so far that is blocked by upstream issues.

infinstor asked me how to integrate concurrent, so i gave them a proposal in the comments above. And yes i think the same, infinstor should implement concurrent as a KFP component or compile to a KFP Pipeline. Adding a new backend is overkill. Regarding Mlflow metadata tracking there is a different story. That must be integrated directly into KFP as an alternative to ml-metadata. The good thing is you can work seperately on both tasks. There is no need to integrate all Mlflow components at once.

@fvde
Copy link

fvde commented Oct 26, 2022

We would absolutely LOVE this. Almost started doing something home cooked already....

@revolutionisme
Copy link

Do we have any further traction or design documents which may have been created? Would love to contribute if possible!

@AlexandreBrown
Copy link

AlexandreBrown commented Feb 8, 2023

Hello @jbottum , do you know who could give an update on this ?
Thanks

@jbottum
Copy link
Contributor

jbottum commented Feb 8, 2023

@jagane-infinstor Hey Jagane - Could you please provide a status on concurrent and related activities ? Thanks!

@juliusvonkohout
Copy link
Member

i hope to upstream a multi-user-isolation mlflow implementation this year. Not concurrent, just the normal mlflow stuff. But no guarantees at all.

@sofsms
Copy link

sofsms commented Aug 9, 2023

Hello :)
Any update on this ?

Thanks.

@jagane-infinstor
Copy link
Author

jagane-infinstor commented Aug 9, 2023 via email

@juliusvonkohout
Copy link
Member

Hello :) Any update on this ?

Thanks.

Do you need help contributing?

@jagane-infinstor
Copy link
Author

jagane-infinstor commented Aug 11, 2023 via email

@revolutionisme
Copy link

Hey Jagane,

The auth concept has been recently introduced with the latest version of the open source MLFlow (still experimental)

And at least from the initial look at the docs, the permisions can also be fine grained for different API's.

Maybe that helps somehow?

@kobiche
Copy link

kobiche commented Nov 11, 2023

Deploy MlFlow on Kubernetes using the helm chart and proxy it on the kubeflow-gateway's subpath. That way you have mlflow deployed on the same cluster and gateway as istio and can push metrics to it via its cluster ip

I recently found this tutorial: https://medium.com/dkatalis/kubeflow-with-mlflow-702cf2ebf3bf
To make the deployment easier, I used the MLflow image directly: ghcr.io/mlflow/mlflow:v2.7.1 and no dependency to an external DB as backend. The pod is running normally.

When I finished the tutorial, the kubeflow dashboard could not be properly load (the elements in the 'Quick shortcuts' were missing). Any idea where the tutorial is failing?

@juliusvonkohout
Copy link
Member

@jagane-infinstor I now have mlflow available. With a central database in the kubeflow namespace and separate credentials per namespace. We can start the per namespace mlflow server via the Workbench/Workspace UI. So Zero overhead namespaces are still possible. But so far there is not enough time to upstream it. Concurrent is another topic. Probably with the Workspace 2.0 overhaul we can take a closer look at upstreaming.

@rareddy
Copy link

rareddy commented Dec 13, 2023

Another relevant effort on the subject here #7396

@juliusvonkohout
Copy link
Member

/close

this belongs to kubeflow/manifests

Please reopen there if necessary and model registry is not enough

Copy link

@juliusvonkohout: Closing this issue.

In response to this:

/close

this belongs to kubeflow/manifests

Please reopen there if necessary and model registry is not enough

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests