Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement a Reusable E2E Kubeflow ML Lifecycle #3728

Merged
merged 15 commits into from
Jun 11, 2024
Binary file not shown.
Binary file removed content/en/docs/images/argo-cd-partial-sync-ui.png
Binary file not shown.
5 changes: 0 additions & 5 deletions content/en/docs/images/aws/OWNERS

This file was deleted.

Binary file not shown.
Binary file removed content/en/docs/images/aws/alb-listener-rule.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/alb-login.png
Binary file not shown.
Binary file not shown.
Binary file removed content/en/docs/images/aws/auth0-callback-url.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/auth0-github-setup.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/auth0-login.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/auth0-welcome-page.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/authentication.png
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed content/en/docs/images/aws/cognito-alb-domain.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/cognito-appclient.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/cognito-certarn.png
Binary file not shown.
Binary file not shown.
Binary file removed content/en/docs/images/aws/cognito-custom-domain.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/cognito-domain-error.jpg
Binary file not shown.
Binary file removed content/en/docs/images/aws/cognito-domain.png
Binary file not shown.
Binary file not shown.
Binary file removed content/en/docs/images/aws/custom-domain-cname.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/efs-create.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/efs-volume.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/external-mysql-rds.png
Binary file not shown.
Binary file removed content/en/docs/images/aws/fsx-assets.png
Diff not rendered.
Binary file removed content/en/docs/images/aws/fsx-create.png
Diff not rendered.
Binary file removed content/en/docs/images/aws/fsx-network.png
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/aws/kubeflow-main-page.png
Diff not rendered.
1 change: 0 additions & 1 deletion content/en/docs/images/aws/reference_architecture.svg
Diff not rendered.
Binary file removed content/en/docs/images/aws/route53-a-record-auth.png
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/aws/route53-a-record.png
Diff not rendered.
Binary file removed content/en/docs/images/aws/route53-hosted-zone.png
Diff not rendered.
Binary file removed content/en/docs/images/aws/route53-record-sets.png
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/consent-screen.png
Diff not rendered.
Binary file removed content/en/docs/images/delete-deployment.png
Diff not rendered.
Binary file removed content/en/docs/images/deployments.png
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/gcp-e2e-ui-connect.png
Diff not rendered.
Binary file removed content/en/docs/images/gcp-e2e-ui-prediction.png
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/gke/full-kf-home.png
Diff not rendered.
Binary file removed content/en/docs/images/gsoc-icon-192.png
Diff not rendered.
Binary file removed content/en/docs/images/ibm-e2e-kubeflow.png
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/ibm/notebook-custom-image.png
Diff not rendered.
Binary file removed content/en/docs/images/jupyter-dashboard.png
Diff not rendered.
Binary file removed content/en/docs/images/kubeflow-deployment.png
Diff not rendered.
Diff not rendered.
1 change: 0 additions & 1 deletion content/en/docs/images/kubeflow-gcp-e2e-tutorial.svg
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/metadata-artifacts-list.png
Diff not rendered.
Binary file removed content/en/docs/images/metadata-dataset.png
Diff not rendered.
Binary file removed content/en/docs/images/metadata-metrics.png
Diff not rendered.
Binary file removed content/en/docs/images/metadata-model.png
Diff not rendered.
Binary file removed content/en/docs/images/metadata-ui-option.png
Diff not rendered.
4 changes: 0 additions & 4 deletions content/en/docs/images/minikf-aws/OWNERS

This file was deleted.

Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/minikf-dashboard.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-deploy.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-info.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-kubeflow.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-launch.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-login.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-ssh.png
Diff not rendered.
Binary file removed content/en/docs/images/minikf-up.png
Diff not rendered.
8 changes: 0 additions & 8 deletions content/en/docs/images/nutanix/OWNERS

This file was deleted.

Binary file removed content/en/docs/images/nutanix/objects_browser.png
Diff not rendered.
Binary file removed content/en/docs/images/oauth-credential.png
Diff not rendered.
Binary file removed content/en/docs/images/oauth-edit.png
Diff not rendered.
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-mnist-graph.png
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-mnist-logs.png
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-mnist-run-list.png
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-mnist-running.png
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-mnist-uploaded.png
Diff not rendered.
1 change: 0 additions & 1 deletion content/en/docs/images/pipelines-sdk-lightweight.svg
Diff not rendered.
1 change: 0 additions & 1 deletion content/en/docs/images/pipelines-sdk-outside-app.svg
Diff not rendered.
1 change: 0 additions & 1 deletion content/en/docs/images/pipelines-sdk-reusable.svg
Diff not rendered.
1 change: 0 additions & 1 deletion content/en/docs/images/pipelines-sdk-within-app.svg
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-start-mnist-run.png
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-upload.png
Diff not rendered.
Binary file removed content/en/docs/images/pipelines-uploading.png
Diff not rendered.
Diff not rendered.
Binary file removed content/en/docs/images/version-dropdown.jpg
Diff not rendered.
Binary file removed content/en/docs/images/view-contributors.png
Diff not rendered.
194 changes: 98 additions & 96 deletions content/en/docs/started/architecture.md
Copy link
Contributor

@vikas-saxena02 vikas-saxena02 Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 40, the definition for Data prepartion can be reworded to say that

In the Data Preparation step you ingest/raw data and transfer it to perform feature engineering to extract ML features for the offline feature store, and prepare training data for model development. Usually, this step is associated with data processing tools such as Spark, Dask, Flink, or Ray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by you ingest/raw data raw data?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry thats was a typo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, the idea of this statement is to say that you use Spark to inject raw data and process it.

Original file line number Diff line number Diff line change
Expand Up @@ -4,142 +4,144 @@ description = "An overview of Kubeflow's architecture"
weight = 10
+++

<!--
Note for authors: The source of the diagrams is held in Google Slides decks,
in the "Doc diagrams" folder in the public Kubeflow shared drive.
-->

This guide introduces Kubeflow as a platform for developing and deploying a
machine learning (ML) system.

Kubeflow is a platform for data scientists who want to build and experiment with
ML pipelines. Kubeflow is also for ML engineers and operational teams who want
to deploy ML systems to various environments for development, testing, and
production-level serving.
This guide introduces Kubeflow ecosystem and explains how Kubeflow components fit in ML lifecycle.

## Conceptual overview
Read [the introduction guide](/docs/started/introduction) to learn more about Kubeflow, standalone
Kubeflow components and Kubeflow Platform.

Kubeflow is *the ML toolkit for Kubernetes*.
## Kubeflow Ecosystem

The following diagram shows Kubeflow as a platform for arranging the
components of your ML system on top of Kubernetes:
The following diagram gives an overview of the Kubeflow Ecosystem and how it relates to the wider
Kubernetes and AI/ML landscapes.

<img src="/docs/started/images/kubeflow-architecture.drawio.svg"
alt="An architectural overview of Kubeflow on Kubernetes"
class="mt-3 mb-3 border border-info rounded">

Kubeflow builds on [Kubernetes](https://kubernetes.io/) as a system for
deploying, scaling, and managing complex systems.
class="mt-3 mb-3">

Using the Kubeflow configuration interfaces (see [below](#interfaces)) you can
specify the ML tools required for your workflow. Then you can deploy the
workflow to various clouds, local, and on-premises platforms for experimentation and
for production use.
Kubeflow builds on [Kubernetes](https://kubernetes.io/) as a system for
deploying, scaling, and managing AI/ML infrastructure.

## Introducing the ML workflow
## Introducing the ML Lifecycle

When you develop and deploy an ML system, the ML workflow typically consists of
several stages. Developing an ML system is an iterative process.
You need to evaluate the output of various stages of the ML workflow, and apply
changes to the model and parameters when necessary to ensure the model keeps
When you develop and deploy an AI application, the ML lifecycle typically consists of
several stages. Developing an ML system is an iterative process.
You need to evaluate the output of various stages of the ML lifecycle, and apply
changes to the model and parameters when necessary to ensure the model keeps
producing the results you need.

For the sake of simplicity, the following diagram
shows the workflow stages in sequence. The arrow at the end of the workflow
points back into the flow to indicate the iterative nature of the process:
The following diagram shows the ML lifecycle stages in sequence:

<img src="/docs/images/kubeflow-overview-workflow-diagram-1.svg"
alt="A typical machine learning workflow"
class="mt-3 mb-3 border border-info rounded">
<img src="/docs/started/images/ml-lifecycle.drawio.svg"
alt="ML Lifecycle"
class="mt-3 mb-3">

Looking at the stages in more detail:

* In the experimental phase, you develop your model based on initial
assumptions, and test and update the model iteratively to produce the
results you're looking for:

* Identify the problem you want the ML system to solve.
* Collect and analyze the data you need to train your ML model.
* Choose an ML framework and algorithm, and code the initial version of your
model.
* Experiment with the data and with training your model.
* Tune the model hyperparameters to ensure the most efficient processing and the
most accurate results possible.

* In the production phase, you deploy a system that performs the following
processes:

* Transform the data into the format that your training system needs.
To ensure that your model behaves consistently during training and
prediction, the transformation process must be the same in the experimental
and production phases.
* Train the ML model.
* Serve the model for online prediction or for running in batch mode.
* Monitor the model's performance, and feed the results into your processes
for tuning or retraining the model.

## Kubeflow components in the ML workflow

The next diagram adds Kubeflow to the workflow, showing which Kubeflow
components are useful at each stage:

<img src="/docs/images/kubeflow-overview-workflow-diagram-2.svg"
alt="Where Kubeflow fits into a typical machine learning workflow"
class="mt-3 mb-3 border border-info rounded">
- In the _Data Preparation_ step you ingest raw data, perform feature engineering to extract ML
features for the offline feature store, and prepare training data for model development.
Usually, this step is associated with data processing tools such as Spark, Dask, Flink, or Ray.

To learn more, read the following guides to the Kubeflow components:
- In the _Model Development_ step you choose an ML framework, develop your model architecture and
explore the existing pre-trained models for fine-tuning like BERT or Llama.

* Kubeflow includes services for spawning and managing
[Jupyter notebooks](/docs/components/notebooks/). Use notebooks for interactive data
science and experimenting with ML workflows.
- In the _Model Optimization_ step you can optimize your model hyperparameters and optimize your
model with various AutoML algorithms such as neural architecture search and model compression.
During model optimization you can store ML metadata in the _Model Registry_.

* [Kubeflow Pipelines](/docs/components/pipelines/) is a platform for
building, deploying, and managing multi-step ML workflows based on Docker
containers.
- In the _Model Training_ step you train or fine-tune your model on the large-scale
compute environment. You should use a distributed training if single GPU can't handle your
model size. The results of the model training is the trained model artifact that you
can store in the _Model Registry_.

* Kubeflow offers several [components](/docs/components/) that you can use
to build your ML training, hyperparameter tuning, and serving workloads across
multiple platforms.
- In the _Model Serving_ step you serve your model artifact for online or batch inference. Your
model may perform predictive or generative AI tasks depending on the use-case. During the model
serving step you may use an online feature store to extract features. You monitor the model
performance, and feed the results into your previous steps in the ML lifecycle.

## Example of a specific ML workflow
### ML Lifecycle for Production and Development Phases

The following diagram shows a simple example of a specific ML workflow that you
can use to train and serve a model trained on the MNIST dataset:
The ML lifecycle for AI applications may be conceptually split between _development_ and
_production_ phases, this diagram explores which stages fit into each phase:

<img src="/docs/images/kubeflow-gcp-e2e-tutorial-simplified.svg"
alt="ML workflow for training and serving an MNIST model"
class="mt-3 mb-3 border border-info rounded">
<img src="/docs/started/images/ml-lifecycle-dev-prod.drawio.svg"
alt="ML Lifecycle with Development and Production"
class="mt-3 mb-3">

### Kubeflow Components in the ML Lifecycle

The next diagram shows how Kubeflow components are used for each stage in the ML lifecycle:

<img src="/docs/started/images/ml-lifecycle-kubeflow.drawio.svg"
alt="Kubeflow Components in ML Lifecycle"
class="mt-3 mb-3">

andreyvelich marked this conversation as resolved.
Show resolved Hide resolved
See the following links for more information about each Kubeflow component:

For details of the workflow and to run the system yourself, see the
[end-to-end tutorial for Kubeflow on GCP](https://github.com/kubeflow/examples/tree/master/mnist#mnist-on-kubeflow-on-gcp).
- [Kubeflow Spark Operator](https://github.com/kubeflow/spark-operator) can be used for data
preparation and feature engineering step.

<a id="interfaces"></a>
## Kubeflow interfaces
- [Kubeflow Notebooks](/docs/components/notebooks/) can be used for model development and interactive
data science to experiment with your ML workflows.

- [Kubeflow Katib](/docs/components/katib/) can be used for model optimization and hyperparameter
tuning using various AutoML algorithms.

- [Kubeflow Training Operator](/docs/components/training/) can be used for large-scale distributed
training or fine-tuning.

- [Kubeflow Model Registry](/docs/components/model-registry/) can be used to store ML metadata,
model artifacts, and preparing models for production serving.

- [KServe](https://kserve.github.io/website/master/) can be used for online and batch inference
in the model serving step.

- [Feast](https://feast.dev/) can be used as a feature store and to manage offline and online
features.

- [Kubeflow Pipelines](/docs/components/pipelines/) can be used to build, deploy, and manage each
step in the ML lifecycle.

You can use most Kubeflow components as
[standalone tools](/docs/started/introduction/#what-are-standalone-kubeflow-components) and
integrate them into your existing AI/ML Platform, or you can deploy the full
[Kubeflow Platform](/docs/started/introduction/#what-is-kubeflow-platform) to get all Kubeflow
components for an end-to-end ML lifecycle.

## Kubeflow Interfaces

This section introduces the interfaces that you can use to interact with
Kubeflow and to build and run your ML workflows on Kubeflow.

### Kubeflow user interface (UI)
### Kubeflow User Interface (UI)

The Kubeflow UI looks like this:

<img src="/docs/images/central-ui.png"
alt="The Kubeflow UI"
class="mt-3 mb-3 border border-info rounded">

The UI offers a central dashboard that you can use to access the components
of your Kubeflow deployment. Read
[how to access the central dashboard](/docs/components/central-dash/overview/).
The Kubeflow Platform includes [Kubeflow Central Dashboard](/docs/components/central-dash/overview/)
which acts as a hub for your ML platform and tools by exposing the UIs of components running in the
cluster.

### Kubeflow APIs and SDKs

<!--
TODO (andreyvelich): Add reference docs once this issue is implemented: https://github.com/kubeflow/katib/issues/2081
-->

## Kubeflow APIs and SDKs
Various components of Kubeflow offer APIs and Python SDKs.

Various components of Kubeflow offer APIs and Python SDKs. See the following
sets of reference documentation:
See the following sets of reference documentation:

andreyvelich marked this conversation as resolved.
Show resolved Hide resolved
* [Pipelines reference docs](/docs/components/pipelines/reference/) for the Kubeflow
- [Pipelines reference docs](/docs/components/pipelines/reference/) for the Kubeflow
Pipelines API and SDK, including the Kubeflow Pipelines domain-specific
language (DSL).
- [Training Operator Python SDK](https://github.com/kubeflow/training-operator/blob/86e0df17db715543b366e885c9ae659aa1342c8e/sdk/python/kubeflow/training/api/training_client.py)
to manage Training Operator jobs using Python APIs.
- [Katib Python SDK](https://github.com/kubeflow/katib/blob/086093fed72610c227e3ae1b4044f27afa940852/sdk/python/v1beta1/kubeflow/katib/api/katib_client.py)
to manage Katib hyperparameter tuning Experiments using Python APIs.

## Next steps

* Follow [Installing Kubeflow](/docs/started/installing-kubeflow/) to set up your environment and install Kubeflow.
- Follow [Installing Kubeflow](/docs/started/installing-kubeflow/) to set up your environment and install Kubeflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
My suggestion will be to remove the ML Engineer Persona or show other personas as well
Also, I would also suggest splitting the Model Serving box in two i.e. Model Serving and ModelMonitoring/Drift detection as KServe has components to do that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
This varies heavily by company. I've worked at many places where MLE does this fwiw.

I added the persona to highlight explicitly how an ideal user should think about this workflow. Though maybe this could be amended to add more personas. I worry about the clarity though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franciscojavierarceo these are my thoughts as well as this gets political with who does what as there is no simple answer hence I was wondering if we should get into personas at all or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that make sense. I definitely understand how it can be a rabbit hole. I am generally customer-centric so my goal was really to just elicit the value-prop for people who are quickly thinking "why should I, as someone who builds models, care about kubeflow?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the main goal and motivation of this page is to explain the value of Kubeflow ecosystem to our users.

Large diffs are not rendered by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
My suggestion will be to remove the ML Engineer Persona or show other personas as well
Also, I would also suggest splitting the Model Serving box in two i.e. Model Serving and ModelMonitoring/Drift detection as KServe has components to do that

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is the lines are blurred, and there is no prescriptive authority as to how this works. What I would do is call that out. "To scale, you have to specialize," but right now MLOPs (and Kubeflow) are incubating, so the average user wears many hats. If an MLE wants to do data prep or a data engineer or a computer engineer nothing stops them if they aren't leaving other work untouched. Ultimately this is a business and engineering mgmt conversation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @chasecadet

As mentioned in another comment, I've worked at several places where the MLE was responsible for all if this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chasecadet @franciscojavierarceo the question is not who does what as it is very subjective, the question is that should we get into personas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense.

Really I just wanted to provide high level clarity about the value proposition of Kubeflow for MLEs or data scientists or whatever they're called this week.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vikas-saxena02 @franciscojavierarceo THIS IS GREAT. So here is the philosophical/KF values questions. My biggest power as a solutions architect is saying " My customers commonly do XYZ". So we need to decide are we doing this as a text book style "this is the world we live in" where we need to point to an authority (@andreyvelich and I were discussing "who's ML Lifecycle are we referencing") or do we make this more community and experience based where we say "We commonly see MLEs within the Kubeflow community leverage these tools aligned to what we have defined as the ML lifecycle based on community feedback Etc... Andrey was mentioning the ML lifecycle we are using was sourced from the CNCF white paper by other professionals who worked to define it. That is totally fine but we need to give the lineage of our information, call out when it can be considered subjective, and also flavor what we are defining as something based on what we have seen in and agreed upon our community ( something that is powerful but is not necessarily the be all end all) and how new users can align themselves to it. We can also provide a place to discuss and challenge our ML lifecycle opinions but if we say "we commonly see data engineers using X" then its not necessarily us telling you what to do, but mentioning what we have seen so far and opening the door to new perspectives. This also helps us stay out of peoples scopes if they say "well the KF community said that this is an MLE tool so I didn't use it for data engineering and/or told off my data engineer". We have to be careful when we are being perscriptive because we could be liable and lose credibility as a community. If this is our "current world view open for discussion/growth" we invite discussion and contribution instead of enforcing our world view. Now that being said, we can 1000% defend our view point as we continue to gather data and understand how organizations do MLOPs with KF and not just let anyone reinvent the lifecycle, but still keep the door open in case someone does have something the community can discuss as a view point that makes sense to adopt or call out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it's a great idea to call out that in practice the lines end up blurry between DE/MLE/DS for some orgs versus others.

I definitely welcome feedback and iteration on this! I think having this guidance is very useful though as it can provide a lot more clarity to the end user involved on why an MLOps team maybe recommending Kubeflow.

Andrej and I drafted this based on the CNCF diagram and modified it a little bit but, again, the language around personas across the industry is pretty fuzzy so I think sharing it with an asterisk is very helpful. It would also be valuable to hiring managers/executives that are trying to make staffing decisions but may not have the nuanced view of things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally agree with these points @chasecadet, but again it is out of scope of this PR.
This PR just explains the value of Kubeflow components in ML lifecycle, and of course you can integrate other components from AI/ML landscape to your AI/ML infra.

We can always iterate and improve our architecture page if we agree with the Kubeflow community.

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions content/en/docs/started/images/ml-lifecycle.drawio.svg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only persona shpwn here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.
My suggestion will be to remove the ML Engineer Persona

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but in different use-cases Data Processing can be done by ML Engineerings. Especially when Spark integrated to the Jupyter Notebooks.
This is just an example of ML Lifecycle, I am not sure if we can cover all use-cases and personas here.
WDYT @StefanoFioravanzo @franciscojavierarceo @hbelmiro ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that data preparation is made by data engineers, but considering we need show an e2e flow that covers all kubeflow components and we just brought spark operator to the ecosystem, we should cover data preparation too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only persona shown here is ML Engineer which in my opinion is not correct as Data Preparation can be done by a Data Engineer. Similarly Model Development, Hyperparameter tuning, Model Training can/will be done by data scientist.

This varies heavily by company. I've worked at many places where MLE does this fwiw.

I added the persona to highlight explicitly how an ideal user should think about this workflow. Though maybe this could be amended to add more personas. I worry about the clarity though.

#3728 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that data preparation is made by data engineers, but considering we need show an e2e flow that covers all kubeflow components and we just brought spark operator to the ecosystem, we should cover data preparation too.

@rimolive @andreyvelich I am 100% with you on that and the answer to this depends on the org structure or the MLOps literature one follows. My question really is that from a tool/platform perspective, should we be putting personas on the documentation as a lot of it are grey areas. Also, given SparkOperator is fully onboard with Kubeflow, should we put that in the main architecture diagram or not? I have put this as a comment on the main PR as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective this is out of scope of this PR. This PR is initial change to the architecture page to make sure our lifecycle diagrams represent up do date version of Kubeflow components.

Also, CNCF white paper already has personas explanation which might be useful for orgs who are looking for Kubernetes as primary platform for AI/Ml infra: https://www.cncf.io/wp-content/uploads/2024/03/cloud_native_ai24_031424a-2.pdf
cc @zanetworker @ronaldpetty @raravena80

Copy link
Contributor

@vikas-saxena02 vikas-saxena02 Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also suggest splitting the Model Serving box in two i.e. Model Serving and ModelMonitoring/Drift detection as KServe has components to do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. Model Monitoring, Drift Detection is part of model serving from my point of view. If we want to split this block, we should say: Online Inference vs Batch Inference, but I am not sure if we need to explain such details.
It's like with Spark, you can do Data Ingestion, Data Processing, Feature Engineering, etc., but we haven't explained everything in this lifecycle diagram.

I hope that more detailed diagrams can be showed in the KServe docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich as a consultant I can vouch that not many people know that kserve has drift detection capabilities and hence m request to put it there.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, that is why they should explore individual components docs for it.
E.g. if you know that you need the model serving component for your AI/ML infra, you will explore the KServe docs.

It is just impossible to show everything in this end-to-end ML lifecycle diagram.

Large diffs are not rendered by default.

12 changes: 5 additions & 7 deletions content/en/docs/started/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,21 +41,19 @@ The Kubeflow Platform can be installed via
[Packaged Distributions](/docs/started/installing-kubeflow/#packaged-distributions) or
[Kubeflow Manifests](/docs/started/installing-kubeflow/#kubeflow-manifests).

## Getting started with Kubeflow
## Kubeflow Overview Diagram

The following diagram shows the main Kubeflow components to cover each step of ML lifecycle
The following diagram shows the main Kubeflow components to cover each stage of the ML lifecycle
on top of Kubernetes.

<img src="/docs/started/images/kubeflow-intro-diagram.drawio.svg"
alt="Kubeflow overview"
class="mt-3 mb-3">

Read the [architecture overview](/docs/started/architecture/) for an
introduction to the architecture of Kubeflow and to see how you can use Kubeflow
to manage your ML workflow.
Read the [architecture overview](/docs/started/architecture/) to learn about the Kubeflow ecosystem
and to see how Kubeflow components fit in ML lifecycle.

andreyvelich marked this conversation as resolved.
Show resolved Hide resolved
Follow [Installing Kubeflow](/docs/started/installing-kubeflow/) to set up
your environment and install Kubeflow.
## Kubeflow Video Introduction

Watch the following video which provides an introduction to Kubeflow.

Expand Down