Improve docs for Training Operator 1.8 #1998

andreyvelich · 2024-01-25T21:06:40Z

On the recent AutoML and Training WG call we discuss how we can improve the documentation for Training Operator and onboarding for new contributors.

We identify several action items that we can work before the release:

Add section Why using Kubeflow Training Operator ? Where we can explain user stories and how Training Operator can manage distributed training for various ML framework in a single place. So ML Engineers can easily train their ML models using unify operator.
Add detailed architecture diagram for Training Operator in addition to this one.
Identify which docs should live on GitHub and which on Kubeflow Website..
Automate SDK doc generation for TrainingClient, ref issue in Katib repo: [SDK] Generate Docs for Katib and Training Operator SDKs katib#2081

Please let me know if we should add something else @kubeflow/release-managers @kubeflow/wg-training-leads @tenzen-y @shashank-iitbhu.

The text was updated successfully, but these errors were encountered:

tenzen-y · 2024-01-29T13:12:40Z

Thank you for raising this great issue!
Describing all features in the doc would be great.
For example, we don't have any doc for TFJob with enableDynamicWorker.

So, as a first iteration, we should identify which feature we don't have any document.

andreyvelich · 2024-02-05T14:11:04Z

cc @andreeamun

StefanoFioravanzo · 2024-04-18T11:28:38Z

@andreyvelich @tenzen-y As discussed, I looked into the training operator docs and I want to propose an initial refactoring to better align with best practices in how technical docs are organized.

A little premise to my porposal: in general you want tech docs to be organized in macro sections that roughly address

"Overview/Installation/GettingStarted"
"HowTOs/UserGuides"
"Reference" (Anything from autogen API docs, to arch diagrams, implementation details, etc.)
"Explanation" (anything that concerns explaining in free form why the project took some decisions, or discussions ecosystem, integrations, etc.

In our case we may also want to consider a "Developer" section, particularly useful for OSS projects.

Now, I can see clear ways to improve the current doc structure to better align with that model. Here are some suggestions:

Split "Overview" into
- "Overview" - trimmed down to only contain an intro to the project, how it fits within the ecosystem, who should care and why
- "Getting Started" - a (one or two) simple example to experiment with the training operator. No explanation required, something that just works end to end
- "Installation" - particularly important for those who want to install without Kubeflow Platform
- Move the Architecture part to a new section "Reference"
Move "Job Scheduling" under a new section called "User Guides", with the name "Advanced Scheduling". The main page provides an overview and then we have two child pages respectively called "Volvano" and "Scheduler Plugins"
Revisit each framework page with the following process:
1. Create a “<framework_name> Training>” under “User Guides” -> all the “how do I do something” goes here
2. Create a “<framework_name>” under “Reference” -> all the CRD reference + implementation details go here.

This doesn't have to happen all in one PR, that's why I split into sequential steps. Let me know what you think. We can start iterating on some of these points in draft PRs and I am happy to get this started.

andreyvelich · 2024-04-20T00:55:50Z

Thank you so much for this @StefanoFioravanzo, I really like your ideas.
A few questions:

Should we order Installation before Getting Started page ? Like in Model Registry docs.
Do we want to separate guides between Users, Administrators, and Developers like in KServe docs or Jupyter Docs or we can do it in the next iteration ?
- For example, initially we can move all guides to the User Guides.

all the CRD reference + implementation details go here.

We don't have CRD reference right now, how should we split these sections?

@kubeflow/wg-training-leads what are your thoughts ?

StefanoFioravanzo · 2024-04-22T13:09:23Z

@andreyvelich

Should we order Installation before Getting Started page ?

Yes let's keep installation before getting started. It makes sense for folks who need to go through the installation before getting their hands on.

Do we want to separate guides between Users, Administrators, and Developers

I am in favour of having additional grouping based on the persona. But, as a first step, I recommend limiting the amount of change. So, as you suggest, let's move all how-tos/guides to a generic "user guides" section. Once we go through this initial restructuring exercise, we can further refine.

We don't have CRD reference right now, how should we split these sections?

I think we do. I think I saw some generic CRD reference for some of the frameworks. If we don't have enough details, we can still add a "TBD" under a framework's reference/API guide.

andreyvelich · 2024-04-22T22:28:25Z

@StefanoFioravanzo I think, we have only this one: https://github.com/kubeflow/training-operator/blob/master/docs/api/kubeflow.org_v1_generated.asciidoc, but I am not sure if we keep this doc updated.
Isn't it @kubeflow/wg-training-leads ?

StefanoFioravanzo · 2024-05-06T16:42:16Z

@andreyvelich since we merged kubeflow/website#3719, can we revisit the first comment of this issue? What do we want to address for training operator 1.8 (Kubeflow 1.9)?

andreyvelich · 2024-05-06T18:01:52Z

I think, as part of Kubeflow 1.9 we completed all items.
Let me close this issue.

andreyvelich added the release/1.8 label Jan 25, 2024

andreyvelich added this to the v0.8.0 Release milestone Jan 25, 2024

andreyvelich mentioned this issue Feb 14, 2024

Training: Add Distributed Training Diagrams kubeflow/website#3678

Merged

StefanoFioravanzo mentioned this issue Apr 16, 2024

Documentation for KF 1.9 kubeflow/website#3711

Open

10 tasks

StefanoFioravanzo mentioned this issue Apr 22, 2024

Documentation Improvements for Katib 0.17 kubeflow/katib#2314

Closed

2 tasks

This was referenced Apr 22, 2024

[Release] Training Operator 1.8 Roadmap #1994

Open

Training: Reorganized Training Operator Docs kubeflow/website#3719

Merged

StefanoFioravanzo mentioned this issue May 2, 2024

Improvements for Kubeflow Pipelines documentation kubeflow/website#3712

Open

17 tasks

andreyvelich closed this as completed May 6, 2024

StefanoFioravanzo mentioned this issue May 14, 2024

Update Kubeflow Installation with Standalone Mode kubeflow/website#3724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve docs for Training Operator 1.8 #1998

Improve docs for Training Operator 1.8 #1998

andreyvelich commented Jan 25, 2024 •

edited

Loading

tenzen-y commented Jan 29, 2024

andreyvelich commented Feb 5, 2024

StefanoFioravanzo commented Apr 18, 2024

andreyvelich commented Apr 20, 2024

StefanoFioravanzo commented Apr 22, 2024

andreyvelich commented Apr 22, 2024

StefanoFioravanzo commented May 6, 2024

andreyvelich commented May 6, 2024

Improve docs for Training Operator 1.8 #1998

Improve docs for Training Operator 1.8 #1998

Comments

andreyvelich commented Jan 25, 2024 • edited Loading

tenzen-y commented Jan 29, 2024

andreyvelich commented Feb 5, 2024

StefanoFioravanzo commented Apr 18, 2024

andreyvelich commented Apr 20, 2024

StefanoFioravanzo commented Apr 22, 2024

andreyvelich commented Apr 22, 2024

StefanoFioravanzo commented May 6, 2024

andreyvelich commented May 6, 2024

andreyvelich commented Jan 25, 2024 •

edited

Loading