[SDK] Get Job Pods Events #1863

andreyvelich · 2023-07-18T18:58:25Z

Our SDK should provide a better visibility to debug/monitor Training Operator Jobs for our users.
Users should not use kubectl to get information about their Training Jobs.

For example, Pod might stuck in Pending/ContainerCreating status for a few minutes (especially if image is huge or pod can't be scheduled to the Node) and user has to use kubectl to understand it.
Therefore, SDK should provide an API to expose Training Operator Job's Pods Kubernetes Events to give users better visibility.

I propose to extend get_job_logs() API to return Pod Events in addition to Pod logs.

For example, the return might look as follows:

$ client.get_job_logs(name, container="pytorch", is_master=False)

The events of Pod train-pytorch-master-0:
2023-07-14T13:21:11INFO Age: 0:04:20 Reason: Pulling Message: Pulling image "docker.io/pytorch/pytorch"
2023-07-14T13:21:11INFO Age: 0:00:20 Reason: Message: Successfully pulled image "docker.io/pytorch/pytorch" in 4m58.407916634s (4m58.407923916s including waiting) 
2023-07-14T13:21:11INFO Age: 0:00:18 Reason: Created Message: Created container pytorch
2023-07-14T13:21:11INFO Age: 0:00:18 Reason: Started Message: Started container pytorch
 
The logs of pod train-pytorch-master-0:
2023-07-14T13:21:11Z INFO     Start training for RANK: 0. WORLD_SIZE: 3
2023-07-14T13:21:11Z INFO     Train Epoch: 0 [0/60000 (0%)]	loss=2.3084
2023-07-14T13:21:12Z INFO     Train Epoch: 0 [420/60000 (1%)]	loss=2.2995

The events of Pod train-pytorch-worker-1:
2023-07-14T13:21:11INFO Age: 0:04:20 Reason: Pulled Message: Container image "docker.io/alpine:3.10" already present on machine
.....

The logs of pod train-pytorch-worker-1:
2023-07-14T13:21:11Z INFO     Start training for RANK: 0. WORLD_SIZE: 3
2023-07-14T13:21:11Z INFO     Train Epoch: 0 [0/60000 (0%)]	loss=2.3084
2023-07-14T13:21:12Z INFO     Train Epoch: 0 [420/60000 (1%)]	loss=2.2995

What do you think @kubeflow/wg-training-leads @tenzen-y @kuizhiqing ?

The text was updated successfully, but these errors were encountered:

terrytangyuan · 2023-07-18T19:02:24Z

Good idea. +1

tenzen-y · 2023-07-18T19:11:53Z

It sounds good!

Also, at the same time, showing the event of FrameworkJob (e.g., TFJob) at the top might be helpful.
@andreyvelich WDYT?

johnugeorge · 2023-07-19T06:56:56Z

Should this be a different API for providing more clarity ?

kuizhiqing · 2023-07-19T08:26:53Z

It would be helpful!

Maybe we should add new API get_job_events to print events of job and pods. And we add new arg with_events=False in API get_job_logs to make it possible to get all the information in the same API.

andreyvelich · 2023-07-19T12:35:41Z

Also, at the same time, showing the event of FrameworkJob (e.g., TFJob) at the top might be helpful.

@tenzen-y That sounds good. The question is how to identify which Job user created ? get_job_logs doesn't have job_type as an input argument, and we just check which pod has these labels:

training.kubeflow.org/job-name=my-job
training.kubeflow.org/job-role=master

The same labels could have multiply jobs (e.g. PyTorchJob, XGBoostJob).

That ties to my other question, if we are going to introduce mandatory job_type argument to our get_job_logs API, should we follow the same pattern for all APIs ?
E.g. instead of get_pytorchjob, get_tfjob, we are going to have a single API called: get_job which takes job_type as a mandatory argument and we are going to get appropriate job based on this type.

We can do the same for all other APIs: create_job, create_job_from_func, get_job, delete_job`, etc.

After refactoring our SDK: #1719, I noticed that it is very confusing for the user that we have some CRUD operations job specific (e.g. create_tfjob), but some of them are not (e.g. get_job_pod_names, get_job_logs).
I can create separate issue to discuss this, and I am going to provide more feedback from the users soon.
cc @kubeflow/wg-training-leads

andreyvelich · 2023-07-19T12:40:39Z

Should this be a different API for providing more clarity ?

I am not sure, if users who are not familiar with Kubernetes should know differences between events and logs.
Usually, when Data Scientists create a ML Job, they want to directly check the logs from this job (e.g. run get_job_logs API).
Otherwise, we should somehow explain them if get_job_logs API fails, they should run get_job_events API.
WDYT @johnugeorge ?

And we add new arg with_events=False in API get_job_logs to make it possible to get all the information in the same API.

I like the idea @kuizhiqing, what would be easier to understand with_events or verbose flag ?

johnugeorge · 2023-07-19T18:48:23Z

+1 Agree with you @andreyvelich

tenzen-y · 2023-07-19T20:50:10Z

The same labels could have multiply jobs (e.g. PyTorchJob, XGBoostJob).

That ties to my other question, if we are going to introduce mandatory job_type argument to our get_job_logs API, should we follow the same pattern for all APIs ?
E.g. instead of get_pytorchjob, get_tfjob, we are going to have a single API called: get_job which takes job_type as a mandatory argument and we are going to get appropriate job based on this type.

We can do the same for all other APIs: create_job, create_job_from_func, get_job, delete_job`, etc.

After refactoring our SDK: #1719, I noticed that it is very confusing for the user that we have some CRUD operations job specific (e.g. create_tfjob), but some of them are not (e.g. get_job_pod_names, get_job_logs).
I can create separate issue to discuss this, and I am going to provide more feedback from the users soon.

@andreyvelich Thanks for the clarification.
I agree with you. Let's work on events for XXXJob in another issue.

kuizhiqing · 2023-07-20T08:27:53Z

what would be easier to understand with_events or verbose flag ?

@andreyvelich Well, you'er right, it would be better to use verbose option without exposing event definition.

andreyvelich · 2023-09-04T14:06:27Z

/assign @andreyvelich

github-actions · 2023-12-03T15:01:42Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2023-12-04T18:29:23Z

/assign @andreyvelich

andreyvelich added the area/sdk label Jul 18, 2023

andreyvelich mentioned this issue Aug 2, 2023

[SDK] Consolidate Naming for CRUD APIs #1877

Closed

google-oss-prow bot assigned andreyvelich Sep 4, 2023

github-actions bot added the lifecycle/stale label Dec 3, 2023

johnugeorge added lifecycle/frozen and removed lifecycle/stale labels Dec 3, 2023

andreyvelich mentioned this issue Jan 5, 2024

[SDK] Get Kubernetes Events for Job #1975

Merged

google-oss-prow bot closed this as completed in #1975 Jan 11, 2024

andreyvelich mentioned this issue Jan 24, 2024

[Release] Training Operator 1.8 Roadmap #1994

Open

11 tasks

andreyvelich added this to the v0.8.0 Release milestone Jan 24, 2024

andreyvelich added the release/1.8 label Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SDK] Get Job Pods Events #1863

[SDK] Get Job Pods Events #1863

andreyvelich commented Jul 18, 2023

terrytangyuan commented Jul 18, 2023

tenzen-y commented Jul 18, 2023

johnugeorge commented Jul 19, 2023

kuizhiqing commented Jul 19, 2023

andreyvelich commented Jul 19, 2023

andreyvelich commented Jul 19, 2023 •

edited

johnugeorge commented Jul 19, 2023

tenzen-y commented Jul 19, 2023

kuizhiqing commented Jul 20, 2023

andreyvelich commented Sep 4, 2023

github-actions bot commented Dec 3, 2023

andreyvelich commented Dec 4, 2023

[SDK] Get Job Pods Events #1863

[SDK] Get Job Pods Events #1863

Comments

andreyvelich commented Jul 18, 2023

terrytangyuan commented Jul 18, 2023

tenzen-y commented Jul 18, 2023

johnugeorge commented Jul 19, 2023

kuizhiqing commented Jul 19, 2023

andreyvelich commented Jul 19, 2023

andreyvelich commented Jul 19, 2023 • edited

johnugeorge commented Jul 19, 2023

tenzen-y commented Jul 19, 2023

kuizhiqing commented Jul 20, 2023

andreyvelich commented Sep 4, 2023

github-actions bot commented Dec 3, 2023

andreyvelich commented Dec 4, 2023

andreyvelich commented Jul 19, 2023 •

edited