Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Get Job Pods Events #1863

Closed
andreyvelich opened this issue Jul 18, 2023 · 12 comments · Fixed by #1975
Closed

[SDK] Get Job Pods Events #1863

andreyvelich opened this issue Jul 18, 2023 · 12 comments · Fixed by #1975

Comments

@andreyvelich
Copy link
Member

Our SDK should provide a better visibility to debug/monitor Training Operator Jobs for our users.
Users should not use kubectl to get information about their Training Jobs.

For example, Pod might stuck in Pending/ContainerCreating status for a few minutes (especially if image is huge or pod can't be scheduled to the Node) and user has to use kubectl to understand it.
Therefore, SDK should provide an API to expose Training Operator Job's Pods Kubernetes Events to give users better visibility.

I propose to extend get_job_logs() API to return Pod Events in addition to Pod logs.

For example, the return might look as follows:

$ client.get_job_logs(name, container="pytorch", is_master=False)

The events of Pod train-pytorch-master-0:
2023-07-14T13:21:11INFO Age: 0:04:20 Reason: Pulling Message: Pulling image "docker.io/pytorch/pytorch"
2023-07-14T13:21:11INFO Age: 0:00:20 Reason: Message: Successfully pulled image "docker.io/pytorch/pytorch" in 4m58.407916634s (4m58.407923916s including waiting) 
2023-07-14T13:21:11INFO Age: 0:00:18 Reason: Created Message: Created container pytorch
2023-07-14T13:21:11INFO Age: 0:00:18 Reason: Started Message: Started container pytorch
 
The logs of pod train-pytorch-master-0:
2023-07-14T13:21:11Z INFO     Start training for RANK: 0. WORLD_SIZE: 3
2023-07-14T13:21:11Z INFO     Train Epoch: 0 [0/60000 (0%)]	loss=2.3084
2023-07-14T13:21:12Z INFO     Train Epoch: 0 [420/60000 (1%)]	loss=2.2995

The events of Pod train-pytorch-worker-1:
2023-07-14T13:21:11INFO Age: 0:04:20 Reason: Pulled Message: Container image "docker.io/alpine:3.10" already present on machine
.....

The logs of pod train-pytorch-worker-1:
2023-07-14T13:21:11Z INFO     Start training for RANK: 0. WORLD_SIZE: 3
2023-07-14T13:21:11Z INFO     Train Epoch: 0 [0/60000 (0%)]	loss=2.3084
2023-07-14T13:21:12Z INFO     Train Epoch: 0 [420/60000 (1%)]	loss=2.2995

What do you think @kubeflow/wg-training-leads @tenzen-y @kuizhiqing ?

@terrytangyuan
Copy link
Member

Good idea. +1

@tenzen-y
Copy link
Member

It sounds good!

Also, at the same time, showing the event of FrameworkJob (e.g., TFJob) at the top might be helpful.
@andreyvelich WDYT?

@johnugeorge
Copy link
Member

Should this be a different API for providing more clarity ?

@kuizhiqing
Copy link
Member

It would be helpful!

Maybe we should add new API get_job_events to print events of job and pods. And we add new arg with_events=False in API get_job_logs to make it possible to get all the information in the same API.

@andreyvelich
Copy link
Member Author

Also, at the same time, showing the event of FrameworkJob (e.g., TFJob) at the top might be helpful.

@tenzen-y That sounds good. The question is how to identify which Job user created ? get_job_logs doesn't have job_type as an input argument, and we just check which pod has these labels:

training.kubeflow.org/job-name=my-job
training.kubeflow.org/job-role=master

The same labels could have multiply jobs (e.g. PyTorchJob, XGBoostJob).

That ties to my other question, if we are going to introduce mandatory job_type argument to our get_job_logs API, should we follow the same pattern for all APIs ?
E.g. instead of get_pytorchjob, get_tfjob, we are going to have a single API called: get_job which takes job_type as a mandatory argument and we are going to get appropriate job based on this type.

We can do the same for all other APIs: create_job, create_job_from_func, get_job, delete_job`, etc.

After refactoring our SDK: #1719, I noticed that it is very confusing for the user that we have some CRUD operations job specific (e.g. create_tfjob), but some of them are not (e.g. get_job_pod_names, get_job_logs).
I can create separate issue to discuss this, and I am going to provide more feedback from the users soon.
cc @kubeflow/wg-training-leads

@andreyvelich
Copy link
Member Author

andreyvelich commented Jul 19, 2023

Should this be a different API for providing more clarity ?

I am not sure, if users who are not familiar with Kubernetes should know differences between events and logs.
Usually, when Data Scientists create a ML Job, they want to directly check the logs from this job (e.g. run get_job_logs API).
Otherwise, we should somehow explain them if get_job_logs API fails, they should run get_job_events API.
WDYT @johnugeorge ?

And we add new arg with_events=False in API get_job_logs to make it possible to get all the information in the same API.

I like the idea @kuizhiqing, what would be easier to understand with_events or verbose flag ?

@johnugeorge
Copy link
Member

+1 Agree with you @andreyvelich

@tenzen-y
Copy link
Member

The same labels could have multiply jobs (e.g. PyTorchJob, XGBoostJob).

That ties to my other question, if we are going to introduce mandatory job_type argument to our get_job_logs API, should we follow the same pattern for all APIs ?
E.g. instead of get_pytorchjob, get_tfjob, we are going to have a single API called: get_job which takes job_type as a mandatory argument and we are going to get appropriate job based on this type.

We can do the same for all other APIs: create_job, create_job_from_func, get_job, delete_job`, etc.

After refactoring our SDK: #1719, I noticed that it is very confusing for the user that we have some CRUD operations job specific (e.g. create_tfjob), but some of them are not (e.g. get_job_pod_names, get_job_logs).
I can create separate issue to discuss this, and I am going to provide more feedback from the users soon.

@andreyvelich Thanks for the clarification.
I agree with you. Let's work on events for XXXJob in another issue.

@kuizhiqing
Copy link
Member

what would be easier to understand with_events or verbose flag ?

@andreyvelich Well, you'er right, it would be better to use verbose option without exposing event definition.

@andreyvelich
Copy link
Member Author

/assign @andreyvelich

Copy link

github-actions bot commented Dec 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member Author

/assign @andreyvelich

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants