Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI] Fix Trial Logs when Kubernetes Job Fails #2164

Merged
merged 2 commits into from
Jun 20, 2023

Conversation

andreyvelich
Copy link
Member

@andreyvelich andreyvelich commented Jun 19, 2023

I fixed UI backend that should return proper value when Kubernetes Job Fails. When Job Fails, it spawns multiple Pods and our UI shows this: More than one master replica found.

To fix this, I follow this:

  1. If one Succeeded or Running Pod exists, we print the logs.
  2. Otherwise, print logs from one of Failed Pods.

/assign @tenzen-y @kimwnasptd @johnugeorge @apo-ger @d-gol

@google-oss-prow
Copy link

@andreyvelich: GitHub didn't allow me to assign the following users: d-gol.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

I fixed UI backend that should return proper value when Kubernetes Job Fails. When Job Fails, it spawns multiple Pods and our UI shows this: More than one master replica found.

To fix this, I follow this:

  1. If one Succeeded or Running Pod exists, we print the logs.
  2. Otherwise, print logs from one of Failed pods.

/assign @tenzen-y @kimwnasptd @johnugeorge @apo-ger @d-gol

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Thank you for fixing this bug!

}

// Otherwise, return the first Failed Pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Otherwise, return the first Failed Pod.
// Otherwise, return the first Failed or Pending Pod.
}

Is this included in Pending Pods too, right?
So, should we return a failed pod if there are pending pods and failed pods in podList?

Copy link
Member Author

@andreyvelich andreyvelich Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this included in Pending Pods too, right?

Yes, that's right. In case of Pending pod, UI shows the similar error with my change:

One of the Pod containers is is ContainerCreating stage

@tenzen-y Do you think we should process only Failed pod after we processed Succeeded and Running (like you mentioned above: So, should we return a failed pod if there are pending pods and failed pods in podList?

And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should process only Failed pod after we processed Succeeded and Running ?

Yes, I do. I think users may prefer that show logs for a failed pod rather than a pending pod. wdyt?

Also in the current implementation, If there are pending pods and failed pods, maybe, the selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).List....

Hence, for either a pending or failed pod, we should guarantee which phase this function returns.

And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?

Uhm. I'm not sure the above message is appropriate. However, I don't think that pending means ContainerCreating. For example, when using a custom scheduler to enforce pending pods based on priority or requested resources, it will take a bit much time until is scheduled into a Node and starts.

Copy link
Member Author

@andreyvelich andreyvelich Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs for a failed pod rather than a pending pod. wdyt

That makes sense.

The selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).

From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.

However, I don't think that pending means ContainerCreating

That is right, it can be different message I just added this as an example. We can always log this message to the user (when Pod is the Pending state): Failed to get logs for this Trial. Pod is in the Pending state. WDYT @tenzen-y ?

Copy link
Member

@tenzen-y tenzen-y Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.

It's interesting. Thanks for sharing the result of the investigation!

FYI: I guess the informer cache might affect the result. And the informer cache generally is rotated in intervals. IIRC, an interval is 10 hours as a default.

Failed to get logs for this Trial. Pod is in the Pending state.

Looks great :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @johnugeorge I've changed it.

@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Jun 19, 2023
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
/lgtm
/approve

@google-oss-prow google-oss-prow bot added the lgtm label Jun 20, 2023
@google-oss-prow google-oss-prow bot merged commit ede6e74 into kubeflow:master Jun 20, 2023
60 of 63 checks passed
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich andreyvelich deleted the fix-ui-trial-log branch June 20, 2023 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants