Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UI] Fix Trial Logs when Kubernetes Job Fails #2164

Merged
merged 2 commits into from
Jun 20, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions pkg/new-ui/v1beta1/backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -733,10 +733,15 @@ func fetchMasterPodName(clientset *kubernetes.Clientset, trial *trialsv1beta1.Tr
field to "true" in the Experiment definition. If this error persists then the Pod's logs are not currently
persisted in the cluster.`)
}
if len(podList.Items) > 1 {
return "", errors.New("More than one master replica found")

// If Pod is Running or Succeeded Pod, return it.
for _, pod := range podList.Items {
if pod.Status.Phase == corev1.PodSucceeded || pod.Status.Phase == corev1.PodRunning {
return pod.Name, nil
}
}

// Otherwise, return the first Failed Pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Otherwise, return the first Failed Pod.
// Otherwise, return the first Failed or Pending Pod.
}

Is this included in Pending Pods too, right?
So, should we return a failed pod if there are pending pods and failed pods in podList?

Copy link
Member Author

@andreyvelich andreyvelich Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this included in Pending Pods too, right?

Yes, that's right. In case of Pending pod, UI shows the similar error with my change:

One of the Pod containers is is ContainerCreating stage

@tenzen-y Do you think we should process only Failed pod after we processed Succeeded and Running (like you mentioned above: So, should we return a failed pod if there are pending pods and failed pods in podList?

And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should process only Failed pod after we processed Succeeded and Running ?

Yes, I do. I think users may prefer that show logs for a failed pod rather than a pending pod. wdyt?

Also in the current implementation, If there are pending pods and failed pods, maybe, the selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).List....

Hence, for either a pending or failed pod, we should guarantee which phase this function returns.

And then, if there are no more Failed pods, just return podList.Items[0].Name (Pending Pod) and UI should show the above error ?

Uhm. I'm not sure the above message is appropriate. However, I don't think that pending means ContainerCreating. For example, when using a custom scheduler to enforce pending pods based on priority or requested resources, it will take a bit much time until is scheduled into a Node and starts.

Copy link
Member Author

@andreyvelich andreyvelich Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logs for a failed pod rather than a pending pod. wdyt

That makes sense.

The selected pods will differ each time we run clientset.CoreV1().Pods(trial.ObjectMeta.Namespace).

From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.

However, I don't think that pending means ContainerCreating

That is right, it can be different message I just added this as an example. We can always log this message to the user (when Pod is the Pending state): Failed to get logs for this Trial. Pod is in the Pending state. WDYT @tenzen-y ?

Copy link
Member

@tenzen-y tenzen-y Jun 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing however, the [0] element is always the last created Pod (despite that list in client-go doesn't guarantee ordering).
I guess, that is how caching works behind list operation.

It's interesting. Thanks for sharing the result of the investigation!

FYI: I guess the informer cache might affect the result. And the informer cache generally is rotated in intervals. IIRC, an interval is 10 hours as a default.

Failed to get logs for this Trial. Pod is in the Pending state.

Looks great :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @johnugeorge I've changed it.

return podList.Items[0].Name, nil
}

Expand Down