You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How should we handle when there are multiple pods per tfjob/pytorchjob?
In case of current pytorch v1beta1 version, there must be exactly one Master pod. So, one solution is to collect log from pod that has labels "pytorch_job_name": Wid and "pytorch-replica-type": master
New master role label has been added to pod(in TFJob/PyTorchJob) which acts as the master. If there is no master in a TFJob, label is added to the first worker .
For logging, the default metric collector can look for the pod that contain this master label
Default metric collector collects and parses metrics emitted from stdout.
In case of tfjob and pytorch job, the default metric collector looks for pods that have labels 'tf_job_name" and "pytorch_job_name" respectively
https://github.com/kubeflow/katib/blob/master/pkg/manager/metricscollector/meticscollector.go#L42
How should we handle when there are multiple pods per tfjob/pytorchjob?
In case of current pytorch v1beta1 version, there must be exactly one Master pod. So, one solution is to collect log from pod that has labels "pytorch_job_name": Wid and "pytorch-replica-type": master
@richardsliu @YujiOshima
Related: #283
The text was updated successfully, but these errors were encountered: