Better support for distributed tfjob/pytorch job #294

johnugeorge · 2018-12-14T05:16:11Z

Default metric collector collects and parses metrics emitted from stdout.
In case of tfjob and pytorch job, the default metric collector looks for pods that have labels 'tf_job_name" and "pytorch_job_name" respectively
https://github.com/kubeflow/katib/blob/master/pkg/manager/metricscollector/meticscollector.go#L42

How should we handle when there are multiple pods per tfjob/pytorchjob?

In case of current pytorch v1beta1 version, there must be exactly one Master pod. So, one solution is to collect log from pod that has labels "pytorch_job_name": Wid and "pytorch-replica-type": master

@richardsliu @YujiOshima

Related: #283

johnugeorge · 2018-12-18T19:32:06Z

/area 0.4.0

johnugeorge · 2018-12-19T02:02:04Z

New master role label has been added to pod(in TFJob/PyTorchJob) which acts as the master. If there is no master in a TFJob, label is added to the first worker .

For logging, the default metric collector can look for the pod that contain this master label

This was referenced Dec 18, 2018

Adding master role label for TFJob kubeflow/training-operator#896

Merged

Add master role label for PyTorchJob kubeflow/pytorch-operator#116

Merged

k8s-ci-robot added the area/0.4.0 label Dec 18, 2018

johnugeorge mentioned this issue Dec 19, 2018

Update ksonnet package for Katib kubeflow/kubeflow#2126

Closed

johnugeorge mentioned this issue Dec 19, 2018

Adding master pod check for default metric collector #300

Merged

k8s-ci-robot closed this as completed in #300 Dec 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for distributed tfjob/pytorch job #294

Better support for distributed tfjob/pytorch job #294

johnugeorge commented Dec 14, 2018 •

edited

johnugeorge commented Dec 18, 2018

johnugeorge commented Dec 19, 2018

Better support for distributed tfjob/pytorch job #294

Better support for distributed tfjob/pytorch job #294

Comments

johnugeorge commented Dec 14, 2018 • edited

johnugeorge commented Dec 18, 2018

johnugeorge commented Dec 19, 2018

johnugeorge commented Dec 14, 2018 •

edited