Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better support for distributed tfjob/pytorch job #294

Closed
johnugeorge opened this issue Dec 14, 2018 · 2 comments
Closed

Better support for distributed tfjob/pytorch job #294

johnugeorge opened this issue Dec 14, 2018 · 2 comments

Comments

@johnugeorge
Copy link
Member

johnugeorge commented Dec 14, 2018

Default metric collector collects and parses metrics emitted from stdout.
In case of tfjob and pytorch job, the default metric collector looks for pods that have labels 'tf_job_name" and "pytorch_job_name" respectively
https://github.com/kubeflow/katib/blob/master/pkg/manager/metricscollector/meticscollector.go#L42

How should we handle when there are multiple pods per tfjob/pytorchjob?

In case of current pytorch v1beta1 version, there must be exactly one Master pod. So, one solution is to collect log from pod that has labels "pytorch_job_name": Wid and "pytorch-replica-type": master

@richardsliu @YujiOshima

Related: #283

@johnugeorge
Copy link
Member Author

/area 0.4.0

@johnugeorge
Copy link
Member Author

New master role label has been added to pod(in TFJob/PyTorchJob) which acts as the master. If there is no master in a TFJob, label is added to the first worker .

For logging, the default metric collector can look for the pod that contain this master label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants