Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Pytorch job in Katib #283

Merged
merged 6 commits into from
Dec 13, 2018
Merged

Conversation

johnugeorge
Copy link
Member

@johnugeorge johnugeorge commented Dec 10, 2018

This PR tracks

  1. Integration of Pytorch job in Katib Studyjob
  2. Support multiple worker kinds in the default Metric Collector

Related: #39


This change is Reviewable

@johnugeorge
Copy link
Member Author

/retest

1 similar comment
@johnugeorge
Copy link
Member Author

/retest

spec:
containers:
- name: pytorch
image: johnugeorge/pytorch-mnist-with-summary:0.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check in the source code for this? Also this should be on gcr.io.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the image?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardsliu Can you help uploading the image to gcr.io ? I don't have access.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge Done. Pushed as gcr.io/kubeflow-ci/pytorch-mnist-with-summary:0.4.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

serviceAccountName: metrics-collector
containers:
- name: {{.WorkerID}}
image: johnugeorge/metrics-collector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New metric-collector image has to be created once this current PR is merged

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the image name once the new image is created after this PR merge.

@johnugeorge
Copy link
Member Author

/hold

/assign @richardsliu
/assign @YujiOshima

- "{{.Name}}={{.Value}}"
{{- end}}
{{- end}}
metricsCollectorSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need to configure PV here?

Copy link
Member Author

@johnugeorge johnugeorge Dec 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardsliu Currently, this example uses the default metric collector which parses the stdout logs. It doesn't need PV. I will add one more example that uses the tf event metric collector.

Copy link
Contributor

@richardsliu richardsliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM.

spec:
containers:
- name: pytorch
image: johnugeorge/pytorch-mnist-with-summary:0.4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the image?

@@ -13,5 +13,6 @@ package studyjob

const (
DefaultJobWorker = "Job"
TFJobWorker = "TFJob"
TFJobWorker = "TFJob"
PytorchJobWorker = "PyTorchJob"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyTorchJobWorker?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@richardsliu
Copy link
Contributor

/retest

@richardsliu
Copy link
Contributor

/test all

@johnugeorge
Copy link
Member Author

/retest

@richardsliu
Copy link
Contributor

/retest

} else {
labelName = "job-name"
}
pl, _ := d.clientset.CoreV1().Pods(namespace).List(metav1.ListOptions{LabelSelector: labelName + "=" + wID, IncludeUninitialized: true})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitch pod will be watched in Pytorch and TFJob Job by pytorch_job_name = wID and tf_job_name = wID ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pods spawn by the pytorch job and tf job will have this label key whose value is set to job name https://github.com/kubeflow/tf-operator/blob/master/pkg/common/jobcontroller/jobcontroller.go#L190

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my question is pytorch job and tf job will create several pods, and metrics collector will get logs from one pod.
Does metrics collector get logs from which one of the pods created by a pytorch job?

Copy link
Member Author

@johnugeorge johnugeorge Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it uses only 1 worker. If there are more workers, Master can take responsibility to emit logs. We need to separately discuss better ways to tackle distributed job. @richardsliu

@richardsliu
Copy link
Contributor

@johnugeorge Need to rebase the PR.

@johnugeorge
Copy link
Member Author

/retest

@johnugeorge
Copy link
Member Author

@richardsliu rebased

@richardsliu
Copy link
Contributor

/lgtm
Will wait for @YujiOshima for approval.

@johnugeorge
Copy link
Member Author

/hold cancel

@YujiOshima
Copy link
Contributor

I wrote a comment to metrics collector.
Though the pytoch job and tf job will create several pods, the default metrics collector will collect one pod of them.
I wonder which pod should we choice, manager, parameter-server, worker or anyone is OK.
If it is difficult to define, the default metrics collector may be deprecated when you use tf job or pytorch job.
WDYT @johnugeorge @richardsliu

@johnugeorge
Copy link
Member Author

@YujiOshima Shall we merge this and start a separate discussion for distributed jobs? I will create one issue

@YujiOshima
Copy link
Contributor

@johnugeorge OK, Thanks.
/approve

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: YujiOshima

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants