Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix https://github.com/kubeflow/training-operator/issues/1706 #1707

Merged
merged 1 commit into from
Dec 26, 2022

Conversation

HeGaoYuan
Copy link
Contributor

Which issue(s) this PR fixes : #1706

Checklist:

  • Docs included if any changes are user facing

@@ -74,6 +75,7 @@ func main() {
"If set, it only monitors kubeflow jobs in the given namespace.")
flag.IntVar(&monitoringPort, "monitoring-port", 9443, "Endpoint port for displaying monitoring metrics. "+
"It can be set to \"0\" to disable the metrics serving.")
flag.IntVar(&controllerThreads, "controller-threads", 10, "Number of worker threads used by the controller.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep current defaults here(1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most users are too "lazy" to change the default values, or they wish the default values can handle most situations, I think the default value set to 10 is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is highly dependent on deployment environment and user needs. I would recommend to keep it default 1 and user can override it based on their needs. (Ref: https://docs.okd.io/latest/operators/operator_sdk/golang/osdk-golang-tutorial.html#osdk-golang-controller-configs_osdk-golang-tutorial)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeGaoYuan I agree with @johnugeorge. We should keep it default 1 since it would be best to set values as the minimal requirement one as default.

Copy link
Contributor Author

@HeGaoYuan HeGaoYuan Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I am still stick on my opinion.

spark-on-k8s-operator is a great project which is similar to training-operator. It use the default value 10 for controllerThreads.( https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/c261df66a00635509f7d8cb2a7fba4c602c9228e/main.go#L56 )

And I think the robustness is more important than minimal requirement for a production-grade project.

Welcome more programmers to convince me of 1 is better than 10. 😁😁

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was totally agree that the default value should be 1 here, regards as following reasons and more,

  • the default value of controller.Options.MaxConcurrentReconciles is 1, cf
  • if we do not make sure that EVERYTHING now and future are completely thread safe, it's better to keep the minimal working configuration as default.

Anyway, keep default as DEFAULT would be better in my personal point of view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeGaoYuan Can you update the PR?

Copy link
Contributor Author

@HeGaoYuan HeGaoYuan Dec 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HeGaoYuan Can you update the PR?

@johnugeorge done

@coveralls
Copy link

coveralls commented Dec 23, 2022

Pull Request Test Coverage Report for Build 3781033970

  • 12 of 30 (40.0%) changed or added relevant lines in 7 files are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.05%) to 39.243%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/mxnet/mxjob_controller.go 0 3 0.0%
pkg/controller.v1/xgboost/xgboostjob_controller.go 0 3 0.0%
pkg/controller.v1/register_controller.go 0 12 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/register_controller.go 1 50.0%
pkg/controller.v1/mpi/mpijob_controller.go 2 77.46%
Totals Coverage Status
Change from base Build 3759228064: 0.05%
Covered Lines: 2663
Relevant Lines: 6786

💛 - Coveralls

@johnugeorge
Copy link
Member

Related: #1708

@@ -74,6 +75,7 @@ func main() {
"If set, it only monitors kubeflow jobs in the given namespace.")
flag.IntVar(&monitoringPort, "monitoring-port", 9443, "Endpoint port for displaying monitoring metrics. "+
"It can be set to \"0\" to disable the metrics serving.")
flag.IntVar(&controllerThreads, "controller-threads", 10, "Number of worker threads used by the controller.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set the controllerThreads per each framework since I don't think users create many Jobs using all frameworks simultaneously?

This means we introduce --tfjob-threads, pytorchjob-threads, and so on here instead of the --controller-threads.

@johnugeorge wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users now can enable and disable schemes that supported by the operator instance.

I think we can keep the command line options simple at first and update it to a more sophisticated version when real cases pump in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zw0610 I see; we can select schemes in the following. Thanks for clarifying!

flag.Var(&enabledSchemes, "enable-scheme", "Enable scheme(s) as --enable-scheme=tfjob --enable-scheme=pytorchjob, case insensitive."+
" Now supporting TFJob, PyTorchJob, MXNetJob, XGBoostJob, PaddleJob. By default, all supported schemes will be enabled.")

@@ -176,9 +176,10 @@ func (jc *MPIJobReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ct
}

// SetupWithManager sets up the controller with the Manager.
func (jc *MPIJobReconciler) SetupWithManager(mgr ctrl.Manager) error {
func (jc *MPIJobReconciler) SetupWithManager(mgr ctrl.Manager, controllerThreads int) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we or should we introduce a way to keep the SetupWithManager API, may be extending the configuration struct is better than extending/changing API.

Is that make sense to introduce MaxConcurrentReconciles config to common.JobController ?

What's your opinion? @johnugeorge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kuizhiqing Yes. But I think, we need more refactoring. eg: NewReconciler can take common.JobControllerConfiguration as argument and we can add MaxConcurrentReconciles to common.JobControllerConfiguration.

We can do this in a separate PR as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge yes, I agree with that.

@johnugeorge
Copy link
Member

Thanks @HeGaoYuan

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HeGaoYuan, johnugeorge

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit e13d262 into kubeflow:master Dec 26, 2022
google-oss-prow bot pushed a commit that referenced this pull request Feb 16, 2023
* Update sdk to v1.6.0rc0

* Update sdk to v1.6.0rc0

* Update sdk to v1.6.0rc0

* Update to latest Image

* [Bugfix][SDK] pod has no metadata attr anymore in the get_job_logs() … (#1760)

---------

Co-authored-by: Hongzhi Chen <chzyaobaiwei@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants