Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnDependentUpdateFunc for Job will panic when enable volcano scheduler #1678

Closed
D0m021ng opened this issue Oct 31, 2022 · 9 comments
Closed

Comments

@D0m021ng
Copy link

D0m021ng commented Oct 31, 2022

OnDependentUpdateFunc for TF Job will panic when enable volcano, because of job controller inject watching for job related podGroup.

OnDependentUpdateFunc

var logger *logrus.Entry
if _, ok := newObj.(*corev1.Pod); ok {
	logger = commonutil.LoggerForPod(newObj.(*corev1.Pod), jc.Controller.GetAPIGroupVersionKind().Kind)
}

if _, ok := newObj.(*corev1.Service); ok {
	logger = commonutil.LoggerForService(newObj.(*corev1.Service), jc.Controller.GetAPIGroupVersionKind().Kind)
}

...

// If it has a controller ref, that's all that matters.
if newControllerRef != nil {
	job := resolveControllerRef(jc, newObj.GetNamespace(), newControllerRef)
	if job == nil {
		return false
	}
	logger.Debugf("pod/service has a controller ref: %v, %v", newObj, oldObj)
	return true
}

logger might be null when newObj is v1beta1.PodGroup

TF Job controller

	// skip watching podgroup if podgroup is not installed
	_, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.SchemeGroupVersion.Group, Kind: "PodGroup"},
		v1beta1.SchemeGroupVersion.Version)
	if err == nil {
		// inject watching for job related podgroup
		if err = c.Watch(&source.Kind{Type: &v1beta1.PodGroup{}}, &handler.EnqueueRequestForOwner{
			IsController: true,
			OwnerType:    &kubeflowv1.TFJob{},
		}, predicate.Funcs{
			CreateFunc: util.OnDependentCreateFunc(r.Expectations),
			UpdateFunc: util.OnDependentUpdateFunc(&r.JobController),
			DeleteFunc: util.OnDependentDeleteFunc(r.Expectations),
		}); err != nil {
			return err
		}
	}

panic
image

@Crazybean-lwb
Copy link

Crazybean-lwb commented Nov 2, 2022

not only tf, pytorch、mxnet... met the same problem :
#1679
I can only delete logger to avoid this problem

@johnugeorge
Copy link
Member

which version of training operator are you using?

@D0m021ng
Copy link
Author

D0m021ng commented Nov 2, 2022

k8s version: 1.20.15
training-operator version: v1-74655a (commit id 74655a177a7d886d6e2d9c86c800271d4d44687b)

@D0m021ng D0m021ng changed the title OnDependentUpdateFunc for TF Job will panic when enable volcano scheduler OnDependentUpdateFunc for Job will panic when enable volcano scheduler Nov 2, 2022
@johnugeorge
Copy link
Member

Seems like, bug got introduced as part of #1666

/cc @ggaaooppeenngg

@ggaaooppeenngg
Copy link
Contributor

I only tested it with MPIJob. I made a temporary fix for this issue. @D0m021ng @johnugeorge

@johnugeorge
Copy link
Member

@ggaaooppeenngg How did it work for MPIJob? Can you add a test case as well?

@ggaaooppeenngg
Copy link
Contributor

@johnugeorge Only MPIJob controller uses OnDependentXXXFuncGeneric which initializes the generic logger at the beginning. I am trying to commit to a test for it.

@johnugeorge
Copy link
Member

@johnugeorge Only MPIJob controller uses OnDependentXXXFuncGeneric which initializes the generic logger at the beginning. I am trying to commit to a test for it.

I see that pod watch uses OnDependentXXXFunc

CreateFunc: util.OnDependentCreateFunc(jc.Expectations),

@ggaaooppeenngg
Copy link
Contributor

@johnugeorge It is for the pod. For other objects, it uses the generic function.

@D0m021ng D0m021ng closed this as completed Nov 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants