Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race between OV and I creation in tests #630

Merged
merged 6 commits into from
Jul 25, 2019
Merged

Conversation

alenkacz
Copy link
Contributor

What type of PR is this?
/kind bug

What this PR does / why we need it:
When OV is created AFTER instance, this is how we look for existing instances https://github.com/kudobuilder/kudo/blob/master/pkg/controller/instance/instance_controller.go#L87 so the labels have to be in place

Which issue(s) this PR fixes:

Fixes #626

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@alenkacz
Copy link
Contributor Author

alenkacz commented Jul 23, 2019

@justinbarrick this label HAS TO be present on all instances right now. For normal operations, CLI ensures that otherwise we'll always end up with flaky test. Would be good to have some automatic check or ... not sure how to approach that.

Other option would be to adjust the controller code to select instances differently, maybe we can go over owners or something like that.

@kensipe
Copy link
Member

kensipe commented Jul 23, 2019

@alenkacz the change makes sense... hence the approval... I don't understand that the current yaml is flaky. why would it be flaky and not always failing?

@jbarrick-mesosphere
Copy link
Member

I've triggered 60 runs of this branch here: https://circleci.com/gh/kudobuilder/workflows/kudo/tree/av%2Ffix-ov-i-race?

Let's see what happens :)

@jbarrick-mesosphere
Copy link
Member

@alenkacz the change makes sense... hence the approval... I don't understand that the current yaml is flaky. why would it be flaky and not always failing?

My guess here would be that it requires two conditions:

  • The Instance creation event to be processed before the OperatorVersion.
  • The Instance has missing labels.

And that it's an issue with the code here for handling that race: https://github.com/kudobuilder/kudo/blob/master/pkg/controller/instance/instance_controller.go#L90

@jbarrick-mesosphere
Copy link
Member

Reading https://github.com/kudobuilder/kudo/blob/master/pkg/controller/instance/instance_controller.go#L90 it seems like we're actually checking the OperatorVersion and not Operator name?

@alenkacz
Copy link
Contributor Author

@jbarrick-mesosphere the reason why is it flaky and not failing: we have two different codepaths for instance created after OV and OV created after instance. This listing of instances based on label is done only in case of OV after I, because in the other case you already have instance in hand

@alenkacz
Copy link
Contributor Author

@jbarrick-mesosphere the operator label is really operator https://github.com/kudobuilder/kudo/blob/master/pkg/kudoctl/bundle/package.go#L183 I don't know why are we not filtering based on OV, that would make more sense to me

@jbarrick-mesosphere
Copy link
Member

Yeah, I think the check I pasted is looking for OV in the operator label, so we'll need to fix that to actually fix these flakes. But maybe we could do a field selector instead of a label selector?

@jbarrick-mesosphere
Copy link
Member

Bummer, field selectors won't work:

➜  kudo git:(av/fix-ov-i-race) ✗ kubectl get instance -o yaml --field-selector spec.operatorVersion.name=flink-demo-0.1.0
apiVersion: v1
items: []
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
Error from server (BadRequest): Unable to find "kudo.k8s.io/v1alpha1, Resource=instances" that match label selector "", field selector "spec.operatorVersion.name=flink-demo-0.1.0": field label not supported: spec.operatorVersion.name
➜  kudo git:(av/fix-ov-i-race) ✗ 

@jbarrick-mesosphere
Copy link
Member

What if we just drop the label selector entirely? We already filter when we iterate:

            for _, instance := range instances.Items {
                // Sanity check - lets make sure that this instance references the operatorVersion
                if instance.Spec.OperatorVersion.Name == a.Meta.GetName() &&
                    instance.GetOperatorVersionNamespace() == a.Meta.GetNamespace() &&
                    instance.Status.ActivePlan.Name == "" {

So I think it would work correctly, the only risk there might be performance issues?

@alenkacz
Copy link
Contributor Author

Reading https://github.com/kudobuilder/kudo/blob/master/pkg/controller/instance/instance_controller.go#L90 it seems like we're actually checking the OperatorVersion and not Operator name?

OOOOH! Yesterday I did not get what you mean but now I see it and I cannot unsee it! :D Nice catch. Yeah I'll go with your suggestion of droping filtering by that label. I think the only reason is really performance. Not sure if having a ton of instances in one namespace is actually a real world use case... At the same time I feel weird about hurting performance just because of convenience. Well I'll think about it a bit more over the day :)

@alenkacz alenkacz added the hold label Jul 24, 2019
@alenkacz alenkacz requested a review from runyontr as a code owner July 24, 2019 12:26
@alenkacz
Copy link
Contributor Author

@justinbarrick okay I removed it for now and left a comment there. Ideal would be to filter by operator version but we cannot do that right now because we removed OV from labels to prevent the statefulset update bug :D I guess we can introduce it when we re-introduce the label

@alenkacz alenkacz removed the hold label Jul 24, 2019
@jbarrick-mesosphere
Copy link
Member

Looks good! I'm going to do the circle ci loop again and see what falls out.

@jbarrick-mesosphere
Copy link
Member

@alenkacz
Copy link
Contributor Author

@jbarrick-mesosphere nooo, no more flakes! :D you're killing me. Okay I am going to take a look. If this PR is not making things worse and those failures are unrelated I am going to merge otherwise I'll iterate on this

@alenkacz
Copy link
Contributor Author

alenkacz commented Jul 25, 2019

@jbarrick-mesosphere I looked into the params failures, but did not figure out what is wrong. I see that there is big delay between creating the inner instance and that plan actually creating the inner deployment and then before nginx could have been started, there is timeout. I added some logs in #648 ... I would say it's probably not caused by this PR so I would merge it but I'll wait on your assesment. Actually since I don't know what's wrong it's hard to say what caused it :)

@oliviabarrick
Copy link
Contributor

nooo, no more flakes! :D you're killing me.

Gerred created a monster when he told me we have unlimited build minutes. 😂

I definitely agree we should move forward with merging this given that:

  1. I do not believe this edge case is new.
  2. This PR makes things better.

@alenkacz alenkacz merged commit d24e84e into master Jul 25, 2019
@alenkacz alenkacz deleted the av/fix-ov-i-race branch October 30, 2019 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When Instance is created before OperatorVersion, Instance may never be run
4 participants