Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-2793: deployment: ignore defaultmode for volumes, accept unsolicited containers #176

Merged

Conversation

alebedev87
Copy link
Contributor

@alebedev87 alebedev87 commented Oct 18, 2022

Some lessons learnt from the other operators:

  • defaultMode for volumes is not part of the expected specification however it's added by the defaulting mechanism which leads to the fight between the operator and API server
  • accept unsolicited containers to enable use cases like side car container injection (service mesh or others)
  • Operator type: unused fields removed

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 18, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alebedev87

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 18, 2022
@alebedev87
Copy link
Contributor Author

failed to connect to the bootstrap machine: dial tcp 35.223.24.74:22: connect: connection refused

/retest

@candita
Copy link

candita commented Oct 19, 2022

/assign
/assign @Miciah

@candita
Copy link

candita commented Oct 20, 2022

@alebedev87 can you add a Jira link for this? @Miciah is going to contact you about this informal story on this one too.

@alebedev87
Copy link
Contributor Author

@candita : I was wondering whether we need a formal Jira bug for things we noticed ourselves which weren't reported by the customer and don't have real bugs (performance improvement at best). Maybe I missed this, is this a requirement now?

@candita
Copy link

candita commented Oct 21, 2022

@candita : I was wondering whether we need a formal Jira bug for things we noticed ourselves which weren't reported by the customer and don't have real bugs (performance improvement at best). Maybe I missed this, is this a requirement now?

Yes, we all create our own Jira bugs for things we've noticed ourselves. Without it, there is no qe, docs, release note, etc. On all the other repos there is a requirement for a valid bugzilla in the tide output. Not sure why it wasn't set up on this repo. But it does require docs-approved, px-approved, and qe-approved labels, which you won't get without a Jira.

Update - I notice you usually add those labels yourself!

for currName, currCont := range currentContMap {
// if the current container is expected: check its fields
if expCont, found := expectedContMap[currName]; found {
// ensure all expected containers are present,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but have you considered using a deep copy instead of assignment in line 368, where you reset them all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do a deep copy, that would be a little more cpu cycles but a bit safer in terms of the expected containers modifications done via updated variable by mistake. But here, we were safe as 1) no other function touches containers and 2) we return immediately so the same function doesn't have a chance to modify updated variable.

However your comment made me think about the if condition which you pointed into. Actually it prevents the extra containers from being kept in case the lengths of the current pod.spec.containers is not the same as expected, so in a lot of cases. I removed this condition to enable the injection of the extra container(s) in any case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

externalDNSDeploymentChanged already makes a deep copy of current to make update, so it seems to me that it would be superfluous to do another deep copy here.

testConfigMapVolume("extravolume", "extra", "files", "/etc"),
),
mutate: func(dep *appsv1.Deployment) {
dep.Spec.Template.Spec.Volumes = []corev1.Volume{testSecretVolume("testcreds", "testsecret", "creds", "/run/secrets")}
Copy link

@candita candita Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to append a new volume here? Or is this a test of adding a duplicate volume?

Copy link
Contributor Author

@alebedev87 alebedev87 Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testcase is weirdly designed. It makes my head turn a little every time I approach it. Which is a sign to rewrite it.

originalDeployment is current state, mutate function changes the current state into the desired deployment. Then the current and desired deployments are passed to externalDNSDeploymentChanged which says if the current state needs to be updated and to which state (expectedDeployment).

So here, we have 2 volumes in the current (cluster) state and 1 volume as desired by the operator and we expect no changes to the current state as we now allow unsolicited volumes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you now allow unsolicited containers. Is it unsolicited volumes are allowed as well as unsoliciated containers? And neither of them are marked as changes.

Copy link

@candita candita Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect no changes to the current state as we now allow unsolicited volumes.

But shouldn't you expect the expectedDeployment to only have one volume?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Miciah could you take a quick look at this test case and let me know if you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You take the current, mutate it, then compare it to the expected state after the mutation

No. The mutation creates the desired (by the operator) state.

but if you have applied the mutate function to the current set of volumes, it should have removed that volume. The expected state should only have the single volume you set in the mutate function.

Desired deployment != expected deployment. The desired deployment is what the operator desires to see. The expected deployment is what we expect to be in the cluster after the operator made the reconciliation. The enforcement of the desired state is done in the soft manner - only the volumes mentioned in the desired deployment has to be present, the rest is left intact. This test case demonstrates exactly this: we have 2 volumes in the cluster and one of them is exactly as desired, so we expect no change (false). expectedDepolyment field here is set but it's not even used as we only check it if expect is true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrey's explanations in #176 (comment) and #176 (comment) make sense to me. original represents whatever is in etcd, tc.mutate(original.DeepCopy()) represents what the operator expects, and tc.expect is the result of reconciling those two things, and since the goal is for the operator to ignore volumes and sidecar containers that some external agent may inject, tc.expect should include any additional sidecars or volumes.

The unsolicited volumes were accepted even before this PR, that was done to cover the cases when, for instance, the kube api token is added by the api server.

Those only get injected into the pods, not into the deployment's pod template spec.

The testcase is weirdly designed. It makes my head turn a little every time I approach it.

This whole discussion reminds me of another: openshift/cluster-ingress-operator#611 (comment)

Copy link
Contributor Author

@alebedev87 alebedev87 Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unsolicited volumes were accepted even before this PR, that was done to cover the cases when, for instance, the kube api token is added by the api server.

Those only get injected into the pods, not into the deployment's pod template spec.

Yes, you are right, I saw the api token volume in the POD, not in the deployment. Do you think we don't need to be that soft for the volumes?
I personally like the concept of the hard enforcement only of those things which can change the behavior of the operated PODs like args command, image etc. The volumes, in my opinion, can be approached softer - letting the cluster user to add some if it's needed.

The testcase is weirdly designed. It makes my head turn a little every time I approach it.

This whole discussion reminds me of another: openshift/cluster-ingress-operator#611 (comment)

I agree more with Steve's ordering but that's more religious than factual. Let me remove the TODO comment though and add your explanation to the test case, I think it helps understand the idea behind.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, I saw the api token volume in the POD, not in the deployment. Do you think we don't need to be that soft for the volumes?

I don't think we need to be that permissive about volumes, but it's fine to leave it permissive in this PR (which is focused on allowing sidecars).

I personally like the concept of the hard enforcement only of those things which can change the behavior of the operated PODs like args command, image etc. The volumes, in my opinion, can be approached softer - letting the cluster user to add some if it's needed.

The risk of loose enforcement is that someone might start modifying the deployment in unexpected ways, and if some future change in external-dns-operator breaks that customization (for example, adding a conflicting volume with the same name), then that person will complain. However, we can leave things permissive for the purposes of getting this PR merged.

expect: false,
originalDeployment: testDeploymentWithVolumes(testSecretVolume("testcreds", "testsecret", "creds", "/run/secrets")),
mutate: func(dep *appsv1.Deployment) {
dep.Spec.Template.Spec.Volumes[0].VolumeSource.Secret.DefaultMode = nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dep.Spec.Template.Spec.Volumes[0].VolumeSource.Secret.DefaultMode = nil
dep.Spec.Template.Spec.Volumes[0].VolumeSource.Secret.DefaultMode = somethingthat'snotnil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it comes form the misleading test design. Here we are saying that the desired deployment doesn't have the defaultMode set.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test where the defaultMode is set to something other than nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

expect: false,
originalDeployment: testDeploymentWithVolumes(testConfigMapVolume("testcerts", "testcerts", "key", "/etc/pki/trust")),
mutate: func(dep *appsv1.Deployment) {
dep.Spec.Template.Spec.Volumes[0].VolumeSource.ConfigMap.DefaultMode = nil
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dep.Spec.Template.Spec.Volumes[0].VolumeSource.ConfigMap.DefaultMode = nil
dep.Spec.Template.Spec.Volumes[0].VolumeSource.ConfigMap.DefaultMode = somethingnotnil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nil seems correct to me. Maybe the test case would be clearer if it were named something like "if the API sets a default mode on a configmap volume".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name to the suggested.

Copy link

@candita candita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few comments.

@alebedev87
Copy link
Contributor Author

Failed GCP installation.

/retest

@alebedev87 alebedev87 changed the title deployment: ignore defaultmode for volumes, accept unsolicited containers OCPBUGS-2793: deployment: ignore defaultmode for volumes, accept unsolicited containers Oct 24, 2022
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 24, 2022
@openshift-ci-robot
Copy link

@alebedev87: This pull request references Jira Issue OCPBUGS-2793, which is invalid:

  • expected the bug to target the "4.12.0" version, but it targets "4.12" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some lessons learnt from the other operators:

  • defaultMode for volumes is not part of the expected specification however it's added by the defaulting mechanism which leads to the fight between the operator and API server
  • accept unsolicited containers to enable use cases like side car container injection (service mesh or others)
  • Operator type: unused fields removed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alebedev87
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 24, 2022
@openshift-ci-robot
Copy link

@alebedev87: This pull request references Jira Issue OCPBUGS-2793, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from lihongan October 24, 2022 14:21
updated.Spec.Template.Spec.Containers = expected.Spec.Template.Spec.Containers
return true
}

currentContMap := buildIndexedContainerMap(current.Spec.Template.Spec.Containers)
expectedContMap := buildIndexedContainerMap(expected.Spec.Template.Spec.Containers)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to not have a container map for the updated.Spec.Template.Spec.Containers? You are mixing assignment by index with append in L397, within the same loop - are you sure that's going to work for every scenario?

Copy link
Contributor Author

@alebedev87 alebedev87 Nov 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to not have a container map for the updated.Spec.Template.Spec.Containers?

Yes, the current and the updated are the same in terms of the containers (what this function is supposed to check) at the beginning of the function. That's actually the whole point: updated == current + updated things.

You are mixing assignment by index with append in L397, within the same loop - are you sure that's going to work for every scenario?

That's an interesting finding, didn't think of it indeed. Good enough it doesn't impact the algorithm: the indexes are still pointing to the right containers as we append to the end of the container list and the indexed container map is not refreshed after each iteration.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

@alebedev87
Copy link
Contributor Author

alebedev87 commented Nov 7, 2022

Infoblox license renewed (3 months have flied by).

/retest

@@ -15,8 +15,8 @@ limitations under the License.
*/

// Package v1alpha1 contains API Schema definitions for the operator v1alpha1 API group
//+kubebuilder:object:generate=true
//+groupName=externaldns.olm.openshift.io
// +kubebuilder:object:generate=true
Copy link

@candita candita Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to regenerate generated code after the api changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was just golang 1.19 new formatting of the comment markers, now they need a whitespace after //. No need to regenerate anything, if there would be anything we would catch it in the CI: it runs the markers to code and code to manifests generators to see if there is anything not committed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already in a935fda#diff-dab150f060822d42b5ee452169a22a98dca6669fef3b58f23a116caf26084955. Presumably this change and the one in api/v1beta1/groupversion_info.go would vanish with a rebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, rebased from main.

@lihongan
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 10, 2022
@xenolinux
Copy link

/label docs-approved

@openshift-ci openshift-ci bot added the docs-approved Signifies that Docs has signed off on this PR label Nov 14, 2022
// deepequal is fine here as we don't have more than 1 item
// neither in the secret nor in the configmap
if !reflect.DeepEqual(currVol.Volume, expVol.Volume) {
if !cmp.Equal(currVol.Volume, expVol.Volume, ignoreFieldsConfigMap, ignoreFieldsSecret) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we changed cluster-ingress-operator to hash the deployment, cluster-ingress-operator similarly used go-cmp to compare volumes—it might be worth a look to see whether there are any logic or unit tests worth adapting to external-dns-operator:

https://github.com/openshift/cluster-ingress-operator/blob/5a82b658c4952e8af5d4658d13e8802b93ded870/pkg/operator/controller/ingress/deployment.go#L359
openshift/cluster-ingress-operator@48c17ff
openshift/cluster-ingress-operator@d32578a
openshift/cluster-ingress-operator@5a82b65

For example, the "if the volumes change ordering" test case might be a good one to add to external-dns-operator.

Cluster-ingress-operator's cmpSecretVolumeSource and cmpConfigMapVolumeSource also explicitly tested each field in the volume source types, which would reduce the likelihood of problems resulting from defaulting if upstream added new fields in these types. Explicitly testing each field might not be worth the bother though.

Copy link
Contributor Author

@alebedev87 alebedev87 Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I see where Steve was inspired from :)

For example, the "if the volumes change ordering" test case might be a good one to add to external-dns-operator.

The order of the volumes doesn't matter in this comparison as we construct maps from the volume slices and compare by keys, added the test case to confirm that though.

Cluster-ingress-operator's cmpSecretVolumeSource and cmpConfigMapVolumeSource also explicitly tested each field in the volume source types, which would reduce the likelihood of problems resulting from defaulting if upstream added new fields in these types. Explicitly testing each field might not be worth the bother though.

To me, the real added value of cmpSecretVolumeSource and cmpConfigMapVolumeSource is the check of the defaultMode if it's set in the desired deployment. EquateEmpty options works well even if specified on the top level of the comparison chain, so I added it on the Volume level. I'll keep defaultMode ignored for the moment, let's revisit the comparison once we'll need to set the desired value for it.

Comment on lines 2820 to 2821
// TODO: refactor this test case to remove the mutation logic as it's misleading.
// This test case should have current/desired/expected fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it makes the test cases easier to read when each test case only specifies what changes, which is what the mutate func shows. I'm biased though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird, though, that some test cases also specify the original deployment and the expected deployment.

Copy link
Contributor Author

@alebedev87 alebedev87 Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think mutate() indeed may keep test cases shorter and clearer (clearer what changed, I use the test case name for this but it's less accurate). However I expected the mutate function to modify the desired (by the operator) state to get the current state while it's the opposite now.

The expected deployment may be different from the desired (e.g. unsolicited containers, volumes). I think it makes sense to verify what externalDNSDeploymentChanged sends to the API server at the end, that's why we need expectedDeployment.

Comment on lines 2881 to 2886
expect: true,
expect: false,
expectedDeployment: testDeploymentWithContainers([]corev1.Container{
testContainer(),
testContainerWithName("extra"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If expect is false, why specify expectedDeployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, removed it in all the test cases to avoid confusion.

@alebedev87 alebedev87 force-pushed the lessons-from-other-operators branch 2 times, most recently from f5ccc50 to 77bdc43 Compare November 22, 2022 17:23
@Miciah
Copy link
Contributor

Miciah commented Nov 23, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 23, 2022
@alebedev87
Copy link
Contributor Author

/label px-approved

As per Chris Fields' statement for this release of ExtDNS Operator - the docs are enough.

@openshift-ci openshift-ci bot added the px-approved Signifies that Product Support has signed off on this PR label Nov 24, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 24, 2022

@alebedev87: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit ac59633 into openshift:main Nov 24, 2022
@openshift-ci-robot
Copy link

@alebedev87: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-2793 has been moved to the MODIFIED state.

In response to this:

Some lessons learnt from the other operators:

  • defaultMode for volumes is not part of the expected specification however it's added by the defaulting mechanism which leads to the fight between the operator and API server
  • accept unsolicited containers to enable use cases like side car container injection (service mesh or others)
  • Operator type: unused fields removed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. docs-approved Signifies that Docs has signed off on this PR jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. px-approved Signifies that Product Support has signed off on this PR qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants