Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revision reconciler always updates the Kubernetes Deployment #13204

Open
SaschaSchwarze0 opened this issue Aug 12, 2022 · 15 comments
Open

Revision reconciler always updates the Kubernetes Deployment #13204

SaschaSchwarze0 opened this issue Aug 12, 2022 · 15 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@SaschaSchwarze0
Copy link
Contributor

SaschaSchwarze0 commented Aug 12, 2022

I was hunting down a performance issue that we are seeing when Knative is reconciling all revisions, for example due to the controller being restarted or global syncs.

The problem happens in the checkAndUpdateDeployment. The code there uses MakeDeployment like in the create case to setup the deployment. It will then check if there are any changes compared to the previous version by running this code:

	// If the spec we want is the spec we have, then we're good.
	if equality.Semantic.DeepEqual(have.Spec, deployment.Spec) {
		return have, nil
	}

This check will never return true. The reasons are the following:

  1. Defaulting. Knative sets up the deployment and does not set fields at various places which Kubernetes will default to some value. Those differences are always there. Examples are: the Protocol field in ports which gets set to TCP, the TimeoutSeconds/PeriodSeconds/FailureThreshold in the queue proxy probe, maxSurge in the rolling update strategy which gets set to 25, the RevisionHistoryLimit which becomes 10, the APIVersion in an env.valueFrom.FieldRef which gets set to v1, the restart policy that gets set to Always ...
  2. Mutations by others. Istio for example mutates in the labels service.istio.io/canonical-name and service.istio.io/canonical-revision in the pod template.

Why is this a problem? Assuming there are just 500 revisions in the system. If every revision reconcile is causing the one Update call, then - with a QPS of 40 (Default 5 multiplied by controller count 8), then only these update calls will need 12.5 seconds. If you have rather 3000 revisions, we're talking about 1m15s. And it's not just about the time. It's also the unnecessary hammering on the API server.

How can this be fixed? I currently have two ideas:

  1. Pass the previous version along into MakeDeployment and there we will need to follow two strategies:

    a) is to set defaults where applicable. If you define a TCP probe, just specify that. Kubernetes would unlikely change its default, but specifying the explicit value also does not hurt.
    b) Use the previous version to set fields that Knative does not care and Kubernetes could change, such as the maxSurge or revision limit.

    A sample of that can be found in SaschaSchwarze0@e3fe7f5. There is also code added in checkAndUpdateDeployment that dumps the differences which one can separately apply to see the problem. This part is also explained below in the reproduction steps.

    But, this approach has two disadvantages:

    a) it will require a regular verification and extension assuming Kubernetes eventually adds new fields with defaulting
    b) it will be a large effort to capture every possible field that others could mutate in, it will never be able to handle situations where others mutate changes on top of Knative

  2. Pass a target object into MakeDeployment. This would be an empty object for the create case and the current version for an update. All the existing functions that build the deployment details, pod spec, its container details, would need a rewrite to handle an empty object, or update an object.

    This will imo be cleaner, but its complexity and amount of changes grew beyond what I wanted to try out in code I am not familiar with so that I have no spike code to show here.

    This will also not solve situations where webhooks mutate things that Knative sets.

I am happy to discuss this. You can reach me in Knative slack (@sascha). Also happy to help with the implementation.

--

One more thing I'd like to mention: creating an object from scratch and then applying it to the cluster for both the create and update case could be a pattern. I only focused on the revision controller to configure the deployment. I would not be surprised if that pattern (which I consider broken because of above explanations) is used elsewhere. Whether that is a pain or not depends on how many objects of that kind would be present and how often all objects get reconciled.

What version of Knative?

Current

Expected Behavior

Knative only updates Kubernetes deployments when there is a need to do this.

Actual Behavior

Every reconcile of a revision triggers a deployment update.

Steps to Reproduce the Problem

Add the following code before https://github.com/knative/serving/blob/v0.33.0/pkg/reconciler/revision/cruds.go#L93:

	debugDiffs, err := kmp.SafeDiff(have.Spec, desiredDeployment.Spec)
	if err != nil {
		return nil, err
	}
	logger.Infof("Updating deployment %s/%s with diff\n", deployment.Namespace, deployment.Name, debugDiffs)

Then let Knative running while you have revisions in the system. You can observe the unnecessary updates in the container logs of the controller.

@SaschaSchwarze0 SaschaSchwarze0 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 12, 2022
@psschwei
Copy link
Contributor

For reference, here's the output of one of those logs after a reconcile

  v1.DeploymentSpec{
    Replicas: &0,
    Selector: &{MatchLabels: {"serving.knative.dev/revisionUID": "f7a53897-8006-48a9-b38f-57617fc8a892"}},
    Template: v1.PodTemplateSpec{
      ObjectMeta: {Labels: {"app": "hello-00001", "serving.knative.dev/configuration": "hello", "serving.knative.dev/configurationGeneration": "1", "serving.knative.dev/configurationUID": "819b6139-fde5-40fb-9c9d-30265e7f7663", ...}, Annotations: {"serving.knative.dev/creator": "minikube-user"}},
      Spec: v1.PodSpec{
        Volumes:        nil,
        InitContainers: nil,
        Containers: []v1.Container{
          {
            ... // 3 identical fields
            Args:       nil,
            WorkingDir: "",
            Ports: []v1.ContainerPort{
              {
                Name:          "user-port",
                HostPort:      0,
                ContainerPort: 8080,
-               Protocol:      "TCP",
+               Protocol:      "",
                HostIP:        "",
              },
            },
            EnvFrom: nil,
            Env:     {{Name: "TARGET", Value: "Go Sample v2"}, {Name: "PORT", Value: "8080"}, {Name: "K_REVISION", Value: "hello-00001"}, {Name: "K_CONFIGURATION", Value: "hello"}, ...},
            Resources: v1.ResourceRequirements{
-             Limits:   nil,
+             Limits:   v1.ResourceList{},
-             Requests: nil,
+             Requests: v1.ResourceList{},
            },
            VolumeMounts:   nil,
            VolumeDevices:  nil,
            LivenessProbe:  nil,
            ReadinessProbe: nil,
            StartupProbe:   nil,
            Lifecycle: &v1.Lifecycle{
              PostStart: nil,
              PreStop: &v1.LifecycleHandler{
                Exec: nil,
                HTTPGet: &v1.HTTPGetAction{
                  Path:        "/wait-for-drain",
                  Port:        {IntVal: 8022},
                  Host:        "",
-                 Scheme:      "HTTP",
+                 Scheme:      "",
                  HTTPHeaders: nil,
                },
                TCPSocket: nil,
              },
            },
-           TerminationMessagePath:   "/dev/termination-log",
+           TerminationMessagePath:   "",
            TerminationMessagePolicy: "FallbackToLogsOnError",
-           ImagePullPolicy:          "IfNotPresent",
+           ImagePullPolicy:          "",
            SecurityContext:          nil,
            Stdin:                    false,
            ... // 2 identical fields
          },
          {
            ... // 3 identical fields
            Args:       nil,
            WorkingDir: "",
            Ports: []v1.ContainerPort{
              {
                Name:          "http-queueadm",
                HostPort:      0,
                ContainerPort: 8022,
-               Protocol:      "TCP",
+               Protocol:      "",
                HostIP:        "",
              },
              {
                Name:          "http-autometric",
                HostPort:      0,
                ContainerPort: 9090,
-               Protocol:      "TCP",
+               Protocol:      "",
                HostIP:        "",
              },
              {
                Name:          "http-usermetric",
                HostPort:      0,
                ContainerPort: 9091,
-               Protocol:      "TCP",
+               Protocol:      "",
                HostIP:        "",
              },
              {
                Name:          "queue-port",
                HostPort:      0,
                ContainerPort: 8012,
-               Protocol:      "TCP",
+               Protocol:      "",
                HostIP:        "",
              },
              {
                Name:          "https-port",
                HostPort:      0,
                ContainerPort: 8112,
-               Protocol:      "TCP",
+               Protocol:      "",
                HostIP:        "",
              },
            },
            EnvFrom: nil,
            Env: []v1.EnvVar{
              ... // 8 identical elements
              {Name: "REVISION_RESPONSE_START_TIMEOUT_SECONDS", Value: "0"},
              {Name: "REVISION_IDLE_TIMEOUT_SECONDS", Value: "0"},
              {
                Name:  "SERVING_POD",
                Value: "",
                ValueFrom: &v1.EnvVarSource{
                  FieldRef: &v1.ObjectFieldSelector{
-                   APIVersion: "v1",
+                   APIVersion: "",
                    FieldPath:  "metadata.name",
                  },
                  ResourceFieldRef: nil,
                  ConfigMapKeyRef:  nil,
                  SecretKeyRef:     nil,
                },
              },
              {
                Name:  "SERVING_POD_IP",
                Value: "",
                ValueFrom: &v1.EnvVarSource{
                  FieldRef: &v1.ObjectFieldSelector{
-                   APIVersion: "v1",
+                   APIVersion: "",
                    FieldPath:  "status.podIP",
                  },
                  ResourceFieldRef: nil,
                  ConfigMapKeyRef:  nil,
                  SecretKeyRef:     nil,
                },
              },
              {Name: "SERVING_LOGGING_CONFIG"},
              {Name: "SERVING_LOGGING_LEVEL"},
              ... // 18 identical elements
            },
            Resources:     {Requests: {s"cpu": {i: {...}, s: "25m", Format: "DecimalSI"}}},
            VolumeMounts:  nil,
            VolumeDevices: nil,
            LivenessProbe: nil,
            ReadinessProbe: &v1.Probe{
              ProbeHandler: v1.ProbeHandler{
                Exec: nil,
                HTTPGet: &v1.HTTPGetAction{
-                 Path:        "/",
+                 Path:        "",
                  Port:        {IntVal: 8012},
                  Host:        "",
-                 Scheme:      "HTTP",
+                 Scheme:      "",
                  HTTPHeaders: {{Name: "K-Network-Probe", Value: "queue"}},
                },
                TCPSocket: nil,
                GRPC:      nil,
              },
              InitialDelaySeconds:           0,
-             TimeoutSeconds:                1,
+             TimeoutSeconds:                0,
-             PeriodSeconds:                 10,
+             PeriodSeconds:                 0,
              SuccessThreshold:              1,
-             FailureThreshold:              3,
+             FailureThreshold:              0,
              TerminationGracePeriodSeconds: nil,
            },
            StartupProbe:             nil,
            Lifecycle:                nil,
-           TerminationMessagePath:   "/dev/termination-log",
+           TerminationMessagePath:   "",
-           TerminationMessagePolicy: "File",
+           TerminationMessagePolicy: "",
-           ImagePullPolicy:          "IfNotPresent",
+           ImagePullPolicy:          "",
            SecurityContext:          &{Capabilities: &{Drop: {"all"}}, RunAsNonRoot: &true, ReadOnlyRootFilesystem: &true, AllowPrivilegeEscalation: &false, ...},
            Stdin:                    false,
            ... // 2 identical fields
          },
        },
        EphemeralContainers:           nil,
-       RestartPolicy:                 "Always",
+       RestartPolicy:                 "",
        TerminationGracePeriodSeconds: &300,
        ActiveDeadlineSeconds:         nil,
-       DNSPolicy:                     "ClusterFirst",
+       DNSPolicy:                     "",
        NodeSelector:                  nil,
        ServiceAccountName:            "",
        ... // 5 identical fields
        HostIPC:               false,
        ShareProcessNamespace: nil,
-       SecurityContext:       s"&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,FSGroupChangePolicy:nil,SeccompProfile:nil,}",
+       SecurityContext:       nil,
        ImagePullSecrets:      nil,
        Hostname:              "",
        Subdomain:             "",
        Affinity:              nil,
-       SchedulerName:         "default-scheduler",
+       SchedulerName:         "",
        Tolerations:           nil,
        HostAliases:           nil,
        ... // 11 identical fields
      },
    },
    Strategy: v1.DeploymentStrategy{
      Type: "RollingUpdate",
      RollingUpdate: &v1.RollingUpdateDeployment{
        MaxUnavailable: &{},
-       MaxSurge:       s"25%",
+       MaxSurge:       nil,
      },
    },
    MinReadySeconds:         0,
-   RevisionHistoryLimit:    &10,
+   RevisionHistoryLimit:    nil,
    Paused:                  false,
    ProgressDeadlineSeconds: &600,
  }

Not sure the best way to do so, but this seems like something we should fix if possible.

/triage accepted

@knative-prow knative-prow bot added the triage/accepted Issues which should be fixed (post-triage) label Aug 16, 2022
@SaschaSchwarze0
Copy link
Contributor Author

SaschaSchwarze0 commented Oct 4, 2022

Hi @psschwei, as I said in the description, I would be happy to help with the implementation. But, I also mentioned that I don't know what the preferred design here would be. If I would have enough self-confidence, I would have opened a PR and not an issue. So, what is your expection after above action ?

@SaschaSchwarze0
Copy link
Contributor Author

/assign psschwei

@skonto
Copy link
Contributor

skonto commented Nov 6, 2022

@SaschaSchwarze0 @psschwei Hi, maybe something similar to this could help here. I havent checked the details yet if defaulting first via K8s, then by knative and then passing the dep obj covers all cases but could be a start.

@SaschaSchwarze0
Copy link
Contributor Author

SaschaSchwarze0 commented Nov 8, 2022

@SaschaSchwarze0 @psschwei Hi, maybe something similar to this could help here. I havent checked the details yet if defaulting first via K8s, then by knative and then passing the dep obj covers all cases but could be a start.

@skonto this idea sounds interesting. But, I don't know how this can work. Knative is a Kubernetes client and only has all the client code available. The registry that the Schema has to do defaulting, does not contain a function for Deployments. Which makes sense given SetObjectDefaults_Deployment is part of Kubernetes server code that Knative does not pick up - and which one is not supposed to import.

Even if it would work and we would import this function somehow, then we would import the defaulting code of one Kubernetes version. I don't know how much this has changed between Kube versions, but we might not be fully happy with that behavior.

@SaschaSchwarze0
Copy link
Contributor Author

SaschaSchwarze0 commented Dec 13, 2022

I opened a pull request with a different solution where a hash of the client-side built Deployment spec is stored as Deployment annotation to be later able to perform a comparison. This approach even works with mutating webhooks modifying the Deployment in an arbitrary way in addition to the "known updates" that the Kubernetes defaulting does.

@dprotaso dprotaso added this to the v1.10.0 milestone Jan 18, 2023
@dprotaso
Copy link
Member

dprotaso commented Jan 18, 2023

I started to look at server side apply and I believe it could solve the excess upgrades described in this issue as well as allowing two controllers to coordinate reconciling a single resource (tracked here: knative/pkg#2128).

1.9 release is next week but I think this is worth digging into for Knative 1.10 (something that I can do)

@SaschaSchwarze0
Copy link
Contributor Author

I started to look at server side apply and I believe it could solve the excess upgrades described in this issue as well as allowing two controllers to coordinate reconciling a single resource (tracked here: knative/pkg#2128).

1.9 release is next week but I think this is worth digging into for Knative 1.10 (something that I can do)

@dprotaso for me to understand ... can you explain me how server-side apply helps to determine on the client side whether an update call to the server should be made or not ? Just to make sure it is clear: I named the issue "Revision reconciler always updates the Kubernetes Deployment". What I really meant was that the reconciler always triggers an update call to the server without that it really updates the object. And the call itself is the problem as it puts unnecessary load on the API server and consumes Knative's "quota" (QPS).

@dprotaso
Copy link
Member

can you explain me how server-side apply helps to determine on the client side whether an update call to the server should be made or not ?

Server-side apply makes use of managedFields on the resource. We can filter out on the client side all the fields that aren't managed by our controllers and then do a comparison of our desired spec. This in theory should remove the diffs caused by defaulting and potentially other controllers interacting with the resource.

Thus we would skip the SSA call if we know that all the fields the controller manages are set to their desired values.

@SaschaSchwarze0
Copy link
Contributor Author

can you explain me how server-side apply helps to determine on the client side whether an update call to the server should be made or not ?

Server-side apply makes use of managedFields on the resource. We can filter out on the client side all the fields that aren't managed by our controllers and then do a comparison of our desired spec. This in theory should remove the diffs caused by defaulting and potentially other controllers interacting with the resource.

Thus we would skip the SSA call if we know that all the fields the controller manages are set to their desired values.

Okay, I am not having too much practical experience with server-side apply. Anyway, what about these scenarios:

(1) You do not always know what "all the fields the controller manages" are, I think. Let me try to explain:

The Knative controller creates a deployment. For simplification we assume that it sets the fields a, b, and c. You can determine from the managed fields that it set these fields. Now Knative next version comes out and sets fields a, b, c, and d. The comparison logic will now compare the old and new objects and will determine that d is different. Given d based on managed fields has not previously been set by the Knative controller, you don't update the deployment as long as a, b, and c are unchanged.

But, in this case, the hash of the first object version that you calculate locally is different than the hash of the new version. My proposal would call the update which is correct.

(2) Even fields that you manage can be changed without that you changed them, I think. With mutating admission in the picture.

The Knative controller creates a deployment. It sets fields a, b, and c. A mutating admission webhook mutates field b. When the Knative controller reconciles the object again, it will determine that b has changed and that it owns this. It will run an update, the same mutating admission webhook will change it back making it a costly no-op. Basically any reconcilation of the revision triggers an update call.

But, in this case, the hash of the first object version that you calculate locally will be the same as on the next reconcilations. My proposal would not call the update which is correct.

@skonto
Copy link
Contributor

skonto commented May 4, 2023

@SaschaSchwarze0 @dprotaso @psschwei

I did more digging into this. This has been discussed here: kubernetes-sigs/kubebuilder#592.
Possible solutions I see:
a) use hashing for the part of the deployment we care about as in here.
b) Use semantic.DeepDerivative(expected.Spec, found.Spec)
kubernetes-sigs/kubebuilder#592 (comment). But only for the empty stuff (as in ray-project/kuberay#945). We still though need to set the non empty default values i would guess.
c) SSA is the goal see: kubernetes-sigs/kubebuilder#592 (comment).
d) Use some other library that does the work for us: https://github.com/banzaicloud/k8s-objectmatcher

I would vote for c) if possible and then check a) (hashing seems an already used approach) or b). I am not sure about depending on some external project like in d).

@SaschaSchwarze0
Copy link
Contributor Author

SaschaSchwarze0 commented May 4, 2023

@skonto I do NOT consider (c) an option. Server-side apply always goes to the API server. Even if the call is cheap, it is still an API call. And with a (reasonable) QPS like 50 but 4000 revisions in the system, you don't do anything else than no-op updates on deployments for 80 seconds.

EDIT: in case you're not aware, we're running successfully with option (a) for half a year now.

@skonto
Copy link
Contributor

skonto commented May 4, 2023

Regarding the load it depends on the use case. However I thought that the idea was to only get a reconciliation trigger when a field we manage is touched. So defaults for fields we dont set will not trigger anything.
If we care about a field we should set it properly with our values. It would be nice to understand if that works as assumed here. Anyway we still have options, hashing is a valid option we could adopt as well.

@SaschaSchwarze0
Copy link
Contributor Author

However I thought that the idea was to only get a reconciliation trigger when a field we manage is touched. So defaults for fields we dont set will not trigger anything.

Sounds interesting. I personally did not spend time to figure out if and how the amount of revision reconciles can be reduced. Its mainly these groups:

  1. Creation of the revision. For sure a reasonable reconcile.
  2. Knative configuration changes that cause a reconcile of everything (like changing the queue-proxy image in config-deployment). That's necessary reconciles because here the deployment really has to be updated.
  3. The controller leader changed. That's necessary reconciles on startup but can cause no deployment update if nothing actually changed.
  4. An owned object is updated, in particular the deployment itself, be it its replica field changed by autoscaler or its status. Not sure if there's any scenario where this should actually cause the controller to also update the deployment.

So yeah, if there are means to reduce the amount of revision reconciles, that would also be cool.

@dprotaso
Copy link
Member

dprotaso commented May 4, 2023

I think SSA is the right approach - you can perform diffs locally only on the fields the knative controller manages so that would prevent the excess API calls you are worrying about.

The other benefit is it should reduce update conflicts between the controller and autoscaler because they care about different properties of the deployment

I will focus on this for v1.11

/assign @dprotaso
/unassign @SaschaSchwarze0

@knative-prow knative-prow bot assigned dprotaso and unassigned SaschaSchwarze0 May 4, 2023
@dprotaso dprotaso removed this from the v1.11.0 milestone Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants