Old revisions running forever when using minScale #2720

nicolaferraro · 2018-12-14T14:03:44Z

/area autoscale
/kind bug

Expected Behavior

When running a kservice with a runLatest configuration and a autoscaling.knative.dev/minScale annotation set to 1, Knative should apply the minScale configuration only on active revisions not on "retired" revisions, otherwise old revisions will remain always running.

I've seen a similar discussion in #755, but that wasn't a real problem since apparently it wasn't possible to deactivate scaling to 0. But the minScale=1 annotation actually deactivates scaling to 0 for a specific service.

Actual Behavior

Each time a kservice is updated, a new revision is added and all revisions are kept running forever.

Steps to Reproduce the Problem

echo '
apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
  name: helloworld-go
spec:
  runLatest:
    configuration:
      revisionTemplate:
        metadata:
          annotations:
            autoscaling.knative.dev/minScale: "1"
        spec:
          container:
            image: gcr.io/knative-samples/helloworld-go
            env:
            - name: TARGET
              value: "Go Sample v1"
' | kubectl apply -f -

echo '
apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
  name: helloworld-go
spec:
  runLatest:
    configuration:
      revisionTemplate:
        metadata:
          annotations:
            autoscaling.knative.dev/minScale: "1"
        spec:
          container:
            image: gcr.io/knative-samples/helloworld-go
            env:
            - name: TARGET
              value: "Go Sample v2"
' | kubectl apply -f -

Revision v1 is never recycled or scaled down.

Additional Info

Question related to this: is there a way to limit the number of total revisions to reduce the impact of the problem?

The text was updated successfully, but these errors were encountered:

carusology · 2018-12-17T18:03:32Z

I ran into this last night when monkeying with minScale. As one would expect, the traffic only goes to the newest revision, even though the old revision has an active deployment. I'm able to work around this right now by deleting old revisions.

josephburnett · 2018-12-17T18:26:24Z

You can delete the revision and it will go away. Or you can just wait and it should get garbage collected after a while (if no route is referencing it). If that doesn't happen, it's a bug.

Also, you can just edit the pod autoscaler and delete the minScale annotation.

carusology · 2018-12-17T18:51:00Z

you can just wait and it should get garbage collected after a while (if no route is referencing it). If that doesn't happen, it's a bug.

The current behavior is that it does not get garbage collected. The replicas property on the old revision's Kubernetes deployment doesn't get updated and it sits idle but still consuming resources.

carusology · 2018-12-20T17:09:47Z

Hey, @josephburnett:

Also, you can just edit the pod autoscaler and delete the minScale annotation.

Which Kubernetes resource for the pod autoscaler stores the minScale annotation? That sounds like a better solution than deleting the old revisions. Apologies in advance if this is obvious.

mattmoor · 2019-01-02T14:19:50Z

cc @greghaynes

Do they get GC'd?

yuzisun · 2019-01-02T14:30:01Z

I had the exact same issue, deleting annotations for old revisions manually does not sound a viable solution, currently I manually delete the old revisions but will need to double check if revision gc works in this case. I guess with default setting they will go away after 24h?

greghaynes · 2019-01-02T16:47:50Z

Yes, the configuration controller should GC revisions not referenced by a route after stale-revision-create-delay (default 24h) - @carusology do you mean that it wasn't GC'd quickly or did you verify somehow that this 24h GC wasn't happening?

It seems like the desired behavior here is for minScale to not apply for non-routed revisions in order to more quickly delete the deployment for a revision? If so I could also see arguments for minScale being enforced for the full lifetime of a revision (maybe I have a reason I never want my app to scale to 0).

greghaynes · 2019-01-02T16:54:13Z

I think if theres desire for faster deleting of deployments (to free up resources taken by minScale) we could try and come up with a more responsive GC system for revisions...

carusology · 2019-01-02T18:53:34Z

@greghaynes

It isn't getting GC'd at all. I still have an old revision with an active deployment from December 17th (~2 weeks ago). The version deployed after it was using the runLatest container configuration, so this lingering revision behavior surprised me.

If you follow the repro steps from @nicolaferraro, the Kubernetes deployment that gets created to run a container of helloworld-go with the TARGET environment variable of "Go Sample v1" will never get scaled to 0 replicas.

The behavior I expected, right or wrong, was when I deployed a service with runLatest and a minOccurs, any old revisions would be scaled to zero in the same way an old revision would be scaled to zero if it did not have a minOccurs setting defined. Presumably using the scale-to-zero-grace-period. I would definitely be surprised if I had to wait 24 hours versus a few minutes for this scaling to occur, though that is still preferable to current workarounds.

carusology · 2019-01-02T19:01:03Z

It seems like the desired behavior here is for minScale to not apply for non-routed revisions in order to more quickly delete the deployment for a revision?

Yes, exactly. Though I think you're talking about less than 24h and the reporters of this issue are talking about less than never.

There's an interesting conversation here about this potentially having issues if a kNative service is not intended to receive traffic, but instead be the source of it. If something were to be scaled to zero if it were to never receive traffic, then this use case of minOccurs is no longer possible. I personally don't think this is something that should be done using kNative, but if it's an intended use case, it's something to consider.

mattmoor · 2019-01-03T17:52:47Z

Putting in v1 because they should at least be subject to GC.

greghaynes · 2019-01-03T17:55:05Z

Agreed. I'm on holiday until Jan 9 but if no one has looked at it by then I'll have a look.

mattmoor · 2019-01-08T05:19:10Z

@josephburnett if it's cool with you I'm going to mark this as 0.4 scope, assuming @greghaynes has cycles to tackle it?

/milestone Serving 0.4

greghaynes · 2019-01-08T15:00:51Z

I do!

/assign greghaynes

josephburnett · 2019-01-28T18:40:13Z

Good that it's in milestone 0.4.

There are two issues here:

Issue 1: The revision should definitely be garbage collected normally, whether it has running pods or not. This is a bug outright and should be fixed as part of this issue.

Issue 2: Should minScale not apply when a revision is not routable? I don't think so. Consider this scenario: what if you have an application that needs 10 minutes to preprocess data before it can handle traffic? For this reason, you set minScale=1 on your service. And suppose you are using release mode to rollout new revisions. You start by referencing the old and new revisions with a 0% rollout. When you ramp up to 99%, you can still fall back to the old revision immediately because that old revision is still referenced in the traffic split. But once you go to 100% the door slams behind you. This is why: rollout mode expresses 100% as a list of 1 revisions with 0% rollout. So you must remove the reference to the old revision to finish. At that point, we would (hypothetically) scale the deployment to 0, despite the minScale=1 directive. If you realize 30 minutes later the new revision is causing problems, a rollback is 10 minute minimum because you have to warm that old revision up again. 😞

greghaynes · 2019-01-29T01:02:29Z

@nicolaferraro Sorry it took a bit, but I just attempted to reproduce the issue where your revision 1 was not getting GC'd after 24h and was unable to. A couple shots in the dark:

Are you sure there are no other routes which refer to revision-00001 of your service?
Do you know what version of knative you were using? GC has been around since 0.2.1.

If you are able to run from source heres how you can turn the GC time span down which should demonstrate working GC without having to wait 24h: master...greghaynes:test/quick-gc

Also, if you are able to run with the above changes and see GC not occurring, could you do a kubectl get revision helloworld-go-00001 -oyaml and see whether metadata.annotations.serving.knative.dev/lastPinned is updating (should be once per minute with this change). Additionally, any errors in the logs of the controller pod would be very helpful.

Thanks!

AnthonyPoschen · 2019-01-29T15:43:16Z

hi @greghaynes i am pretty new to kubernetes and knative, running on a small 4 node cluster and setup my first service in a ci/cd pipeline and quickly noticed my resources disappear as i slammed revisions out. This thread has been amazing to workout what is causing it and the solution. Would be great to have this highlighted more in documentation if it was possible, potentially in a FAQ. Might help others out as it caught me out fairly quickly.

vagababov · 2019-01-29T16:46:50Z

There's also this configuration item "StaleRevisionMinimumGenerations", which it keeps N revisions even if they are stale. Perhaps it's set > 1?

carusology · 2019-01-29T17:05:24Z

@greghaynes

Do you know what version of knative you were using? GC has been around since 0.2.1.

Ah, nice catch. I was running v0.2.0 in the instance where it never got cleaned up (or at least, not after 2+ weeks). I'm glad to know that upgrading to v0.2.1+ will at least set a 24h cap on the old revision finally scaling back to 0.

mattmoor · 2019-02-02T15:37:42Z

Skimming the updates in this thread, it looks like folks should be WAI, but that our docs need work.

@greghaynes since your assigned do you want to take a crack at it?

cc @josephburnett as this is also related to minScale annotations, and not really a GC problem.

yuzisun · 2019-02-03T14:44:50Z

@greghaynes

Do you know what version of knative you were using? GC has been around since 0.2.1.

Ah, nice catch. I was running v0.2.0 in the instance where it never got cleaned up (or at least, not after 2+ weeks). I'm glad to know that upgrading to v0.2.1+ will at least set a 24h cap on the old revision finally scaling back to 0.

I am running v0.2.3 and service has minScale annotation, but old revision stays more than 2 days and there is no traffic assigning to it.

kubectl get pods -n default
iaas-00001-deployment-6d75b6bc66-hc28w        3/3     Running                      0          2d19h
iaas-00002-deployment-6c56dc88d8-kdl69        3/3     Running                      0          2d15h
iaas-00002-deployment-6c56dc88d8-l2x74        3/3     Running                      0          2d15h

kubectl get routes iaas -n default -o yaml
spec:
  generation: 2
  traffic:
  - name: current
    percent: 100
    revisionName: iaas-00002
  - name: candidate
    percent: 0
    revisionName: iaas-00001
  - configurationName: iaas
    name: latest
    percent: 0

mattmoor · 2019-02-06T02:34:39Z

@yuzisun that looks right to me.

iaas-00001 won't go away until 24h after it's been removed from that block (it looks like it is being put there by a release mode Service)

mattmoor · 2019-02-08T01:32:00Z

/unassign @greghaynes
/assign @josephburnett
/milestone Serving 0.5

As this isn't a GC bug, but a minScale documentation gap I'm moving to 0.5 and reassigning.

yanweiguo · 2019-03-19T22:17:14Z

/unassign @josephburnett
/assign @yanweiguo

knative-prow-robot added area/autoscale kind/bug Categorizes issue or PR as related to a bug. labels Dec 14, 2018

mattmoor added this to the Serving "v1" (ready for production) milestone Jan 3, 2019

knative-prow-robot modified the milestones: Serving "v1" (ready for production), Serving 0.4 Jan 8, 2019

knative-prow-robot assigned greghaynes Jan 8, 2019

nicolaferraro mentioned this issue Jan 10, 2019

Use standard deployment for Knative event sources apache/camel-k#324

Closed

knative-prow-robot unassigned greghaynes Feb 8, 2019

knative-prow-robot modified the milestones: Serving 0.4, Serving 0.5 Feb 8, 2019

knative-prow-robot assigned josephburnett Feb 8, 2019

knative-prow-robot assigned yanweiguo and unassigned josephburnett Mar 19, 2019

yanweiguo mentioned this issue Mar 19, 2019

Document scale annotations lifetime #3461

Merged

knative-prow-robot closed this as completed in #3461 Mar 26, 2019

viktorklang mentioned this issue Mar 28, 2019

Old Revisions need to be scaled down to 0 cloudstateio/cloudstate#6

Open

rgregg mentioned this issue May 29, 2019

minScale should not apply to non-routable revisions #4183

Closed

A-And mentioned this issue Jul 26, 2019

Editing a KFService Object leaves unused resources kserve/kserve#269

Closed

pmacik mentioned this issue Oct 3, 2019

Add knative example. redhat-developer/service-binding-operator#181

Merged

aerfio mentioned this issue Apr 8, 2020

Cleanup old function revisions kyma-project/kyma#7947

Closed

sivanantha321 mentioned this issue Jan 13, 2024

prev revision is not removed when using inference logger kserve/kserve#3359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old revisions running forever when using minScale #2720

Old revisions running forever when using minScale #2720

nicolaferraro commented Dec 14, 2018

carusology commented Dec 17, 2018 •

edited

Loading

josephburnett commented Dec 17, 2018

carusology commented Dec 17, 2018 •

edited

Loading

carusology commented Dec 20, 2018 •

edited

Loading

mattmoor commented Jan 2, 2019

yuzisun commented Jan 2, 2019

greghaynes commented Jan 2, 2019

greghaynes commented Jan 2, 2019

carusology commented Jan 2, 2019 •

edited

Loading

carusology commented Jan 2, 2019 •

edited

Loading

mattmoor commented Jan 3, 2019

greghaynes commented Jan 3, 2019

mattmoor commented Jan 8, 2019

greghaynes commented Jan 8, 2019

josephburnett commented Jan 28, 2019

greghaynes commented Jan 29, 2019

AnthonyPoschen commented Jan 29, 2019

vagababov commented Jan 29, 2019 •

edited

Loading

carusology commented Jan 29, 2019

mattmoor commented Feb 2, 2019

yuzisun commented Feb 3, 2019

mattmoor commented Feb 6, 2019

mattmoor commented Feb 8, 2019

yanweiguo commented Mar 19, 2019

Old revisions running forever when using minScale #2720

Old revisions running forever when using minScale #2720

Comments

nicolaferraro commented Dec 14, 2018

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

carusology commented Dec 17, 2018 • edited Loading

josephburnett commented Dec 17, 2018

carusology commented Dec 17, 2018 • edited Loading

carusology commented Dec 20, 2018 • edited Loading

mattmoor commented Jan 2, 2019

yuzisun commented Jan 2, 2019

greghaynes commented Jan 2, 2019

greghaynes commented Jan 2, 2019

carusology commented Jan 2, 2019 • edited Loading

carusology commented Jan 2, 2019 • edited Loading

mattmoor commented Jan 3, 2019

greghaynes commented Jan 3, 2019

mattmoor commented Jan 8, 2019

greghaynes commented Jan 8, 2019

josephburnett commented Jan 28, 2019

greghaynes commented Jan 29, 2019

AnthonyPoschen commented Jan 29, 2019

vagababov commented Jan 29, 2019 • edited Loading

carusology commented Jan 29, 2019

mattmoor commented Feb 2, 2019

yuzisun commented Feb 3, 2019

mattmoor commented Feb 6, 2019

mattmoor commented Feb 8, 2019

yanweiguo commented Mar 19, 2019

carusology commented Dec 17, 2018 •

edited

Loading

carusology commented Dec 17, 2018 •

edited

Loading

carusology commented Dec 20, 2018 •

edited

Loading

carusology commented Jan 2, 2019 •

edited

Loading

carusology commented Jan 2, 2019 •

edited

Loading

vagababov commented Jan 29, 2019 •

edited

Loading