Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Old revisions running forever when using minScale #2720

Closed
nicolaferraro opened this issue Dec 14, 2018 · 24 comments
Closed

Old revisions running forever when using minScale #2720

nicolaferraro opened this issue Dec 14, 2018 · 24 comments
Assignees
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@nicolaferraro
Copy link

/area autoscale
/kind bug

Expected Behavior

When running a kservice with a runLatest configuration and a autoscaling.knative.dev/minScale annotation set to 1, Knative should apply the minScale configuration only on active revisions not on "retired" revisions, otherwise old revisions will remain always running.

I've seen a similar discussion in #755, but that wasn't a real problem since apparently it wasn't possible to deactivate scaling to 0. But the minScale=1 annotation actually deactivates scaling to 0 for a specific service.

Actual Behavior

Each time a kservice is updated, a new revision is added and all revisions are kept running forever.

Steps to Reproduce the Problem

echo '
apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
  name: helloworld-go
spec:
  runLatest:
    configuration:
      revisionTemplate:
        metadata:
          annotations:
            autoscaling.knative.dev/minScale: "1"
        spec:
          container:
            image: gcr.io/knative-samples/helloworld-go
            env:
            - name: TARGET
              value: "Go Sample v1"
' | kubectl apply -f -

echo '
apiVersion: serving.knative.dev/v1alpha1
kind: Service
metadata:
  name: helloworld-go
spec:
  runLatest:
    configuration:
      revisionTemplate:
        metadata:
          annotations:
            autoscaling.knative.dev/minScale: "1"
        spec:
          container:
            image: gcr.io/knative-samples/helloworld-go
            env:
            - name: TARGET
              value: "Go Sample v2"
' | kubectl apply -f -

Revision v1 is never recycled or scaled down.

Additional Info

Question related to this: is there a way to limit the number of total revisions to reduce the impact of the problem?

@knative-prow-robot knative-prow-robot added area/autoscale kind/bug Categorizes issue or PR as related to a bug. labels Dec 14, 2018
@carusology
Copy link

carusology commented Dec 17, 2018

I ran into this last night when monkeying with minScale. As one would expect, the traffic only goes to the newest revision, even though the old revision has an active deployment. I'm able to work around this right now by deleting old revisions.

@josephburnett
Copy link
Contributor

You can delete the revision and it will go away. Or you can just wait and it should get garbage collected after a while (if no route is referencing it). If that doesn't happen, it's a bug.

Also, you can just edit the pod autoscaler and delete the minScale annotation.

@carusology
Copy link

carusology commented Dec 17, 2018

you can just wait and it should get garbage collected after a while (if no route is referencing it). If that doesn't happen, it's a bug.

The current behavior is that it does not get garbage collected. The replicas property on the old revision's Kubernetes deployment doesn't get updated and it sits idle but still consuming resources.

@carusology
Copy link

carusology commented Dec 20, 2018

Hey, @josephburnett:

Also, you can just edit the pod autoscaler and delete the minScale annotation.

Which Kubernetes resource for the pod autoscaler stores the minScale annotation? That sounds like a better solution than deleting the old revisions. Apologies in advance if this is obvious.

@mattmoor
Copy link
Member

mattmoor commented Jan 2, 2019

cc @greghaynes

Do they get GC'd?

@yuzisun
Copy link

yuzisun commented Jan 2, 2019

I had the exact same issue, deleting annotations for old revisions manually does not sound a viable solution, currently I manually delete the old revisions but will need to double check if revision gc works in this case. I guess with default setting they will go away after 24h?

@greghaynes
Copy link
Contributor

Yes, the configuration controller should GC revisions not referenced by a route after stale-revision-create-delay (default 24h) - @carusology do you mean that it wasn't GC'd quickly or did you verify somehow that this 24h GC wasn't happening?

It seems like the desired behavior here is for minScale to not apply for non-routed revisions in order to more quickly delete the deployment for a revision? If so I could also see arguments for minScale being enforced for the full lifetime of a revision (maybe I have a reason I never want my app to scale to 0).

@greghaynes
Copy link
Contributor

I think if theres desire for faster deleting of deployments (to free up resources taken by minScale) we could try and come up with a more responsive GC system for revisions...

@carusology
Copy link

carusology commented Jan 2, 2019

@greghaynes

It isn't getting GC'd at all. I still have an old revision with an active deployment from December 17th (~2 weeks ago). The version deployed after it was using the runLatest container configuration, so this lingering revision behavior surprised me.

If you follow the repro steps from @nicolaferraro, the Kubernetes deployment that gets created to run a container of helloworld-go with the TARGET environment variable of "Go Sample v1" will never get scaled to 0 replicas.

The behavior I expected, right or wrong, was when I deployed a service with runLatest and a minOccurs, any old revisions would be scaled to zero in the same way an old revision would be scaled to zero if it did not have a minOccurs setting defined. Presumably using the scale-to-zero-grace-period. I would definitely be surprised if I had to wait 24 hours versus a few minutes for this scaling to occur, though that is still preferable to current workarounds.

@carusology
Copy link

carusology commented Jan 2, 2019

It seems like the desired behavior here is for minScale to not apply for non-routed revisions in order to more quickly delete the deployment for a revision?

Yes, exactly. Though I think you're talking about less than 24h and the reporters of this issue are talking about less than never.

There's an interesting conversation here about this potentially having issues if a kNative service is not intended to receive traffic, but instead be the source of it. If something were to be scaled to zero if it were to never receive traffic, then this use case of minOccurs is no longer possible. I personally don't think this is something that should be done using kNative, but if it's an intended use case, it's something to consider.

@mattmoor
Copy link
Member

mattmoor commented Jan 3, 2019

Putting in v1 because they should at least be subject to GC.

@greghaynes
Copy link
Contributor

Agreed. I'm on holiday until Jan 9 but if no one has looked at it by then I'll have a look.

@mattmoor
Copy link
Member

mattmoor commented Jan 8, 2019

@josephburnett if it's cool with you I'm going to mark this as 0.4 scope, assuming @greghaynes has cycles to tackle it?

/milestone Serving 0.4

@greghaynes
Copy link
Contributor

I do!

/assign greghaynes

@josephburnett
Copy link
Contributor

Good that it's in milestone 0.4.

There are two issues here:

Issue 1: The revision should definitely be garbage collected normally, whether it has running pods or not. This is a bug outright and should be fixed as part of this issue.

Issue 2: Should minScale not apply when a revision is not routable? I don't think so. Consider this scenario: what if you have an application that needs 10 minutes to preprocess data before it can handle traffic? For this reason, you set minScale=1 on your service. And suppose you are using release mode to rollout new revisions. You start by referencing the old and new revisions with a 0% rollout. When you ramp up to 99%, you can still fall back to the old revision immediately because that old revision is still referenced in the traffic split. But once you go to 100% the door slams behind you. This is why: rollout mode expresses 100% as a list of 1 revisions with 0% rollout. So you must remove the reference to the old revision to finish. At that point, we would (hypothetically) scale the deployment to 0, despite the minScale=1 directive. If you realize 30 minutes later the new revision is causing problems, a rollback is 10 minute minimum because you have to warm that old revision up again. 😞

@greghaynes
Copy link
Contributor

@nicolaferraro Sorry it took a bit, but I just attempted to reproduce the issue where your revision 1 was not getting GC'd after 24h and was unable to. A couple shots in the dark:

  • Are you sure there are no other routes which refer to revision-00001 of your service?
  • Do you know what version of knative you were using? GC has been around since 0.2.1.

If you are able to run from source heres how you can turn the GC time span down which should demonstrate working GC without having to wait 24h: master...greghaynes:test/quick-gc

Also, if you are able to run with the above changes and see GC not occurring, could you do a kubectl get revision helloworld-go-00001 -oyaml and see whether metadata.annotations.serving.knative.dev/lastPinned is updating (should be once per minute with this change). Additionally, any errors in the logs of the controller pod would be very helpful.

Thanks!

@AnthonyPoschen
Copy link

hi @greghaynes i am pretty new to kubernetes and knative, running on a small 4 node cluster and setup my first service in a ci/cd pipeline and quickly noticed my resources disappear as i slammed revisions out. This thread has been amazing to workout what is causing it and the solution. Would be great to have this highlighted more in documentation if it was possible, potentially in a FAQ. Might help others out as it caught me out fairly quickly.

@vagababov
Copy link
Contributor

vagababov commented Jan 29, 2019

There's also this configuration item "StaleRevisionMinimumGenerations", which it keeps N revisions even if they are stale. Perhaps it's set > 1?

@carusology
Copy link

@greghaynes

Do you know what version of knative you were using? GC has been around since 0.2.1.

Ah, nice catch. I was running v0.2.0 in the instance where it never got cleaned up (or at least, not after 2+ weeks). I'm glad to know that upgrading to v0.2.1+ will at least set a 24h cap on the old revision finally scaling back to 0.

@mattmoor
Copy link
Member

mattmoor commented Feb 2, 2019

Skimming the updates in this thread, it looks like folks should be WAI, but that our docs need work.

@greghaynes since your assigned do you want to take a crack at it?

cc @josephburnett as this is also related to minScale annotations, and not really a GC problem.

@yuzisun
Copy link

yuzisun commented Feb 3, 2019

@greghaynes

Do you know what version of knative you were using? GC has been around since 0.2.1.

Ah, nice catch. I was running v0.2.0 in the instance where it never got cleaned up (or at least, not after 2+ weeks). I'm glad to know that upgrading to v0.2.1+ will at least set a 24h cap on the old revision finally scaling back to 0.

I am running v0.2.3 and service has minScale annotation, but old revision stays more than 2 days and there is no traffic assigning to it.

kubectl get pods -n default
iaas-00001-deployment-6d75b6bc66-hc28w        3/3     Running                      0          2d19h
iaas-00002-deployment-6c56dc88d8-kdl69        3/3     Running                      0          2d15h
iaas-00002-deployment-6c56dc88d8-l2x74        3/3     Running                      0          2d15h
kubectl get routes iaas -n default -o yaml
spec:
  generation: 2
  traffic:
  - name: current
    percent: 100
    revisionName: iaas-00002
  - name: candidate
    percent: 0
    revisionName: iaas-00001
  - configurationName: iaas
    name: latest
    percent: 0

@mattmoor
Copy link
Member

mattmoor commented Feb 6, 2019

@yuzisun that looks right to me.

iaas-00001 won't go away until 24h after it's been removed from that block (it looks like it is being put there by a release mode Service)

@mattmoor
Copy link
Member

mattmoor commented Feb 8, 2019

/unassign @greghaynes
/assign @josephburnett
/milestone Serving 0.5

As this isn't a GC bug, but a minScale documentation gap I'm moving to 0.5 and reassigning.

@yanweiguo
Copy link
Contributor

/unassign @josephburnett
/assign @yanweiguo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

10 participants