Don't scale to 1 upon deploy #4098

duglin · 2019-05-15T02:45:26Z

In what area(s)?

/area API
/area autoscale

Other classifications:
/kind good-first-issue
/kind spec

Describe the feature

Right now when a new KnService is deployed it is automatically scaled up to one instance. I believe this is incorrect, it should be scaled to zero by default. For a couple of reasons:

when not being used, w/o minScale > 0, services will scale to zero.... the service has not been used yet
when $ is involved with the running of the pods associated with a service, end-users will not expect to be charged when no one called their service... starting with a scale of 1 means they will be charged even though no one called their service, violating that "only pay for what you use" contract
the current behavior treats "deploy" time differently from any other time the service is alive. I see no reason to think that the 90 seconds immediately after a service is created is any different from any other time
if people really think "deploy time" is special, then we should call it out explicitly by allowing the user to specify a default scale value (either on the Service itself or via some Kn-wide config) rather than assuming the same needs apply to all users. Then they can also choose something other than "1" if they want.

/cc @nimakaviani

edit: proposal: provide a config knob so people can choose if they want to scale to 1 on deploy.

markusthoemmes · 2019-05-15T06:07:46Z

This is referred to in the runtime contract as a deployment probe. It's there to make sure your app is even viable and can start up in the first place.

We could enhance it and force scale down to 0 as soon as we see it becoming ready successfully. It served its purpose at this point.

duglin · 2019-05-15T18:27:28Z

Maybe... I can certainly see why it's done the way it is, but technically there could be lots of reasons why it won't start-up later in time too due to things changing in the env, and so I'm still thinking that treating the "deploy time" as special is probably not appropriate.

But, as I mentioned, giving the user the choice when money might be involved would be nice.

rgregg · 2019-05-29T21:14:26Z

Without scaling the service to 1 during deployment, there are a number of failure conditions that we won't be able to detect at deployment time, and will postpone those to being discovered at runtime. That seems like a bad user experience.

What's the downside about creating an instance of the app initially?

duglin · 2019-05-30T11:44:31Z

In an environment where a customer only wants to be charged when their service is actually invoked, they're now going to be charged just because it was deployed, not invoked.

As I mentioned above, I think having some kind of initialScale type of flag (and we can bike-shed on whether it's default is 1 or 0) would solve this. At least then the user (or cloud provider) can choose.

mbehrendt · 2019-05-30T16:20:07Z

adding to what @duglin wrote: If a customer wants to test whether what they deployed works, they'd just curl it. That would scaling from 0 to 1, and then immediately back to 0.

mattmoor · 2019-06-25T23:22:52Z

If a customer wants to test whether what they deployed works, they'd just curl it.

They could if they are manually rolling revisions out, but this elides a fairly fundamental check for "run latest" style services.

duglin · 2019-06-27T02:06:23Z

The runtime contract says:

On the initial deployment, platform providers SHOULD start an instance of the container to validate that the container is valid and will become ready. This startup SHOULD occur even if the container would not serve any user requests.

Notice the SHOULDs in there. They are not MUSTs. Therefore it is optional. We should allow for the user to leverage this optionality. I'm not sure I agree with the SHOULDs in the contract, I think even that's too strong, I'd prefer a MAY, but SHOULD still gives the flexibility that we need.

mattmoor · 2019-06-27T05:20:52Z

ok, I'll elaborate on "fundamental".

I'm reminded of the time that someone on the cloud dataflow team checked in a python syntax error that caused their container to crash on startup. You might not think that's such a big deal, but their integration testing environment decided to let 1000 or so VMs sit in this state overnight set to imagePullPolicy: Always (as in, pull the image every container restart). All was good on the GCR side, but I was pretty stressed out for a bit because GCR's QPS went up 10x for about 20 hours (for them it was devel, so who cares?). This was also in the fun days before the kubelet had rate limits.

I'm also reminded of a time before GCR was public that a PR was merged into the github repo backing the google/docker-registry image, which at the time was configured with automated builds to :latest. At the time this was used by a fledgling App Engine Flex product (before it was called that), which noticed the change and proceeded to pull it across the fleet. Turns out this was in the days of docker registry v1, and the OSS docker-registry was Python. Another syntax error, but this one actually managed to take down DockerHub for a few hours. This outage was effectively the reason imagePullPolicy: exists in Kubernetes, and has such a terrible default value (that makes me sad).

This type of error is really real and is all too common in dynamic languages, which are quite popular in the serverless space. However, statically typed languages aren't immune to boot-time failures either. You could omit a required flag, listen on the wrong port, or even do something as simple as the notorious multiple definitions of flag "log_dir".

We can even attack this from a higher level than language: what if I request a new revision with resource limits larger than can be possible scheduled on node in my cluster? What if I deploy a windows container to a linux cluster?

I could probably go on all night. The point of all of these is that if you don't boot the container to the point of a successful readiness probe before declaring it Ready and rolling it out, you'd miss every last one of these errors, and if no traffic ever comes in then the prior working versions might get GC'd before the user even realizes something is amiss.

MAY clearly sends the wrong message here, frankly SHOULD sends the wrong message (and it does surprise me that we've written SHOULD vs. MUST). However, in my experience you should take this SHOULD to mean: you'd be totally crazy not to, and I think we should seriously consider making it MUST.

That said, changing SHOULD -> MUST is easier (breaking for vendor conformance), but MUST -> SHOULD is harder (breaking for customer applications).

duglin · 2019-06-27T05:58:08Z

I'm not questioning the need for this type of check in some cases. What I'm questioning is the need for this type of check in all cases, which is what Kn currently does. I think having an option to allow people to choose the behavior they want is the most appropriate. As stated above, in cases where time is money, people need to be given the choice of whether they want the cost of this check.

mattmoor · 2019-07-03T09:59:24Z

My point is that the cost to users of not having this check is outages.

The cost of having this check varies wildly by how a platform vendor decides to charge end-users. If I pay for K8s and run this on top, then there is effectively zero billing impact. If I pay-per-request, then there is effectively zero billing impact (unless you meter and charge for probes). If I pay by resource usage over time, then yes deployments have a nominal 90s charge (today, which I think we can bring down considerably at initial deployment). There are probably other models I haven't enumerated, but those are top of mind.

Serverless is a business of operators eating costs to provide a pay-for-use abstraction. Just because this cost exists doesn't mean it has to cost end-users money.

Frankly, "SHOULD" in the runtime contract also doesn't mean we need a knob here, it just means you can have an implementation that is conformant without it. What we let in here, we have to support and I'm saying I don't want to support this foot cannon.

I think we should change this wording to "MUST" and pivot this discussion to how we reduce the cost of this check. Eliding this check is user-hostile, you'd save a little bit of money by setting them up for uncaught failures.

duglin · 2019-07-03T11:31:11Z

What you stated above is a fine set of choices but those are just choices and based on what's out there today not all serverless platforms are making those choices (none in my quick/narrow check are). See #4522 (comment)

re: charging model and who pays those costs... Kn should not be in the business of telling companies what their charging model should be.

If Google wishes to eat the cost of the httpd server (and any potential other thread the user might choose to run while not processing an incoming request - or force the user to deal with any potential side effect of those threads despite the function never being invoked) that's a choice they are free to make. But not everyone else will want to make that same choice.

Note, I'm not asking to force everyone to scale to zero on deploy, all I'm asking for is the choice to opt out of it via a config knob and given that there are existing very popular platforms out there today that do not mandate this behavior I think the request is reasonable.

mbehrendt · 2019-07-03T14:48:36Z

I agree with @duglin .

For this to be a truly vibrant and open community effort, I think it is important to accept other valid positions and not assume my own position is the only true one on this planet. In particular, when they can be enabled with a simple config parameter. Also, hand-coughing other vendors to implement a given cost model by changing the runtime contract to a single prescriptive value doesn't send a signal to other potential parties to join.

mattmoor · 2019-07-03T22:31:28Z

Nobody is "hand-coughing" here. Billing model came up because I'm trying to rationalize your position in order to find compromise and accommodate your perspective. My issues with this proposal have nothing to do with billing models.

I completely agree that you've highlighted a fantastic problem: 90s to scale to zero after initial deployment is far too high, let's reduce it, a lot. I'd bet that we can reduce this to <5s if we allow it to immediately hook in the activator and begin scale down post readiness check (see @markusthoemmes comment above).

I appreciate your calling attention to this problem.

With regard to the runtime contract...

We have something in our RevisionSpec called a "readiness" probe. I cannot get past us calling a Revision "Ready" without even attempting its "readiness" probe. It is a simple contradiction in terms.

I can live with leaving SHOULD for executing a TCP probe by default, but I strongly believe that a user-specified readiness probe MUST be evaluated prior to declaring the Revision ready. That said, I believe allowing users to specify readiness probes is also a SHOULD.

Yes, this leaves the door open to an implementation disallowing readiness probes, opting out of default TCP probes, and always starting the deployment at scale zero. Furthermore, I'd wager that a majority of the systems @duglin surveyed also don't allow user-specified readiness probes.

With regard to implementing the above in our upstream repo...

I am opposed to having the upstream implementation support a knob for this. Knobs are a support and testing nightmare, of which we are already far too guilty, and this one worries me more than any we have today. It is not so "simple". What's more, we have also already seen poor traction getting fixes for other knobs you've added, which are "simple".

My opinions on this have nothing to do with my employer. I am advocating to get the best experience for our collective users that I can manage, and I'd do that even if I worked for IBM, that's just how I'm built. That works both ways, and I'm sure I've fought Googlers on topics that have benefited IBM and the rest of the community.

I'm sorry if you think I've discounted your perspective, but please stop discounting mine. I'm actually trying to find a solution to this which gets you what you want, and I've outlined a number of solutions that as a steward of this repo and our specs that I can abide.

I can abide the terms of the runtime contract outlined above that allow for implementations to do what you want. I can abide by improvements to how quickly we scale down from the initial 1 pod. I cannot abide by the added complexity / debt of keeping both of these paths working in our upstream implementation in perpetuity.

duglin · 2019-07-17T18:23:00Z

For reference - a previous discussion on this topic:
https://github.com/knative/serving/pull/1471/files#r200371570

The commands simply toggle the app.Spec.Instances.Stopped value. The controller has been updated to delete the knative service if the app is stopped. This is required because knative will otherwise delete the pods and bring back a single pod for a while (even if scaling is set to 0): knative/serving#4098 fixes #282

* pkg/kf/commands/apps: adds start and stop commands The commands simply toggle the app.Spec.Instances.Stopped value. The controller has been updated to delete the knative service if the app is stopped. This is required because knative will otherwise delete the pods and bring back a single pod for a while (even if scaling is set to 0): knative/serving#4098 fixes #282 * adds integration test

nimakaviani · 2019-12-04T00:22:19Z

Here is the link to the feature proposal document for this issue: https://docs.google.com/document/d/1RHaYgMUJ-a5ibhMMwmnpLQ3DHxdzvVdtGBwVwWERyio/edit

knative-housekeeping-robot · 2020-03-03T12:01:22Z

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

duglin · 2020-03-03T12:10:09Z

/remove-lifecycle stale

markusthoemmes · 2020-05-27T08:37:37Z

This is being worked on in #7682. Pulling into the milestone.

taragu · 2020-08-06T19:23:50Z

Closed via #8613
/close

knative-prow-robot · 2020-08-06T19:23:57Z

@taragu: Closing this issue.

In response to this:

Closed via #8613
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

duglin added the kind/feature Well-understood/specified features, ready for coding. label May 15, 2019

knative-prow-robot added area/API API objects and controllers area/autoscale kind/good-first-issue kind/spec Discussion of how a feature should be exposed to customers. labels May 15, 2019

mattmoor added this to the Ice Box milestone Jun 25, 2019

vagababov mentioned this issue Jun 26, 2019

Allow service to have scale 0 right after creation #2328

Closed

duglin mentioned this issue Jul 3, 2019

Feature Request: Allow processing of async requests in Knative #4522

Closed

mattmoor removed the kind/good-first-issue label Jul 12, 2019

poy mentioned this issue Jul 18, 2019

pkg/kf/commands/apps: adds start and stop commands google/kf#368

Merged

duglin mentioned this issue Aug 19, 2019

Reference revisions w.r.t @latest in traffic splitting and tagging knative/client#372

Closed

markusthoemmes mentioned this issue Nov 18, 2019

Is it possible to start the application with 0 instance? #6059

Closed

taragu mentioned this issue Dec 3, 2019

[WIP] Init scale to 0 #6138

Closed

dgerd added this to To do in Don't scale on deploy Jan 28, 2020

knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020

knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020

taragu mentioned this issue Mar 9, 2020

Add init scale to zero operator config #7187

Closed

This was referenced Apr 20, 2020

Service initial scale #7652

Closed

Add new ConfigMap fields for service initial scale #7674

Merged

Add service initial scale annotation and validation #7678

Merged

Service Initial Scale #7682

Closed

markusthoemmes modified the milestones: Ice Box, Serving 0.16.x May 27, 2020

markusthoemmes assigned taragu May 27, 2020

dprotaso modified the milestones: Serving 0.16.x, 0.17.x Jul 8, 2020

knative-prow-robot closed this as completed Aug 6, 2020

skonto mentioned this issue Jul 10, 2023

Error for failed revision is not reported due to scaling to zero #14157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't scale to 1 upon deploy #4098

Don't scale to 1 upon deploy #4098

duglin commented May 15, 2019 •

edited

Loading

markusthoemmes commented May 15, 2019

duglin commented May 15, 2019

rgregg commented May 29, 2019

duglin commented May 30, 2019

mbehrendt commented May 30, 2019

mattmoor commented Jun 25, 2019

duglin commented Jun 27, 2019

mattmoor commented Jun 27, 2019 •

edited

Loading

duglin commented Jun 27, 2019

mattmoor commented Jul 3, 2019

duglin commented Jul 3, 2019

mbehrendt commented Jul 3, 2019

mattmoor commented Jul 3, 2019

duglin commented Jul 17, 2019

nimakaviani commented Dec 4, 2019

knative-housekeeping-robot commented Mar 3, 2020

duglin commented Mar 3, 2020

markusthoemmes commented May 27, 2020

taragu commented Aug 6, 2020

knative-prow-robot commented Aug 6, 2020

Don't scale to 1 upon deploy #4098

Don't scale to 1 upon deploy #4098

Comments

duglin commented May 15, 2019 • edited Loading

In what area(s)?

Describe the feature

markusthoemmes commented May 15, 2019

duglin commented May 15, 2019

rgregg commented May 29, 2019

duglin commented May 30, 2019

mbehrendt commented May 30, 2019

mattmoor commented Jun 25, 2019

duglin commented Jun 27, 2019

mattmoor commented Jun 27, 2019 • edited Loading

duglin commented Jun 27, 2019

mattmoor commented Jul 3, 2019

duglin commented Jul 3, 2019

mbehrendt commented Jul 3, 2019

mattmoor commented Jul 3, 2019

duglin commented Jul 17, 2019

nimakaviani commented Dec 4, 2019

knative-housekeeping-robot commented Mar 3, 2020

duglin commented Mar 3, 2020

markusthoemmes commented May 27, 2020

taragu commented Aug 6, 2020

knative-prow-robot commented Aug 6, 2020

duglin commented May 15, 2019 •

edited

Loading

mattmoor commented Jun 27, 2019 •

edited

Loading