Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't scale to 1 upon deploy #4098

Closed
duglin opened this issue May 15, 2019 · 20 comments
Closed

Don't scale to 1 upon deploy #4098

duglin opened this issue May 15, 2019 · 20 comments
Assignees
Labels
area/API API objects and controllers area/autoscale kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers.
Milestone

Comments

@duglin
Copy link

duglin commented May 15, 2019

In what area(s)?

/area API
/area autoscale

Other classifications:
/kind good-first-issue
/kind spec

Describe the feature

Right now when a new KnService is deployed it is automatically scaled up to one instance. I believe this is incorrect, it should be scaled to zero by default. For a couple of reasons:

  • when not being used, w/o minScale > 0, services will scale to zero.... the service has not been used yet
  • when $ is involved with the running of the pods associated with a service, end-users will not expect to be charged when no one called their service... starting with a scale of 1 means they will be charged even though no one called their service, violating that "only pay for what you use" contract
  • the current behavior treats "deploy" time differently from any other time the service is alive. I see no reason to think that the 90 seconds immediately after a service is created is any different from any other time
  • if people really think "deploy time" is special, then we should call it out explicitly by allowing the user to specify a default scale value (either on the Service itself or via some Kn-wide config) rather than assuming the same needs apply to all users. Then they can also choose something other than "1" if they want.

/cc @nimakaviani

edit: proposal: provide a config knob so people can choose if they want to scale to 1 on deploy.

@duglin duglin added the kind/feature Well-understood/specified features, ready for coding. label May 15, 2019
@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/autoscale kind/good-first-issue kind/spec Discussion of how a feature should be exposed to customers. labels May 15, 2019
@markusthoemmes
Copy link
Contributor

This is referred to in the runtime contract as a deployment probe. It's there to make sure your app is even viable and can start up in the first place.

We could enhance it and force scale down to 0 as soon as we see it becoming ready successfully. It served its purpose at this point.

@duglin
Copy link
Author

duglin commented May 15, 2019

Maybe... I can certainly see why it's done the way it is, but technically there could be lots of reasons why it won't start-up later in time too due to things changing in the env, and so I'm still thinking that treating the "deploy time" as special is probably not appropriate.

But, as I mentioned, giving the user the choice when money might be involved would be nice.

@rgregg
Copy link
Contributor

rgregg commented May 29, 2019

Without scaling the service to 1 during deployment, there are a number of failure conditions that we won't be able to detect at deployment time, and will postpone those to being discovered at runtime. That seems like a bad user experience.

What's the downside about creating an instance of the app initially?

@duglin
Copy link
Author

duglin commented May 30, 2019

In an environment where a customer only wants to be charged when their service is actually invoked, they're now going to be charged just because it was deployed, not invoked.

As I mentioned above, I think having some kind of initialScale type of flag (and we can bike-shed on whether it's default is 1 or 0) would solve this. At least then the user (or cloud provider) can choose.

@mbehrendt
Copy link

adding to what @duglin wrote: If a customer wants to test whether what they deployed works, they'd just curl it. That would scaling from 0 to 1, and then immediately back to 0.

@mattmoor
Copy link
Member

If a customer wants to test whether what they deployed works, they'd just curl it.

They could if they are manually rolling revisions out, but this elides a fairly fundamental check for "run latest" style services.

@duglin
Copy link
Author

duglin commented Jun 27, 2019

The runtime contract says:


On the initial deployment, platform providers SHOULD start an instance of the container to validate that the container is valid and will become ready. This startup SHOULD occur even if the container would not serve any user requests.


Notice the SHOULDs in there. They are not MUSTs. Therefore it is optional. We should allow for the user to leverage this optionality. I'm not sure I agree with the SHOULDs in the contract, I think even that's too strong, I'd prefer a MAY, but SHOULD still gives the flexibility that we need.

@mattmoor
Copy link
Member

mattmoor commented Jun 27, 2019

ok, I'll elaborate on "fundamental".

I'm reminded of the time that someone on the cloud dataflow team checked in a python syntax error that caused their container to crash on startup. You might not think that's such a big deal, but their integration testing environment decided to let 1000 or so VMs sit in this state overnight set to imagePullPolicy: Always (as in, pull the image every container restart). All was good on the GCR side, but I was pretty stressed out for a bit because GCR's QPS went up 10x for about 20 hours (for them it was devel, so who cares?). This was also in the fun days before the kubelet had rate limits.

I'm also reminded of a time before GCR was public that a PR was merged into the github repo backing the google/docker-registry image, which at the time was configured with automated builds to :latest. At the time this was used by a fledgling App Engine Flex product (before it was called that), which noticed the change and proceeded to pull it across the fleet. Turns out this was in the days of docker registry v1, and the OSS docker-registry was Python. Another syntax error, but this one actually managed to take down DockerHub for a few hours. This outage was effectively the reason imagePullPolicy: exists in Kubernetes, and has such a terrible default value (that makes me sad).

This type of error is really real and is all too common in dynamic languages, which are quite popular in the serverless space. However, statically typed languages aren't immune to boot-time failures either. You could omit a required flag, listen on the wrong port, or even do something as simple as the notorious multiple definitions of flag "log_dir".

We can even attack this from a higher level than language: what if I request a new revision with resource limits larger than can be possible scheduled on node in my cluster? What if I deploy a windows container to a linux cluster?

I could probably go on all night. The point of all of these is that if you don't boot the container to the point of a successful readiness probe before declaring it Ready and rolling it out, you'd miss every last one of these errors, and if no traffic ever comes in then the prior working versions might get GC'd before the user even realizes something is amiss.


MAY clearly sends the wrong message here, frankly SHOULD sends the wrong message (and it does surprise me that we've written SHOULD vs. MUST). However, in my experience you should take this SHOULD to mean: you'd be totally crazy not to, and I think we should seriously consider making it MUST.

That said, changing SHOULD -> MUST is easier (breaking for vendor conformance), but MUST -> SHOULD is harder (breaking for customer applications).

@duglin
Copy link
Author

duglin commented Jun 27, 2019

I'm not questioning the need for this type of check in some cases. What I'm questioning is the need for this type of check in all cases, which is what Kn currently does. I think having an option to allow people to choose the behavior they want is the most appropriate. As stated above, in cases where time is money, people need to be given the choice of whether they want the cost of this check.

@mattmoor
Copy link
Member

mattmoor commented Jul 3, 2019

My point is that the cost to users of not having this check is outages.

The cost of having this check varies wildly by how a platform vendor decides to charge end-users. If I pay for K8s and run this on top, then there is effectively zero billing impact. If I pay-per-request, then there is effectively zero billing impact (unless you meter and charge for probes). If I pay by resource usage over time, then yes deployments have a nominal 90s charge (today, which I think we can bring down considerably at initial deployment). There are probably other models I haven't enumerated, but those are top of mind.

Serverless is a business of operators eating costs to provide a pay-for-use abstraction. Just because this cost exists doesn't mean it has to cost end-users money.

Frankly, "SHOULD" in the runtime contract also doesn't mean we need a knob here, it just means you can have an implementation that is conformant without it. What we let in here, we have to support and I'm saying I don't want to support this foot cannon.

I think we should change this wording to "MUST" and pivot this discussion to how we reduce the cost of this check. Eliding this check is user-hostile, you'd save a little bit of money by setting them up for uncaught failures.

@duglin
Copy link
Author

duglin commented Jul 3, 2019

What you stated above is a fine set of choices but those are just choices and based on what's out there today not all serverless platforms are making those choices (none in my quick/narrow check are). See #4522 (comment)

re: charging model and who pays those costs... Kn should not be in the business of telling companies what their charging model should be.

If Google wishes to eat the cost of the httpd server (and any potential other thread the user might choose to run while not processing an incoming request - or force the user to deal with any potential side effect of those threads despite the function never being invoked) that's a choice they are free to make. But not everyone else will want to make that same choice.

Note, I'm not asking to force everyone to scale to zero on deploy, all I'm asking for is the choice to opt out of it via a config knob and given that there are existing very popular platforms out there today that do not mandate this behavior I think the request is reasonable.

@mbehrendt
Copy link

I agree with @duglin .

For this to be a truly vibrant and open community effort, I think it is important to accept other valid positions and not assume my own position is the only true one on this planet. In particular, when they can be enabled with a simple config parameter. Also, hand-coughing other vendors to implement a given cost model by changing the runtime contract to a single prescriptive value doesn't send a signal to other potential parties to join.

@mattmoor
Copy link
Member

mattmoor commented Jul 3, 2019

Nobody is "hand-coughing" here. Billing model came up because I'm trying to rationalize your position in order to find compromise and accommodate your perspective. My issues with this proposal have nothing to do with billing models.

I completely agree that you've highlighted a fantastic problem: 90s to scale to zero after initial deployment is far too high, let's reduce it, a lot. I'd bet that we can reduce this to <5s if we allow it to immediately hook in the activator and begin scale down post readiness check (see @markusthoemmes comment above).

I appreciate your calling attention to this problem.


With regard to the runtime contract...

We have something in our RevisionSpec called a "readiness" probe. I cannot get past us calling a Revision "Ready" without even attempting its "readiness" probe. It is a simple contradiction in terms.

I can live with leaving SHOULD for executing a TCP probe by default, but I strongly believe that a user-specified readiness probe MUST be evaluated prior to declaring the Revision ready. That said, I believe allowing users to specify readiness probes is also a SHOULD.

Yes, this leaves the door open to an implementation disallowing readiness probes, opting out of default TCP probes, and always starting the deployment at scale zero. Furthermore, I'd wager that a majority of the systems @duglin surveyed also don't allow user-specified readiness probes.


With regard to implementing the above in our upstream repo...

I am opposed to having the upstream implementation support a knob for this. Knobs are a support and testing nightmare, of which we are already far too guilty, and this one worries me more than any we have today. It is not so "simple". What's more, we have also already seen poor traction getting fixes for other knobs you've added, which are "simple".


My opinions on this have nothing to do with my employer. I am advocating to get the best experience for our collective users that I can manage, and I'd do that even if I worked for IBM, that's just how I'm built. That works both ways, and I'm sure I've fought Googlers on topics that have benefited IBM and the rest of the community.

I'm sorry if you think I've discounted your perspective, but please stop discounting mine. I'm actually trying to find a solution to this which gets you what you want, and I've outlined a number of solutions that as a steward of this repo and our specs that I can abide.

I can abide the terms of the runtime contract outlined above that allow for implementations to do what you want. I can abide by improvements to how quickly we scale down from the initial 1 pod. I cannot abide by the added complexity / debt of keeping both of these paths working in our upstream implementation in perpetuity.

@duglin
Copy link
Author

duglin commented Jul 17, 2019

For reference - a previous discussion on this topic:
https://github.com/knative/serving/pull/1471/files#r200371570

poy added a commit to google/kf that referenced this issue Jul 18, 2019
The commands simply toggle the app.Spec.Instances.Stopped value. The
controller has been updated to delete the knative service if the app is
stopped. This is required because knative will otherwise delete the pods
and bring back a single pod for a while (even if scaling is set to 0):
knative/serving#4098

fixes #282
poy added a commit to google/kf that referenced this issue Jul 19, 2019
* pkg/kf/commands/apps: adds start and stop commands

The commands simply toggle the app.Spec.Instances.Stopped value. The
controller has been updated to delete the knative service if the app is
stopped. This is required because knative will otherwise delete the pods
and bring back a single pod for a while (even if scaling is set to 0):
knative/serving#4098

fixes #282

* adds integration test
@nimakaviani
Copy link
Contributor

Here is the link to the feature proposal document for this issue: https://docs.google.com/document/d/1RHaYgMUJ-a5ibhMMwmnpLQ3DHxdzvVdtGBwVwWERyio/edit

@dgerd dgerd added this to To do in Don't scale on deploy Jan 28, 2020
@knative-housekeeping-robot

Issues go stale after 90 days of inactivity.
Mark the issue as fresh by adding the comment /remove-lifecycle stale.
Stale issues rot after an additional 30 days of inactivity and eventually close.
If this issue is safe to close now please do so by adding the comment /close.

Send feedback to Knative Productivity Slack channel or file an issue in knative/test-infra.

/lifecycle stale

@knative-prow-robot knative-prow-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020
@duglin
Copy link
Author

duglin commented Mar 3, 2020

/remove-lifecycle stale

@markusthoemmes
Copy link
Contributor

This is being worked on in #7682. Pulling into the milestone.

@taragu
Copy link
Contributor

taragu commented Aug 6, 2020

Closed via #8613
/close

@knative-prow-robot
Copy link
Contributor

@taragu: Closing this issue.

In response to this:

Closed via #8613
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/autoscale kind/feature Well-understood/specified features, ready for coding. kind/spec Discussion of how a feature should be exposed to customers.
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

10 participants