Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immediately scale up autoscaler for cold-start #2963

Conversation

greghaynes
Copy link
Contributor

When the autoscaler gets a stat to scale a service up from 0 we should
act on this immediately in order to decrease cold-start time.

Closes #2961

Release Note

Immediately scale from zero in autoscaler.

@knative-prow-robot knative-prow-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 23, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greghaynes: 0 warnings.

In response to this:

When the autoscaler gets a stat to scale a service up from 0 we should
act on this immediately in order to decrease cold-start time.

Closes #2961

Release Note

Immediately scale from zero in autoscaler.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach is lgtm but I think there's a bug here wrt. to how we get to know if it's the activator sending the metrics.

On that note: Is it even important that the activator is sending the metric in the end? Is there any harm dropping that requirement and force a tick on any incoming metric when at 0?

scaler.scaler.Record(ctx, stat)
if scaler.metric.Status.DesiredScale == 0 && stat.PodName == ActivatorPodName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're scaling the activator, this won't work anymore I believe. You presumably need to do the same check as the autoscaler does, which is: if strings.HasPrefix(stat.PodName, ActivatorPodName) {

Has this worked locally?

Can the first half of this be: scaler.getLatestScale() == 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find - I removed the pod name check entirely since it really shouldnt matter where the stat came from (as you point out).

@mattmoor mattmoor added this to the Serving 0.4 milestone Jan 28, 2019
@mattmoor
Copy link
Member

+1 to @markusthoemmes comment, but when we reverse the flow of metrics (@yanweiguo is working on this) it shouldn't matter (any push is activator then), so will it matter?

@greghaynes I put this in 0.4, so LMK if that timeframe is a problem.

@greghaynes greghaynes force-pushed the fix/autoscaler-immediate-scale-from-zero branch from 4be0da9 to fcaf0c1 Compare January 30, 2019 21:32
@markusthoemmes
Copy link
Contributor

Thanks for making the change. Can you add a test for this to bump coverage up again?

@mattmoor
Copy link
Member

/retest

@mattmoor
Copy link
Member

@greghaynes I'm curious what speedup you see with this?

@greghaynes
Copy link
Contributor Author

@mattmoor In theory it should be ~.5 sec avg (0-1sec depending on when a stat is sent). Ive been trying to test this out the past day on my shared k8s cluster and unfortunately my k8s controlplane is way too noisy to get useful data out of (another interesting argument for us doing local scheduling). I'll try and set up a local private cluster to see, but if anyone has one and wants to check we now have a perf test for this: go test -v -tags performance ./test/performance -run ScaleFromZero1 | grep Average:

@greghaynes
Copy link
Contributor Author

@mattmoor Using #3059 I got the following results:

without this patch:
min: 6.3 max: 8.1 avg: 6.9
with this patch:
min: 5.9 max: 6.5 avg: 6.2

@mattmoor
Copy link
Member

mattmoor commented Feb 5, 2019

Ping @greghaynes I think there's one outstanding comment to merge this.

@markusthoemmes
Copy link
Contributor

For clarity, that comment is:

Can you add a test for this to bump coverage up again?

@markusthoemmes
Copy link
Contributor

/assign @markusthoemmes

So it appears in my review lists and I keep track of it. We should land this in 0.4 (as scheduled)

@greghaynes
Copy link
Contributor Author

Thanks for the reminder - ill get to this today.

@greghaynes greghaynes force-pushed the fix/autoscaler-immediate-scale-from-zero branch from fcaf0c1 to 17e9996 Compare February 7, 2019 20:28
@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 7, 2019
@mattmoor
Copy link
Member

/test pull-knative-serving-go-coverage

@mattmoor
Copy link
Member

mattmoor commented Feb 12, 2019

I think not commenting may mean same-coverage? If so we should file a bug that it doesn't remove the old comment, I reran it to test.

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refine the test a bit to prevent regression in the future.

// We got the signal!
case <-time.After(30 * time.Millisecond):
t.Fatalf("timed out waiting for Watch()")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test would also pass without the changes made? I get coverage on the newly added path only intermittently. The multiscaler runs on a tick interval of 1ms so it will be quick either way. Should we increase the tick interval of the multiscaler and assert we get a tick in a substantially quicker time? (For example: tickInterval of 100ms, make the others wait for 200ms and this wait for 30ms).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 let's add a parallel test to this with a tick interval larger than the overall test timeout. Then we should be able to properly test this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, I am having a hard time coming up with a way to do this without chopping up or adding some multiscaler access. When we create a metric which in turn creates a scaler we need to go through one tickScaler cycle in order for the scaler desiredScale to be set to 0. If I up the tick interval then we'll have to wait for that time period. AFAICT theres no good interface for me to inject this tick cycle, do you have any thoughts on this @markusthoemmes ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@greghaynes yeah, I realize it's not as easy. The following adjustments trigger the path we want in the test pretty reliably:

diff --git a/pkg/autoscaler/multiscaler_test.go b/pkg/autoscaler/multiscaler_test.go
index a39ba0d7..840220d5 100644
--- a/pkg/autoscaler/multiscaler_test.go
+++ b/pkg/autoscaler/multiscaler_test.go
@@ -110,9 +110,11 @@ func TestMultiScalerScaling(t *testing.T) {
 }
 
 func TestMultiScalerScaleToZero(t *testing.T) {
+	tickInterval := 50 * time.Millisecond
+
 	ctx := context.TODO()
 	ms, stopCh, uniScaler := createMultiScaler(t, &autoscaler.Config{
-		TickInterval:      time.Millisecond * 1,
+		TickInterval:      tickInterval,
 		EnableScaleToZero: true,
 	})
 	defer close(stopCh)
@@ -160,18 +162,10 @@ func TestMultiScalerScaleToZero(t *testing.T) {
 	select {
 	case <-done:
 		// We got the signal!
-	case <-time.After(30 * time.Millisecond):
+	case <-time.After(tickInterval * 2):
 		t.Fatalf("timed out waiting for Watch()")
 	}
 
-	// Verify that we stop seeing "ticks"
-	select {
-	case <-done:
-		t.Fatalf("Got unexpected tick")
-	case <-time.After(30 * time.Millisecond):
-		// We got nothing!
-	}
-
 	desiredScaleMutex.Lock()
 	desiredScale = 1
 	desiredScaleMutex.Unlock()
@@ -188,7 +182,7 @@ func TestMultiScalerScaleToZero(t *testing.T) {
 	select {
 	case <-done:
 		// We got the signal!
-	case <-time.After(30 * time.Millisecond):
+	case <-time.After(tickInterval / 2):
 		t.Fatalf("timed out waiting for Watch()")
 	}

Does that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markusthoemmes hrm, check out what I just pushed - I added a method to directly set the desiredscale of a scalerrunner which lets us actually test this code path 100% of the time.

@mattmoor
Copy link
Member

Ping, I'd like to see this land in 0.4

@greghaynes greghaynes force-pushed the fix/autoscaler-immediate-scale-from-zero branch from 17e9996 to 000e179 Compare February 14, 2019 05:51
@knative-prow-robot knative-prow-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 14, 2019
@greghaynes greghaynes force-pushed the fix/autoscaler-immediate-scale-from-zero branch 2 times, most recently from 9616881 to b691dff Compare February 14, 2019 06:49
@knative-prow-robot knative-prow-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 14, 2019
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/autoscaler/multiscaler.go 92.3% 94.7% 2.4

@greghaynes greghaynes force-pushed the fix/autoscaler-immediate-scale-from-zero branch 3 times, most recently from 723019a to 3b777b3 Compare February 14, 2019 06:54
When the autoscaler gets a stat to scale a service up from 0 we should
act on this immediately in order to decrease cold-start time.

Closes knative#2961
@greghaynes greghaynes force-pushed the fix/autoscaler-immediate-scale-from-zero branch from 3b777b3 to ebc0a58 Compare February 14, 2019 06:55
Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing these changes

/lgtm
/approve

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2019
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: greghaynes, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2019
@knative-prow-robot knative-prow-robot merged commit 5b7e155 into knative:master Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants