Many services in a single namespace leads to assorted problems. #8498

mattmoor · 2020-06-27T15:02:39Z

/area API
/area autoscale

What version of Knative?

HEAD

Description of the problem.

I wanted to push on our limits a bit, and so I wrote the very innovative (patent pending 🤣 ) script below. I plotted the latency between creationTimestamp and status.conditions[Ready].lastTransitionTime here.

A few observations in this context:

Revision creation latency creeps up over time.
(anecdotally by curling) the cold start latency of ALL ksvcs creeps up over time (from 2-3s to 10-12s, with all time being spent between "Container create" and "Container start")
After 1198 services were deployed, I started seeing deployments fail with:

RevisionFailed: Revision "foo-1200-sjwyy-1" failed with message: Container failed with: standard_init_linux.go:211: exec user process caused "argument list too long

When the above happened, we stop being able to cold start new services (they crash loop with the same message)!

Here's where it gets interesting... On a whim, I tried picking back up in a second namespace, and things work! Not only do they work, but cold start latency for the new services is back down!

Steps to Reproduce the Problem

I needed a GKE cluster with at least 10 nodes (post-master resize) to tolerate the number of services this creates. I was playing with this in the context of mink, but there's no reason that would affect what I'm seeing.

#!/bin/bash -e

for i in $(seq 1 1500); do
  kn service create foo-$i --image=gcr.io/knative-samples/autoscale-go:0.1
  sleep 10
done

I gathered the latencies as CSV with:

kubectl get ksvc -ojsonpath='{range .items[*]}{.metadata.name},{.metadata.creationTimestamp},{.status.conditions[?(@.type=="Ready")].lastTransitionTime}{"\n"}{end}' | pbcopy

The text was updated successfully, but these errors were encountered:

mattmoor · 2020-06-27T15:06:34Z

Given the sheer volume of services we create, I have the sneaking suspicion that this is related to the service link environment variables injected by k8s. @dprotaso has a change to allow us to use enableServiceLinks: false here.

I am going to try pulling that in and see if it alleviates these problems. If it does, then I think we have our compelling reason to make this default configurable.

cc @vagababov @markusthoemmes @duglin

vagababov · 2020-06-27T15:44:14Z

Yeah Dave's change should help with this

tshak · 2020-06-27T17:13:46Z

enableServiceLinks: false still feels like the right default to me, especially if it's causing issues like this. I really can't imagine anyone using knative relying on Docker links behavior. If they do we have a switch now where it can be enabled in the PodSpec. Lest us not forget...

Built by codifying the best practices shared by successful real-world implementations...

mattmoor · 2020-06-27T18:46:08Z

@tshak Ack, when this was raised previously we didn't really have tangible evidence that this could cause correctness or performance problems at scale. I am running a test now that should confirm/deny that impact, and then we can revisit in the WG call. However, I think the sentiment from the previous WG call stands: we need to handle this like a change to "stable" behavior.

Assuming we have a smoking gun...

I'd suggest:

Make the default configurable,
Announce the intent to change the default (and why), and advise folks interested to specify an explicit default (and how).
Wait until the release specified in 2.
Change the default and advise folks how to restore the previous behavior.

I don't see a super compelling reason to remove the configurable default (since it enables "K8s compat mode"), but I think these issues would be enough to sway my view on the default behavior.

I'll report back in a few hundred more services 😉

mattmoor · 2020-06-27T20:04:14Z

Alright, I still need to confirm that we can push beyond 1200 services without the standard_init_linux.go problem, but I think we have enough to declare a smoking gun.

This is the before picture (effectively enableServiceLinks: true):

This is the after picture (enableServiceLinks: false):

On top of that, not only is the cold-start latency for services better, but it seems like it is a noticeable improvement over where we are at HEAD today with nothing on the cluster 🤩

time curl http://foo-xdsps.default.ducktagon.io?sleep=100
Slept for 100.13 milliseconds.
real    0m2.016s
user    0m0.007s
sys     0m0.011s


time curl http://foo-z74zf.default.ducktagon.io?sleep=100
Slept for 100.25 milliseconds.
real    0m1.895s
user    0m0.008s
sys     0m0.009s

I deployed a ksvc with the original method on top of these 1000 ksvcs (it was slow!):

kn service create foo-svc-links --image=gcr.io/knative-samples/autoscale-go:0.1
Creating service 'foo-svc-links' in namespace 'default':

  0.119s Configuration "foo-svc-links" is waiting for a Revision to become ready.
  0.131s unsuccessfully observed a new generation
  0.154s Configuration "foo-svc-links" is waiting for a Revision to become ready.
 22.734s ...
 22.861s Ingress has not yet been reconciled.
 23.549s Waiting for load balancer to be ready
 25.077s Ready to serve.

Service 'foo-svc-links' created with latest revision 'foo-svc-links-fxfmm-1' and URL:
http://foo-svc-links.default.ducktagon.io

I waited for this to scale to zero:

time curl http://foo-svc-links.default.ducktagon.io?sleep=100
Slept for 100.14 milliseconds.

real    0m13.042s
user    0m0.008s
sys     0m0.009s

My next step is to keep cranking out services and see if I can push past 1200 (where foo-svc-links should fail) and see if the ksvcs without service links keep working.

mattmoor · 2020-06-27T20:49:42Z

Alright, I have confirmed the last piece of this. Above 1200 services, we fail to cold start with service links, but without we still see pretty exceptional cold starts...

time curl http://foo-z9gs2.default.ducktagon.io?sleep=100
Slept for 100.12 milliseconds.

real    0m1.698s
user    0m0.008s
sys     0m0.009s

mattmoor · 2020-06-27T21:39:24Z

cc @markusthoemmes

mattmoor · 2020-06-28T15:48:08Z

Overnight I did a fresh run creating 1500 ksvc (with kn as above, but with service links disabled by default thanks to @vagababov), and the Revision creation latency is completely flat:

... bear in mind that the original graph was only 1000 ksvc.

So I think we should start a discussion around how we want to change this default behavior. Thoughts?

mattmoor · 2020-08-31T04:16:23Z

Closing in favor of #8563

See knative#8498 for why are we doing this. Fixes knative#8563

duglin · 2022-10-17T13:45:56Z

@mattmoor do you happen to remember if this issue was strictly due to K8s Services env vars being injected, or whether it would have happened even if the user defined the same number of env vars manually?

mattmoor · 2022-10-17T14:03:04Z

@duglin I'd guess it'd happen with either

duglin · 2022-10-17T14:31:03Z

ok thanks

mattmoor added the kind/bug Categorizes issue or PR as related to a bug. label Jun 27, 2020

mattmoor added this to the 0.17.x milestone Jun 27, 2020

knative-prow-robot added the area/API API objects and controllers label Jun 27, 2020

dprotaso mentioned this issue Jun 27, 2020

Unable to set EnableServiceLinks in PodSpec #6074

Closed

vagababov mentioned this issue Jun 27, 2020

Disable service links via defaults cm #8499

Merged

This was referenced Jun 29, 2020

Default PodSpec.enableServiceLinks to false kubernetes/kubernetes#92226

Closed

1000+ services in the same namespace will degrade pod start time & eventually prevent pods from starting kubernetes/kubernetes#92615

Closed

mattmoor mentioned this issue Jul 6, 2020

Change the "enable service links" default #8563

Closed

mattmoor mentioned this issue Aug 20, 2020

has scaling from zero degraded in performance? #9104

Closed

mattmoor closed this as completed Aug 31, 2020

vagababov added a commit to vagababov/serving that referenced this issue Sep 29, 2020

Default the service links flag setting to false in knative

14ccc9b

See knative#8498 for why are we doing this. Fixes knative#8563

vagababov mentioned this issue Sep 29, 2020

Default the service links flag setting to false in knative #9622

Closed

TAYTS mentioned this issue Sep 24, 2022

Helm chart: Add enableServiceLinks grafana/loki#7129

Merged

4 tasks

TAYTS mentioned this issue Oct 1, 2022

[promtail] Add enable service links for Promtail grafana/helm-charts#1863

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many services in a single namespace leads to assorted problems. #8498

Many services in a single namespace leads to assorted problems. #8498

mattmoor commented Jun 27, 2020 •

edited

Loading

mattmoor commented Jun 27, 2020

vagababov commented Jun 27, 2020

tshak commented Jun 27, 2020 •

edited

Loading

mattmoor commented Jun 27, 2020

mattmoor commented Jun 27, 2020

mattmoor commented Jun 27, 2020

mattmoor commented Jun 27, 2020

mattmoor commented Jun 28, 2020

mattmoor commented Aug 31, 2020

duglin commented Oct 17, 2022

mattmoor commented Oct 17, 2022

duglin commented Oct 17, 2022

Many services in a single namespace leads to assorted problems. #8498

Many services in a single namespace leads to assorted problems. #8498

Comments

mattmoor commented Jun 27, 2020 • edited Loading

What version of Knative?

Description of the problem.

Steps to Reproduce the Problem

mattmoor commented Jun 27, 2020

vagababov commented Jun 27, 2020

tshak commented Jun 27, 2020 • edited Loading

mattmoor commented Jun 27, 2020

mattmoor commented Jun 27, 2020

mattmoor commented Jun 27, 2020

mattmoor commented Jun 27, 2020

mattmoor commented Jun 28, 2020

mattmoor commented Aug 31, 2020

duglin commented Oct 17, 2022

mattmoor commented Oct 17, 2022

duglin commented Oct 17, 2022

mattmoor commented Jun 27, 2020 •

edited

Loading

tshak commented Jun 27, 2020 •

edited

Loading