Distinguish zero concurrency from slow/failed scraping when bucketing #8610

julz · 2020-07-13T09:48:40Z

Describe the feature

Currently we do not differentiate between a scrape that actually reports zero concurrency from a replica and just not having data for a particular bucket. This is fine if the network is fast and autoscaler is not overloaded because we will have data ~every second, but on a slow or overloaded network (or e.g. with a resource constrained host => slow QP response to scrapes) it could cause issues: when we average over the bucket we could think we have lower load than we do, and scale down (or fail to scale up) replicas incorrectly.

(This is somewhat related to #8377 in that if we introduce a work pool there's a greater danger of things backed up in the queue not getting stats every second).

mattmoor · 2020-08-17T22:43:08Z

cc @markusthoemmes @vagababov for thoughts

vagababov · 2020-08-17T23:27:06Z

Actually it's the same as the @duglin issue we revisited earlier :)

mattmoor · 2020-08-18T01:13:33Z

Want to find it and dupe?

vagababov · 2020-08-18T05:00:00Z

Done.

julz · 2020-08-18T10:58:13Z

FWIW I think this isn't totally the same as #8390. In #8390 @duglin is scraping at the correct rate, but the pockets of zero concurrency lead us to end up with less replicas than we actually need for peak load (because the simulated workload is a GitHub trigger firing multiple parallel events every 10 seconds or so, and we average over the full window). That one can potentially be fixed by the max-vs-average flag we've been informally chatting about: I'll pull out a top level issue for that now.

This one, I think, is slightly different. When the network is slow, or blips, we can get zero scaling data for a few seconds (or longer) and our current behaviour is to treat any gaps in data as if we'd actually seen concurrency zero. This means if the scraper loses connectivity to the pods, or the network is temporarily congested, we can start to rapidly scale down the workload as fast as max-scale-down-rate will let us. The 'max' switch described above that would potentially help bursty loads would cope with this slightly better, but I think it's a cross-cutting problem we should solve in both cases: for example by assuming the rolling average rather than 0 when we miss a scrape.

Edit: Spun out #9092.

duglin · 2020-10-03T20:34:10Z

Since #8390 was closed, I want to add my testcase to this one because I'm still seeing odd behavior even after #9092 is merged.

Script:

#!/bin/bash

set -e
PAUSE=${1:-10}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=1 | tail -1`
echo "Service: $URL"
echo "Pause: $PAUSE"
for i in `seq 1 20` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo

And output I see today:

$ ./bug 
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
Pause: 10
1 : Running 50... real  0m19.686s # pods: 72
2 : Running 50... real  0m10.168s # pods: 72
3 : Running 50... real  0m10.177s # pods: 72
4 : Running 50... real  0m10.181s # pods: 48
5 : Running 50... real  0m10.186s # pods: 72
6 : Running 50... real  0m10.162s # pods: 72
7 : Running 50... real  0m10.183s # pods: 36
8 : Running 50... real  0m10.190s # pods: 72
9 : Running 50... real  0m10.174s # pods: 72
10 : Running 50... real 0m10.192s # pods: 36
11 : Running 50... real 0m10.204s # pods: 72
12 : Running 50... real 0m10.171s # pods: 72
13 : Running 50... real 0m10.191s # pods: 36
14 : Running 50... real 0m20.176s # pods: 72
15 : Running 50... real 0m10.192s # pods: 72
16 : Running 50... real 0m10.189s # pods: 72
17 : Running 50... real 0m20.139s # pods: 72
18 : Running 50... real 0m10.186s # pods: 72
19 : Running 50... real 0m10.213s # pods: 72
20 : Running 50... real 0m10.188s # pods: 39
Service 'echo' successfully deleted in namespace 'default'.

Notice how the # of pods isn't consistent and it going below 50 doesn't seem right. But the 2x latency at times is obviously the biggest concern.

duglin · 2020-10-17T16:36:49Z

Using:

URL=`kn service create echo --image duglin/echo --concurrency-limit=1 \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=120s \
  --annotation-revision autoscaling.knative.dev/window=6s \

helped w.r.t. latency - it was around 10 seconds consistently. However, I had 72 pods the entire time, which just doesn't seem right when I only have 50 requests. Yes I know that TU (70%) is probably why I get an extra 12 pods, but from a user's POV it's hard to explain. I wonder if we need to make it more clear that this "utilization" isn't just per pod, but across all pods and really should be look at like some kind of "over provisioning" flag. Then it's clear that anything other than 100% means they're asking for "extra" unused space. And this space is calculated across all pods, not just within one.

duglin · 2020-10-17T16:38:55Z

Meaning, (# of requests) * (CC/TU%) == # of pods they should see

vagababov · 2020-10-17T18:25:09Z

Run with TU=95%? :)

…

On Sat, Oct 17, 2020 at 9:39 AM Doug Davis ***@***.***> wrote: Meaning, (# of requests) * (CC/TU%) == # of pods they should see — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8610 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAF2WXZ7YOGNK6CFTB73KLLSLHCCZANCNFSM4OYJZFJQ> .

duglin · 2020-10-17T23:18:54Z

Just FYI:

$ cat bug
#!/bin/bash

set -e
PAUSE=${PAUSE:-20}
COUNT=${COUNT:-20}
SDD=${SDD:-6s}
TU=${TU:-100}
WIN=${WIN:-6s}
CC=${CC:-1}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=$CC \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=$SDD \
  --annotation-revision autoscaling.knative.dev/targetUtilizationPercentage=$TU \
  --annotation-revision autoscaling.knative.dev/WIN=$WIN \
  | tail -1`
echo "Service: $URL"
echo "PAUSE: $PAUSE"
echo "CC: $CC"
echo "SDD: $SDD"
echo "TU: $TU"
echo "WIN: $WIN"

for i in `seq 1 $COUNT` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo

$ PAUSE=20 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 20
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m18.245s # pods: 50
2 : Running 50... real  0m20.091s # pods: 49
3 : Running 50... real  0m20.108s # pods: 50
4 : Running 50... real  0m20.143s # pods: 50
5 : Running 50... real  0m19.347s # pods: 50
6 : Running 50... real  0m20.105s # pods: 49
7 : Running 50... real  0m20.155s # pods: 50
8 : Running 50... real  0m20.155s # pods: 50
9 : Running 50... real  0m20.163s # pods: 50
10 : Running 50... real 0m19.596s # pods: 50
11 : Running 50... real 0m20.126s # pods: 50
12 : Running 50... real 0m19.307s # pods: 50
13 : Running 50... real 0m20.131s # pods: 50
14 : Running 50... real 0m20.097s # pods: 50
15 : Running 50... real 0m19.604s # pods: 50
16 : Running 50... real 0m20.141s # pods: 50
17 : Running 50... real 0m20.132s # pods: 50
18 : Running 50... real 0m20.116s # pods: 49
19 : Running 50... real 0m20.181s # pods: 50
20 : Running 50... real 0m20.179s # pods: 49

$ PAUSE=10 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 10
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m19.904s # pods: 50
2 : Running 50... real  0m10.185s # pods: 29
3 : Running 50... real  0m10.182s # pods: 50
4 : Running 50... real  0m20.176s # pods: 50
5 : Running 50... real  0m10.180s # pods: 50
6 : Running 50... real  0m10.168s # pods: 25
7 : Running 50... real  0m20.164s # pods: 34
8 : Running 50... real  0m19.300s # pods: 50
9 : Running 50... real  0m10.193s # pods: 50
10 : Running 50... real 0m10.180s # pods: 50
11 : Running 50... real 0m20.177s # pods: 50
12 : Running 50... real 0m10.174s # pods: 49
13 : Running 50... real 0m10.187s # pods: 26
14 : Running 50... real 0m10.167s # pods: 50
15 : Running 50... real 0m20.191s # pods: 32
16 : Running 50... real 0m18.220s # pods: 50
17 : Running 50... real 0m10.186s # pods: 50
18 : Running 50... real 0m10.174s # pods: 50
19 : Running 50... real 0m20.141s # pods: 50
20 : Running 50... real 0m10.190s # pods: 35
Service 'echo' successfully deleted in namespace 'default'.

github-actions · 2021-01-16T02:37:32Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

duglin · 2021-02-16T04:01:29Z

/reopen
/remove-lifecycle stale

evankanderson · 2021-03-22T03:50:35Z

@vagababov @julz

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

/triage needs-user-input

(I'll also point out that this bug timed out, so if it's a major issue, we may need to reconsider our priorities. If it's not a major issue, we may want to consider allowing it to time out again.)

julz · 2021-03-22T09:48:10Z

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

Unfortunately not. What sounds easy is actually a bit tricky because of how we do metric aggregation. Having said that it's possible the new pluggable aggregation stuff @vagababov has added may make this more tractable 🤔 .

evankanderson · 2021-06-23T20:26:42Z

/remove-triage needs-user-input
/triage accepted

I'm not sure that Victor is going to land anything here; is this still an issue, and what priority?

julz · 2021-06-23T20:31:53Z

I do think it's a legit issue ("we do not distinguish failed/slow scrapes from zero concurrency and we should or we'll potentially scale down due to network blips") that needs more work to progress than a "good first issue" should. If we had a 'this is something someone who wants something meaty could work on' tag, Id add that to this.

evankanderson · 2021-06-24T01:25:11Z

/help

knative-prow-robot · 2021-06-24T01:25:12Z

@evankanderson:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nader-ziada · 2022-03-09T21:39:17Z

/assign

julz added the kind/feature Well-understood/specified features, ready for coding. label Jul 13, 2020

knative-prow-robot added the area/autoscale label Jul 13, 2020

skonto mentioned this issue Jul 13, 2020

Bound Concurrency of Pod Scraping (Probably with a Work Pool) #8377

Open

vagababov mentioned this issue Aug 18, 2020

Inconsistent scaling #8390

Closed

julz changed the title ~~Distinguish zero concurrency from slow scraping when bucketing~~ Distinguish zero concurrency from slow/failed scraping when bucketing Aug 18, 2020

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 16, 2021

github-actions bot closed this as completed Feb 16, 2021

duglin reopened this Feb 16, 2021

knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021

knative-prow-robot added the triage/needs-user-input Issues which are waiting on a response from the reporter label Mar 22, 2021

knative-prow-robot added triage/accepted Issues which should be fixed (post-triage) and removed triage/needs-user-input Issues which are waiting on a response from the reporter labels Jun 23, 2021

knative-prow-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 24, 2021

knative-prow-robot assigned nader-ziada Mar 9, 2022

dprotaso added this to the Icebox milestone Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

julz commented Jul 13, 2020

mattmoor commented Aug 17, 2020

vagababov commented Aug 17, 2020

mattmoor commented Aug 18, 2020

vagababov commented Aug 18, 2020

julz commented Aug 18, 2020 •

edited

Loading

duglin commented Oct 3, 2020

duglin commented Oct 17, 2020

duglin commented Oct 17, 2020

vagababov commented Oct 17, 2020 via email

duglin commented Oct 17, 2020

github-actions bot commented Jan 16, 2021

duglin commented Feb 16, 2021

evankanderson commented Mar 22, 2021

julz commented Mar 22, 2021

evankanderson commented Jun 23, 2021

julz commented Jun 23, 2021

evankanderson commented Jun 24, 2021

knative-prow-robot commented Jun 24, 2021

nader-ziada commented Mar 9, 2022

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Comments

julz commented Jul 13, 2020

Describe the feature

mattmoor commented Aug 17, 2020

vagababov commented Aug 17, 2020

mattmoor commented Aug 18, 2020

vagababov commented Aug 18, 2020

julz commented Aug 18, 2020 • edited Loading

duglin commented Oct 3, 2020

duglin commented Oct 17, 2020

duglin commented Oct 17, 2020

vagababov commented Oct 17, 2020 via email

duglin commented Oct 17, 2020

github-actions bot commented Jan 16, 2021

duglin commented Feb 16, 2021

evankanderson commented Mar 22, 2021

julz commented Mar 22, 2021

evankanderson commented Jun 23, 2021

julz commented Jun 23, 2021

evankanderson commented Jun 24, 2021

knative-prow-robot commented Jun 24, 2021

nader-ziada commented Mar 9, 2022

julz commented Aug 18, 2020 •

edited

Loading