Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

Open
julz opened this issue Jul 13, 2020 · 19 comments
Open

Distinguish zero concurrency from slow/failed scraping when bucketing #8610

julz opened this issue Jul 13, 2020 · 19 comments
Assignees
Labels
area/autoscale help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Well-understood/specified features, ready for coding. triage/accepted Issues which should be fixed (post-triage)
Milestone

Comments

@julz
Copy link
Member

julz commented Jul 13, 2020

Describe the feature

Currently we do not differentiate between a scrape that actually reports zero concurrency from a replica and just not having data for a particular bucket. This is fine if the network is fast and autoscaler is not overloaded because we will have data ~every second, but on a slow or overloaded network (or e.g. with a resource constrained host => slow QP response to scrapes) it could cause issues: when we average over the bucket we could think we have lower load than we do, and scale down (or fail to scale up) replicas incorrectly.

(This is somewhat related to #8377 in that if we introduce a work pool there's a greater danger of things backed up in the queue not getting stats every second).

@julz julz added the kind/feature Well-understood/specified features, ready for coding. label Jul 13, 2020
@mattmoor
Copy link
Member

cc @markusthoemmes @vagababov for thoughts

@vagababov
Copy link
Contributor

Actually it's the same as the @duglin issue we revisited earlier :)

@mattmoor
Copy link
Member

Want to find it and dupe?

@vagababov
Copy link
Contributor

Done.

@julz
Copy link
Member Author

julz commented Aug 18, 2020

FWIW I think this isn't totally the same as #8390. In #8390 @duglin is scraping at the correct rate, but the pockets of zero concurrency lead us to end up with less replicas than we actually need for peak load (because the simulated workload is a GitHub trigger firing multiple parallel events every 10 seconds or so, and we average over the full window). That one can potentially be fixed by the max-vs-average flag we've been informally chatting about: I'll pull out a top level issue for that now.

This one, I think, is slightly different. When the network is slow, or blips, we can get zero scaling data for a few seconds (or longer) and our current behaviour is to treat any gaps in data as if we'd actually seen concurrency zero. This means if the scraper loses connectivity to the pods, or the network is temporarily congested, we can start to rapidly scale down the workload as fast as max-scale-down-rate will let us. The 'max' switch described above that would potentially help bursty loads would cope with this slightly better, but I think it's a cross-cutting problem we should solve in both cases: for example by assuming the rolling average rather than 0 when we miss a scrape.

Edit: Spun out #9092.

@julz julz changed the title Distinguish zero concurrency from slow scraping when bucketing Distinguish zero concurrency from slow/failed scraping when bucketing Aug 18, 2020
@duglin
Copy link

duglin commented Oct 3, 2020

Since #8390 was closed, I want to add my testcase to this one because I'm still seeing odd behavior even after #9092 is merged.

Script:

#!/bin/bash

set -e
PAUSE=${1:-10}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=1 | tail -1`
echo "Service: $URL"
echo "Pause: $PAUSE"
for i in `seq 1 20` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo

And output I see today:

$ ./bug 
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
Pause: 10
1 : Running 50... real  0m19.686s # pods: 72
2 : Running 50... real  0m10.168s # pods: 72
3 : Running 50... real  0m10.177s # pods: 72
4 : Running 50... real  0m10.181s # pods: 48
5 : Running 50... real  0m10.186s # pods: 72
6 : Running 50... real  0m10.162s # pods: 72
7 : Running 50... real  0m10.183s # pods: 36
8 : Running 50... real  0m10.190s # pods: 72
9 : Running 50... real  0m10.174s # pods: 72
10 : Running 50... real 0m10.192s # pods: 36
11 : Running 50... real 0m10.204s # pods: 72
12 : Running 50... real 0m10.171s # pods: 72
13 : Running 50... real 0m10.191s # pods: 36
14 : Running 50... real 0m20.176s # pods: 72
15 : Running 50... real 0m10.192s # pods: 72
16 : Running 50... real 0m10.189s # pods: 72
17 : Running 50... real 0m20.139s # pods: 72
18 : Running 50... real 0m10.186s # pods: 72
19 : Running 50... real 0m10.213s # pods: 72
20 : Running 50... real 0m10.188s # pods: 39
Service 'echo' successfully deleted in namespace 'default'.

Notice how the # of pods isn't consistent and it going below 50 doesn't seem right. But the 2x latency at times is obviously the biggest concern.

@duglin
Copy link

duglin commented Oct 17, 2020

Using:

URL=`kn service create echo --image duglin/echo --concurrency-limit=1 \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=120s \
  --annotation-revision autoscaling.knative.dev/window=6s \

helped w.r.t. latency - it was around 10 seconds consistently. However, I had 72 pods the entire time, which just doesn't seem right when I only have 50 requests. Yes I know that TU (70%) is probably why I get an extra 12 pods, but from a user's POV it's hard to explain. I wonder if we need to make it more clear that this "utilization" isn't just per pod, but across all pods and really should be look at like some kind of "over provisioning" flag. Then it's clear that anything other than 100% means they're asking for "extra" unused space. And this space is calculated across all pods, not just within one.

@duglin
Copy link

duglin commented Oct 17, 2020

Meaning, (# of requests) * (CC/TU%) == # of pods they should see

@vagababov
Copy link
Contributor

vagababov commented Oct 17, 2020 via email

@duglin
Copy link

duglin commented Oct 17, 2020

Just FYI:

$ cat bug
#!/bin/bash

set -e
PAUSE=${PAUSE:-20}
COUNT=${COUNT:-20}
SDD=${SDD:-6s}
TU=${TU:-100}
WIN=${WIN:-6s}
CC=${CC:-1}
echo "Erasing old service"
kn service delete echo > /dev/null 2>&1 && sleep 5
URL=`kn service create echo --image duglin/echo --concurrency-limit=$CC \
  --annotation-revision autoscaling.knative.dev/scaleDownDelay=$SDD \
  --annotation-revision autoscaling.knative.dev/targetUtilizationPercentage=$TU \
  --annotation-revision autoscaling.knative.dev/WIN=$WIN \
  | tail -1`
echo "Service: $URL"
echo "PAUSE: $PAUSE"
echo "CC: $CC"
echo "SDD: $SDD"
echo "TU: $TU"
echo "WIN: $WIN"

for i in `seq 1 $COUNT` ; do
  echo -n "$i : Running 50... "
  (time (
    for i in `seq 1 50` ; do
      curl -s ${URL}?sleep=10 > /dev/null &
    done
    wait )
  ) 2>&1 | grep real | tr '\n' '\0'
  echo -n " # pods: "
  kubectl get pods | grep echo.*Running | wc -l
  sleep $PAUSE
done
kn service delete echo
$ PAUSE=20 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 20
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m18.245s # pods: 50
2 : Running 50... real  0m20.091s # pods: 49
3 : Running 50... real  0m20.108s # pods: 50
4 : Running 50... real  0m20.143s # pods: 50
5 : Running 50... real  0m19.347s # pods: 50
6 : Running 50... real  0m20.105s # pods: 49
7 : Running 50... real  0m20.155s # pods: 50
8 : Running 50... real  0m20.155s # pods: 50
9 : Running 50... real  0m20.163s # pods: 50
10 : Running 50... real 0m19.596s # pods: 50
11 : Running 50... real 0m20.126s # pods: 50
12 : Running 50... real 0m19.307s # pods: 50
13 : Running 50... real 0m20.131s # pods: 50
14 : Running 50... real 0m20.097s # pods: 50
15 : Running 50... real 0m19.604s # pods: 50
16 : Running 50... real 0m20.141s # pods: 50
17 : Running 50... real 0m20.132s # pods: 50
18 : Running 50... real 0m20.116s # pods: 49
19 : Running 50... real 0m20.181s # pods: 50
20 : Running 50... real 0m20.179s # pods: 49
$ PAUSE=10 ./bug
Erasing old service
Service: http://echo-default.kndev.us-south.containers.appdomain.cloud
PAUSE: 10
CC: 1
SDD: 6s
TU: 100
WIN: 6s
1 : Running 50... real  0m19.904s # pods: 50
2 : Running 50... real  0m10.185s # pods: 29
3 : Running 50... real  0m10.182s # pods: 50
4 : Running 50... real  0m20.176s # pods: 50
5 : Running 50... real  0m10.180s # pods: 50
6 : Running 50... real  0m10.168s # pods: 25
7 : Running 50... real  0m20.164s # pods: 34
8 : Running 50... real  0m19.300s # pods: 50
9 : Running 50... real  0m10.193s # pods: 50
10 : Running 50... real 0m10.180s # pods: 50
11 : Running 50... real 0m20.177s # pods: 50
12 : Running 50... real 0m10.174s # pods: 49
13 : Running 50... real 0m10.187s # pods: 26
14 : Running 50... real 0m10.167s # pods: 50
15 : Running 50... real 0m20.191s # pods: 32
16 : Running 50... real 0m18.220s # pods: 50
17 : Running 50... real 0m10.186s # pods: 50
18 : Running 50... real 0m10.174s # pods: 50
19 : Running 50... real 0m20.141s # pods: 50
20 : Running 50... real 0m10.190s # pods: 35
Service 'echo' successfully deleted in namespace 'default'.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 16, 2021
@duglin
Copy link

duglin commented Feb 16, 2021

/reopen
/remove-lifecycle stale

@duglin duglin reopened this Feb 16, 2021
@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2021
@evankanderson
Copy link
Member

@vagababov @julz

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

/triage needs-user-input

(I'll also point out that this bug timed out, so if it's a major issue, we may need to reconsider our priorities. If it's not a major issue, we may want to consider allowing it to time out again.)

@knative-prow-robot knative-prow-robot added the triage/needs-user-input Issues which are waiting on a response from the reporter label Mar 22, 2021
@julz
Copy link
Member Author

julz commented Mar 22, 2021

Is this still an issue? Would this be a "good first issue" in the autoscaling area?

Unfortunately not. What sounds easy is actually a bit tricky because of how we do metric aggregation. Having said that it's possible the new pluggable aggregation stuff @vagababov has added may make this more tractable 🤔 .

@evankanderson
Copy link
Member

/remove-triage needs-user-input
/triage accepted

I'm not sure that Victor is going to land anything here; is this still an issue, and what priority?

@knative-prow-robot knative-prow-robot added triage/accepted Issues which should be fixed (post-triage) and removed triage/needs-user-input Issues which are waiting on a response from the reporter labels Jun 23, 2021
@julz
Copy link
Member Author

julz commented Jun 23, 2021

I do think it's a legit issue ("we do not distinguish failed/slow scrapes from zero concurrency and we should or we'll potentially scale down due to network blips") that needs more work to progress than a "good first issue" should. If we had a 'this is something someone who wants something meaty could work on' tag, Id add that to this.

@evankanderson
Copy link
Member

/help

@knative-prow-robot
Copy link
Contributor

@evankanderson:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 24, 2021
@nader-ziada
Copy link
Member

/assign

@dprotaso dprotaso added this to the Icebox milestone Mar 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Well-understood/specified features, ready for coding. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

8 participants