Actual implementation of direct pod scraping #7804

vagababov · 2020-05-01T07:06:40Z

This is the actual scraping implementation for #5978

/assign @markusthoemmes @yanweiguo mattmoor

knative-prow-robot · 2020-05-01T07:06:49Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vagababov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/autoscaler/OWNERS~~ [vagababov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pkg/autoscaler/metrics/stats_scraper.go

julz · 2020-05-01T12:51:41Z

pkg/autoscaler/metrics/stats_scraper.go

+				// Scrape!
+				target := "http://" + pods[myIdx] + ":" + portAndPath
+				stat, err := s.sClient.Scrape(target)
+				if err == nil {


kind of sad that the real error gets swallowed here.. is there any non-awful way to return one of the actual errors instead of errPodsExhausted - maybe a chan error we can write the error to and pull the first error out to log/wrap from the block on L258 or something? Would help debugging if we ever find a system falling back to mesh scraping unexpectedly or failing to scrape

There might be more than one, which makes packaging hard. HttpScraper should log (or perhaps right here, though we have no logging right now).

pkg/autoscaler/metrics/stats_scraper_test.go

pkg/autoscaler/metrics/stats_scraper.go

vagababov · 2020-05-01T14:58:54Z

OK fixed all but the logging one. I'd prefer to separately thread in loggers into stats_scraper and add observability to this file in a separate PR. This is already too big for my taste.
Graphs after coffee :)

julz · 2020-05-01T15:03:51Z

fair enough, polishing the error logging later seems legit - looking forward to the graphs!

vagababov · 2020-05-01T16:05:56Z

/test pull-knative-serving-unit-tests

vagababov · 2020-05-01T16:23:43Z

Before

vagababov · 2020-05-01T16:23:57Z

After:

vagababov · 2020-05-01T16:25:25Z

In sustained mode: p95 ~5ms vs ~45
In panic mode (lots of new pods): p95 ~ 16ms vs ~400ms
😓

vagababov · 2020-05-01T16:30:02Z

For posterity this is running sustained load of 50 concurrent requests to a svc with CC=2 for 200s.
I'll try 100, but it kills prometheus
(ノಠ益ಠ)ノ彡┻━┻

vagababov · 2020-05-01T16:45:24Z

New @100 scaled to 40pods.

Before:

At some scale this matters much, since at infinite pods I think our sample is just 16 pods (but this actually might permit us to use a tighter than 95% confidence interval).

vagababov · 2020-05-01T17:03:03Z

@googlebot rescan

yanweiguo · 2020-05-01T17:12:52Z

pkg/autoscaler/metrics/stats_scraper.go

+		// Got some successful pods.
+		// TODO(vagababov): perhaps separate |pods| == 1 case here as well?
+		if len(results) > 0 {
+			return emptyStat, errPodsExhausted


Can we use the results from some successful pods? Current behavior is that empty result and errPodsExhausted will be returned and won't fall back to scraping service. Is it intended?

As per design doc we decided not. Though, we might always change that decision.

yanweiguo · 2020-05-01T17:50:26Z

pkg/autoscaler/metrics/stats_scraper.go

+	idx := int32(-1)
+	for i := 0; i < sampleSize; i++ {
+		grp.Go(func() error {
+			for {


Could we add a comment here for why we use two for loops? At first I was confused by myIdx >= len(pods) check and no return when error from s.sClient.Scrape(target) is not nil.

yanweiguo · 2020-05-01T17:53:08Z

pkg/autoscaler/metrics/stats_scraper.go

+		grp.Go(func() error {
+			for {
+				// Acquire next pod.
+				myIdx := int(atomic.AddInt32(&idx, 1))


I don't get why we +1 for each run in the second for loop.

We're just going down the list of pods and we need each pod to be scraped only once. Hence we need to select the next available.

pkg/autoscaler/metrics/stats_scraper_test.go

vagababov · 2020-05-01T21:15:38Z

All done, thanks!

knative-metrics-robot · 2020-05-01T22:33:05Z

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/autoscaler/metrics/stats_scraper.go	89.9%	94.5%	4.7

yanweiguo · 2020-05-01T22:34:37Z

/lgtm

knative-test-reporter-robot · 2020-05-01T23:12:07Z

The following jobs failed:

Test name	Triggers	Retries
pull-knative-serving-integration-tests	pull-knative-serving-integration-tests	1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-integration-tests

vagababov · 2020-05-02T00:12:40Z

/test pull-knative-serving-integration-tests

This was requested in the knative#7804 review, eXpecially to log the errors about pod scrape failures. This finishes knative#5978

This was requested in the #7804 review, eXpecially to log the errors about pod scrape failures. This finishes #5978

vagababov added 2 commits May 1, 2020 00:01

Actual implementation of direct pod scraping

426acba

fix nits

3d2ff50

knative-prow-robot assigned markusthoemmes, mattmoor and yanweiguo May 1, 2020

knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 1, 2020

knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 1, 2020

googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 1, 2020

knative-prow-robot requested review from markusthoemmes and mdemirhan May 1, 2020 07:06

knative-prow-robot added the area/autoscale label May 1, 2020

markusthoemmes reviewed May 1, 2020

View reviewed changes

pkg/autoscaler/metrics/stats_scraper.go Outdated Show resolved Hide resolved

markusthoemmes reviewed May 1, 2020

View reviewed changes

pkg/autoscaler/metrics/stats_scraper.go Outdated Show resolved Hide resolved

julz reviewed May 1, 2020

View reviewed changes

pkg/autoscaler/metrics/stats_scraper.go Show resolved Hide resolved

julz reviewed May 1, 2020

View reviewed changes

pkg/autoscaler/metrics/stats_scraper_test.go Outdated Show resolved Hide resolved

julz reviewed May 1, 2020

View reviewed changes

pkg/autoscaler/metrics/stats_scraper_test.go Outdated Show resolved Hide resolved

taragu reviewed May 1, 2020

View reviewed changes

pkg/autoscaler/metrics/stats_scraper.go Show resolved Hide resolved

review

d600a23

missed

09501c9

vagababov mentioned this pull request May 1, 2020

Direct pod scraping in autoscaler. #5978

Closed

yanweiguo reviewed May 1, 2020

View reviewed changes

vagababov added 2 commits May 1, 2020 13:41

review

f8159f4

extra char

65ed85c

yanweiguo reviewed May 1, 2020

View reviewed changes

tests

5e5e7f4

moar test names

7e11292

knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020

knative-prow-robot merged commit 00775f2 into knative:master May 2, 2020

vagababov added a commit to vagababov/serving that referenced this pull request May 2, 2020

Add logging to the stats scraper

8ff1233

This was requested in the knative#7804 review, eXpecially to log the errors about pod scrape failures. This finishes knative#5978

vagababov mentioned this pull request May 2, 2020

Add logging to the stats scraper #7815

Merged

knative-prow-robot pushed a commit that referenced this pull request May 2, 2020

Add logging to the stats scraper (#7815)

bb8b899

This was requested in the #7804 review, eXpecially to log the errors about pod scrape failures. This finishes #5978

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actual implementation of direct pod scraping #7804

Actual implementation of direct pod scraping #7804

vagababov commented May 1, 2020

knative-prow-robot commented May 1, 2020

julz May 1, 2020

vagababov May 1, 2020

vagababov commented May 1, 2020

julz commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

yanweiguo May 1, 2020

vagababov May 1, 2020

yanweiguo May 1, 2020

vagababov May 1, 2020

yanweiguo May 1, 2020

vagababov May 1, 2020

vagababov commented May 1, 2020

knative-metrics-robot commented May 1, 2020

yanweiguo commented May 1, 2020

knative-test-reporter-robot commented May 1, 2020

vagababov commented May 2, 2020

Actual implementation of direct pod scraping #7804

Actual implementation of direct pod scraping #7804

Conversation

vagababov commented May 1, 2020

knative-prow-robot commented May 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagababov commented May 1, 2020

julz commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

vagababov commented May 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vagababov commented May 1, 2020

knative-metrics-robot commented May 1, 2020

yanweiguo commented May 1, 2020

knative-test-reporter-robot commented May 1, 2020

vagababov commented May 2, 2020