Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Actual implementation of direct pod scraping #7804

Merged

Conversation

vagababov
Copy link
Contributor

This is the actual scraping implementation for #5978

/assign @markusthoemmes @yanweiguo mattmoor

@knative-prow-robot knative-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 1, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vagababov

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 1, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label May 1, 2020
// Scrape!
target := "http://" + pods[myIdx] + ":" + portAndPath
stat, err := s.sClient.Scrape(target)
if err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of sad that the real error gets swallowed here.. is there any non-awful way to return one of the actual errors instead of errPodsExhausted - maybe a chan error we can write the error to and pull the first error out to log/wrap from the block on L258 or something? Would help debugging if we ever find a system falling back to mesh scraping unexpectedly or failing to scrape

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be more than one, which makes packaging hard. HttpScraper should log (or perhaps right here, though we have no logging right now).

@vagababov
Copy link
Contributor Author

OK fixed all but the logging one. I'd prefer to separately thread in loggers into stats_scraper and add observability to this file in a separate PR. This is already too big for my taste.
Graphs after coffee :)

@julz
Copy link
Member

julz commented May 1, 2020

fair enough, polishing the error logging later seems legit - looking forward to the graphs!

@vagababov
Copy link
Contributor Author

/test pull-knative-serving-unit-tests

@vagababov
Copy link
Contributor Author

Before
image

@vagababov
Copy link
Contributor Author

After:
image

@vagababov
Copy link
Contributor Author

In sustained mode: p95 ~5ms vs ~45
In panic mode (lots of new pods): p95 ~ 16ms vs ~400ms
😓

@vagababov
Copy link
Contributor Author

For posterity this is running sustained load of 50 concurrent requests to a svc with CC=2 for 200s.
I'll try 100, but it kills prometheus
(ノಠ益ಠ)ノ彡┻━┻

@vagababov
Copy link
Contributor Author

New @100 scaled to 40pods.
image

Before:
image

At some scale this matters much, since at infinite pods I think our sample is just 16 pods (but this actually might permit us to use a tighter than 95% confidence interval).

@vagababov
Copy link
Contributor Author

@googlebot rescan

// Got some successful pods.
// TODO(vagababov): perhaps separate |pods| == 1 case here as well?
if len(results) > 0 {
return emptyStat, errPodsExhausted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the results from some successful pods? Current behavior is that empty result and errPodsExhausted will be returned and won't fall back to scraping service. Is it intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per design doc we decided not. Though, we might always change that decision.

idx := int32(-1)
for i := 0; i < sampleSize; i++ {
grp.Go(func() error {
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment here for why we use two for loops? At first I was confused by myIdx >= len(pods) check and no return when error from s.sClient.Scrape(target) is not nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

grp.Go(func() error {
for {
// Acquire next pod.
myIdx := int(atomic.AddInt32(&idx, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we +1 for each run in the second for loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're just going down the list of pods and we need each pod to be scraped only once. Hence we need to select the next available.

pkg/autoscaler/metrics/stats_scraper_test.go Outdated Show resolved Hide resolved
pkg/autoscaler/metrics/stats_scraper_test.go Outdated Show resolved Hide resolved
pkg/autoscaler/metrics/stats_scraper_test.go Outdated Show resolved Hide resolved
pkg/autoscaler/metrics/stats_scraper_test.go Outdated Show resolved Hide resolved
@vagababov
Copy link
Contributor Author

All done, thanks!

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/autoscaler/metrics/stats_scraper.go 89.9% 94.5% 4.7

@yanweiguo
Copy link
Contributor

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label May 1, 2020
@knative-test-reporter-robot

The following jobs failed:

Test name Triggers Retries
pull-knative-serving-integration-tests pull-knative-serving-integration-tests 1/3

Automatically retrying due to test flakiness...
/test pull-knative-serving-integration-tests

@vagababov
Copy link
Contributor Author

/test pull-knative-serving-integration-tests

@knative-prow-robot knative-prow-robot merged commit 00775f2 into knative:master May 2, 2020
vagababov added a commit to vagababov/serving that referenced this pull request May 2, 2020
This was requested in the knative#7804 review, eXpecially to log the errors about pod
scrape failures.
This finishes knative#5978
knative-prow-robot pushed a commit that referenced this pull request May 2, 2020
This was requested in the #7804 review, eXpecially to log the errors about pod
scrape failures.
This finishes #5978
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants