-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Actual implementation of direct pod scraping #7804
Actual implementation of direct pod scraping #7804
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vagababov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// Scrape! | ||
target := "http://" + pods[myIdx] + ":" + portAndPath | ||
stat, err := s.sClient.Scrape(target) | ||
if err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kind of sad that the real error gets swallowed here.. is there any non-awful way to return one of the actual errors instead of errPodsExhausted - maybe a chan error
we can write the error to and pull the first error out to log/wrap from the block on L258 or something? Would help debugging if we ever find a system falling back to mesh scraping unexpectedly or failing to scrape
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be more than one, which makes packaging hard. HttpScraper should log (or perhaps right here, though we have no logging right now).
OK fixed all but the logging one. I'd prefer to separately thread in loggers into stats_scraper and add observability to this file in a separate PR. This is already too big for my taste. |
fair enough, polishing the error logging later seems legit - looking forward to the graphs! |
/test pull-knative-serving-unit-tests |
In sustained mode: p95 ~5ms vs ~45 |
For posterity this is running sustained load of 50 concurrent requests to a svc with CC=2 for 200s. |
New @100 scaled to 40pods. At some scale this matters much, since at infinite pods I think our sample is just 16 pods (but this actually might permit us to use a tighter than 95% confidence interval). |
@googlebot rescan |
// Got some successful pods. | ||
// TODO(vagababov): perhaps separate |pods| == 1 case here as well? | ||
if len(results) > 0 { | ||
return emptyStat, errPodsExhausted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the results from some successful pods? Current behavior is that empty result and errPodsExhausted will be returned and won't fall back to scraping service. Is it intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per design doc we decided not. Though, we might always change that decision.
idx := int32(-1) | ||
for i := 0; i < sampleSize; i++ { | ||
grp.Go(func() error { | ||
for { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment here for why we use two for loops? At first I was confused by myIdx >= len(pods)
check and no return when error from s.sClient.Scrape(target) is not nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
grp.Go(func() error { | ||
for { | ||
// Acquire next pod. | ||
myIdx := int(atomic.AddInt32(&idx, 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why we +1 for each run in the second for loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're just going down the list of pods and we need each pod to be scraped only once. Hence we need to select the next available.
All done, thanks! |
The following is the coverage report on the affected files.
|
/lgtm |
The following jobs failed:
Automatically retrying due to test flakiness... |
/test pull-knative-serving-integration-tests |
This was requested in the knative#7804 review, eXpecially to log the errors about pod scrape failures. This finishes knative#5978
This is the actual scraping implementation for #5978
/assign @markusthoemmes @yanweiguo mattmoor