Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(profiler): Force gax to retry in case of certificate errors #3178
fix(profiler): Force gax to retry in case of certificate errors #3178
Changes from 10 commits
b4bc13d
e5f42d1
b94cacf
3469efd
a42dd12
c9bb5e3
ecee895
e07097d
d627d00
7acc10c
1fa8405
aae6c09
08816b7
ba22df5
ea017b1
d5e7e8a
a6958be
797c090
5abeb75
fd60f31
113e641
4308d15
85790f1
092feb1
6131922
db1dfa8
e768510
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this got too complicated for the problem in hand. I'm not sure we even need to return err from createProfile, I think simply
should be enough to fix the original problem. This is the only error that is not retried by gax.Invoke - see
https://github.com/googleapis/gax-go/blob/21fd469b63bc1a2bed9b99b614b4b9dc54a8aeae/v2/invoke.go#L83.
Sure, this might change tomorrow, but it's unlikely and I don't think this is worth the complexity of crafting another retry utility here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a false dichotomy. There are solutions which are simple and robust. For example, something as simple as a
time.Sleep(time.Second)
in the loop would go a long way towards addressing my concerns while being 24 times simpler than what's already proposed here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'd challenge "not worth it". For you, this is an annoying bug that you just want to close. For me, it's a bug that very nearly brought down a production website, and could do so again if any one of a very large number of things outside my control go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not worth it" == "not a lot of benefit over the proposed fix", not "the bug is unimportant".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what's the benefit of your proposal over my proposed fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy if you wanted to do both. The important thing to me is that by reading nothing more than the code in profiler.go, and ideally just one or two functions in there, I can convince myself that every iteration of the loop will have some non-zero amount of idle time. There are a lot of ways that could be accomplished.
But so far this PR guarantees some idle time only under some conditions. I can't predict when those conditions will occur since they depend on gax, the go libraries, and the code running on Google servers, among other things. Since I can't verify the correctness of the code I can't be sure there isn't a possibility that the agent won't consume 100% CPU, and so I can't be comfortable running it in production.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay. I will refactor this PR to what aalexand@ suggests above as it will address the root cause while keeping the code changes minimal and simple.
Having agents self manage delays is not something we are considering at this time to ensure the backend can appropriately control the sampling rate as a whole.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me be very clear.
My patch does not put the agent in control of the delay.
You do not seem to understand this point, so let me put it another way:
My patch retains the ability of the server to control the delay between profiles.
Profiles are always 10 seconds long, right? And only one profile can be taken at a time, right? So it should be impossible to profile more than 6 times per minute, right?
So if the agent finds that the main loop takes only 20 microseconds to execute, could this be anything other than a failure of the profiling implementation (and that could be the part that talks to the API, or the part that collects the profiling data)?
I don't see any reason to avoid checking the parameters to the agent for sanity. This is a normal thing to do, especially when those parameters are coming from a remote service.