fix(profiler): Force gax to retry in case of certificate errors #3178
I don't really like this solution.
Firstly, certificate errors aren't really permanent. An administrator, on seeing these errors, could install the appropriate certificates. There could be a certificate misconfiguration on the server side. Something funny with DNS or IP routing could be causing the request to go to the wrong host (MITM attack, either malicious or part of a corporate firewall, administrator error in using iptables, or some screw-up with BGP on the internet...). While these errors should be uncommon they shouldn't be unrecoverable. In fact they should be expected because TLS exists to mitigate them: that implies they can and will happen.
What this patch does is shut down profiling whenever TLS does its job. Some of these failure modes could shut down profiling for every Google customer until they restart their processes. Unless they've built retry behavior of their own (which they won't). And since "keep profiling until the process exits" is by far the most common use case, why would that retry behavior not be implemented in the library?
Secondly, I don't think the CPU spinning bug is because of the certificate error behavior in gax. The intention when a user calls .Start() is pretty clear:
Looking at the implementation of
A failure to anticipate that the call may return immediately if certificate validation fails is just one way this might go wrong. Is it the only way? Are the people making changes to gax aware of the implicit contract that's been created here and mindful of how they might break that contract, causing similar failures in the future?
Admittedly, exiting the loop when createProfile encounters an error is better -- given the choice between stopping profiling and pegging a CPU with and endless loop I'd choose the former. But again, this is relying on createProfile returning an error if it ever returns immediately, which is probably true but difficult to prove. And it's opening the door for a new class of bugs that will probably be along the lines of "profiling just randomly stops working sometimes and I have no idea why".
In case you didn't see it, I've already proposed a solution here: #3159
A further improvement might be to retain the propagation of the error introduced in this patch, but instead of exiting the loop, log those errors or otherwise make them available. Right now they just go to the debug log which is probably disabled, but these errors certainly rise to a level of importance above "debug". And with my patch included it's guaranteed these errors won't be logged at a rate of more than 1 per minute (which is more than I can say for the previous implementation, which generated 2 days worth of log volume over the span of 10 minutes)
@bitglue I will address your concerns. I am implementing an exponential backoff on errors.
However, I think the application will need to be restarted in case of bad certificates. I ran few simple experiments:
I did not investigate this further. I assume some components maybe caching the certificates.
I don't think backoff on errors is quite the right solution, either. The problem is there's a loop which calls a function as fast as it can, and if that function returns immediately, the result is consuming 100% of a CPU.
That could be a catastrophic failure for many programs, so a little defensive programming seems appropriate here.
Now, it's probably the case that if the function is going to return immediately, it's going to return an error. But can you be 100% certain it's so? Evidently someone already thought so in the past, and they've been wrong at least once. What's to say the function won't return immediately without an error?
Currently, in the happy path, the
Now think about that from my perspective: with one mistake, Google could change the behavior of the API to be much faster, instantly raising the CPU usage of my processes. This could easily bring down my application. Do I want to trust Google with the availability of my application? No, I do not.
The solution here is to have some simple code which very directly limits the rate of that loop under any circumstances. That's what the timer accomplishes.
In normal operating conditions, the agent cannot decide on its own how long to wait between createProfile calls. The backend controls the agent collection and backoff behavior to maintain the required sampling rate. We document the behavior at https://cloud.google.com/profiler/docs/concepts-profiling#collection
In cases where agent is unable to connect to the backend, this PR will ensure the agent won't monopolize the resources.
Please let us know if you find any other instances where the agent is misbehaving and we will fix it promptly.
That is not what my patch does. My patch instructs the agent to wait at a minimum 1 minute. Maybe that's a bit too long: in the case of just 1 process and say 6 profile types, 10 seconds per iteration should be sufficient.
The agent certainly can check what its being told to do for reasonableness. Under no circumstances does it make sense to attempt to profile at a rate of millions per second.
If the server should desire a longer pause, my patch respects that. The call to
That's not the point. I do not want this agent blowing up a second time. I was very surprised when it blew up the first time, so I looked at the code and found it to be brittle because the implementation provides no guaranteed lower bound on iteration time. Your patch does not fully address this issue because it only addresses the case where
Can I audit all the code to be reasonably confident there are no cases where
And like I said, I can think of a case where it would return too quickly: when the API is not behaving as currently intended. And while I hope there is nothing in this package, in its dependencies, or in all the code running on Google's servers, or all the networking between, that is simply too much surface area to simply trust on the weak guarantee that "it was designed that way and we hope it works".
gax retries and uses exponential backoff. Now we have exponential backoff to cover cases where gax does not retry. I believe this sufficiently address the root cause of this issue and similar cases.
Profile collection is done over a 10 second interval. So in case of only one agent, the maximum number of times it can call createProfile is ~6. So in case of successful API calls, the number of calls are limited by the number of profile types enabled.
To make this more robust and defensive against unexpected API behavior, we could limit the number of iterations in
@aalexand WDYT? I can update the PR to enforce an upper limit on number of iterations over a one minute duration.