[5.x] Fix supervisor reprovisioning #1288
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When a supervisor gets killed and the status code is not 0, 2 or 13, the supervisor tries to restart by calling the reprovision method. That restarting always fails, because the reprovision method puts the AddSupervisor on the queue, which results in the SupervisorCommand being executed. One of the first things that are done there is to check if there are already supervisors running with the same name by calling the ensureNoDuplicateSupervisors method, which checks if a supervisor with that name already exists.
At the point of restarting, that key will already exist, so reprovisioning doesn't work with the current code.
Why we were running into this issue specifically: We are running horizon on multiple k8s pods. We noticed a bunch of pods with only a few of the supervisors running after a certain amount of time. With the help of #1284 we found that we got a bunch of SuperVisorDiedEvents with exit code 137 (Indicating the pod memory limit was reached), immediately followed by the same event with exit code 13(Indicating that a supervisor with the same name already exists).
We have been running this fixed code in production, and the second event with status code 13 is now not triggered anymore in this scenario, and our supervisors gracefully restart after exiting with code 137.
This PR makes sure that the key is removed before attempting to restart the supervisor.
Also fixes the underlying issue in #1273