Use two replicas for Kubeapps-apis service in CI #3867
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Michael Nelson minelson@vmware.com
Description of the change
Updates the CI deployment to use two replicas for the kubeapps-apis service.
We've been experiencing intermittent issues with the last e2e test,
08-rollback.js
, lately that appear to be getting more frequent.I first added the same retries as Antonio had done in #3866 so that I could continue debugging, and found that the first screenshot (first failure) shows an error:
while the second and third retries are just reloading the same page (though note that the update to the replicas is lost, so it shows 1 again) without error:
So an error is occurring, but then the retry is simply re-loading the page (without re-clicking on the deploy), which works fine, but is a different state to that expected.
I then retried the e2e test with SSH enabled 3 times, and of course, it passed three times straight (perhaps they allocate more memory when ssh is enabled, not sure as I didn't check).
But I can reproduce the exact same error locally by getting a package ready to deploy, then scaling the apis server to 0, then deploying (same screenshot as first error above).
So I'm confident that the issue is a 502 at that point, perhaps due to OOM or similar.
I started investigating how we could work around this in the CI test by retrying with the deploy click included, or handling a 502, but realised that we're working around a problem which kubernetes already solves by avoiding a single point-of-failure. The only reason we're hitting this is because we scale all deployments to a single replica in CI to keep resources down, whereas a valid solution here is to not run with a SPOF for the kubeapps apis service and let k8s route the requests.
Of course, this doesn't mean we shouldn't also continue to learn more about the memory usage of the service, but it does provide a more realistic e2e test and solve the current CI intermittent issue, IMO.
Benefits
More stable CI.
Possible drawbacks
Extra resource required in CI (not a big deal)
Applicable issues
Additional information