Nomad autoscaler deployment loses canary status and gets stuck #20295
Labels
stage/needs-investigation
theme/autoscaling
Issues related to supporting autoscaling
theme/deployments
type/bug
Projects
Nomad version
1.7.3
Operating system and Environment details
Ubuntu 22.04.2 LTS
Issue
From time to time auto scaling results in a deployment that partially fails. The deployment never completes and gets hung in a state that seems corrupt.
In the example below, the autoscaler is dropping an instance count from 5 to 2. We came across this instance an hour after it started.
Given that the deployment is auto promoting and auto reverting, and that there is a tight progress deadline of 31 seconds, I don't understand why the deployment is still running and stuck hours later. The deployment starts at 08:54:13 but with events and deadlines smeared across about 4 minutes also doesn't make sense to me. I'm not sure what it was doing.
The UI overview tab shows a summary that does not include canary information:
The UI deployment tab does show canary information:
The chart showing auto scaling behavior:
The change recorded for the deployment:
A request to the API to get the deployment returns:
A request to the API to get the canary allocation returns:
Consul shows all 3 allocations are healthy.
Reproduction steps
We have 100s of successful auto scaling deployments every day across many services. I haven't been able to recreate on demand.
Expected Result
Actual Result
Nomad Client logs
Consul logs for canary alloc:
Nomad logs don't turn up anything interesting on that alloc.
The text was updated successfully, but these errors were encountered: