-
Notifications
You must be signed in to change notification settings - Fork 16.8k
[stable/concourse] Upgrade to concourse 3.8.0 #3203
Conversation
65dda5c
to
1e9b3b7
Compare
lgtm |
- /bin/sh | ||
- -c | ||
- |- | ||
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a condition where retire-worker will never come back good and this will be stuck in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've seen this happen where ATC can't/won't clean up the worker as a result of calling retire-worker
. On termination the terminationGracePeriod will take affect and the container will be killed. When it comes back it could get stuck in this loop. The pod will be live, but the worker won't register because the worker process hasn't started yet, it's still trying to retire-worker because the old worker still exists in concourse.
The only resolution is to manually intervene in the fly cli with fly prune-worker
. I've requested that the concourse
command add the option to forcefully delete: concourse/concourse#1457 (comment), which would be called in the startup script instead of looping over retire-worker
.
This deserves a paragraph in the readme, but I'm not sure if we can make this any better at the moment, unless we can find another way to send that delete request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is a way to make this better, have the liveness probe ensure the concourse process is up in addition to checking for fatal errors, and have the livenessProbeDelay tuneable. fail if we're still trying to retire the worker at startup after the livenessProbeDelaySeconds has passed. This will trigger a crashloopbackoff that should be obvious enough to signal manual intervention.
stable/concourse/Chart.yaml
Outdated
version: 0.10.7 | ||
appVersion: 3.6.0 | ||
version: 0.11.0 | ||
appVersion: 3.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image tag is 3.5.0
. The apps version isn't 3.8.0. This needs to be cleaned up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4dbb1bf
to
b72c88c
Compare
6a34d6e
to
0c188a6
Compare
/ok-to-test |
Leaving for @viglesiasce or one of the other maintainers of this chart to approve |
Make necessary improvements to Concourse worker lifecycle management. Add additional fatal errors emitted as of Concourse 3.8.0 that should trigger a restart, and remove "unkown volume" as one such error as this will happen normally when running multiple concourse-web pods. Try to start workers with a clean slate by cleaning up previous incarnations of a worker. Call retire-worker before starting. Also clear the concourse-work-dir before starting. Call retire-worker in a loop and don't exit that loop until the old worker is gone. This allows us to remove the fixed worker.postStopDelaySeconds duration. Add a note about persistent volumes being necessary. Add containerPlacementStrategy, default random to better spread load across workers.
2852149
to
1918e0f
Compare
Rebased onto master and squashed into a single commit |
by the way, implemented and running in production for a while now https://github.com/EugenMayer/docker-image-concourseci-worker-solid using a trap to do what you are doing here, makes it a little less intrusive: https://github.com/EugenMayer/docker-image-concourseci-worker-solid/blob/master/worker_wrapper.sh |
/hold |
/hold cancel |
/test all |
Would love to see this merged, if it's ready (gentle nudge :) ). Waiting for v3.8.0 to test a move from our BOSH deployment to Kubernetes. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: viglesiasce, william-tran The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
Make necessary improvements to Concourse worker lifecycle management.
Add additional fatal errors emitted as of Concourse 3.8.0 that should
trigger a restart, and remove "unkown volume" as one such error as this
will happen normally when running multiple concourse-web pods.
Try to start workers with a clean slate by cleaning up previous
incarnations of a worker. Call retire-worker before starting. Also
clear the concourse-work-dir before starting.
Call retire-worker in a loop and don't exit that loop until the old
worker is gone. This allows us to remove the fixed worker.postStopDelaySeconds
duration.
Add a note about persistent volumes being necessary.
Add container placement strategy value and default it to random to better spread load across workers. This is a new feature as of 3.7. See vmware-archive/atc#219 for background.