Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[stable/concourse] Upgrade to concourse 3.8.0 #3203

Merged
merged 1 commit into from
Jan 16, 2018

Conversation

william-tran
Copy link
Collaborator

@william-tran william-tran commented Jan 3, 2018

Make necessary improvements to Concourse worker lifecycle management.

Add additional fatal errors emitted as of Concourse 3.8.0 that should
trigger a restart, and remove "unkown volume" as one such error as this
will happen normally when running multiple concourse-web pods.

Try to start workers with a clean slate by cleaning up previous
incarnations of a worker. Call retire-worker before starting. Also
clear the concourse-work-dir before starting.

Call retire-worker in a loop and don't exit that loop until the old
worker is gone. This allows us to remove the fixed worker.postStopDelaySeconds
duration.

Add a note about persistent volumes being necessary.

Add container placement strategy value and default it to random to better spread load across workers. This is a new feature as of 3.7. See vmware-archive/atc#219 for background.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 3, 2018
@nexeck
Copy link
Contributor

nexeck commented Jan 3, 2018

lgtm

- /bin/sh
- -c
- |-
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a condition where retire-worker will never come back good and this will be stuck in the loop?

Copy link
Collaborator Author

@william-tran william-tran Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've seen this happen where ATC can't/won't clean up the worker as a result of calling retire-worker. On termination the terminationGracePeriod will take affect and the container will be killed. When it comes back it could get stuck in this loop. The pod will be live, but the worker won't register because the worker process hasn't started yet, it's still trying to retire-worker because the old worker still exists in concourse.

The only resolution is to manually intervene in the fly cli with fly prune-worker. I've requested that the concourse command add the option to forcefully delete: concourse/concourse#1457 (comment), which would be called in the startup script instead of looping over retire-worker.

This deserves a paragraph in the readme, but I'm not sure if we can make this any better at the moment, unless we can find another way to send that delete request.

Copy link
Collaborator Author

@william-tran william-tran Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a way to make this better, have the liveness probe ensure the concourse process is up in addition to checking for fatal errors, and have the livenessProbeDelay tuneable. fail if we're still trying to retire the worker at startup after the livenessProbeDelaySeconds has passed. This will trigger a crashloopbackoff that should be obvious enough to signal manual intervention.

version: 0.10.7
appVersion: 3.6.0
version: 0.11.0
appVersion: 3.8.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The image tag is 3.5.0. The apps version isn't 3.8.0. This needs to be cleaned up

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@william-tran william-tran force-pushed the concourse-3.8 branch 2 times, most recently from 4dbb1bf to b72c88c Compare January 3, 2018 16:16
@viglesiasce
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 4, 2018
@mattfarina
Copy link
Contributor

Leaving for @viglesiasce or one of the other maintainers of this chart to approve

Make necessary improvements to Concourse worker lifecycle management.

Add additional fatal errors emitted as of Concourse 3.8.0 that should
trigger a restart, and remove "unkown volume" as one such error as this
will happen normally when running multiple concourse-web pods.

Try to start workers with a clean slate by cleaning up previous
incarnations of a worker. Call retire-worker before starting. Also
clear the concourse-work-dir before starting.

Call retire-worker in a loop and don't exit that loop until the old
worker is gone. This allows us to remove the fixed worker.postStopDelaySeconds
duration.

Add a note about persistent volumes being necessary.

Add containerPlacementStrategy, default random to better spread load
across workers.
@william-tran
Copy link
Collaborator Author

Rebased onto master and squashed into a single commit

@EugenMayer
Copy link

by the way, implemented and running in production for a while now https://github.com/EugenMayer/docker-image-concourseci-worker-solid

using a trap to do what you are doing here, makes it a little less intrusive: https://github.com/EugenMayer/docker-image-concourseci-worker-solid/blob/master/worker_wrapper.sh

@william-tran
Copy link
Collaborator Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 13, 2018
@william-tran
Copy link
Collaborator Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 13, 2018
@william-tran
Copy link
Collaborator Author

/test all

@ahume
Copy link
Contributor

ahume commented Jan 16, 2018

Would love to see this merged, if it's ready (gentle nudge :) ). Waiting for v3.8.0 to test a move from our BOSH deployment to Kubernetes.

@viglesiasce
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: viglesiasce, william-tran

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2018
@k8s-ci-robot k8s-ci-robot merged commit ac5f935 into helm:master Jan 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants