[stable/concourse] Upgrade to concourse 3.8.0 #3203

william-tran · 2018-01-03T04:19:51Z

Make necessary improvements to Concourse worker lifecycle management.

Add additional fatal errors emitted as of Concourse 3.8.0 that should
trigger a restart, and remove "unkown volume" as one such error as this
will happen normally when running multiple concourse-web pods.

Try to start workers with a clean slate by cleaning up previous
incarnations of a worker. Call retire-worker before starting. Also
clear the concourse-work-dir before starting.

Call retire-worker in a loop and don't exit that loop until the old
worker is gone. This allows us to remove the fixed worker.postStopDelaySeconds
duration.

Add a note about persistent volumes being necessary.

Add container placement strategy value and default it to random to better spread load across workers. This is a new feature as of 3.7. See vmware-archive/atc#219 for background.

nexeck · 2018-01-03T09:34:55Z

lgtm

mattfarina · 2018-01-03T16:06:42Z

stable/concourse/templates/worker-statefulset.yaml

+                  - /bin/sh
+                  - -c
+                  - |-
+                    while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do


Is there a condition where retire-worker will never come back good and this will be stuck in the loop?

Yes, I've seen this happen where ATC can't/won't clean up the worker as a result of calling retire-worker. On termination the terminationGracePeriod will take affect and the container will be killed. When it comes back it could get stuck in this loop. The pod will be live, but the worker won't register because the worker process hasn't started yet, it's still trying to retire-worker because the old worker still exists in concourse.

The only resolution is to manually intervene in the fly cli with fly prune-worker. I've requested that the concourse command add the option to forcefully delete: concourse/concourse#1457 (comment), which would be called in the startup script instead of looping over retire-worker.

This deserves a paragraph in the readme, but I'm not sure if we can make this any better at the moment, unless we can find another way to send that delete request.

Actually, there is a way to make this better, have the liveness probe ~~ensure the concourse process is up in addition to checking for fatal errors, and have the livenessProbeDelay tuneable.~~ fail if we're still trying to retire the worker at startup after the livenessProbeDelaySeconds has passed. This will trigger a crashloopbackoff that should be obvious enough to signal manual intervention.

mattfarina · 2018-01-03T16:08:11Z

stable/concourse/Chart.yaml

-version: 0.10.7
-appVersion: 3.6.0
+version: 0.11.0
+appVersion: 3.8.0


The image tag is 3.5.0. The apps version isn't 3.8.0. This needs to be cleaned up

Addressed in https://github.com/kubernetes/charts/pull/3203/files#diff-969838d261afed87173f82e43073caefR71 and https://github.com/kubernetes/charts/pull/3203/files#diff-95cc28250f44d26990fee96ad2fbd63dR16

viglesiasce · 2018-01-04T04:20:38Z

/ok-to-test

mattfarina · 2018-01-04T16:20:28Z

Leaving for @viglesiasce or one of the other maintainers of this chart to approve

Make necessary improvements to Concourse worker lifecycle management. Add additional fatal errors emitted as of Concourse 3.8.0 that should trigger a restart, and remove "unkown volume" as one such error as this will happen normally when running multiple concourse-web pods. Try to start workers with a clean slate by cleaning up previous incarnations of a worker. Call retire-worker before starting. Also clear the concourse-work-dir before starting. Call retire-worker in a loop and don't exit that loop until the old worker is gone. This allows us to remove the fixed worker.postStopDelaySeconds duration. Add a note about persistent volumes being necessary. Add containerPlacementStrategy, default random to better spread load across workers.

william-tran · 2018-01-07T17:21:38Z

Rebased onto master and squashed into a single commit

EugenMayer · 2018-01-09T21:48:46Z

by the way, implemented and running in production for a while now https://github.com/EugenMayer/docker-image-concourseci-worker-solid

using a trap to do what you are doing here, makes it a little less intrusive: https://github.com/EugenMayer/docker-image-concourseci-worker-solid/blob/master/worker_wrapper.sh

william-tran · 2018-01-13T00:33:13Z

/hold

william-tran · 2018-01-13T02:11:05Z

/hold cancel

william-tran · 2018-01-13T15:50:25Z

/test all

ahume · 2018-01-16T19:12:51Z

Would love to see this merged, if it's ready (gentle nudge :) ). Waiting for v3.8.0 to test a move from our BOSH deployment to Kubernetes.

viglesiasce · 2018-01-16T19:20:09Z

/lgtm

k8s-ci-robot · 2018-01-16T19:20:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: viglesiasce, william-tran

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [viglesiasce]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 3, 2018

william-tran force-pushed the concourse-3.8 branch from 65dda5c to 1e9b3b7 Compare January 3, 2018 04:20

william-tran mentioned this pull request Jan 3, 2018

[stable/concourse] Upgrade to 3.8.0 #3082

Closed

mattfarina reviewed Jan 3, 2018

View reviewed changes

mattfarina suggested changes Jan 3, 2018

View reviewed changes

william-tran force-pushed the concourse-3.8 branch 2 times, most recently from 4dbb1bf to b72c88c Compare January 3, 2018 16:16

egeland approved these changes Jan 3, 2018

View reviewed changes

william-tran force-pushed the concourse-3.8 branch from 6a34d6e to 0c188a6 Compare January 3, 2018 22:13

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 4, 2018

matthope approved these changes Jan 4, 2018

View reviewed changes

mattfarina approved these changes Jan 4, 2018

View reviewed changes

william-tran mentioned this pull request Jan 7, 2018

[stable/concourse] Refactor values and make secrets optional #3254

Merged

william-tran force-pushed the concourse-3.8 branch from 2852149 to 1918e0f Compare January 7, 2018 17:20

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 13, 2018

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 13, 2018

k8s-ci-robot assigned viglesiasce Jan 16, 2018

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2018

k8s-ci-robot merged commit ac5f935 into helm:master Jan 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/concourse] Upgrade to concourse 3.8.0 #3203

[stable/concourse] Upgrade to concourse 3.8.0 #3203

william-tran commented Jan 3, 2018 •

edited

Loading

nexeck commented Jan 3, 2018

mattfarina Jan 3, 2018

william-tran Jan 3, 2018 •

edited

Loading

william-tran Jan 3, 2018 •

edited

Loading

mattfarina Jan 3, 2018

william-tran Jan 3, 2018

viglesiasce commented Jan 4, 2018

mattfarina commented Jan 4, 2018

william-tran commented Jan 7, 2018

EugenMayer commented Jan 9, 2018

william-tran commented Jan 13, 2018

william-tran commented Jan 13, 2018

william-tran commented Jan 13, 2018

ahume commented Jan 16, 2018

viglesiasce commented Jan 16, 2018

k8s-ci-robot commented Jan 16, 2018

[stable/concourse] Upgrade to concourse 3.8.0 #3203

[stable/concourse] Upgrade to concourse 3.8.0 #3203

Conversation

william-tran commented Jan 3, 2018 • edited Loading

nexeck commented Jan 3, 2018

mattfarina Jan 3, 2018

Choose a reason for hiding this comment

william-tran Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

william-tran Jan 3, 2018 • edited Loading

Choose a reason for hiding this comment

mattfarina Jan 3, 2018

Choose a reason for hiding this comment

william-tran Jan 3, 2018

Choose a reason for hiding this comment

viglesiasce commented Jan 4, 2018

mattfarina commented Jan 4, 2018

william-tran commented Jan 7, 2018

EugenMayer commented Jan 9, 2018

william-tran commented Jan 13, 2018

william-tran commented Jan 13, 2018

william-tran commented Jan 13, 2018

ahume commented Jan 16, 2018

viglesiasce commented Jan 16, 2018

k8s-ci-robot commented Jan 16, 2018

william-tran commented Jan 3, 2018 •

edited

Loading

william-tran Jan 3, 2018 •

edited

Loading

william-tran Jan 3, 2018 •

edited

Loading