Skip to content
This repository has been archived by the owner on Feb 22, 2022. It is now read-only.

[stable/concourse] Upgrade to concourse 3.8.0 #3203

Merged
merged 1 commit into from
Jan 16, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions stable/concourse/Chart.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: concourse
version: 0.10.8
appVersion: 3.6.0
version: 0.11.0
appVersion: 3.8.0
description: Concourse is a simple and scalable CI system.
icon: https://avatars1.githubusercontent.com/u/7809479
keywords:
Expand Down
14 changes: 8 additions & 6 deletions stable/concourse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ $ kubectl scale statefulset my-release-worker --replicas=3

### Restarting workers

If a worker isn't taking on work, you can restart the worker with `kubectl delete pod`. This will initiate a graceful shutdown by "retiring" the worker, with some waiting time before the worker starts up again to ensure concourse doesn't try looking for old volumes on the new worker. The values `worker.postStopDelaySeconds` and `worker.terminationGracePeriodSeconds` can be used to tune this.
If a worker isn't taking on work, you can restart the worker with `kubectl delete pod`. This will initiate a graceful shutdown by "retiring" the worker, to ensure Concourse doesn't try looking for old volumes on the new worker. The value`worker.terminationGracePeriodSeconds` can be used to provide an upper limit on graceful shutdown time before forcefully terminating the container.

### Worker Liveness Probe

Expand All @@ -68,7 +68,7 @@ The following tables lists the configurable parameters of the Concourse chart an
| Parameter | Description | Default |
| ----------------------- | ---------------------------------- | ---------------------------------------------------------- |
| `image` | Concourse image | `concourse/concourse` |
| `imageTag` | Concourse image version | `3.3.2` |
| `imageTag` | Concourse image version | `3.8.0` |
| `imagePullPolicy` |Concourse image pull policy | `Always` if `imageTag` is `latest`, else `IfNotPresent` |
| `concourse.username` | Concourse Basic Authentication Username | `concourse` |
| `concourse.password` | Concourse Basic Authentication Password | `concourse` |
Expand All @@ -85,6 +85,7 @@ The following tables lists the configurable parameters of the Concourse chart an
| `concourse.oldResourceGracePeriod` | How long to cache the result of a get step after a newer version of the resource is found | `5m` |
| `concourse.resourceCacheCleanupInterval` | The interval on which to check for and release old caches of resource versions | `30s` |
| `concourse.baggageclaimDriver` | The filesystem driver used by baggageclaim | `naive` |
| `concourse.containerPlacementStrategy` | The selection strategy for placing containers onto workers | `random` |
| `concourse.externalURL` | URL used to reach any ATC from the outside world | `nil` |
| `concourse.dockerRegistry` | An URL pointing to the Docker registry to use to fetch Docker images | `nil` |
| `concourse.insecureDockerRegistry` | Docker registry(ies) (comma separated) to allow connecting to even if not secure | `nil` |
Expand Down Expand Up @@ -126,8 +127,7 @@ The following tables lists the configurable parameters of the Concourse chart an
| `worker.minAvailable` | Minimum number of workers available after an eviction | `1` |
| `worker.resources` | Concourse Worker resource requests and limits | `{requests: {cpu: "100m", memory: "512Mi"}}` |
| `worker.additionalAffinities` | Additional affinities to apply to worker pods. E.g: node affinity | `nil` |
| `worker.postStopDelaySeconds` | Time to wait after graceful shutdown of worker before starting up again | `60` |
| `worker.terminationGracePeriodSeconds` | Upper bound for graceful shutdown, including `worker.postStopDelaySeconds` | `120` |
| `worker.terminationGracePeriodSeconds` | Upper bound for graceful shutdown to allow the worker to drain its tasks | `60` |
| `worker.fatalErrors` | Newline delimited strings which, when logged, should trigger a restart of the worker | *See [values.yaml](values.yaml)* |
| `worker.updateStrategy` | `OnDelete` or `RollingUpdate` (requires Kubernetes >= 1.7) | `RollingUpdate` |
| `worker.podManagementPolicy` | `OrderedReady` or `Parallel` (requires Kubernetes >= 1.7) | `Parallel` |
Expand Down Expand Up @@ -205,7 +205,7 @@ concourse:
< Insert the contents of your concourse-keys/worker_key.pub file >
```

Alternativelly, you can provide those keys to `helm install` via parameters:
Alternatively, you can provide those keys to `helm install` via parameters:


```console
Expand Down Expand Up @@ -243,6 +243,8 @@ persistence:
size: "20Gi"
```

It is highly recommended to use Persistent Volumes for Concourse Workers; otherwise container images managed by the Worker is stored in an `emptyDir` volume on the node's disk. This will interfere with k8s ImageGC and the node's disk will fill up as a result. This will be fixed in a future release of k8s: https://github.com/kubernetes/kubernetes/pull/57020

### Ingress TLS

If your cluster allows automatic creation/retrieval of TLS certificates (e.g. [kube-lego](https://github.com/jetstack/kube-lego)), please refer to the documentation for that mechanism.
Expand Down Expand Up @@ -328,5 +330,5 @@ credentialManager:
## initial periodic token issued for concourse
## ref: https://www.vaultproject.io/docs/concepts/tokens.html#periodic-tokens
##
clientToken: PERIODIC_VAULT_TOKEN
clientToken: PERIODIC_VAULT_TOKEN
```
4 changes: 2 additions & 2 deletions stable/concourse/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ data:
concourse-resource-cache-cleanup-interval: {{ .Values.concourse.resourceCacheCleanupInterval | quote }}
concourse-external-url: {{ default "" .Values.concourse.externalURL | quote }}
concourse-baggageclaim-driver: {{ .Values.concourse.baggageclaimDriver | quote }}
container-placement-strategy: {{ .Values.concourse.containerPlacementStrategy | quote }}
garden-docker-registry: {{ default "" .Values.concourse.dockerRegistry | quote }}
garden-insecure-docker-registry: {{ default "" .Values.concourse.insecureDockerRegistry | quote }}
github-auth-organization: {{ default "" .Values.concourse.githubAuthOrganization | quote }}
Expand All @@ -37,10 +38,9 @@ data:
generic-oauth-auth-url-param: {{ default "" .Values.concourse.genericOauthAuthUrlParam | quote }}
generic-oauth-scope: {{ default "" .Values.concourse.genericOauthScope | quote }}
generic-oauth-token-url: {{ default "" .Values.concourse.genericOauthTokenUrl | quote }}
worker-post-stop-delay-seconds: {{ .Values.worker.postStopDelaySeconds | quote }}
worker-fatal-errors: {{ default "" .Values.worker.fatalErrors | quote }}
{{ if .Values.credentialManager.vault }}
vault-url: {{ default "" .Values.credentialManager.vault.url | quote }}
vault-path-prefix: {{ default "/concourse" .Values.credentialManager.vault.pathPrefix | quote }}
vault-path-prefix: {{ default "/concourse" .Values.credentialManager.vault.pathPrefix | quote }}
vault-auth-backend: {{ default "" .Values.credentialManager.vault.authBackend | quote }}
{{ end }}
5 changes: 5 additions & 0 deletions stable/concourse/templates/web-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,11 @@ spec:
- name: CONCOURSE_PROMETHEUS_BIND_PORT
value: {{ .Values.web.metrics.prometheus.port | quote }}
{{- end }}
- name: CONCOURSE_CONTAINER_PLACEMENT_STRATEGY
valueFrom:
configMapKeyRef:
name: {{ template "concourse.concourse.fullname" . }}
key: container-placement-strategy
ports:
- name: atc
containerPort: {{ .Values.concourse.atcPort }}
Expand Down
25 changes: 16 additions & 9 deletions stable/concourse/templates/worker-statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,13 @@ spec:
- -c
- |-
cp /dev/null /concourse-work-dir/.liveness_probe
rm -rf /concourse-work-dir/*
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do
touch /concourse-work-dir/.pre_start_cleanup
sleep 5
done
rm -f /concourse-work-dir/.pre_start_cleanup
concourse worker --name=${HOSTNAME} | tee -a /concourse-work-dir/.liveness_probe
sleep ${POST_STOP_DELAY_SECONDS}
livenessProbe:
exec:
command:
Expand All @@ -49,16 +54,23 @@ spec:
>&2 echo "Fatal error detected: ${FATAL_ERRORS}"
exit 1
fi
if [ -f /concourse-work-dir/.pre_start_cleanup ]; then
>&2 echo "Still trying to clean up before starting concourse. 'fly prune-worker -w ${HOSTNAME}' might need to be called to force cleanup."
exit 1
fi
failureThreshold: 1
initialDelaySeconds: 10
periodSeconds: 10
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "concourse retire-worker --name=${HOSTNAME}"
- /bin/sh
- -c
- |-
while ! concourse retire-worker --name=${HOSTNAME} | grep -q worker-not-found; do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a condition where retire-worker will never come back good and this will be stuck in the loop?

Copy link
Collaborator Author

@william-tran william-tran Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've seen this happen where ATC can't/won't clean up the worker as a result of calling retire-worker. On termination the terminationGracePeriod will take affect and the container will be killed. When it comes back it could get stuck in this loop. The pod will be live, but the worker won't register because the worker process hasn't started yet, it's still trying to retire-worker because the old worker still exists in concourse.

The only resolution is to manually intervene in the fly cli with fly prune-worker. I've requested that the concourse command add the option to forcefully delete: concourse/concourse#1457 (comment), which would be called in the startup script instead of looping over retire-worker.

This deserves a paragraph in the readme, but I'm not sure if we can make this any better at the moment, unless we can find another way to send that delete request.

Copy link
Collaborator Author

@william-tran william-tran Jan 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is a way to make this better, have the liveness probe ensure the concourse process is up in addition to checking for fatal errors, and have the livenessProbeDelay tuneable. fail if we're still trying to retire the worker at startup after the livenessProbeDelaySeconds has passed. This will trigger a crashloopbackoff that should be obvious enough to signal manual intervention.

sleep 5
done
env:
- name: CONCOURSE_TSA_HOST
valueFrom:
Expand Down Expand Up @@ -91,11 +103,6 @@ spec:
configMapKeyRef:
name: {{ template "concourse.concourse.fullname" . }}
key: concourse-baggageclaim-driver
- name: POST_STOP_DELAY_SECONDS
valueFrom:
configMapKeyRef:
name: {{ template "concourse.concourse.fullname" . }}
key: worker-post-stop-delay-seconds
- name: LIVENESS_PROBE_FATAL_ERRORS
valueFrom:
configMapKeyRef:
Expand Down
26 changes: 13 additions & 13 deletions stable/concourse/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ image: concourse/concourse
## Concourse image version.
## ref: https://hub.docker.com/r/concourse/concourse/tags/
##
imageTag: "3.5.0"
imageTag: "3.8.0"

## Specify a imagePullPolicy: 'Always' if imageTag is 'latest', else set to 'IfNotPresent'.
## ref: https://kubernetes.io/docs/user-guide/images/#pre-pulling-images
Expand Down Expand Up @@ -184,6 +184,12 @@ concourse:
##
baggageclaimDriver: naive

## The selection strategy for placing containers onto workers, as of Concourse 3.7 this can be
## "volume-locality" or "random". Random can better spread load across workers, see
## https://github.com/concourse/atc/pull/219 for background.
##
containerPlacementStrategy: random

## An URL pointing to the Docker registry to use to fetch Docker images.
## If unset, this will default to the Docker default
##
Expand Down Expand Up @@ -449,24 +455,18 @@ worker:
# value: "value"
# effect: "NoSchedule"

## Time to delay after the worker process shuts down. This inserts time between shutdown and startup
## to avoid errors caused by a worker restart.
postStopDelaySeconds: 60

## Time to allow the pod to terminate before being forcefully terminated. This should include
## postStopDelaySeconds, and should additionally provide time for the worker to retire, e.g.
## = postStopDelaySeconds + max time to allow the worker to drain its tasks. See
## https://concourse.ci/worker-internals.html for worker lifecycle semantics.
terminationGracePeriodSeconds: 120
## Time to allow the pod to terminate before being forcefully terminated. This should provide time for
## the worker to retire, i.e. drain its tasks. See https://concourse.ci/worker-internals.html for worker
## lifecycle semantics.
terminationGracePeriodSeconds: 60

## If any of the strings are found in logs, the worker's livenessProbe will fail and trigger a pod restart.
## Specify one string per line, exact matching is used.
##
## "guardian.api.garden-server.create.failed" appears when the worker's filesystem has issues.
## "unknown handle" appears if a worker didn't cleanly restart.
fatalErrors: |-
guardian.api.garden-server.create.failed
unknown handle
guardian.api.garden-server.run.failed
baggageclaim.api.volume-server.create-volume-async.failed-to-create
## Strategy for StatefulSet updates (requires Kubernetes 1.6+)
## Ref: https://kubernetes.io/docs/concepts/workloads/controllers/statefulset
Expand Down