Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concourse-worker fails to connect to concourse-web #1

Open
mthaddon opened this issue Jun 18, 2021 · 9 comments
Open

concourse-worker fails to connect to concourse-web #1

mthaddon opened this issue Jun 18, 2021 · 9 comments

Comments

@mthaddon
Copy link
Owner

Currently there's an issue with concourse-worker connecting to concourse-web as follows:

2021-06-16T09:36:36.701Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.701384755Z","level":"info","source":"baggageclaim","message":"baggageclaim.using-driver","data":{"driver":"overlay"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702172404Z","level":"info","source":"worker","message":"worker.garden.dns-proxy.started","data":{"session":"1.2"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702423522Z","level":"info","source":"baggageclaim","message":"baggageclaim.listening","data":{"addr":"127.0.0.1:7788"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702585418Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-connect-to-tsa","data":{"error":"dial tcp 10.1.234.43:2222: connect: connection refused","session":"4.1"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702626980Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa","data":{"error":"all worker SSH gateways unreachable","session":"4.1.1"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702642926Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-dial","data":{"error":"all worker SSH gateways unreachable","session":"4.1"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702695666Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"all worker SSH gateways unreachable","session":"4.1"}}
2021-06-16T09:36:36.702Z [concourse-worker] {"timestamp":"2021-06-16T09:36:36.702769481Z","level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"all worker SSH gateways unreachable","session":"4"}}
2021-06-16T09:36:37.630Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.629884822Z","level":"info","source":"guardian","message":"guardian.no-port-pool-state-to-recover-starting-clean","data":{}}
2021-06-16T09:36:37.630Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.630515329Z","level":"info","source":"guardian","message":"guardian.metrics-notifier.starting","data":{"interval":"1m0s","session":"5"}}
2021-06-16T09:36:37.630Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.630539336Z","level":"info","source":"guardian","message":"guardian.start.starting","data":{"session":"6"}}
2021-06-16T09:36:37.630Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.630692380Z","level":"info","source":"guardian","message":"guardian.metrics-notifier.started","data":{"interval":"1m0s","session":"5","time":"2021-06-16T09:36:37.630690055Z"}}
2021-06-16T09:36:37.632Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.632314304Z","level":"info","source":"guardian","message":"guardian.cgroups-tmpfs-already-mounted","data":{"path":"/sys/fs/cgroup"}}
2021-06-16T09:36:37.632Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.632424321Z","level":"info","source":"guardian","message":"guardian.mount-cgroup.started","data":{"path":"/sys/fs/cgroup/cpuset","session":"7","subsystem":"cpuset"}}
2021-06-16T09:36:37.632Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.632510869Z","level":"info","source":"guardian","message":"guardian.start.completed","data":{"session":"6"}}
2021-06-16T09:36:37.632Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.632527035Z","level":"error","source":"guardian","message":"guardian.starting-guardian-backend","data":{"error":"bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied"}}
2021-06-16T09:36:37.632Z [concourse-worker] bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied
2021-06-16T09:36:37.632Z [concourse-worker] bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636289358Z","level":"error","source":"worker","message":"worker.garden.gdn-runner.logging-runner-exited","data":{"error":"exit status 1","session":"1.4"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636382165Z","level":"info","source":"worker","message":"worker.garden.dns-proxy-runner.logging-runner-exited","data":{"session":"1.3"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636410046Z","level":"error","source":"worker","message":"worker.garden-runner.logging-runner-exited","data":{"error":"Exit trace for group:\ngdn exited with error: exit status 1\ndns-proxy exited with nil
\n","session":"8"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636516052Z","level":"info","source":"worker","message":"worker.debug-runner.logging-runner-exited","data":{"session":"10"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636496248Z","level":"info","source":"worker","message":"worker.container-sweeper.sweep-cancelled-by-signal","data":{"session":"6","signal":2}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636557137Z","level":"info","source":"worker","message":"worker.container-sweeper.logging-runner-exited","data":{"session":"13"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636556472Z","level":"info","source":"worker","message":"worker.healthcheck-runner.logging-runner-exited","data":{"session":"11"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636536991Z","level":"info","source":"worker","message":"worker.volume-sweeper.sweep-cancelled-by-signal","data":{"session":"7","signal":2}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636595064Z","level":"info","source":"worker","message":"worker.volume-sweeper.logging-runner-exited","data":{"session":"14"}}
2021-06-16T09:36:37.636Z [concourse-worker] {"timestamp":"2021-06-16T09:36:37.636548015Z","level":"info","source":"worker","message":"worker.baggageclaim-runner.logging-runner-exited","data":{"session":"9"}}
2021-06-16T09:36:37.825Z [pebble] POST /v1/files 6.612782ms 200
2021-06-16T09:36:37.887Z [pebble] GET /v1/plan?format=yaml 327.034µs 200
2021-06-16T09:36:41.703Z [concourse-worker] {"timestamp":"2021-06-16T09:36:41.703348868Z","level":"info","source":"worker","message":"worker.beacon-runner.restarting","data":{"session":"4"}}
2021-06-16T09:36:41.703Z [concourse-worker] {"timestamp":"2021-06-16T09:36:41.703670068Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-connect-to-tsa","data":{"error":"dial tcp 10.1.234.43:2222: connect: connection refused","session":"4.1"}}
2021-06-16T09:36:41.703Z [concourse-worker] {"timestamp":"2021-06-16T09:36:41.703751573Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.dial.failed-to-connect-to-any-tsa","data":{"error":"all worker SSH gateways unreachable","session":"4.1.2"}}
2021-06-16T09:36:41.703Z [concourse-worker] {"timestamp":"2021-06-16T09:36:41.703765585Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-dial","data":{"error":"all worker SSH gateways unreachable","session":"4.1"}}
2021-06-16T09:36:41.703Z [concourse-worker] {"timestamp":"2021-06-16T09:36:41.703797415Z","level":"info","source":"worker","message":"worker.beacon-runner.beacon.signal.signalled","data":{"session":"4.1.3"}}
2021-06-16T09:36:41.703Z [concourse-worker] {"timestamp":"2021-06-16T09:36:41.703825752Z","level":"info","source":"worker","message":"worker.beacon-runner.logging-runner-exited","data":{"session":"12"}}
2021-06-16T09:36:41.706Z [concourse-worker] error: Exit trace for group:
2021-06-16T09:36:41.706Z [concourse-worker] garden exited with error: Exit trace for group:
2021-06-16T09:36:41.706Z [concourse-worker] gdn exited with error: exit status 1
2021-06-16T09:36:41.706Z [concourse-worker] dns-proxy exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] 
2021-06-16T09:36:41.706Z [concourse-worker] debug exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] container-sweeper exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] healthcheck exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] volume-sweeper exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] baggageclaim exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] beacon exited with nil
2021-06-16T09:36:41.706Z [concourse-worker] 

I believe this is simply a case of figuring out how to expose port 2222 on the concourse-worker instance, but haven't yet had a chance to look into this further.

@ycheng
Copy link

ycheng commented Jul 8, 2021

Try to manually run it as ssh into the container by "juju ssh --container concourse-worker concourse-worker/0", run with the command:

set -x

export CONCOURSE_BAGGAGECLAIM_DRIVER=overlay
export CONCOURSE_TSA_HOST=TSA_HOST_IP:2222
export CONCOURSE_TSA_PUBLIC_KEY=/concourse-keys/tsa_host_key.pub
export CONCOURSE_TSA_WORKER_PRIVATE_KEY=/concourse-keys/worker_key
export CONCOURSE_WORK_DIR=/opt/concourse/worker

/usr/local/concourse/bin/concourse worker

I didn't saw error on connect to TSA, however local gdn that should listen to port 7777 not run. Per check the log, I think it's because:
bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied

@ycheng
Copy link

ycheng commented Jul 8, 2021

Did more test and here is the theory and verification:

  1. I think the web won't re-read the worker public key after it's started.
  2. the worker won't keep retrying to connect to web via port 2222

Given so, here is what I did:

  1. start worker first, and modify the /usr/local/bin/entrypoint.sh so that it keep re-run "/usr/local/concourse/bin/concourse $@" every 10 seconds.
  2. start web then, related web and worker so that the public key is copied to web
  3. start Postgres and related to web, then the web start with the worker public key in place.
  4. check the worker log, the worker gives us endless logs on

2021-07-08T09:47:21.189Z [concourse-worker] {"timestamp":"2021-07-08T09:47:21.189242695Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.failed-to-dial","data":{"addr":"127.0.0.1:7788","error":"dial tcp 127.0.0.1:7788: connect: connection refused","network":"tcp","session":"4.1.5"}}
2021-07-08T09:47:21.421Z [concourse-worker] {"timestamp":"2021-07-08T09:47:21.421690479Z","level":"info","source":"worker","message":"worker.beacon-runner.beacon.forward-conn.retrying","data":{"addr":"127.0.0.1:7777","network":"tcp","session":"4.1.4"}}

but not exit like previously as it can't connect to web/tsa via port 2222.

I think the direction is:

  1. restart the web as the worker public key changed.
  2. re-try worker.
  3. To fix the cgroup problem. (I think this is because the worker pod is not privileged, agree? How's our policy on the privileged pod?)

@mthaddon
Copy link
Owner Author

mthaddon commented Aug 13, 2021

I've made a few small updates to the charm, to create an k8s service for the concourse-web application on port 2222. There's still a race condition that needs fixing where as noted above the worker won't keep retrying to connect to web via port 2222, so if the relation is established too early this will still fail to connect to that port.

However, if after everything is up and confirming the concourse-web unit is responding on port 2222, you then add a new unit to concourse-worker (or likely if you wait to do this before relating concourse-worker to concourse-web) I get the following errors:

$ microk8s kubectl logs -n concourse-test concourse-worker-1 -c concourse-worker
2021-08-13T10:26:57.683Z [pebble] Started daemon.
2021-08-13T10:27:04.238Z [pebble] GET /v1/files?action=read&path=%2Fusr%2Flocal%2Fconcourse%2Fbin%2Fconcourse 1.690774341s 200
2021-08-13T10:27:09.407Z [pebble] POST /v1/files 5.300573ms 200
2021-08-13T10:27:09.433Z [pebble] GET /v1/plan?format=yaml 114.617µs 200
2021-08-13T10:27:09.434Z [pebble] POST /v1/layers 223.203µs 200
2021-08-13T10:27:09.440Z [pebble] GET /v1/services?names=concourse-worker 72.854µs 200
2021-08-13T10:27:09.448Z [pebble] POST /v1/services 6.797955ms 202
2021-08-13T10:27:09.502Z [concourse-worker] {"timestamp":"2021-08-13T10:27:09.502051404Z","level":"info","source":"baggageclaim","message":"baggageclaim.using-driver","data":{"driver":"overlay"}}
2021-08-13T10:27:09.502Z [concourse-worker] {"timestamp":"2021-08-13T10:27:09.502691208Z","level":"info","source":"worker","message":"worker.garden.dns-proxy.started","data":{"session":"1.2"}}
2021-08-13T10:27:09.502Z [concourse-worker] {"timestamp":"2021-08-13T10:27:09.502835042Z","level":"info","source":"baggageclaim","message":"baggageclaim.listening","data":{"addr":"127.0.0.1:7788"}}
2021-08-13T10:27:09.538Z [concourse-worker] {"timestamp":"2021-08-13T10:27:09.538531821Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-dial","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to
 authenticate, attempted methods [none publickey], no supported methods remain","session":"4.1"}}
2021-08-13T10:27:09.538Z [concourse-worker] {"timestamp":"2021-08-13T10:27:09.538590178Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.exited-with-error","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable
 to authenticate, attempted methods [none publickey], no supported methods remain","session":"4.1"}}
2021-08-13T10:27:09.538Z [concourse-worker] {"timestamp":"2021-08-13T10:27:09.538611817Z","level":"error","source":"worker","message":"worker.beacon-runner.failed","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to authenticate, 
attempted methods [none publickey], no supported methods remain","session":"4"}}
2021-08-13T10:27:10.839Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.839827686Z","level":"info","source":"guardian","message":"guardian.no-port-pool-state-to-recover-starting-clean","data":{}}
2021-08-13T10:27:10.840Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.840518375Z","level":"info","source":"guardian","message":"guardian.metrics-notifier.starting","data":{"interval":"1m0s","session":"5"}}
2021-08-13T10:27:10.840Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.840544561Z","level":"info","source":"guardian","message":"guardian.start.starting","data":{"session":"6"}}
2021-08-13T10:27:10.841Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.840740787Z","level":"info","source":"guardian","message":"guardian.metrics-notifier.started","data":{"interval":"1m0s","session":"5","time":"2021-08-13T10:27:10.840738723Z"}}
2021-08-13T10:27:10.842Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.842309381Z","level":"info","source":"guardian","message":"guardian.cgroups-tmpfs-already-mounted","data":{"path":"/sys/fs/cgroup"}}
2021-08-13T10:27:10.842Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.842429978Z","level":"info","source":"guardian","message":"guardian.mount-cgroup.started","data":{"path":"/sys/fs/cgroup/cpuset","session":"7","subsystem":"cpuset"}}
2021-08-13T10:27:10.842Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.842504846Z","level":"info","source":"guardian","message":"guardian.start.completed","data":{"session":"6"}}
2021-08-13T10:27:10.842Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.842519511Z","level":"error","source":"guardian","message":"guardian.starting-guardian-backend","data":{"error":"bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied"}}
2021-08-13T10:27:10.842Z [concourse-worker] bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied
2021-08-13T10:27:10.842Z [concourse-worker] bulk starter: mounting subsystem 'cpuset' in '/sys/fs/cgroup/cpuset': permission denied
2021-08-13T10:27:10.850Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850597736Z","level":"error","source":"worker","message":"worker.garden.gdn-runner.logging-runner-exited","data":{"error":"exit status 1","session":"1.4"}}
2021-08-13T10:27:10.850Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850707053Z","level":"info","source":"worker","message":"worker.garden.dns-proxy-runner.logging-runner-exited","data":{"session":"1.3"}}
2021-08-13T10:27:10.850Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850736697Z","level":"error","source":"worker","message":"worker.garden-runner.logging-runner-exited","data":{"error":"Exit trace for group:\ngdn exited with error: exit status 1\ndns-proxy exited with nil
\n","session":"8"}}
2021-08-13T10:27:10.850Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850861618Z","level":"info","source":"worker","message":"worker.container-sweeper.sweep-cancelled-by-signal","data":{"session":"6","signal":2}}
2021-08-13T10:27:10.850Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850911539Z","level":"info","source":"worker","message":"worker.container-sweeper.logging-runner-exited","data":{"session":"13"}}
2021-08-13T10:27:10.850Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850910622Z","level":"info","source":"worker","message":"worker.volume-sweeper.sweep-cancelled-by-signal","data":{"session":"7","signal":2}}
2021-08-13T10:27:10.851Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850988472Z","level":"info","source":"worker","message":"worker.volume-sweeper.logging-runner-exited","data":{"session":"14"}}
2021-08-13T10:27:10.851Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.851000149Z","level":"info","source":"worker","message":"worker.debug-runner.logging-runner-exited","data":{"session":"10"}}
2021-08-13T10:27:10.851Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850974540Z","level":"info","source":"worker","message":"worker.baggageclaim-runner.logging-runner-exited","data":{"session":"9"}}
2021-08-13T10:27:10.851Z [concourse-worker] {"timestamp":"2021-08-13T10:27:10.850944525Z","level":"info","source":"worker","message":"worker.healthcheck-runner.logging-runner-exited","data":{"session":"11"}}
2021-08-13T10:27:12.040Z [pebble] POST /v1/files 11.934703ms 200
2021-08-13T10:27:12.097Z [pebble] GET /v1/plan?format=yaml 289.48µs 200
2021-08-13T10:27:14.539Z [concourse-worker] {"timestamp":"2021-08-13T10:27:14.538997203Z","level":"info","source":"worker","message":"worker.beacon-runner.restarting","data":{"session":"4"}}
2021-08-13T10:27:14.576Z [concourse-worker] {"timestamp":"2021-08-13T10:27:14.576459867Z","level":"error","source":"worker","message":"worker.beacon-runner.beacon.failed-to-dial","data":{"error":"failed to establish SSH connection with gateway: ssh: handshake failed: ssh: unable to
 authenticate, attempted methods [none publickey], no supported methods remain","session":"4.1"}}
2021-08-13T10:27:14.576Z [concourse-worker] {"timestamp":"2021-08-13T10:27:14.576528550Z","level":"info","source":"worker","message":"worker.beacon-runner.beacon.signal.signalled","data":{"session":"4.1.3"}}
2021-08-13T10:27:14.576Z [concourse-worker] {"timestamp":"2021-08-13T10:27:14.576562299Z","level":"info","source":"worker","message":"worker.beacon-runner.logging-runner-exited","data":{"session":"12"}}
2021-08-13T10:27:14.577Z [concourse-worker] error: Exit trace for group:
2021-08-13T10:27:14.577Z [concourse-worker] garden exited with error: Exit trace for group:
2021-08-13T10:27:14.577Z [concourse-worker] gdn exited with error: exit status 1
2021-08-13T10:27:14.577Z [concourse-worker] dns-proxy exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] 
2021-08-13T10:27:14.577Z [concourse-worker] container-sweeper exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] volume-sweeper exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] debug exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] healthcheck exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] baggageclaim exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] beacon exited with nil
2021-08-13T10:27:14.577Z [concourse-worker] 

So, it's failing to authenticate, but is at least able to connect. This needs further investigation.

The cgroup problem is still there, and you're correct that's because the worker pod isn't privileged. I'm not actually sure how to do that in k8s charms (yet), will need to look into that. Having said that, it seems like something that would be good to avoid if possible - per https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged "By default a container is not allowed to access any devices on the host, but a "privileged" container is given access to all devices on the host. This allows the container nearly all the same access as processes running on the host. This is useful for containers that want to use linux capabilities like manipulating the network stack and accessing devices.". It would be good to understand why concourse needs this, as it might restrict the places we're prepared to run this.

@mthaddon
Copy link
Owner Author

mthaddon commented Aug 13, 2021

I think the direction is:

  1. restart the web as the worker public key changed.

This can be done via pebble. I've put together a PR can you take a look? mthaddon/concourse-web-operator#1

  1. re-try worker.
  2. To fix the cgroup problem. (I think this is because the worker pod is not privileged, agree? How's our policy on the privileged pod?)

@ycheng
Copy link

ycheng commented Aug 13, 2021

Per my understanding, running concourse ci worker inside k8s or docker will be a container in a container, and that needs privileged mode.

Per Workers Architecture in [1], "Workers are machines running Garden and Baggageclaim servers and registering themselves via the TSA." And Garden is "a platform-agnostic Go API for container creation and management".

Ref: [1] https://concourse-ci.org/internals.html
Ref: [2] https://github.com/cloudfoundry/garden

Per my test using plain old yaml to try running concourse ci on k8s, it does need privileged mode in worker.
Per check https://github.com/concourse/concourse-chart, they do the same thing for workers.

$ grep -r privileged .
./concourse-chart/values.yaml: ## Disable remapping of user/group IDs in unprivileged volumes.
./concourse-chart/templates/worker-deployment.yaml: privileged: true
./concourse-chart/templates/worker-podsecuritypolicy.yaml: privileged: true
./concourse-chart/templates/web-podsecuritypolicy.yaml: privileged: false
./concourse-chart/templates/worker-statefulset.yaml: privileged: true
./concourse-chart/templates/worker-statefulset.yaml: privileged: true
$ grep -r allowPrivilegeEscalation .
./concourse-chart/templates/worker-podsecuritypolicy.yaml: allowPrivilegeEscalation: true
./concourse-chart/templates/web-podsecuritypolicy.yaml: allowPrivilegeEscalation: false

@ycheng
Copy link

ycheng commented Aug 24, 2021

I test the PR and the result is positive to restart the web as the worker public key changed.

@jason-lo
Copy link

Not sure if concourse-remote-worker help[1].

I think we might need to define remote Concourse worker as a K8s cluster. The reason behind it, in my naive opinion, is because a Concourse worker is a collection of docker containers when we run it locally.[2] A K8s Pod is basically mapping to a Docker container (or at most 2 containers in that single Pod).

When we deploy Concourse-worker we actually deploy it into one Pod. This will lead us trying to create lots of containers in that Pod. I also agree with @ycheng that we are going to create containers inside that container (K8s Pod.)

[1] https://tanzu.vmware.com/developer/guides/concourse-remote-workers/
[2] https://concourse-ci.org/concourse-worker.html

@jason-lo
Copy link

jason-lo commented Jan 18, 2022

To investigate the issue a bit further, I have tried deploying concourse-CI on microk8s following this guild [1].

When running a pipeline I do hit some issue of cgroup.
run check: find or create container on worker concourse-worker-1: runc run: exit status 1: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: rootfs_linux.go:76: mounting "cgroup" to rootfs at "/sys/fs/cgroup" caused: invalid argument

On the other hand, I have also tried the deployment on GKE[2], which works well.

My best guess is that microk8s is using K8S node based on LxD container, where GKE create node based on VM. The difference then would be Type 2 V.S. Type 1 Hypervisor. The charms should be work, under my guess, when deploying to K8S with VM level virtualization.

To sum up, maybe we can try deploying charms (Concourse-web and Concourse-worker) on Kubernetes with VM level of virtualization.

Do we have any of that environment for handy for testing? Or is it possible that we configure mirok8s to based on LxD vm instead?

[1] https://tanzu.vmware.com/developer/guides/concourse-gs/
[2] https://cloud.google.com/kubernetes-engine

@jason-lo
Copy link

jason-lo commented Jan 20, 2022

Juju deploy Concourse-web/worker [1][2] on GKE configured using this doc.[3]

$juju status

concourse-web active 1 concourse-web charmhub stable 2 kubernetes 10.48.10.158
concourse-worker active 1 concourse-worker charmhub stable 2 kubernetes 10.48.3.70
nginx-ingress-integrator active 1 nginx-ingress-integrator charmhub stable 24 kubernetes 10.48.3.197
postgresql-k8s .../postgresql@ed0e37f active 1 postgresql-k8s charmhub stable 3 kubernetes 10.48.5.149

concourse-web/0* active idle 10.44.2.6
concourse-worker/0* active idle 10.44.2.7
nginx-ingress-integrator/0* active idle 10.44.2.8 Ingress with service IP(s): 10.48.9.162
postgresql-k8s/0* active idle 10.44.1.10 5432/TCP Pod configured

$juju debug-log

unit-nginx-ingress-integrator-0: 15:50:13 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-concourse-web-0: 15:51:27 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-postgresql-k8s: 15:52:56 INFO juju.worker.caasoperator.uniter.postgresql-k8s/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-nginx-ingress-integrator-0: 15:54:34 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-concourse-worker-0: 15:54:52 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-concourse-web-0: 15:55:49 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
application-postgresql-k8s: 15:58:18 INFO juju.worker.caasoperator.uniter.postgresql-k8s/0.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-nginx-ingress-integrator-0: 15:59:15 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-concourse-worker-0: 16:00:38 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)
unit-concourse-web-0: 16:01:25 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

http://concourse-web not working
IP: 10.48.9.162 not working either

[1] https://charmhub.io/concourse-web
[2] https://charmhub.io/concourse-worker
[3] https://juju.is/docs/olm/google-kubernetes-engine-(gke)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants