Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks being killed off due to rpc error #9950

Closed
thetooth opened this issue Feb 3, 2021 · 3 comments
Closed

Tasks being killed off due to rpc error #9950

thetooth opened this issue Feb 3, 2021 · 3 comments

Comments

@thetooth
Copy link

thetooth commented Feb 3, 2021

Nomad version

Nomad v1.0.1 (c9c68aa)

Operating system and Environment details

Linux magic 5.5.13-arch2-1 #1 SMP PREEMPT Mon, 30 Mar 2020 20:42:41 +0000 x86_64 GNU/Linux

Issue

All tasks are currently being killed off on a particular node shortly after they start, anything scheduled here that needs to run on this node ends up in a permanent pending state.

Restarting nomad allows the task to be rescheduled after purging it but it will quickly fail again. The logs suggest the garbage collector is destroying everything on the node the moment it's catalogued???

    2021-02-03T09:02:30.304+1030 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stdout.fifo timestamp=2021-02-03T09:02:30.304+1030
 client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stdout.fifo timestamp=2021-02-03T09:02:30.304+1030
    2021-02-03T09:02:30.304+1030 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stderr.fifo timestamp=2021-02-03T09:02:30.304+1030
 client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder @module=logmon path=/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/alloc/logs/.recorder.stderr.fifo timestamp=2021-02-03T09:02:30.304+1030
    2021/02/03 09:02:30.360886 [INFO] (runner) creating new runner (dry: false, once: false)
    2021/02/03 09:02:30.360999 [INFO] (runner) creating watcher
(runner) creating new runner (dry: false, once: false)
(runner) creating watcher
    2021/02/03 09:02:30.361067 [INFO] (runner) starting
(runner) starting
    2021/02/03 09:02:30.478630 [INFO] (runner) rendered "(dynamic)" => "/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/recorder/secrets/file.env"
(runner) rendered "(dynamic)" => "/var/nomad/alloc/03dbeb0b-5931-6f28-bc22-4eec77745f50/recorder/secrets/file.env"
    2021-02-03T09:02:30.483+1030 [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c cd /opt && /opt/recorder -nats nats://192.168.0.244:4222]}"
 client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-c cd /opt && /opt/recorder -nats nats://192.168.0.244:4222]}"
    2021-02-03T09:06:48.469+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=fb42bec2-b846-ee41-2d34-9b6d82593ba7 reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=fb42bec2-b846-ee41-2d34-9b6d82593ba7 reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: marking allocation for GC: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=b9711ce8-80de-879f-b86a-767366574fae reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=abf24944-980e-e088-c558-9f43c02857ab reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=ebf18f69-9f3d-1fba-02c2-0cd5137855e1 reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 reason="forced collection"
    2021-02-03T09:42:42.530+1030 [INFO]  client.gc: garbage collecting allocation: alloc_id=46a1a840-def3-108d-eff8-2f3d41f00a5f reason="forced collection"
 client.gc: marking allocation for GC: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50
 client.gc: garbage collecting allocation: alloc_id=b9711ce8-80de-879f-b86a-767366574fae reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=abf24944-980e-e088-c558-9f43c02857ab reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=ebf18f69-9f3d-1fba-02c2-0cd5137855e1 reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 reason="forced collection"
 client.gc: garbage collecting allocation: alloc_id=46a1a840-def3-108d-eff8-2f3d41f00a5f reason="forced collection"
    2021-02-03T09:42:42.595+1030 [ERROR] client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 driver=raw_exec task_name=recorder error="rpc error: code = Unavailable desc = transport is closing"
    2021-02-03T09:42:42.595+1030 [ERROR] client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder error="rpc error: code = Canceled desc = grpc: the client connection is closing"
client.driver_mgr.raw_exec: error receiving stream from Stats executor RPC, closing stream: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 driver=raw_exec task_name=recorder error="rpc error: code = Unavailable desc = transport is closing"
client.alloc_runner.task_runner.task_hook.stats_hook: failed to start stats collection for task: alloc_id=03dbeb0b-5931-6f28-bc22-4eec77745f50 task=recorder error="rpc error: code = Canceled desc = grpc: the client connection is closing"
    2021/02/03 09:42:42.602868 [INFO] (runner) stopping
    2021/02/03 09:42:42.602919 [INFO] (runner) received finish
(runner) stopping
(runner) received finish
@tgross
Copy link
Member

tgross commented Feb 3, 2021

Hi @thetooth! The "forced collection" message you're seeing is because the client is full enough that it needs to GC allocations that have failed. ~4 minutes passed between start and finish, and the GC can't reap a running task. The RPC error you're seeing is coming from the stats collector not having anything to collect stats from; it shouldn't be throwing an error on a task stop and that's probably a bug but that's unrelated to why the task has stopped in the first place.

A few things we could use here:

  • Take a look at your client GC config so as to delay the GC of allocations: gc_max_allocs, gc_disk_usage_threshold, etc.
  • Once you've done that, you should be able to get the nomad alloc logs :alloc_id for the allocation (don't forget to check with the -stderr flag too)
  • Turn up the client's logs to debug so that we can see if the client is giving us any more clues as to what's going on.

@tgross tgross added this to In Progress in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross self-assigned this Feb 12, 2021
@tgross
Copy link
Member

tgross commented Feb 26, 2021

Without additional information there's not much more we can do with this issue. Going to close this one out. Please feel free to re-open or open a new issue if you have additional information.

@tgross tgross closed this as completed Feb 26, 2021
Nomad - Community Issues Triage automation moved this from In Progress to Done Feb 26, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants