Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up allocs that previously failed cleanup #1497

Closed
jshaw86 opened this issue Aug 2, 2016 · 11 comments
Closed

Clean up allocs that previously failed cleanup #1497

jshaw86 opened this issue Aug 2, 2016 · 11 comments

Comments

@jshaw86
Copy link

jshaw86 commented Aug 2, 2016

Sometimes allocs are failing to clean up due to too many open files. These allocs then get orphaned by nomad and the allocs have to be cleaned by hand. See below logs:

Jul 30 01:31:02 nomad-agent1 2016/07/30 01:31:02 http: Accept error: accept tcp [::]:4646: accept4: too many open files; retrying in 5ms
Jul 30 01:31:02 nomad-agent1 nomad[1149]: 4646: accept4: too many open files; retrying in 10ms
Jul 30 01:31:02 nomad-agent1 2016/07/30 01:31:02 http: Accept error: accept tcp [::]:4646: accept4: too many open files; retrying in 10ms
Jul 30 01:31:02 nomad-agent1 nomad[1149]: 4646: accept4: too many open files; retrying in 20ms
Jul 30 01:31:02 nomad-agent1 2016/07/30 01:31:02 http: Accept error: accept tcp [::]:4646: accept4: too many open files; retrying in 20ms
Jul 30 01:31:02 nomad-agent1  2016/07/30 01:31:02 [DEBUG] http: Request /v1/agent/servers (263.892µs)
Jul 30 01:31:02 nomad-agent1 nomad[1149]: http: Request /v1/agent/servers (263.892µs)
Jul 30 01:31:04 nomad-agent1 nomad[1149]: driver.exec: cgroups unavailable, disabling
Jul 30 01:31:04 nomad-agent1 2016/07/30 01:31:04.373534 [DEBUG] driver.exec: cgroups unavailable, disabling
Jul 30 01:31:05 nomad-agent1 2016/07/30 01:31:05.654566 [ERR] client: failed to destroy context for alloc 'c7a4f543-2148-98e6-dc91-9ad045c1bf88': 1 error(s) occurred:
Jul 30 01:31:05 nomad-agent1 nomad[1149]: client: failed to destroy context for alloc 'c7a4f543-2148-98e6-dc91-9ad045c1bf88': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/c7a4f543-2148-98e6-dc91-9ad045c1bf88/event_handler_243_0/usr/sbin: too many open files

What would be nice is if the orphaned allocs were not orphaned and nomad still cleaned them up.

@justinwalz
Copy link
Contributor

I've also seen too many open files errors, a few separate times on different hosts. All are seen clost to a large repetition of client: dropping update to alloc '<alloc-id>'

Nomad version

0.4.1

Operating system and Environment details

Both Ubuntu 14.04 and 16.04

Nomad Client logs (if appropriate)

    2016/11/09 00:43:19.802156 [ERR] client: failed to save state of Task Runner for task "<job-name>": failed to rename tmp to path:rename /var/lib/ssi/nomad/client/alloc/29da229f-d967-b89b-5c6e-02ef4dd53645/task-ad3697de13202d8e459e620ed7f1e43c/state.json.tmp /var/lib/ssi/nomad/client/alloc/29da229f-d967-b89b-5c6e-02ef4dd53645/task-ad3697de13202d8e459e620ed7f1e43c/state.json: no such file or directory
    2016/11/09 00:44:22.445119 [WARN] client: error fetching stats of task <job-name>: stats collection hasn't started yet
    2016/11/09 00:45:45.519462 [WARN] client: error fetching stats of task <job-name>: stats collection hasn't started yet
    2016/11/09 00:46:01.040493 [ERR] http: Request /v1/client/fs/logs/01e35657-dfd7-7e06-c5c2-5d3f2f517724?origin=start&offset=0&task=<job-name>&type=stderr&follow=true, error: alloc not found
    2016/11/09 00:46:01.041102 [ERR] http: Request /v1/client/fs/logs/01e35657-dfd7-7e06-c5c2-5d3f2f517724?origin=start&offset=0&task=<job-name>&type=stdout&follow=true, error: alloc not found
    2016/11/09 00:46:29.515124 [WARN] client: error fetching stats of task <job-name>: stats collection hasn't started yet
    2016/11/09 00:47:43.901390 [WARN] client: error fetching stats of task <job-name>: stats collection hasn't started yet
    2016/11/09 00:50:11.955573 [ERR] driver.docker: error removing container: API error (500): {"message":"Driver devicemapper failed to remove root filesystem e0507542b2c9fd4799538076ca9b151813920385d31fde217b983bf49f06e901: Device is Busy"}
    2016/11/09 00:50:29.016048 [ERR] http: Request /v1/client/fs/logs/6ce53a9f-6d0a-1fad-3a95-a6b13d79872a?origin=start&offset=0&task=<job-name>&type=stderr&follow=true, error: failed to stream "alloc/logs/<job-name>.stderr.0": failed to list entries: open /var/lib/ssi/nomad/alloc/6ce53a9f-6d0a-1fad-3a95-a6b13d79872a/alloc/logs: too many open files
    2016/11/09 00:50:29 http: multiple response.WriteHeader calls
    2016/11/09 00:50:36.664419 [ERR] driver.docker: error removing container: API error (500): {"message":"Driver devicemapper failed to remove root filesystem d463fe61c12fee1628cfabe5cc835333512638045eb8674566e54c74178f5378: Device is Busy"}
    2016/11/09 00:55:48.421233 [ERR] http: Request /v1/client/fs/logs/01e35657-dfd7-7e06-c5c2-5d3f2f517724?origin=start&offset=0&task=<job-name>&type=stdout&follow=true, error: failed to stream "alloc/logs/<job-name>.stdout.0": failed to list entries: open /var/lib/ssi/nomad/alloc/01e35657-dfd7-7e06-c5c2-5d3f2f517724/alloc/logs: too many open files
    2016/11/09 00:55:48 http: multiple response.WriteHeader calls
2016/11/09 01:10:01 [ERR] yamux: Failed to write header: write unix @->/tmp/plugin943940569: write: broken pipe
2016/11/09 01:21:30 [ERR] yamux: Failed to write header: write unix @->/tmp/plugin109911821: write: broken pipe
    2016/11/09 01:28:44 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:30:15 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:31:00.234340 [ERR] http: Request /v1/client/fs/logs/01e35657-dfd7-7e06-c5c2-5d3f2f517724?origin=start&offset=0&task=<job-name>&type=stderr&follow=true, error: failed to stream "alloc/logs/<job-name>.stderr.0": failed to list entries: open /var/lib/ssi/nomad/alloc/01e35657-dfd7-7e06-c5c2-5d3f2f517724/alloc/logs: too many open files
    2016/11/09 01:31:00 http: multiple response.WriteHeader calls
    2016/11/09 01:32:29 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:34:54 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:40:10 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:42:04 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:44:17 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:45:13.742411 [ERR] client: failed to save state for alloc e4678ee0-81f4-d0c6-46db-9ea87a915801: failed to rename tmp to path: rename /var/lib/ssi/nomad/client/alloc/e4678ee0-81f4-d0c6-46db-9ea87a915801/state.json.tmp /var/lib/ssi/nomad/client/alloc/e4678ee0-81f4-d0c6-46db-9ea87a915801/state.json: no such file or directory
    2016/11/09 01:45:13.829017 [ERR] client: failed to save state: 1 error(s) occurred:

* failed to rename tmp to path: rename /var/lib/ssi/nomad/client/alloc/e4678ee0-81f4-d0c6-46db-9ea87a915801/state.json.tmp /var/lib/ssi/nomad/client/alloc/e4678ee0-81f4-d0c6-46db-9ea87a915801/state.json: no such file or directory
    2016/11/09 01:46:56 http: Accept error: accept tcp <ip-address>:4646: accept4: too many open files; retrying in 5ms
    2016/11/09 01:47:30.287513 [ERR] http: Request /v1/client/fs/logs/6ce53a9f-6d0a-1fad-3a95-a6b13d79872a?origin=start&offset=0&task=<job-name>&type=stderr&follow=true, error: failed to list entries: readdirent: no such file or directory
    2016/11/09 01:47:30 http: multiple response.WriteHeader calls
    2016/11/09 01:51:35.094599 [ERR] client: dropping update to alloc '7afaecb0-9c5d-993b-f3e5-37a04f4ad576'
    2016/11/09 01:51:49.786363 [ERR] client: dropping update to alloc '7afaecb0-9c5d-993b-f3e5-37a04f4ad576'

@dadgar
Copy link
Contributor

dadgar commented Nov 9, 2016

Have you all upped the file descriptor limit for the Nomad process?

@jshaw86
Copy link
Author

jshaw86 commented Jan 26, 2017

@dadgar we have max file descriptors we are still seeing it as 0.4.2. We are trying to get up to 0.5.2 to check. In your opinion this should be fixed in 0.5.2?

@dadgar
Copy link
Contributor

dadgar commented Jan 28, 2017

@jshaw86 0.5.3 should help as it does client side garbage collection

@jshaw86
Copy link
Author

jshaw86 commented Jan 28, 2017

@dadgar For us this is #1 priority as we keep having to manually rotate nomad clients and clean up disk. So will we see no benefit if we upgrade to 0.5.2 right now from 0.4.2?

It seems nomad is exhausting FD's when it tries to GC or something like that again still on 0.4.2 we are trying to expedite a 0.5.2 upgrade but due to the breaking job spec change it is pretty difficult.

My current running theory is nomad tries to clean up too many files at once, the GC process crashes, loses track of the allocs that were marked for cleanup, those allocs remain on the FS forever while nomad still provisions new chroots creating more inodes and eventually crashing the systems. We also have correlation through some monitoring to when a nomad client reports "too many open files" and i-node exhaustion.

FD exhaustion logs below.

Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc 'da4c8550-4367-919a-a4a7-b7978fe05feb': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/da4c8550-4367-919a-a4a7-b7978fe05feb/event_handler_4857_9_fra/usr/lib/pn-blocks/src/node_modules/cassandra-driver/node_modules: too many open files
Jan 28 06:48:06 nomad-agent21 nomad_agent_startup: * open /opt/nomad/mnt/a9a172d9-62e4-b295-ad3e-b8596b371c8e/event_handler_3308_1_fra/etc: too many open files
Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc 'a9a172d9-62e4-b295-ad3e-b8596b371c8e': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/a9a172d9-62e4-b295-ad3e-b8596b371c8e/event_handler_3308_1_fra/etc: too many open files
Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc 'd7ff5b3f-7a94-2c30-d953-a07f0cb1fe1a': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/d7ff5b3f-7a94-2c30-d953-a07f0cb1fe1a/event_handler_2556_10_fra/etc/ca-certificates/update.d: too many open files
Jan 28 06:48:06 nomad-agent21 nomad_agent_startup: * open /opt/nomad/mnt/d7ff5b3f-7a94-2c30-d953-a07f0cb1fe1a/event_handler_2556_10_fra/etc/ca-certificates/update.d: too many open files
Jan 28 06:48:06 nomad-agent21 nomad_agent_startup: * open /opt/nomad/mnt/c7de2028-ca57-dd25-c3a6-2759d33b4702/event_handler_4859_1_fra/usr/lib/node_modules/npm/node_modules/read-package-json/node_modules/json-parse-helpfulerror/node_modules/jju/lib: too many open files
Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc 'c7de2028-ca57-dd25-c3a6-2759d33b4702': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/c7de2028-ca57-dd25-c3a6-2759d33b4702/event_handler_4859_1_fra/usr/lib/node_modules/npm/node_modules/read-package-json/node_modules/json-parse-helpfulerror/node_modules/jju/lib: too many open files
Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc 'ea487d55-58b6-ba6e-0cde-90c96c9101ee': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/ea487d55-58b6-ba6e-0cde-90c96c9101ee/event_handler_4934_1_fra/etc/ld.so.conf.d: too many open files
Jan 28 06:48:06 nomad-agent21 nomad_agent_startup: * open /opt/nomad/mnt/ea487d55-58b6-ba6e-0cde-90c96c9101ee/event_handler_4934_1_fra/etc/ld.so.conf.d: too many open files
Jan 28 06:48:06 nomad-agent21 nomad_agent_startup: * open /opt/nomad/mnt/6e4adef8-5b7c-c32d-e6b9-91654c6f5e15/event_handler_940_39_fra/usr/lib/pn-blocks: too many open files
Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc '6e4adef8-5b7c-c32d-e6b9-91654c6f5e15': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/6e4adef8-5b7c-c32d-e6b9-91654c6f5e15/event_handler_940_39_fra/usr/lib/pn-blocks: too many open files
Jan 28 06:48:06 nomad-agent21 nomad[1281]: client: failed to destroy context for alloc '7c1f1634-73c8-d194-94a0-4b3470de1450': 1 error(s) occurred:#012#012* open /opt/nomad/mnt/7c1f1634-73c8-d194-94a0-4b3470de1450/event_handler_4930_40_fra/etc/ssl/certs: too many open files
Jan 28 06:48:06 nomad-agent21 nomad_agent_startup: * open /opt/nomad/mnt/7c1f1634-73c8-d194-94a0-4b3470de1450/event_handler_4930_40_fra/etc/ssl/certs: too many open files

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

@jshaw86 I would upgrade to Nomad 0.5.3. It does garbage collection client side which should avoid the thunderherd of GCs that causes so many FDs to be opened.

@dadgar dadgar added this to the near-term milestone Jan 31, 2017
@jshaw86
Copy link
Author

jshaw86 commented Jan 31, 2017

@dadgar OK we will try to get to 0.5.3 ASAP. I was also able to catch this thundering heard on one of the boxes and have a detailed syslog that I attached below if you just want to validate this is the same issue.
error-log-nomad-event.txt

@dadgar
Copy link
Contributor

dadgar commented Jan 31, 2017

@jshaw86 Thanks! Let us know how 0.5.3 works out for GCing. Btw please do not do an in-place update of Nomad on the clients.

@justinwalz
Copy link
Contributor

Is that because of this #2256?

If so, can you please update https://www.nomadproject.io/docs/upgrade/upgrade-specific.html?

@tgross
Copy link
Member

tgross commented Jan 9, 2020

Doing some issue cleanup; this one was closed by the addition of GC.

@tgross tgross closed this as completed Jan 9, 2020
@tgross tgross removed this from the near-term milestone Jan 9, 2020
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 15, 2022
@tgross tgross removed their assignment Jun 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants