Skip to content

[critical] K or kweb or something opens up too many files on fslweb #19

@pdaian

Description

@pdaian

From Joel in IT:

Grigore,

Fslweb is back up and serving the website.  We are, unfortunately not in control of the network update schedule, but we do apologize for the outage.  That said, here is some technical information regarding what happened.

The logs were recording many messages of this form:

Oct 26 03:30:33 fslweb NetworkManager[1795]: <warn> error parsing timestamps file '/var/lib/NetworkManager/timestamps': Too many open files
Oct 26 03:30:33 fslweb NetworkManager[1795]: <warn> error saving timestamp: Failed to create file '/var/lib/NetworkManager/timestamps.F7ZEOX': Too many open files

Days and perhaps months prior to the outage.  This leads me to believe there was an issue already present that the network

Nov  1 05:40:40 fslweb NetworkManager[1795]: <warn> sysctl: failed to open '/proc/sys/net/ipv6/conf/eth0/accept_ra': (24) Too many open files
Nov  1 05:40:40 fslweb NetworkManager[1795]: <error> [1414838440.4364] [nm-device.c:3486] nm_device_update_ip4_address(): couldn't open control socket.
Nov  1 05:40:40 fslweb NetworkManager[1795]: <error> [1414838440.4476] [nm-system.c:771] nm_system_device_is_up_with_iface(): couldn't open control socket.
Nov  1 05:40:40 fslweb NetworkManager[1795]: <info> (eth0): bringing up device.

Is the time of the actual network outage.  You can see that it fails to come back up due to a lack of available file handles.  It then spams the same line repeatedly up until I restarted the machine this morning.

Nov  2 03:45:52 fslweb NetworkManager[1795]: <error> [1414921552.22183] [nm-system.c:771] nm_system_device_is_up_with_iface(): couldn't open control socket.

Checking the max open file handles, 1,620,366 is the number of files the system will open concurrently.  That's a million and a half open files.  From checking the backup stats on the machine it looks like the machine itself has almost 7 million files in just 115GB of space.  This leads me to believe that the issue that caused the machine to not come back up after the networking outage was the open files, not something directly related to the network.  I need to run to a meeting, but I'll provide additional information this afternoon.

Joel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions