Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netdata orphaned log files #10539

Closed
fayak opened this issue Jan 21, 2021 · 11 comments
Closed

Netdata orphaned log files #10539

fayak opened this issue Jan 21, 2021 · 11 comments
Assignees
Labels
bug needs triage Issues which need to be manually labelled

Comments

@fayak
Copy link
Contributor

fayak commented Jan 21, 2021

Bug report summary

Netdata keeps the log files opened even after log rotation, leading to lost space.

$ lsof -nP +L1
COMMAND     PID    USER   FD   TYPE DEVICE SIZE/OFF NLINK      NODE NAME
netdata   20142 netdata    2w   REG   0,20   431992     0 560040116 /var/log/netdata/error.log.1 (deleted)
go.d.plug 20471 netdata    2w   REG   0,20   431992     0 560040116 /var/log/netdata/error.log.1 (deleted)
python    20475 netdata    2w   REG   0,20   431992     0 560040116 /var/log/netdata/error.log.1 (deleted)
apps.plug 20477 netdata    2w   REG   0,20   431992     0 560040116 /var/log/netdata/error.log.1 (deleted)
OS / Environment
/etc/logrotate.d $ uname -a
Linux node01 4.9.0-13-amd64 #1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux
/etc/logrotate.d $ cat netdata
/var/log/netdata/*.log {
        daily
        missingok
        rotate 14
        compress
        delaycompress
        notifempty
        sharedscripts
        postrotate
                /bin/kill -HUP `cat /var/run/netdata/netdata.pid 2>/dev/null` 2>/dev/null || true
        endscript
}
Netdata version
$ netdata -V
netdata v1.28.0-68-nightly
Steps To Reproduce

I don't know for sure. Netdata has been installed with apt and has been running for a while. The issue has been reported on multiple servers, but not on all of them.

Expected behavior

Having the log files rotated properly and deleted

Please let me know if I can be more helpful !

@fayak fayak added bug needs triage Issues which need to be manually labelled labels Jan 21, 2021
@stelfrag
Copy link
Collaborator

Hi @fayak

Thank you for the report. While we investigate this could you please check that you
do see a line

MAIN : SIGNAL: Received SIGHUP. Reopening all log files

in the netdata's error.log file.

@fayak
Copy link
Contributor Author

fayak commented Jan 25, 2021

There's one in every error.log file I have:

grep -rni "Received SIGHUP. Reopening all log files"
error.log.9:1369:2021-01-17 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.8:1244:2021-01-18 06:25:04: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.7:4931:2021-01-19 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.6:1015:2021-01-20 06:25:04: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.5:1041:2021-01-21 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.4:756:2021-01-22 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.3:8153:2021-01-23 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.2:785:2021-01-24 06:25:02: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.14:978:2021-01-12 06:25:02: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.13:1334:2021-01-13 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.12:1261:2021-01-14 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.11:1277:2021-01-15 06:25:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.10:1815:2021-01-16 06:25:02: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
error.log.1:1117:2021-01-25 06:25:02: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...

@MrZammler
Copy link
Contributor

Hi! I'm trying to replicate the issue (albeit not through logrotate, but manually). By manually I mean trying to move or delete e.g. the error log, and then send a HUP kill to see if it leaves anything behind. I will try a clean install, on a Debian system with logrotate scripts and see what happens. Will let you know!

@stelfrag
Copy link
Collaborator

stelfrag commented Apr 2, 2021

@MrZammler Can you recheck?

@illes
Copy link
Contributor

illes commented Jun 25, 2021

this is still an issue with netdata 1.31.0 on Debian

# apt-cache policy netdata
netdata:
  Installed: 1.31.0
  Candidate: 1.31.0
  Version table:
 *** 1.31.0 500
        500 https://packagecloud.io/netdata/netdata/debian stretch/main amd64 Packages
        100 /var/lib/dpkg/status

After sending HUP to netdata:

2021-06-25 18:56:28: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
2021-06-25 18:56:28: netdata INFO  : MAIN : COMMAND: Reopening all log files.

signal handled, but error.log is not reopened:

# lsof +L1
COMMAND     PID    USER   FD   TYPE DEVICE    SIZE/OFF NLINK    NODE NAME
netdata   16897 netdata    2w   REG    8,2 18359594002     0 2228692 /var/log/netdata/error.log.1 (deleted)
java      17022 netdata    2w   REG    8,2 18359594002     0 2228692 /var/log/netdata/error.log.1 (deleted)
fping     17027 netdata    2w   REG    8,2 18359594002     0 2228692 /var/log/netdata/error.log.1 (deleted)
apps.plug 17029 netdata    2w   REG    8,2 18359594002     0 2228692 /var/log/netdata/error.log.1 (deleted)
go.d.plug 17031 netdata    2w   REG    8,2 18359594002     0 2228692 /var/log/netdata/error.log.1 (deleted)
python    17032 netdata    2w   REG    8,2 18359594002     0 2228692 /var/log/netdata/error.log.1 (deleted)

@stelfrag
Copy link
Collaborator

@MrZammler ☝️

@MrZammler
Copy link
Contributor

Hi @illes ! Thanks for your report... Last time didn't get far with this...

Can you please share the output of ps aux | grep 16897 ? (Assuming netdata hasn't been restarted). I'd like to see if --special-spawn-server is what holding it....

Thanks!

@illes
Copy link
Contributor

illes commented Jun 28, 2021

It has been restarted there, here's a snapshot from another 1.31.0 instance:

systemctl status netdata 
● netdata.service - Real time performance monitoring
   Loaded: loaded (/etc/systemd/system/netdata.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-06-04 11:06:21 CEST; 3 weeks 3 days ago
  Process: 27971 ExecStartPre=/bin/chown -R netdata:netdata /var/cache/netdata (code=exited, status=0/SUCCESS)
  Process: 27969 ExecStartPre=/bin/mkdir -p /var/cache/netdata (code=exited, status=0/SUCCESS)
 Main PID: 27986 (netdata)
    Tasks: 135 (limit: 6144)
   CGroup: /system.slice/netdata.service
           ├─15402 /usr/libexec/netdata/plugins.d/nfacct.plugin 1
           ├─27986 /usr/sbin/netdata -P /run/netdata/netdata.pid -D
           ├─27990 /usr/sbin/netdata --special-spawn-server
           ├─28117 /usr/bin/python /usr/libexec/netdata/plugins.d/python.d.plugin 1
           ├─28121 /usr/local/bin/fping # [REDACTED]
           ├─28124 /usr/libexec/netdata/plugins.d/apps.plugin 1
           ├─28127 /usr/libexec/netdata/plugins.d/go.d.plugin 1
           ├─28128 java -agentlib:hprof=cpu=samples,depth=100,thread=y,lineno=y,file=/var/log/netdata/java.hprof.txt.1622797582 -Djava.util.logging.SimpleFormatter.format=%1
           └─28989 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
# ps aux | grep ^netdata
netdata  27986  2.7  0.3 575496 232656 ?       Ssl  Jun04 966:57 /usr/sbin/netdata -P /run/netdata/netdata.pid -D
netdata  27990  0.0  0.0  64428  7656 ?        Sl   Jun04   0:00 /usr/sbin/netdata --special-spawn-server
...
# lsof +L1 | grep netdata
netdata   27990 netdata    2w   REG    8,1 13888292710     0    2100621 /var/log/netdata/error.log.1 (deleted)
python    28117 netdata    2w   REG    8,1 13888292710     0    2100621 /var/log/netdata/error.log.1 (deleted)
fping     28121 netdata    2w   REG    8,1 13888292710     0    2100621 /var/log/netdata/error.log.1 (deleted)
apps.plug 28124 netdata    2w   REG    8,1 13888292710     0    2100621 /var/log/netdata/error.log.1 (deleted)
go.d.plug 28127 netdata    2w   REG    8,1 13888292710     0    2100621 /var/log/netdata/error.log.1 (deleted)
java      28128 netdata    2w   REG    8,1 13888292710     0    2100621 /var/log/netdata/error.log.1 (deleted)
# grep HUP /var/log/netdata/error.log
2021-06-28 16:17:01: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...

@illes
Copy link
Contributor

illes commented Jun 28, 2021

I could also reproduce this in vanilla docker as follows:

  1. Start a fresh Debian Buster:

    docker run --rm --name netdata -it --entrypoint /bin/bash debian:buster
  2. Install and start netdata 1.31.0 inside the container:

    apt update
    apt install -y curl ca-certificates lsof procps
    curl -s https://packagecloud.io/install/repositories/netdata/netdata/script.deb.sh | bash
    apt install -y netdata=1.31.0
    /etc/init.d/netdata start
  3. Simulate logrotate:

    mv /var/log/netdata/error.log /var/log/netdata/error.log.old
    /bin/kill -HUP `cat /var/run/netdata/netdata.pid 2>/dev/null`
  4. Verify HUP received:

    # grep -A1 HUP /var/log/netdata/error.log.old
    2021-06-28 16:27:08: netdata INFO  : MAIN : SIGNAL: Received SIGHUP. Reopening all log files...
    2021-06-28 16:27:08: netdata INFO  : MAIN : COMMAND: Reopening all log files.
  5. Check if logs are reopened at /var/log/netdata/error.log:

    # su -s /bin/bash -c 'lsof -n' - netdata | grep error.log
    netdata   4965                netdata    2w      REG               0,49   236968 894874446 /var/log/netdata/error.log.old
    netdata   4965 4981 netdata   netdata    2w      REG               0,49   236968 894874446 /var/log/netdata/error.log.old
    ...
    # ps aux | grep ^netdata
    netdata   4961 14.4  0.0 297304 25788 ?        Sl   16:26   0:40 /usr/sbin/netdata -P /var/run/netdata/netdata.pid
    netdata   4965  0.6  0.0  35744  6368 ?        Sl   16:26   0:01 /usr/sbin/netdata --special-spawn-server
    netdata   5095  0.0  0.0   3868  3036 ?        S    16:26   0:00 bash /usr/libexec/netdata/plugins.d/tc-qos-helper.sh 1
    netdata   5109  0.2  0.0 724940 27388 ?        Sl   16:26   0:00 /usr/libexec/netdata/plugins.d/go.d.plugin 1

This workflow could help someone close in on the revision/commit that is causing the issue.

@MrZammler
Copy link
Contributor

Hi @illes thanks for your detailed report! Ok, so the plugins and the special-spawn-server are those that keep the files... We'll have a look how we can proceed to fix this.

Thanks again!

@cpipilas
Copy link

Fixed in the latest releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs triage Issues which need to be manually labelled
Projects
None yet
Development

No branches or pull requests

6 participants