Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: netdata stops working when disk is full #12324

Open
ktsaou opened this issue Mar 4, 2022 · 2 comments
Open

[Bug]: netdata stops working when disk is full #12324

ktsaou opened this issue Mar 4, 2022 · 2 comments

Comments

@ktsaou
Copy link
Member

ktsaou commented Mar 4, 2022

Bug description

On a test raspberry pi I have, I experienced this issue:

  1. netdata was running
  2. disk got almost full
  3. netdata agent properly raised disk full alarm to WARNING, which was sent to the cloud
  4. disk got almost 100% full
  5. netdata agent tried to commit dbengine data to disk, which failed
  6. 10min_dbengine_global_io_errors was raised and the alarm was sent to the cloud
  7. disk got 100% full
  8. netdata agent properly raised disk full alarm to CRITICAL, which was NOT sent to the cloud

At this point, the netdata agent was still running. I didn't check if it was responsive or not.
I removed a file to make disk space and tried to see if netdata could recover from the situation.
I realized that netdata could not respond to dashboard queries. All queries were just hanging.
I tried to restart netdata. systemd killed it after some time (90 secs I think).

After netdata restart, the cloud didn't sync alarms properly. Certain alarms are still raised in the cloud, while none is raised at the agent, so the cloud failed to detect that there is a discrepancy in the alarm log.

Expected behavior

  1. netdata should survive a disk full situation. Even if data cannot be saved to disk, netdata should continue to function properly, given that old data may have to be discarded.
  2. netdata should always send alerts to the cloud, even if it cannot commit the alarm log to disk. The cloud should be aware of all alerts, even if disk is not usable.
  3. netdata should properly re-sync with the cloud after a crash. A crash may mean a alarm snapshot re-sync is needed. At least, we should be able to trigger this somehow.

Disk full situation

Netdata may have to change the dbengine rotation policy to adapt to disk full situations.

So, once it cannot append metrics to a disk file, it could trigger a dbengine file rotation. dbengine file rotation could be done by moving the oldest file to the newest position, zeroing its headers and using the existing file as a preallocated buffer to commit data to disk.

This strategy of using preallocated disk space could be the default, to allow dbengine use always a fixed amount of disk space for metrics.

Ideally, we would like to have something similar for sqlite. I hope there is a solution for this.

No disk situation

Disks may also fail completely. No disk at all, suddenly, at runtime.

Netdata runtime should be able to survive such a situation and continue running, triggering alarms, streaming metrics to a parent, communicating with netdata cloud.

Steps to reproduce

  1. Run netdata
  2. Get the disk 100% full

or

  1. Run netdata from a removable disk
  2. remove the disk on the fly

Installation method

kickstart.sh

System info

Any

Netdata build info

# /opt/netdata/bin/netdata -W buildinfo
Version: netdata v1.33.1-99-gcf90fc9e8
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--with-bundled-libJudy' 'CFLAGS=-static -O3 -I/openssl-static/include' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: kickstart-static
    Binary architecture: armv7l
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Additional info

Related to netdata/netdata-cloud#323

@ktsaou ktsaou added bug needs triage Issues which need to be manually labelled labels Mar 4, 2022
@ktsaou ktsaou changed the title [Bug]: netdata stops working when disk if full [Bug]: netdata stops working when disk is full Mar 4, 2022
@ktsaou
Copy link
Member Author

ktsaou commented Mar 4, 2022

@stelfrag @cpipilas please check this.

@Ferroin
Copy link
Member

Ferroin commented Mar 7, 2022

I generally agree that the preallocated space solution should probably be the default here (and I’m 99% certain that there is a way to get SQLite to preallocate space too). In most cases, that should actually improve performance for us and lower the overall impact on the rest of the system.

However, it will not fix this issue in all cases. Specifically, on filesystems that use copy-on-write semantics for internal updates (BTRFS, ZFS, etc) this approach can still fail (because you need at least enough room for the new data there regardless).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants