[Bug]: netdata stops working when disk is full #12324

ktsaou · 2022-03-04T22:58:43Z

Bug description

On a test raspberry pi I have, I experienced this issue:

netdata was running
disk got almost full
netdata agent properly raised disk full alarm to WARNING, which was sent to the cloud
disk got almost 100% full
netdata agent tried to commit dbengine data to disk, which failed
10min_dbengine_global_io_errors was raised and the alarm was sent to the cloud
disk got 100% full
netdata agent properly raised disk full alarm to CRITICAL, which was NOT sent to the cloud

At this point, the netdata agent was still running. I didn't check if it was responsive or not.
I removed a file to make disk space and tried to see if netdata could recover from the situation.
I realized that netdata could not respond to dashboard queries. All queries were just hanging.
I tried to restart netdata. systemd killed it after some time (90 secs I think).

After netdata restart, the cloud didn't sync alarms properly. Certain alarms are still raised in the cloud, while none is raised at the agent, so the cloud failed to detect that there is a discrepancy in the alarm log.

Expected behavior

netdata should survive a disk full situation. Even if data cannot be saved to disk, netdata should continue to function properly, given that old data may have to be discarded.
netdata should always send alerts to the cloud, even if it cannot commit the alarm log to disk. The cloud should be aware of all alerts, even if disk is not usable.
netdata should properly re-sync with the cloud after a crash. A crash may mean a alarm snapshot re-sync is needed. At least, we should be able to trigger this somehow.

Disk full situation

Netdata may have to change the dbengine rotation policy to adapt to disk full situations.

So, once it cannot append metrics to a disk file, it could trigger a dbengine file rotation. dbengine file rotation could be done by moving the oldest file to the newest position, zeroing its headers and using the existing file as a preallocated buffer to commit data to disk.

This strategy of using preallocated disk space could be the default, to allow dbengine use always a fixed amount of disk space for metrics.

Ideally, we would like to have something similar for sqlite. I hope there is a solution for this.

No disk situation

Disks may also fail completely. No disk at all, suddenly, at runtime.

Netdata runtime should be able to survive such a situation and continue running, triggering alarms, streaming metrics to a parent, communicating with netdata cloud.

Steps to reproduce

Run netdata
Get the disk 100% full

or

Run netdata from a removable disk
remove the disk on the fly

Installation method

kickstart.sh

System info

Any

Netdata build info

# /opt/netdata/bin/netdata -W buildinfo
Version: netdata v1.33.1-99-gcf90fc9e8
Configure options:  '--prefix=/opt/netdata/usr' '--sysconfdir=/opt/netdata/etc' '--localstatedir=/opt/netdata/var' '--libexecdir=/opt/netdata/usr/libexec' '--libdir=/opt/netdata/usr/lib' '--with-zlib' '--with-math' '--with-user=netdata' '--enable-cloud' '--without-bundled-protobuf' '--with-bundled-libJudy' 'CFLAGS=-static -O3 -I/openssl-static/include' 'LDFLAGS=-static -L/openssl-static/lib' 'PKG_CONFIG_PATH=/openssl-static/lib/pkgconfig'
Install type: kickstart-static
    Binary architecture: armv7l
Features:
    dbengine:                   YES
    Native HTTPS:               YES
    Netdata Cloud:              YES 
    ACLK Next Generation:       YES
    ACLK-NG New Cloud Protocol: YES
    ACLK Legacy:                NO
    TLS Host Verification:      YES
    Machine Learning:           YES
    Stream Compression:         YES
Libraries:
    protobuf:                YES (system)
    jemalloc:                NO
    JSON-C:                  YES
    libcap:                  NO
    libcrypto:               YES
    libm:                    YES
    tcalloc:                 NO
    zlib:                    YES
Plugins:
    apps:                    YES
    cgroup Network Tracking: YES
    CUPS:                    NO
    EBPF:                    YES
    IPMI:                    NO
    NFACCT:                  NO
    perf:                    YES
    slabinfo:                YES
    Xen:                     NO
    Xen VBD Error Tracking:  NO
Exporters:
    AWS Kinesis:             NO
    GCP PubSub:              NO
    MongoDB:                 NO
    Prometheus Remote Write: YES

Additional info

Related to netdata/netdata-cloud#323

The text was updated successfully, but these errors were encountered:

ktsaou · 2022-03-04T23:04:52Z

@stelfrag @cpipilas please check this.

Ferroin · 2022-03-07T12:32:45Z

I generally agree that the preallocated space solution should probably be the default here (and I’m 99% certain that there is a way to get SQLite to preallocate space too). In most cases, that should actually improve performance for us and lower the overall impact on the rest of the system.

However, it will not fix this issue in all cases. Specifically, on filesystems that use copy-on-write semantics for internal updates (BTRFS, ZFS, etc) this approach can still fail (because you need at least enough room for the new data there regardless).

ktsaou added bug needs triage Issues which need to be manually labelled labels Mar 4, 2022

ktsaou changed the title ~~[Bug]: netdata stops working when disk if full~~ [Bug]: netdata stops working when disk is full Mar 4, 2022

ktsaou mentioned this issue Mar 4, 2022

[BUG]: Alarm is CRITICAL on the agent, but WARNING on the cloud netdata/netdata-cloud#323

Closed

ktsaou mentioned this issue Mar 8, 2022

[BUG]: agent and cloud have out of sync alarm log netdata/netdata-cloud#330

Closed

hugovalente-pm added the stability label Mar 17, 2022

stelfrag added area/daemon area/database and removed needs triage Issues which need to be manually labelled labels Mar 29, 2022

cpipilas assigned MrZammler and stelfrag Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: netdata stops working when disk is full #12324

[Bug]: netdata stops working when disk is full #12324

ktsaou commented Mar 4, 2022 •

edited

ktsaou commented Mar 4, 2022

Ferroin commented Mar 7, 2022

[Bug]: netdata stops working when disk is full #12324

[Bug]: netdata stops working when disk is full #12324

Comments

ktsaou commented Mar 4, 2022 • edited

Bug description

Expected behavior

Disk full situation

No disk situation

Steps to reproduce

Installation method

System info

Netdata build info

Additional info

ktsaou commented Mar 4, 2022

Ferroin commented Mar 7, 2022

ktsaou commented Mar 4, 2022 •

edited