Skip to content

[deploy] Alloy WAL volume alloy_data has no size limit — disk exhaustion under sustained remote_write failure #293

@obchain

Description

@obchain

PR: #55 (feat/27-docker-compose)
Files: deploy/compose/docker-compose.yml volumes block

The compose file declares:

volumes:
  alloy_data: {}

Alloy uses /var/lib/alloy/data as its WAL for buffering scraped series when the Grafana Cloud remote_write endpoint is unavailable (network partition, Grafana Cloud outage, expired API token). With no WAL retention limits configured, Alloy will buffer indefinitely until disk space is exhausted.

Combined with the missing log rotation noted in a separate issue, two concurrent disk-filling vectors exist on the same 40 GB Hetzner host. Disk exhaustion kills both containers and the signing process.

Suggested fix: Configure WAL retention bounds in alloy-config.alloy:

prometheus.remote_write "grafana_cloud" {
    endpoint { ... }
    wal {
        truncate_frequency = "1h"
        min_wal_time       = "5m"
        max_wal_time       = "4h"
    }
}

This bounds buffered data to at most 4 hours of metrics, limiting WAL disk usage to a few hundred MB under worst-case remote_write failure. Document the expected disk footprint in the deploy README section.

Refs #55

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglayer:devopsCI / deploy / infra / telemetrypr-reviewFindings from PR review processpriority:p2-polishNice-to-have / polishstatus:readyScoped and ready to pick up

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions