WAL cleanup panics if the WAL is not yet loaded #1126

gouthamve · 2021-11-23T13:05:13Z

ts=2021-11-23T12:26:09.634937274Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2021-11-23T12:26:09.63557965Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2021-11-23T12:26:09Z level=info caller=traces/traces.go:123 msg="Traces Logger Initialized" component=traces
ts=2021-11-23T12:26:09.648577389Z caller=server.go:77 level=info msg="server configuration changed, restarting server"
ts=2021-11-23T12:26:09.649497502Z caller=gokit.go:72 level=info http=[::]:3090 grpc=[::]:9095 msg="server listening on addresses"
ts=2021-11-23T12:26:09.786774813Z caller=wal.go:182 level=info agent=prometheus instance=14d8e3c77664fe89fcfcb980c9ba0874 msg="replaying WAL, this may take a while" dir=/var/lib/agent/data/14d8e3c77664fe89fcfcb980c9ba0874/wal
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1a293f0] 
goroutine 253 [running]:
github.com/grafana/agent/pkg/metrics/instance.(*Instance).StorageDirectory(0xc000889e40, 0xc08d627830, 0xc00063fd78)
    /src/agent/pkg/metrics/instance/instance.go:559 +0x30
github.com/grafana/agent/pkg/metrics.(*WALCleaner).getManagedStorage(0xc00013e1e0, 0xc08d627830, 0x1)
    /src/agent/pkg/metrics/cleaner.go:154 +0x9d
github.com/grafana/agent/pkg/metrics.(*WALCleaner).cleanup(0xc00013e1e0)
    /src/agent/pkg/metrics/cleaner.go:248 +0xb7
github.com/grafana/agent/pkg/metrics.(*WALCleaner).run(0xc00013e1e0)
    /src/agent/pkg/metrics/cleaner.go:237 +0x98
created by github.com/grafana/agent/pkg/metrics.NewWALCleaner
    /src/agent/pkg/metrics/cleaner.go:145 +0x205

Here the WAL loading takes longer than 30mins, at which point the WALCleaner runs. But because WAL is nil, it panics. We should fix that.

The text was updated successfully, but these errors were encountered:

stale · 2022-01-09T03:52:15Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

tpaschalis · 2022-01-17T14:47:41Z

I was able to reproduce by overriding the DefaultCleanupPeriod in the config and adding some delay to the loadWAL method.

I can go for a fix, but I'm not sure what's the best thing to do here:

What I'd do is to just skip any instance that's not ready in the current tick's cleanup().
This way, we'd retry in wal_cleanup_period time (eg. 30mins). Question is, what if the WAL also failed to load during the second/third tick, or if it never loads? Is it okay to just keep trying?
(Not really a fan) Exit gracefully with a relevant error. Problem is that a) it adds an implicit thing that wal_cleanup_period is used for, and b) even if the agent is re-started (eg. in a K8s environment), there's nothing ensuring that it wouldn't fail each subsequent time.
Any other ideas?

Thoughts? /cc @rfratto @mattdurham

stale bot added the stale Issue/PR mark as stale due lack of activity label Jan 9, 2022

stale bot removed the stale Issue/PR mark as stale due lack of activity label Jan 17, 2022

tpaschalis mentioned this issue Jan 20, 2022

Skip non-ready entries when listing instances #1289

Merged

3 tasks

rfratto closed this as completed in #1289 Jan 21, 2022

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 22, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL cleanup panics if the WAL is not yet loaded #1126

WAL cleanup panics if the WAL is not yet loaded #1126

gouthamve commented Nov 23, 2021 •

edited

Loading

stale bot commented Jan 9, 2022

tpaschalis commented Jan 17, 2022 •

edited

Loading

WAL cleanup panics if the WAL is not yet loaded #1126

WAL cleanup panics if the WAL is not yet loaded #1126

Comments

gouthamve commented Nov 23, 2021 • edited Loading

stale bot commented Jan 9, 2022

tpaschalis commented Jan 17, 2022 • edited Loading

gouthamve commented Nov 23, 2021 •

edited

Loading

tpaschalis commented Jan 17, 2022 •

edited

Loading