Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL cleanup panics if the WAL is not yet loaded #1126

Closed
gouthamve opened this issue Nov 23, 2021 · 2 comments · Fixed by #1289
Closed

WAL cleanup panics if the WAL is not yet loaded #1126

gouthamve opened this issue Nov 23, 2021 · 2 comments · Fixed by #1289
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.

Comments

@gouthamve
Copy link
Member

gouthamve commented Nov 23, 2021

ts=2021-11-23T12:26:09.634937274Z caller=node.go:85 level=info agent=prometheus component=cluster msg="applying config"
ts=2021-11-23T12:26:09.63557965Z caller=remote.go:180 level=info agent=prometheus component=cluster msg="not watching the KV, none set"
ts=2021-11-23T12:26:09Z level=info caller=traces/traces.go:123 msg="Traces Logger Initialized" component=traces
ts=2021-11-23T12:26:09.648577389Z caller=server.go:77 level=info msg="server configuration changed, restarting server"
ts=2021-11-23T12:26:09.649497502Z caller=gokit.go:72 level=info http=[::]:3090 grpc=[::]:9095 msg="server listening on addresses"
ts=2021-11-23T12:26:09.786774813Z caller=wal.go:182 level=info agent=prometheus instance=14d8e3c77664fe89fcfcb980c9ba0874 msg="replaying WAL, this may take a while" dir=/var/lib/agent/data/14d8e3c77664fe89fcfcb980c9ba0874/wal
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x1a293f0] 
goroutine 253 [running]:
github.com/grafana/agent/pkg/metrics/instance.(*Instance).StorageDirectory(0xc000889e40, 0xc08d627830, 0xc00063fd78)
    /src/agent/pkg/metrics/instance/instance.go:559 +0x30
github.com/grafana/agent/pkg/metrics.(*WALCleaner).getManagedStorage(0xc00013e1e0, 0xc08d627830, 0x1)
    /src/agent/pkg/metrics/cleaner.go:154 +0x9d
github.com/grafana/agent/pkg/metrics.(*WALCleaner).cleanup(0xc00013e1e0)
    /src/agent/pkg/metrics/cleaner.go:248 +0xb7
github.com/grafana/agent/pkg/metrics.(*WALCleaner).run(0xc00013e1e0)
    /src/agent/pkg/metrics/cleaner.go:237 +0x98
created by github.com/grafana/agent/pkg/metrics.NewWALCleaner
    /src/agent/pkg/metrics/cleaner.go:145 +0x205

Here the WAL loading takes longer than 30mins, at which point the WALCleaner runs. But because WAL is nil, it panics. We should fix that.

@stale
Copy link

stale bot commented Jan 9, 2022

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale Issue/PR mark as stale due lack of activity label Jan 9, 2022
@tpaschalis
Copy link
Member

tpaschalis commented Jan 17, 2022

I was able to reproduce by overriding the DefaultCleanupPeriod in the config and adding some delay to the loadWAL method.

I can go for a fix, but I'm not sure what's the best thing to do here:

  • What I'd do is to just skip any instance that's not ready in the current tick's cleanup().
    This way, we'd retry in wal_cleanup_period time (eg. 30mins). Question is, what if the WAL also failed to load during the second/third tick, or if it never loads? Is it okay to just keep trying?
  • (Not really a fan) Exit gracefully with a relevant error. Problem is that a) it adds an implicit thing that wal_cleanup_period is used for, and b) even if the agent is re-started (eg. in a K8s environment), there's nothing ensuring that it wouldn't fail each subsequent time.
  • Any other ideas?

Thoughts? /cc @rfratto @mattdurham

@stale stale bot removed the stale Issue/PR mark as stale due lack of activity label Jan 17, 2022
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 22, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants