Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: non-positive interval for NewTicker #2263

Closed
gebn opened this issue Mar 27, 2023 · 12 comments
Closed

panic: non-positive interval for NewTicker #2263

gebn opened this issue Mar 27, 2023 · 12 comments
Labels
good first issue Good for newcomers stale Used for stale issues / PRs type/bug Something isn't working

Comments

@gebn
Copy link
Contributor

gebn commented Mar 27, 2023

Describe the bug
Crash on start-up when configuration heading(s?) do not contain any content. Note the new config.verify flag does not find any fault with the configuration.

To Reproduce
Steps to reproduce the behavior:

  1. Start Tempo v2.0.1 or latest (grafana/tempo@sha256:928aaab8da7320fd417ea725dcf4cbf200e7549a71c3e59245819d7469784c20) with this configuration:
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:

ingester:

Expected behavior
Tempo starts with the ingester component enabled, using its default configuration, including flush_check_period: 10s as in the manual.

Actual behavior
Tempo appears to use default Go values for the ingester.Config struct, causing FlushCheckPeriod to default to 0, causing the panic when calling time.NewTicker(). RegisterFlagsAndApplyDefaults() appears to attempt to set defaults, but these can be incorrectly overridden with the wrong thing.

level=info ts=2023-03-27T17:36:01.714592041Z caller=main.go:215 msg="initialising OpenTracing tracer"
level=info ts=2023-03-27T17:36:01.715180371Z caller=main.go:102 msg="Starting Tempo" version="(version=, branch=main, revision=970588df8)"
level=info ts=2023-03-27T17:36:02.306502893Z caller=server.go:323 http=[::]:3200 grpc=[::]:9095 msg="server listening on addresses"
level=warn ts=2023-03-27T17:36:02.307134494Z caller=modules.go:228 msg="Worker address is empty in single binary mode. Attempting automatic worker configuration. If queries are unresponsive consider configuring the worker explicitly." address=127.0.0.1:9095
level=info ts=2023-03-27T17:36:02.30726099Z caller=worker.go:103 msg="Starting querier worker connected to query-frontend" frontend=127.0.0.1:9095
ts=2023-03-27T17:36:02Z level=info msg="OTel Shim Logger Initialized" component=tempo
level=info ts=2023-03-27T17:36:02.307864038Z caller=frontend.go:43 msg="creating middleware in query frontend"
level=info ts=2023-03-27T17:36:02.30798823Z caller=tempodb.go:421 msg="polling enabled" interval=5m0s concurrency=50
level=info ts=2023-03-27T17:36:02.696208293Z caller=poller.go:171 msg="successfully pulled tenant index" tenant=single-tenant createdAt=2023-03-27T17:33:15.402433505Z metas=272 compactedMetas=0
level=info ts=2023-03-27T17:36:02.696780764Z caller=module_service.go:82 msg=initialising module=store
level=info ts=2023-03-27T17:36:02.69682166Z caller=module_service.go:82 msg=initialising module=internal-server
level=info ts=2023-03-27T17:36:02.697413466Z caller=module_service.go:82 msg=initialising module=server
level=info ts=2023-03-27T17:36:02.697651602Z caller=module_service.go:82 msg=initialising module=overrides
level=info ts=2023-03-27T17:36:02.697822431Z caller=module_service.go:82 msg=initialising module=memberlist-kv
level=info ts=2023-03-27T17:36:02.69799279Z caller=module_service.go:82 msg=initialising module=ring
level=info ts=2023-03-27T17:36:02.698176393Z caller=ring.go:263 msg="ring doesn't exist in KV store yet"
level=info ts=2023-03-27T17:36:02.698239591Z caller=module_service.go:82 msg=initialising module=querier
level=info ts=2023-03-27T17:36:02.698541085Z caller=module_service.go:82 msg=initialising module=metrics-generator
level=info ts=2023-03-27T17:36:02.698649277Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=tempo-78f5df48f9-c76m7 ring=metrics-generator
level=info ts=2023-03-27T17:36:02.698810518Z caller=module_service.go:82 msg=initialising module=query-frontend
level=info ts=2023-03-27T17:36:02.699059584Z caller=module_service.go:82 msg=initialising module=metrics-generator-ring
level=info ts=2023-03-27T17:36:02.699402245Z caller=module_service.go:82 msg=initialising module=ingester
level=info ts=2023-03-27T17:36:02.699424386Z caller=ingester.go:327 msg="beginning wal replay"
level=warn ts=2023-03-27T17:36:02.699475982Z caller=wal.go:112 msg="unowned file entry ignored during wal replay" file=blocks err=null
level=info ts=2023-03-27T17:36:02.699484449Z caller=ingester.go:365 msg="wal replay complete"
level=info ts=2023-03-27T17:36:02.699505357Z caller=ingester.go:379 msg="reloading local blocks" tenants=0
level=info ts=2023-03-27T17:36:02.699761416Z caller=lifecycler.go:547 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2023-03-27T17:36:02.699788457Z caller=lifecycler.go:576 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2023-03-27T17:36:02.699834804Z caller=lifecycler.go:416 msg="auto-joining cluster after timeout" ring=ingester
level=info ts=2023-03-27T17:36:02.699879637Z caller=module_service.go:82 msg=initialising module=compactor
level=info ts=2023-03-27T17:36:02.699894024Z caller=tempodb.go:421 msg="polling enabled" interval=5m0s concurrency=50
level=info ts=2023-03-27T17:36:02.700282661Z caller=worker.go:179 msg="adding connection" addr=127.0.0.1:9095
level=info ts=2023-03-27T17:36:02.700954086Z caller=module_service.go:82 msg=initialising module=distributor
ts=2023-03-27T17:36:02Z level=info msg="Starting GRPC server on endpoint 0.0.0.0:4317" component=tempo
panic: non-positive interval for NewTicker

goroutine 710 [running]:
time.NewTicker(0xc000befe10?)
        /usr/local/go/src/time/tick.go:24 +0x10f
github.com/grafana/tempo/modules/ingester.(*Ingester).loop(0xc0004fbc00, {0x2738500, 0xc000be7400})
        /drone/src/modules/ingester/ingester.go:130 +0x67
github.com/grafana/dskit/services.(*BasicService).main(0xc0006e1400)
        /drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:190 +0x1e8
created by github.com/grafana/dskit/services.(*BasicService).StartAsync.func1
        /drone/src/vendor/github.com/grafana/dskit/services/basic_service.go:119 +0x10a

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: raw YAML
@joe-elliott
Copy link
Member

Thanks for this issue. Agree we should add this to the -config.verify check and exit with a real error instead of panicing.

@joe-elliott joe-elliott added type/bug Something isn't working good first issue Good for newcomers labels Mar 27, 2023
@gebn
Copy link
Contributor Author

gebn commented Mar 27, 2023

@joe-elliott Thanks for the swift response. Is the proposed workaround here adding all the fields with values to every config file? For ease of use/simplicity of config, is there a way enable the component without repeating the "defaults"?

@zalegrala
Copy link
Contributor

We have a series of checks that could be updated to include this trip-hazard. https://github.com/grafana/tempo/blob/main/cmd/tempo/app/config.go#L128

@joe-elliott
Copy link
Member

joe-elliott commented Mar 28, 2023

Is the proposed workaround here adding all the fields with values to every config file? For ease of use/simplicity of config, is there a way enable the component without repeating the "defaults"?

The components started in the Tempo binary are defined by the "target" and not which config blocks appear. The default target is "all" or single binary which starts all components. Just remove the empty ingester block and it will default correctly.

@gebn
Copy link
Contributor Author

gebn commented Mar 28, 2023

Argh, somehow I got it in my head that having the heading enabled the component. Re-reading the docs, it's clear this is not the case - thank you for fixing my understanding. In that case, definitely agree with the config.verify check saying something along the lines of "this will result in unexpected configuration, the component is already enabled".

Is there still be a footgun if someone tries selectively override defaults? E.g. given:

ingester:
  max_block_duration: 1h

Does every other field still default to the Go zero-value?

@joe-elliott
Copy link
Member

joe-elliott commented Mar 29, 2023

Does every other field still default to the Go zero-value?

We do what you are suggesting all the time and everything defaults as expected. So it should be fine. Here's a config we use for our docker-compose examples:

I'm surprised by the behavior of just:

ingester:

I would have expected that to use defaults as well, but obviously it doesn't. Maybe some quirk of go yaml parsing?

@zalegrala
Copy link
Contributor

Possibly yes a quirk of the parser. But perhaps drop ingester: altogether and then it will be nil and should assume defaults.

@gebn
Copy link
Contributor Author

gebn commented Apr 5, 2023

Does every other field still default to the Go zero-value?

Confirming this is not the case. With no ingester: stanza, /status/config shows:

ingester:
    ...
    concurrent_flushes: 4
    flush_check_period: 10s
    flush_op_timeout: 5m0s
    trace_idle_period: 10s
    max_block_duration: 30m0s
    max_block_bytes: 524288000
    complete_block_timeout: 15m0s
    override_ring_key: ring

Attempting to override one key...

ingester:
  max_block_duration: 1h

... does not set every other field to the Go default:

ingester:
    ...
    concurrent_flushes: 4
    flush_check_period: 10s
    flush_op_timeout: 5m0s
    trace_idle_period: 10s
    max_block_duration: 1h0m0s
    max_block_bytes: 524288000
    complete_block_timeout: 15m0s
    override_ring_key: ring

So the unintuitive behaviour happens only when ingester: - and presumably other top-level keys - are specified, but empty.

@mghildiy
Copy link
Contributor

I can give it a try.

@mghildiy
Copy link
Contributor

mghildiy commented May 2, 2023

All that is needed here is to put a check in Config:CheckConfig that config.Ingester is not empty.

That would suffice I guess.

@jamaclellan
Copy link

jamaclellan commented May 3, 2023

This behavior is exhibited when any struct key is present, but contains no children or value. The YAML parser treats it as a scalar value and, in the absence of a value, assigns the zero value of the type to the field. This determination is made by the lack of presence of a value or children.

This applies at any depth in the parse tree, so it a much larger tripping hazard than I thought when looking at the issue the first time.

I initially had the same thought as mghildiy that comparison against the zero value would be sufficient for this config section, but it applies elsewhere as well. To fully resolve this, it seems like one would have to create tests for the zero value case when a struct contains defaults. Unilaterally testing against the zero value would create a tripping hazard for new config sections that may not contain any non-zero defaults. Perhaps using struct field tags this could be denoted programmatically and validated in the config parser, but it seems like a lot of code and complexity.

There is also the option of sentry values that are always assigned a non empty value and then that can be quickly tested against. Again, this seem like a potentially messy solution, but it would be effective and customizable.

A quick review of the codebase suggests that there are very few/no sections without non-zero defaults, so a stop gap could be to test against the zero value for important config structs.

Fixing the panic mentioned here would be as easy as adding a check to the config test that validated the duration and warned about it.

For documentation it may be useful to warn users against having any empty sections if they are not explicitly trying to override default values.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

This issue has been automatically marked as stale because it has not had any activity in the past 60 days.
The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity.
Please apply keepalive label to exempt this Issue.

@github-actions github-actions bot added the stale Used for stale issues / PRs label Jul 3, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers stale Used for stale issues / PRs type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants