Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop Distributor on fatal error #1887

Merged
merged 2 commits into from
Nov 21, 2022
Merged

Conversation

rdooley
Copy link
Contributor

@rdooley rdooley commented Nov 10, 2022

What this PR does:

  • occasionally otel receivers will report a fatal error, and expected behavior is that the Host is stopped
  • match this behavior by stopping the receiver shim service and letting the distributor stop itself

If you think we need a test added to prevent regression, let me know

Which issue(s) this PR fixes:
No Issue

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

* occasionally otel receivers will report a fatal error, and expected
  behavior is that the Host is stopped
* match this behavior by stopping the receiver shim service and letting
  the distributor stop itself
@CLAassistant
Copy link

CLAassistant commented Nov 10, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Can we get a changelog entry?

If you think we need a test added to prevent regression, let me know

If you can think of a meaningful way to test this, we would appreciate it.

@@ -279,6 +279,7 @@ func (r *receiversShim) ConsumeTraces(ctx context.Context, td ptrace.Traces) err
// ReportFatalError implements component.Host
func (r *receiversShim) ReportFatalError(err error) {
_ = level.Error(log.Logger).Log("msg", "fatal error reported", "err", err)
r.StopAsync()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this? i honestly don't know the services code very well. this may simply stop the receiver, but not cause the distributor to exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have not yet, writing an otel receiver mock to do so

Copy link
Contributor Author

@rdooley rdooley Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it doesnt work, working on that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got this working by switching the shim to be a basicservice
Will update this PR with some testing results on Monday (fed holiday tomorrow I'm off)

@rdooley
Copy link
Contributor Author

rdooley commented Nov 14, 2022

Some context before I get into the testing:

  • Shopify is using a custom pubsublite receiver and exporter in our fork of opentelemetry-collector
  • Occasionally the pubsublite client returns an unretryable error, for the purposes of our test we will simulate this by deleting the pubsublite subscription

After deleting this subscription observed

level=error ts=2022-11-14T16:53:11.711264932Z caller=shim.go:221 msg="fatal error reported" err="assigner(projects/${PROJECT_NAME}/locations/us-east1-c/subscriptions/rees-test-again): rpc error: code = NotFound desc = Resource projects/${PROJECT_ID}/locations/us-east1-c/subscriptions/rees-test-again of type SUBSCRIPTION does not exist."

And saw pods restart after the logged fatal error

kubectl get pods | grep distributor
tempo-distributor-us-central1-a-999446849-tzvnz    1/1     Running   1 (31s ago)   22m
tempo-distributor-us-central1-a-999446849-vh7nn    1/1     Running   1 (31s ago)   22m
tempo-distributor-us-central1-a-999446849-x6vvj    1/1     Running   1 (29s ago)   22m
tempo-distributor-us-central1-c-77cf7f7696-cw6hv   1/1     Running   1 (31s ago)   22m
tempo-distributor-us-central1-c-77cf7f7696-g7257   1/1     Running   1 (29s ago)   22m
tempo-distributor-us-central1-c-77cf7f7696-xfwjj   1/1     Running   1 (29s ago)   22m

@rdooley
Copy link
Contributor Author

rdooley commented Nov 14, 2022

Oh I need to amend to add a changelog entry with the bugfix

* cant use idle service to do this i think
* also add changelog entry
* fix my bad copypaste
@mapno mapno merged commit cf6a933 into grafana:main Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants