-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka receiver stuck while shutting down at v0.93.0 #30789
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
@mwear: Do you have any thoughts on why a component may never be able to get the lock when reporting status? It looks like this may be related to the work you've been doing on component status reporting. Possible related PR: open-telemetry/opentelemetry-collector#8836 |
Based on the research @james-ryans did, this came in after this change: #30610. What I suspect is happening is that writing the fatal error to the asyncErrorChannel in serviceHost is blocking, so that ReportStatus never returns (and never releases its lock). Here is the suspect line: https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/host.go#L73. I think this is a variation of this existing problem: open-telemetry/opentelemetry-collector#8116, which is also assigned to me. It has been on my todo list. I'll look into it. |
Thanks @mwear, appreciate your insight here! |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
If I am not mistaken, this issue should happen for all receivers. Here is an example of a flaky test in the collector-contrib due to the same issue happening for the |
Adding a reference to the issue for the flaky test: #27295 |
+1 freq: #32667 |
Your fixed worked @Dennis8274, thanks for the suggestion! I've posted a PR to resolve this issue. 👍 |
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> The kafka receiver's shutdown method is to cancel the context of a running sub goroutine. However, a small bug was causing a fatal error to be reported during shutdown when this expected condition was hit. The fatal error being reported during shutdown was causing another bug to be hit, open-telemetry/opentelemetry-collector#9824. This fix means that shutdown won't be blocked in expected shutdown conditions, but the `core` bug referenced above means shutdown will still be block in unexpected error situations. This fix is being taken from a comment made by @Dennis8274 on the issue. **Link to tracking Issue:** <Issue number if applicable> Fixes #30789 **Testing:** <Describe what testing was performed and which tests were added.> Stepped through `TestTracesReceiverStart` in a debugger before the change to see the fatal status being reported. It was no longer reported after applying the fix. Manually tested running the collector with a kafka receiver and saw that before the fix it was indeed being blocked on a normal shutdown, but after the fix it shutdown as expected.
…2720) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> The kafka receiver's shutdown method is to cancel the context of a running sub goroutine. However, a small bug was causing a fatal error to be reported during shutdown when this expected condition was hit. The fatal error being reported during shutdown was causing another bug to be hit, open-telemetry/opentelemetry-collector#9824. This fix means that shutdown won't be blocked in expected shutdown conditions, but the `core` bug referenced above means shutdown will still be block in unexpected error situations. This fix is being taken from a comment made by @Dennis8274 on the issue. **Link to tracking Issue:** <Issue number if applicable> Fixes open-telemetry#30789 **Testing:** <Describe what testing was performed and which tests were added.> Stepped through `TestTracesReceiverStart` in a debugger before the change to see the fatal status being reported. It was no longer reported after applying the fix. Manually tested running the collector with a kafka receiver and saw that before the fix it was indeed being blocked on a normal shutdown, but after the fix it shutdown as expected.
Component(s)
receiver/kafka
What happened?
Description
Shutting down kafka receiver got stuck forever while transitioning from
StatusStopping
toStatusStopped
.I've debugged this for a while and apparently this is because of
consumeLoop
returnscontext canceled
error and doReportStatus FatalErrorEvent
atreceiver/kafkareceiver/kafka_receiver.go
between line 163-165. But thesync.Mutex.Lock()
inReportStatus FatalErrorEvent
never gets unlocked (I don't know why), so that theReportStatus
forStatusStopped
stuck forever while trying to acquire the mutex lock.Also, I've tried to rollback to v0.92.0 and it works well. And traced to issue down to
receiver/kafkareceiver/kafka_receiver.go
at line 164c.settings.ReportStatus(component.NewFatalErrorEvent(err))
changed at PR #30593 was the cause.Steps to Reproduce
Create a collector with
kafkareceiver
factory in it. And have areceivers.kafka
in the config.Expected Result
Should be able to shutdown properly.
Actual Result
Stuck indefinitely while shutting down with the logs below.
Collector version
v0.93.0
Environment information
No response
OpenTelemetry Collector configuration
Log output
Additional context
No response
The text was updated successfully, but these errors were encountered: