Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka receiver stuck while shutting down at v0.93.0 #30789

Closed
james-ryans opened this issue Jan 26, 2024 · 11 comments · Fixed by #32720
Closed

Kafka receiver stuck while shutting down at v0.93.0 #30789

james-ryans opened this issue Jan 26, 2024 · 11 comments · Fixed by #32720
Labels
bug Something isn't working priority:p1 High receiver/kafka

Comments

@james-ryans
Copy link
Contributor

james-ryans commented Jan 26, 2024

Component(s)

receiver/kafka

What happened?

Description

Shutting down kafka receiver got stuck forever while transitioning from StatusStopping to StatusStopped.

I've debugged this for a while and apparently this is because of consumeLoop returns context canceled error and do ReportStatus FatalErrorEvent at receiver/kafkareceiver/kafka_receiver.go between line 163-165. But the sync.Mutex.Lock() in ReportStatus FatalErrorEvent never gets unlocked (I don't know why), so that the ReportStatus for StatusStopped stuck forever while trying to acquire the mutex lock.

Also, I've tried to rollback to v0.92.0 and it works well. And traced to issue down to receiver/kafkareceiver/kafka_receiver.go at line 164 c.settings.ReportStatus(component.NewFatalErrorEvent(err)) changed at PR #30593 was the cause.

Steps to Reproduce

Create a collector with kafkareceiver factory in it. And have a receivers.kafka in the config.

Expected Result

Should be able to shutdown properly.

Actual Result

Stuck indefinitely while shutting down with the logs below.

Collector version

v0.93.0

Environment information

No response

OpenTelemetry Collector configuration

service:
  pipelines:
    traces:
      receivers: [kafka]

receivers:
  kafka:
    brokers:
      - localhost:9092
    encoding: otlp_proto # available encodings are otlp_proto, jaeger_proto, jaeger_json, zipkin_proto, zipkin_json, zipkin_thrift
    initial_offset: earliest # consume messages from the beginning

Log output

2024-01-26T08:32:45.266+0700	info	kafkareceiver@v0.93.0/kafka_receiver.go:431	Starting consumer group	{"kind": "receiver", "name": "kafka", "data_type": "traces", "partition": 0}
^C2024-01-26T08:32:53.626+0700	info	otelcol@v0.93.0/collector.go:258	Received signal from OS	{"signal": "interrupt"}
2024-01-26T08:32:53.626+0700	info	service@v0.93.0/service.go:179	Starting shutdown...
2024-01-26T08:32:54.010+0700	info	kafkareceiver@v0.93.0/kafka_receiver.go:181	Consumer stopped	{"kind": "receiver", "name": "kafka", "data_type": "traces", "error": "context canceled"}

Additional context

No response

@james-ryans james-ryans added bug Something isn't working needs triage New item requiring triage labels Jan 26, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@crobert-1
Copy link
Member

@mwear: Do you have any thoughts on why a component may never be able to get the lock when reporting status? It looks like this may be related to the work you've been doing on component status reporting.

Possible related PR: open-telemetry/opentelemetry-collector#8836

@mwear
Copy link
Member

mwear commented Jan 26, 2024

Based on the research @james-ryans did, this came in after this change: #30610. What I suspect is happening is that writing the fatal error to the asyncErrorChannel in serviceHost is blocking, so that ReportStatus never returns (and never releases its lock). Here is the suspect line: https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/host.go#L73.

I think this is a variation of this existing problem: open-telemetry/opentelemetry-collector#8116, which is also assigned to me. It has been on my todo list. I'll look into it.

@crobert-1
Copy link
Member

Thanks @mwear, appreciate your insight here!

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Mar 27, 2024
@crobert-1 crobert-1 removed the Stale label Mar 27, 2024
@atoulme
Copy link
Contributor

atoulme commented Mar 30, 2024

This is open-telemetry/opentelemetry-collector#9824.

@atoulme atoulme removed the needs triage New item requiring triage label Mar 30, 2024
@lahsivjar
Copy link
Contributor

This is open-telemetry/opentelemetry-collector#9824.

If I am not mistaken, this issue should happen for all receivers. Here is an example of a flaky test in the collector-contrib due to the same issue happening for the opencensus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8742859512/job/23992117763. The test has goroutine dump which indicates to the same problem with asyncErrorChannel as is pointed out in the shared issue.

@crobert-1
Copy link
Member

This is open-telemetry/opentelemetry-collector#9824.

Here is an example of a flaky test in the collector-contrib due to the same issue happening for the opencensus receiver: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/8742859512/job/23992117763. The test has goroutine dump which indicates to the same problem with asyncErrorChannel as is pointed out in the shared issue.

Adding a reference to the issue for the flaky test: #27295

@crobert-1
Copy link
Member

+1 freq: #32667

@Dennis8274
Copy link

quick fix as follows?
image

@crobert-1
Copy link
Member

Your fixed worked @Dennis8274, thanks for the suggestion! I've posted a PR to resolve this issue. 👍

MovieStoreGuy pushed a commit that referenced this issue May 22, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
The kafka receiver's shutdown method is to cancel the context of a
running sub goroutine. However, a small bug was causing a fatal error to
be reported during shutdown when this expected condition was hit. The
fatal error being reported during shutdown was causing another bug to be
hit,
open-telemetry/opentelemetry-collector#9824.

This fix means that shutdown won't be blocked in expected shutdown
conditions, but the `core` bug referenced above means shutdown will
still be block in unexpected error situations.

This fix is being taken from a comment made by @Dennis8274 on the issue.

**Link to tracking Issue:** <Issue number if applicable>
Fixes #30789

**Testing:** <Describe what testing was performed and which tests were
added.>
Stepped through `TestTracesReceiverStart` in a debugger before the
change to see the fatal status being reported. It was no longer reported
after applying the fix. Manually tested running the collector with a
kafka receiver and saw that before the fix it was indeed being blocked
on a normal shutdown, but after the fix it shutdown as expected.
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Jul 11, 2024
…2720)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
The kafka receiver's shutdown method is to cancel the context of a
running sub goroutine. However, a small bug was causing a fatal error to
be reported during shutdown when this expected condition was hit. The
fatal error being reported during shutdown was causing another bug to be
hit,
open-telemetry/opentelemetry-collector#9824.

This fix means that shutdown won't be blocked in expected shutdown
conditions, but the `core` bug referenced above means shutdown will
still be block in unexpected error situations.

This fix is being taken from a comment made by @Dennis8274 on the issue.

**Link to tracking Issue:** <Issue number if applicable>
Fixes open-telemetry#30789

**Testing:** <Describe what testing was performed and which tests were
added.>
Stepped through `TestTracesReceiverStart` in a debugger before the
change to see the fatal status being reported. It was no longer reported
after applying the fix. Manually tested running the collector with a
kafka receiver and saw that before the fix it was indeed being blocked
on a normal shutdown, but after the fix it shutdown as expected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p1 High receiver/kafka
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants