Process never exits due to hang in shutdown #2284

flashgorman-chorus · 2021-11-18T14:51:07Z

We have a process that sometimes fails to exit until we hit ctrl-c. Traceback at that time shows the process is hung, attempting to "join" a worker thread:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/usr/local/lib/python3.7/dist-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):

In this case, the process completed in about 500 ms and would have produced only a few spans. The process was configured to export to a remote OT collector which may or may not have been reachable at the time - but we expect the "reachability" of the collector to not interfere with process execution. We gave it like 30 seconds to complete before terminating the process.

This, of course, interferes with auto-scaling and our ability to manage processes.

Please advise.

The text was updated successfully, but these errors were encountered:

owais · 2021-11-18T15:59:24Z

This shouldn't happen and looks like a bug. I wasn't able to reproduce it with a simple example. Could you please share a reproducible example with all otel dependencies listed?

flashgorman-chorus · 2021-11-18T19:39:45Z

Thanks, @owais . I was able to reproduce the issue with this stand-alone test:

import time

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.sdk.resources import DEPLOYMENT_ENVIRONMENT
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.resources import SERVICE_NAME
from opentelemetry.sdk.resources import SERVICE_NAMESPACE
from opentelemetry.sdk.resources import SERVICE_VERSION
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
from opentelemetry.trace import set_tracer_provider
from opentelemetry import trace


tracer = trace.get_tracer(__name__)


def main():
    print("Initializing OpenTelemetry...")
    provider = TracerProvider(
        sampler=TraceIdRatioBased(0.1),
        resource=Resource.create(
            {
                DEPLOYMENT_ENVIRONMENT: "production",
                "platform": "chorus",
                SERVICE_NAME: "mike",
                SERVICE_NAMESPACE: "chorus",
                SERVICE_VERSION: "1.0",
            }
        ),
    )
    set_tracer_provider(provider)
    span_processor = BatchSpanProcessor(
        OTLPSpanExporter(endpoint="otel.chorus.internal:4317")
    )
    provider.add_span_processor(span_processor)
    print("Pretending to do work...")

    with tracer.start_as_current_span("foo"):
        time.sleep(0.5)

    print("Done!  (Python should exit now.)")


if __name__ == "__main__":
    main()

It took 10-20 tries but eventually exhibited the "hung" behavior reported above. After it hung, I hit Ctrl+C to get the process to exit. Here's the console output of that run:

Initializing OpenTelemetry...
Pretending to do work...
Done!  (Python should exit now.)
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
    self._active_span_processor.shutdown()
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
    sp.shutdown()
  File "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 385, in shutdown
    self.worker_thread.join()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1044, in join
    self._wait_for_tstate_lock()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 1060, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

And here are all of the OT packages I have installed:

opentelemetry-api                           1.6.2
opentelemetry-exporter-otlp                 1.6.2
opentelemetry-exporter-otlp-proto-grpc      1.6.2
opentelemetry-exporter-otlp-proto-http      1.6.2
opentelemetry-instrumentation               0.25b2
opentelemetry-instrumentation-boto          0.25b2
opentelemetry-instrumentation-botocore      0.25b2
opentelemetry-instrumentation-celery        0.25b2
opentelemetry-instrumentation-elasticsearch 0.25b2
opentelemetry-instrumentation-pymongo       0.25b2
opentelemetry-instrumentation-redis         0.25b2
opentelemetry-instrumentation-requests      0.25b2
opentelemetry-instrumentation-sqlalchemy    0.25b2
opentelemetry-instrumentation-tornado       0.25b2
opentelemetry-proto                         1.6.2
opentelemetry-sdk                           1.6.2
opentelemetry-semantic-conventions          0.25b2
opentelemetry-util-http                     0.25b2

flashgorman-chorus · 2021-11-18T19:44:49Z

FYI - when I replace OTLPSpanExporter with ConsoleSpanExporter, I am not able to reproduce the issue (after 70 tests).

flashgorman-chorus · 2021-11-18T19:47:38Z

Restoring the original OTLPSpanExporter from my test but disabling the sampler, the issue now happens on every execution. So, previously, the reason it took "10 or 20 times" was because, when the sampler did not sample the span, then everything was fine. But when every span is processed (due to commenting-out the sampler), it causes the Python process to hang every time.

flashgorman-chorus · 2021-11-18T20:04:03Z

To help illuminate what's going on during shutdown, I "monkey patched" BatchSpanProcessor.shutdown method in order to print out a traceback of each thread, to see what each thread is doing, prior to dispatching to the "real" shutdown method. The results were, indeed, illuminating:

Initializing OpenTelemetry...
Pretending to do work...
Done!  (Python should exit now.)
in BatchSpanProcessor.shutdown!
These threads exist:
<_MainThread(MainThread, stopped 4441382400)>
<Thread(OtelBatchSpanProcessor, started daemon 123145478844416)>
BatchSpanProcessor's worker_thread is <Thread(OtelBatchSpanProcessor, started daemon 123145478844416)> (123145478844416)
Here's what all threads are doing as we speak:

# ThreadID: 123145478844416
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 890, in _bootstrap
  self._bootstrap_inner()
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
  self.run()
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 870, in run
  self._target(*self._args, **self._kwargs)
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 237, in worker
  self.condition.wait(timeout)
File: "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 300, in wait
  gotit = waiter.acquire(True, timeout)

# ThreadID: 4441382400
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 1136, in shutdown
  self._active_span_processor.shutdown()
File: "/Users/mgorman/projects/web-server/.env/lib/python3.7/site-packages/opentelemetry/sdk/trace/__init__.py", line 166, in shutdown
  sp.shutdown()
File: "/Users/mgorman/projects/web-server/ot_test.py", line 39, in __my_shutdown
  for filename, lineno, name, line in traceback.extract_stack(stack):
Forwarding the call to shutdown...

I can provide my monkey-patching if needed but, hopefully, this is enough to figure out what's going on.

flashgorman-chorus · 2021-11-18T20:09:27Z

Appears the worker thread is hung here:

srikanthccv · 2021-11-19T09:43:11Z

There is an exponential backoff and jitter retry mechanism (64 secs) for OTLPSpanExporter. This same issue has come up a few times since people are not aware of it. The spec is vague and doesn't have the recommended timeout. The last time we talked about it we agreed on this value (earlier it was 900 seconds taken from go impl) and didn't want to make this configureable (via constructor/env) to not introduce breaking changes. I think in this kind of scenarios it should finish up as much as possible and cancel the ongoing work and exit.

owais · 2021-11-19T11:54:31Z

Thanks for the detailed explanation @flashgorman-chorus. This is a great report!

@lonewolf3739 Shouldn't ctrl-c cancel all in-flight requests and exit immediately? That's what I would expect as an end-user.

srikanthccv · 2021-11-19T12:19:29Z

Author was expecting it to end without having to hit CTRL-C, which makes sense. It exists after a minute but nobody waits till that point and perceive this as a process hung.

owais · 2021-11-19T12:24:08Z

Ah, that makes sense.

owais · 2021-11-19T12:26:11Z

I think this is working as expected in that case and can probably be closed. It would be great if processors could watch for sigkill, log a helpful message and exit instead of crashing but not a big deal especially if the implementation is not going to be straight forward.

ocelotl · 2021-11-19T16:23:19Z

I agree with @owais, what would really "fix" this issue is a log message that lets the user know the process is waiting

flashgorman-chorus · 2021-11-22T13:17:32Z

Thanks, all. I would like to get clarification if you are saying there is nothing we can do to influence this "process shutdown" behavior. For example, is there a timeout configuration we can provide if we want the shutdown behavior to not wait?

Ideally, we would like a property that says "how long should OTLPSpanExporter wait to flush spans (what I assume it is waiting on) during shutdown?" The default can be 64 seconds, as it is today, but if your priority is "process management" and don't care about "lost spans", you might set this to 0 seconds which says "never delay process shutdown due to... well, for any reason."

owais · 2021-11-22T13:29:58Z

@flashgorman-chorus there is no special timeout during shutdown today. We just have a single retry timeout which we could make configurable but I'm not sure if setting it to 0 would be a good solution to this problem.

I think what we should do is detect SIGKILL and make one last attempt to flush everything in memory without retry.

owais · 2021-11-22T13:31:00Z

I think tracer provider should be responsible for detecting this and calling shutdown() on itself resulting in all processors shutting down and flushing ASAP.

owais · 2021-11-22T13:31:17Z

@ocelotl @lonewolf3739 thoughts about this? ^

flashgorman-chorus · 2021-11-22T13:41:13Z

Thanks. Will await your thoughtful decision. But, just to be clear, we are talking about a process exiting naturally - no human or Ctrl+C involved. I don't know if SIGKILL is involved under "natural exit" but if that only happens under Ctrl+C or kill <pid> scenarios, I wanted to make sure it is clear we are hoping for natural process exit without human intervention and without delay.

owais · 2021-11-22T13:45:30Z

@flashgorman-chorus I'm not sure what we can do to "automate" that case. Tracer provider does have a shutdown method and for "natural" exit cases, you could call provider.shutdown() explicitly in your code and that should have the same effect.

owais · 2021-11-22T13:47:55Z

if __name__ == "__main__":
    main()
    provider = get_tracer_provider()
    provider.force_flush()
    provider.shutdown()

See if this works for you ^ @flashgorman-chorus
Not sure if you need to call force_flush as well but you can try and s

flashgorman-chorus · 2021-11-22T13:54:28Z

That, unfortunately, will not work because we have at least 100 different scripts so there's no single point of entry and, thus, no single point of exit where we could call this.

Moreover, we see shutdown being called - so that is not the problem. It is what's going on inside shutdown that causes the process to hang. If force_flush would do the trick, then what would be nice is if shutdown looked like this:

def shutdown(self, force_flush: bool = True):
    if force_flush:
        self.force_flush()

    # proceed with shutdown...

owais · 2021-11-22T15:02:04Z

I think shutdown should force flush implicitly if it doesn't already.

flashgorman-chorus · 2021-11-22T15:18:29Z

I tested this with an explicit call to provider.force_flush() right before I expect the process to exit and all that did was add a 30-second delay.

I next attempted to add a call to self.force_flush() inside my monkey-patched version of shutdown (along the lines I suggested above) with the same results - an extra 30 seconds.

I did verify that if I simply wait 64 seconds, the process will exit. What I don't understand is what is it trying to do for 64 seconds? It can't be "that's the time it takes to 'flush' all of the queued-up spans" because I only create one span in this test.

ocelotl · 2021-11-22T17:14:42Z

What I don't understand is _what is it trying to do for 64 seconds?

It is mostly waiting. We have to do this, it is required in the spec and the intention of the spec is to protect the destination from being overwhelmed. If we allowed the exporting to wait 0 seconds we would be making it possible for this to happen and it would be against the spec. In my opinion we should not allow the user to introduce another delay algorithm, as the spec requires this one.

If you have specific purposes, testing or otherwise, I would recommend to make a custom exporter, it should not be very hard:

https://github.com/open-telemetry/opentelemetry-python/blob/main/exporter/opentelemetry-exporter-otlp-proto-grpc/src/opentelemetry/exporter/otlp/proto/grpc/exporter.py#L310

Changing delay to 0 in that line should do the trick.

I'm sorry, I wish we could provide a solution for this directly in our published code, but I think for this particular situation the spec does not allow us to do so (@owais, @lonewolf3739 please comment if you disagree).

Please let me know if you need help implementing this, @flashgorman-chorus, you can contact me directly via Slack at the CNCF workspace.

flashgorman-chorus · 2021-11-22T17:34:40Z

Alright! Thanks for the thoughtful response. We'll put our heads together here and decide how we want to proceed (or if we need any help). Everyone's desire to do the right thing has been very refreshing. Thank you.

owais · 2021-11-22T18:07:47Z

@ocelotl @lonewolf3739 I think retry mechanism with jitter should only apply during normal operation of a service. That should be enough to comply with the spec. Once an application starts to shut down, I think it might be more reasonable to make an immediate and final (or may be a few without exponential back-off) attempt at exporting everything in memory and quit after that.

Shihao-Zhong · 2021-12-14T20:32:07Z

Hi guys,

We are facing same challenges about the stuck in _wait_for_tstate_lock function, can we apply some special logic for that when the application is ending?

markfickett · 2022-04-26T20:56:48Z

I'm occasionally seeing a hang similar to this. My app seems to hang indefinitely (> 1h), which doesn't match up with the 64s timeout described above. The app is using Python multiprocessing.

I think I'm seeing the same stack trace when I eventually ^C the program:

^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 1169, in shutdown
    self._active_span_processor.shutdown()

  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/__init__.py", line 171, in shutdown
    sp.shutdown()
  File "/.../python3.9/site-packages/opentelemetry/sdk/trace/export/__init__.py", line 402, in shutdown
    self.worker_thread.join()
  File "/usr/lib/python3.9/threading.py", line 1033, in join
Aborted!
    self._wait_for_tstate_lock()
  File "/usr/lib/python3.9/threading.py", line 1049, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):
KeyboardInterrupt

I'm using:

opentelemetry-api==1.11.1
opentelemetry-sdk==1.11.1
opentelemetry-exporter-otlp-proto-grpc==1.11.1
opentelemetry-instrumentation-requests==0.30b1
opentelemetry-instrumentation-logging==0.30b1
opentelemetry-instrumentation-sqlalchemy==0.30b1

I'll look out for what conditions trigger it / if I can come up with a repro case that hangs for many minutes, but here's at least the initial information.

alkurbatov · 2022-11-28T13:23:34Z

Hello colleagues.
I experience the same issue :(

My setup:

jaeger-all-in-one (including collector) docker container with grpc used to deliver tracing data
my service with async nature in a docker container with proper handling of signals (sigint, sigterm)

The problem
If by some reason the all-in-one image goes down before shutdown of my service, the service fails to send tracing data and goes into endless loop retrying with different timeouts. In this case the only way I can stop it is to send SIGKILL.

The reason is exactly as discussed above: the thread which sends the data never exits as it doesn't give up after several numbers of failed retries.

The question is: are there any plans to handle this issue?
To me it would be enough to provide a kind of cancellation api to forcibly finish the task during service shutdown or even specify maximum number of retries to give up on retrying.

cazador481 · 2023-09-06T13:29:09Z

I am still seeing this problem. Is there a work around?

Error:

^CException ignored in atexit callback: <bound method MeterProvider.shutdown of <opentelemetry.sdk.metrics._internal.MeterProvider object at 0x7fffe6ca54e0>>
Traceback (most recent call last):
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/__init__.py", line 440, in shutdown
    metric_reader.shutdown(
  File "/home/nv/utils/ci/nvci_create/1.2.1/lib/python3.10/site-packages/opentelemetry/sdk/metrics/_internal/export/__init__.py", line 543, in shutdown
    self._daemon_thread.join(
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1100, in join
    self._wait_for_tstate_lock(timeout=max(timeout, 0))
  File "/home/utils/Python/builds/3.10.10-20230601/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

clumsy · 2024-03-11T13:31:17Z

I wonder if this could be caused by this deadlock issue in ProcessPoolExecutor. In such case the workaround is switching to multiprocessing.Pool or to the latest Python version.

owais added the bug Something isn't working label Nov 18, 2021

srikanthccv mentioned this issue Nov 19, 2021

Improve log message in OTLP exporters #2285

Merged

racng mentioned this issue Dec 13, 2023

ir_dist alignment stuck scverse/scirpy#468

Closed

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

84a2e2b

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

c3c2415

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

6625df4

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

bd19a3d

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Arnatious mentioned this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters #3764

Open

10 tasks

Arnatious added a commit to Arnatious/opentelemetry-python that referenced this issue Mar 7, 2024

export and shutdown timeouts for all OTLP exporters

3cd93c9

Fixes open-telemetry#2284, open-telemetry#2663, open-telemetry#3309

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process never exits due to hang in shutdown #2284

Process never exits due to hang in shutdown #2284

flashgorman-chorus commented Nov 18, 2021

owais commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021 •

edited

Loading

flashgorman-chorus commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021

srikanthccv commented Nov 19, 2021 •

edited

Loading

owais commented Nov 19, 2021

srikanthccv commented Nov 19, 2021 •

edited

Loading

owais commented Nov 19, 2021

owais commented Nov 19, 2021

ocelotl commented Nov 19, 2021

flashgorman-chorus commented Nov 22, 2021

owais commented Nov 22, 2021

owais commented Nov 22, 2021

owais commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021

owais commented Nov 22, 2021

owais commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021 •

edited

Loading

owais commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021

ocelotl commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021

owais commented Nov 22, 2021

Shihao-Zhong commented Dec 14, 2021

markfickett commented Apr 26, 2022

alkurbatov commented Nov 28, 2022 •

edited

Loading

cazador481 commented Sep 6, 2023

clumsy commented Mar 11, 2024 •

edited

Loading

Process never exits due to hang in shutdown #2284

Process never exits due to hang in shutdown #2284

Comments

flashgorman-chorus commented Nov 18, 2021

owais commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021 • edited Loading

flashgorman-chorus commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021

flashgorman-chorus commented Nov 18, 2021

srikanthccv commented Nov 19, 2021 • edited Loading

owais commented Nov 19, 2021

srikanthccv commented Nov 19, 2021 • edited Loading

owais commented Nov 19, 2021

owais commented Nov 19, 2021

ocelotl commented Nov 19, 2021

flashgorman-chorus commented Nov 22, 2021

owais commented Nov 22, 2021

owais commented Nov 22, 2021

owais commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021

owais commented Nov 22, 2021

owais commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021 • edited Loading

owais commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021

ocelotl commented Nov 22, 2021

flashgorman-chorus commented Nov 22, 2021

owais commented Nov 22, 2021

Shihao-Zhong commented Dec 14, 2021

markfickett commented Apr 26, 2022

alkurbatov commented Nov 28, 2022 • edited Loading

cazador481 commented Sep 6, 2023

clumsy commented Mar 11, 2024 • edited Loading

flashgorman-chorus commented Nov 18, 2021 •

edited

Loading

srikanthccv commented Nov 19, 2021 •

edited

Loading

srikanthccv commented Nov 19, 2021 •

edited

Loading

flashgorman-chorus commented Nov 22, 2021 •

edited

Loading

alkurbatov commented Nov 28, 2022 •

edited

Loading

clumsy commented Mar 11, 2024 •

edited

Loading