Skip to content

Race condition during metric instrument creation #4892

@dshivashankar1994

Description

@dshivashankar1994

Describe your environment

OS: (e.g, Ubuntu) RHEL8
Python version: (e.g., Python 3.9.10) python 3.11
SDK version: (e.g., 1.25.0) - 1.25.0
API version: (e.g., 1.25.0) - 1.25.0

What happened?

While using multiple threads with each thread trying to create a metric, I run into the race condition raising the KeyError

Why it manifests as KeyError: one thread marks an instrument ID as “already created” before the SDK has finished constructing and storing the actual instrument object. A second thread then tries to look up the instrument by ID and hits a missing dict entry.

This is a classic TOCTOU race (“time-of-check to time-of-use”) caused by split-brain state between two separate registries:
Registry A (API layer): stores a set of instrument IDs (“have we created this before?”)
Registry B (SDK layer): a dict mapping instrument ID → instrument instance (“here is the created object”)

Because these are two different objects under different lock context, one thread can “publish” the fact that an instrument exists (by inserting into the set) before it “publishes” the actual instrument instance (by inserting into the dict). Another thread then sees “already registered” and immediately does a dict lookup that can legitimately be missing at that moment → KeyError.

What should have been done is to maintain a single source of truth (one registry) and do a lock-protected get-or-create on that one structure, so “existence” and “retrievability” are the same invariant.

Steps to Reproduce

Here is a sample snippet: But unfortunately, this happens only in highly loaded machine

diff --git a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py
index 6d8d5dc..693ee3b 100644
--- a/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py
+++ b/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py
@@ -226,7 +226,8 @@ class Meter(APIMeter):
             )
             with self._instrument_id_instrument_lock:
                 return self._instrument_id_instrument[instrument_id]
-
+        import time
+        time.sleep(1)
         instrument = _Histogram(
             name,
             self._instrumentation_scope,



In [1]: import threading
   ...: from opentelemetry.sdk.metrics import MeterProvider
   ...:
   ...: def main():
   ...:     meter = MeterProvider().get_meter("race-demo")
   ...:
   ...:     N = 100
   ...:     label = "py.test.foo.shiva.logical_equal_duration"
   ...:
   ...:     barrier = threading.Barrier(N)
   ...:     results = [None] * N
   ...:     errors = [None] * N
   ...:
   ...:     def worker(i: int):
   ...:         try:
   ...:             barrier.wait()
   ...:             results[i] = meter.create_histogram(label)
   ...:         except Exception as e:
   ...:             errors[i] = e
   ...:
   ...:     threads = [threading.Thread(target=worker, args=(i,)) for i in range(N)]
   ...:     for t in threads:
   ...:         t.start()
   ...:     for t in threads:
   ...:         t.join()
   ...:
   ...:     errs = [e for e in errors if e is not None]
   ...:     if errs:
   ...:         print("Errors:", {type(e).__name__ for e in errs})
   ...:         raise errs[0]
   ...:
   ...:     print("unique instrument objects:", len({id(x) for x in results}))
   ...:     print("OK")
   ...:
   ...: if __name__ == "__main__":
   ...:     main()
   ...:
An instrument with name py.test.foo.shiva.logical_equal_duration, type Histogram, unit  and description  has been created already.
An instrument with name py.test.foo.shiva.logical_equal_duration, type Histogram, unit  and description  has been created already.
An instrument with name py.test.foo.shiva.logical_equal_duration, type Histogram, unit  and description  has been created already.
...
...
Errors: {'KeyError'}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 36
     33     print("OK")
     35 if __name__ == "__main__":
---> 36     main()

Cell In[1], line 30, in main()
     28 if errs:
     29     print("Errors:", {type(e).__name__ for e in errs})
---> 30     raise errs[0]
     32 print("unique instrument objects:", len({id(x) for x in results}))
     33 print("OK")

Cell In[1], line 17, in main.<locals>.worker(i)
     15 try:
     16     barrier.wait()
---> 17     results[i] = meter.create_histogram(label)
     18 except Exception as e:
     19     errors[i] = e

File /codemill/dhamodha/projects/open/opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py:228, in Meter.create_histogram(self, name, unit, description)
    219     _logger.warning(
    220         "An instrument with name %s, type %s, unit %s and "
    221         "description %s has been created already.",
   (...)    225         description,
    226     )
    227     with self._instrument_id_instrument_lock:
--> 228         return self._instrument_id_instrument[instrument_id]
    229 import time
    230 time.sleep(1)

KeyError: 'py.test.foo.shiva.logical_equal_duration,_Histogram,,'

Expected Result

I'd expect only warning saying that the instrument is already registered instead of KeyError

Actual Result

Errors: {'KeyError'}
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 36
     33     print("OK")
     35 if __name__ == "__main__":
---> 36     main()

Cell In[1], line 30, in main()
     28 if errs:
     29     print("Errors:", {type(e).__name__ for e in errs})
---> 30     raise errs[0]
     32 print("unique instrument objects:", len({id(x) for x in results}))
     33 print("OK")

Cell In[1], line 17, in main.<locals>.worker(i)
     15 try:
     16     barrier.wait()
---> 17     results[i] = meter.create_histogram(label)
     18 except Exception as e:
     19     errors[i] = e

File /codemill/dhamodha/projects/open/opentelemetry-python/opentelemetry-sdk/src/opentelemetry/sdk/metrics/_internal/__init__.py:228, in Meter.create_histogram(self, name, unit, description)
    219     _logger.warning(
    220         "An instrument with name %s, type %s, unit %s and "
    221         "description %s has been created already.",
   (...)    225         description,
    226     )
    227     with self._instrument_id_instrument_lock:
--> 228         return self._instrument_id_instrument[instrument_id]
    229 import time
    230 time.sleep(1)

KeyError: 'py.test.foo.shiva.logical_equal_duration,_Histogram,,'

Additional context

No response

Would you like to implement a fix?

None

Tip

React with 👍 to help prioritize this issue. Please use comments to provide useful context, avoiding +1 or me too, to help us triage it. Learn more here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions