-
Notifications
You must be signed in to change notification settings - Fork 836
Description
Hello dear sirs,
This is to report a problem we have with prometheus_client in multiprocessing mode with restarting workers.
How do we observe the problem?
In our production environment it looks like this:
- master process starts and spawns a set of worker processes;
- metrics reporting is fine;
- reconfiguration is requested, and all worker processes are replaced;
- sometimes after this reconfiguration:
- metrics reporting stops working (HTTP endpoint returns 500)
- logs contain errors like the ones below, which suggests that .db files get corrupted
- metrics do not come back until complete restart (and removal of corrupted .db files)
The errors may look like this:
...
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 682, in __reset
files[file_prefix] = _MmapedDict(filename)
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 577, in __init__
for key, _, pos in self._read_all_values():
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 611, in _read_all_values
encoded = unpack_from(('%ss' % encoded_len).encode(), data, pos)[0]
error: unpack_from requires a buffer of at least 1919251561 bytes
...
File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/multiprocess.py", line 42, in merge
metric_name, name, labels = json.loads(key)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
How to reproduce the problem?
This is very tricky to reproduce in isolated environment, and for sure does not fit in description of a Github issue, so I put the code and instructions how to reproduce here:
https://github.com/lonlylocly/prometheus_client_concurrency_issue
Basically, it is our production script stripped down to minimal version. It reproduces maybe 50% of the time.
What do I want from this issue?
I must admit that we are lost and we can't figure how can this issue be mitigated. We love prometheus and we really liked the convenience of python_client but metrics reporting breaks with this problem on stable basis and we would like to eliminate it.
I would appreciate any sort of suggestion or advice and am also willing to help via PR (if we manage to figure a workaround, personally I don't even know how to start).
Thank you!