Skip to content

multiprocessing with restarting workers, .db files are corrupted #347

@lonlylocly

Description

@lonlylocly

Hello dear sirs,

This is to report a problem we have with prometheus_client in multiprocessing mode with restarting workers.

How do we observe the problem?

In our production environment it looks like this:

  • master process starts and spawns a set of worker processes;
  • metrics reporting is fine;
  • reconfiguration is requested, and all worker processes are replaced;
  • sometimes after this reconfiguration:
    • metrics reporting stops working (HTTP endpoint returns 500)
    • logs contain errors like the ones below, which suggests that .db files get corrupted
    • metrics do not come back until complete restart (and removal of corrupted .db files)

The errors may look like this:

...
  File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 682, in __reset
    files[file_prefix] = _MmapedDict(filename)
  File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 577, in __init__
    for key, _, pos in self._read_all_values():
  File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/core.py", line 611, in _read_all_values
    encoded = unpack_from(('%ss' % encoded_len).encode(), data, pos)[0]
error: unpack_from requires a buffer of at least 1919251561 bytes

...
  File "/Users/vasiliev/.virtualenvs/metrics-issue27/lib/python2.7/site-packages/prometheus_client/multiprocess.py", line 42, in merge
    metric_name, name, labels = json.loads(key)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

How to reproduce the problem?

This is very tricky to reproduce in isolated environment, and for sure does not fit in description of a Github issue, so I put the code and instructions how to reproduce here:

https://github.com/lonlylocly/prometheus_client_concurrency_issue

Basically, it is our production script stripped down to minimal version. It reproduces maybe 50% of the time.

What do I want from this issue?

I must admit that we are lost and we can't figure how can this issue be mitigated. We love prometheus and we really liked the convenience of python_client but metrics reporting breaks with this problem on stable basis and we would like to eliminate it.

I would appreciate any sort of suggestion or advice and am also willing to help via PR (if we manage to figure a workaround, personally I don't even know how to start).

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions