Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle multiprocess setups using preloading and equivilents #127

Closed
Lispython opened this issue Jan 16, 2017 · 26 comments
Closed

Handle multiprocess setups using preloading and equivilents #127

Lispython opened this issue Jan 16, 2017 · 26 comments

Comments

@Lispython
Copy link

Sometimes application raise error.

At line https://github.com/prometheus/client_python/blob/master/prometheus_client/core.py#L361

Stacktrace (последний вызов снизу):

  File "flask/app.py", line 1817, in wsgi_app
    response = self.full_dispatch_request()
  File "flask/app.py", line 1477, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "flask/app.py", line 1381, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "flask/app.py", line 1473, in full_dispatch_request
    rv = self.preprocess_request()
  File "flask/app.py", line 1666, in preprocess_request
    rv = func()
  File "core/middleware.py", line 43, in before_request_middleware
    metrics.requests_total.labels(env_role=metrics.APP_ENV_ROLE, method=request.method, url_rule=rule).inc()
  File "prometheus_client/core.py", line 498, in labels
    self._metrics[labelvalues] = self._wrappedClass(self._name, self._labelnames, labelvalues, **self._kwargs)
  File "prometheus_client/core.py", line 599, in __init__
    self._value = _ValueClass(self._type, name, name, labelnames, labelvalues)
  File "prometheus_client/core.py", line 414, in __init__
    files[file_prefix] = _MmapedDict(filename)
  File "prometheus_client/core.py", line 335, in __init__
    for key, _, pos in self._read_all_values():
  File "prometheus_client/core.py", line 361, in _read_all_values
    encoded = struct.unpack_from('{0}s'.format(encoded_len).encode(), self._m, pos)[0]

We run application via uwsgi:

[uwsgi]
chdir=/usr/src/app/
env = APP_ROLE=dev_uwsgi
wsgi-file = /usr/src/app/app.wsgi
master=True
vacuum=True
max-requests=5000
harakiri=120
post-buffering=65536
workers=16
listen=4000
# socket=0.0.0.0:8997
stats=/tmp/uwsgi-app.stats
logger=syslog:uwsgi_app_stage,local0
buffer-size=65536
http = 0.0.0.0:8051
@brian-brazil
Copy link
Contributor

That sounds like data corruption. When you restart the app and blow away the data does this reoccur?

To be on the safe side, are you running on a big or little endian system?

@Lispython
Copy link
Author

little endian

@brian-brazil
Copy link
Contributor

If you have further debugging information, please reopen.

@stefano-pogliani
Copy link

Hello, we are having the same problem.
We have a flask app run by gunicorn (with two workers at the this time).
The exception is raised by a Summary with a label used to track time spent in different states.

Possibly useful bits of our code:

STATUS_TIME = Summary('<name>', 'Time (in seconds) spent in <STATUS>', ['status'])
# ...
def _record_time_in_current_status(self):
    # ...
    STATUS_TIME.labels(status=status).observe(duration)  # <-- this often works but sometimes fails.

Our gunicorn config is:

from prometheus_client import multiprocess

# *** Global configuration *** #
bind = '0.0.0.0:5000'
workers = 2

# *** Server hooks *** #
def child_exit(server, worker):
    # Needed for metrics to be consistent when multiple workers exists.
    # -> https://github.com/prometheus/client_python#multiprocess-mode-gunicorn
    multiprocess.mark_process_dead(worker.pid)

Is there any other information we can provide?
What should we look for to debug further?

@brian-brazil
Copy link
Contributor

Is it the exact same number of bytes it's reporting?

@stefano-pogliani
Copy link

No for us the buffer is expected to be 1952543859.
We have experienced this exception 14 times, all with the same number.

I forgot to mention we have started seeing these exceptions when we started testing our app with many concurrent users.

@brian-brazil
Copy link
Contributor

Can you figure out which file is the issue, and attach it here?

@stefano-pogliani
Copy link

We lost the file as the service is restated but I'll try to get it if/when this happens again.

@stefano-pogliani
Copy link

Thanks for the very quick response 😄

@brian-brazil brian-brazil reopened this May 5, 2017
@brian-brazil
Copy link
Contributor

Just to verify, the worker processes are fork&execed? Pre-fork approaches don't work with the current multi-proc.

@stefano-pogliani
Copy link

Hello, I finally had the time to check.
We do not explicitly set preload_app = True or load our flask app in the gunicorn master if that matters.

It does look like gunicorn master forks workers but does not exec* after forking: https://github.com/benoitc/gunicorn/blob/19.4.5/gunicorn/arbiter.py#L498

Does that mean gunicorn is not actually a supported use case or are we using it wrong?
The Readme suggest that it is supported.

@stefano-pogliani
Copy link

I was looking into why multiprocessing requires fork & exec and not just fork.
I was also looking at what happens with gunicorn only forking.

By adding log statements to prometheus_client.core I noticed the below (which may be unrelated but it is suspicious to me).
Logging code added to prometheus_client.core:

import logging
log = logging.getLogger('test')
def _MultiProcessValue(__pid=os.getpid()):
    ...
    log.error("~~~" "PID: %s", pid)

    class _MmapedValue(object):
        ...
        def __init__(...):
            ...
            self._lock = Lock()
            log.error("~~~" "PId2: %s => %s", pid, os.getpid())
        ...

In order to configure gunicorn to clean up live gauges, the gunicorn configuration module imports prometheus_client.multiprocess.
This in turn imports prometheus_client.core which runs _ValueClass = _MultiProcessValue() here https://github.com/prometheus/client_python/blob/ce7f2978499dbd24e1028ef8966e50f374f51f5a/prometheus_client/core.py

At this stage, os.getpid() is executed in the (gunicorn) master process.
When the process forks, children inherit the pre-initialised prometheus_client.core module with the master pid and never update to use their own pid.

By removing the import in the config module (I removed cleanup for testing, a better way is needed) we can see that prometheus_client.core is now left uninitialised in the master process and each worker initialises it with a different pid instead of all using the master pid.

Is it possible to re-work prometheus_client.multiprocess to avoid the top-level import of prometheus_client.core?
I know that nested imports are bad, maybe move the MultiProcessCollector to a separate file that can be skipped while configuring the master process.

@lumengxi
Copy link

lumengxi commented May 9, 2017

We have same problem in production and I agreed to @stefano-pogliani's analysis.

@brian-brazil
Copy link
Contributor

Does that mean gunicorn is not actually a supported use case or are we using it wrong?

It should work, something else must be enabling pre-loading on you.

We don't support multiprocessing currently. We need each child to start with a fresh interpreter.

I know that nested imports are bad, maybe move the MultiProcessCollector to a separate file that can be skipped while configuring the master process.

I'm not sure that works for all use cases.

@stefano-pogliani
Copy link

Does that mean gunicorn is not actually a supported use case or are we using it wrong?

It should work, something else must be enabling pre-loading on you.

I checked the code in guicorn from various versions and none of them seem to start workers with fresh interpreters, they all just fork.
I've looked at our code and made sure that nothing is imported in the master node, the only exception was the prometheous_client package as explained above.

I even tried raising an exception at import time of our app: the workers attempt to import the app and fail while the master proceeds fine, which suggests it does not import the app module.

Any idea of what else I could check or of a known working gunicorn with Prometheus app I can compare ours with?

@brian-brazil
Copy link
Contributor

To be more precise, no prometheus_client code should be imported until after the fork.

What I've had in mind is to check the pid on every metric change, and reset things if it changes. This would be a 10% performance hit.

@gerricom
Copy link

We don't support multiprocessing currently. We need each child to start with a fresh interpreter.

So just to get things straight: Using prometheus_client in a multiprocessing environment, as used in uwsgi or gunicorn, is not recommended at the moment - right?

And to make things clear to myself: A solution is not that easy because the most solutions would result in performance issues when tracking lots of metrics?

I'm issuing the latter question because I'm thinking about developing a custom client using an approach as seen in the php library for prometheus where Redis is used as an in-memory datastore for metrics.

@brian-brazil
Copy link
Contributor

Using prometheus_client in a multiprocessing environment, as used in uwsgi or gunicorn, is not recommended at the moment - right?

That works fine as long as you're not preloading, the multiprocessing module does not.

A solution is not that easy because the most solutions would result in performance issues when tracking lots of metrics?

There'll be a performance hit, but it's tolerable. It's more that I haven't gotten around to it yet.

I'm issuing the latter question because I'm thinking about developing a custom client using an approach as seen in the php library for prometheus where Redis is used as an in-memory datastore for metrics.

Anything involving the network is going to be orders of magnitude slower.

@gerricom
Copy link

Thank you for that very swift reponse!

That works fine as long as you're not preloading, the multiprocessing module does not.

Okay! That makes it clearer to me. As I'm still struggling with the vast amount of uwsgi config options, I'll give gunicorn (without preloading) a try and see, if I can get better results here. Thanks again!

@brian-brazil brian-brazil changed the title unpack_from requires a buffer of at least 1819440418 bytes Handle multiprocess setups using preloading and equivilents May 23, 2017
@jgjay
Copy link

jgjay commented May 25, 2017

FWIW I've run into this problem, and after some debugging discovered that it only presents itself when you add the child_exit hook to the gunicorn config file. Doing so imports the client library code in the gunicorn master process and appears to result in metric files only being generated for the master PID, and the workers sharing these, leading to the data corruption.

As a workaround we found that importing the client library code inside the child_exit hook function seems to avoid the issue:

def child_exit(server, worker):
    from prometheus_client import multiprocess
    multiprocess.mark_process_dead(worker.pid)

@brian-brazil
Copy link
Contributor

Hmm, wouldn't that still have the problem for any child processes started after the first one exits?

@jgjay
Copy link

jgjay commented May 26, 2017

It's strange but it doesn't appear to be a problem. I tested this by killing off workers and what I saw is that the child processes started to replace them all got their own metric data files correctly associated with their PIDs. I have to say I don't fully understand why this is the case.

@brian-brazil
Copy link
Contributor

Can ye try #169?

@brian-brazil
Copy link
Contributor

We've got one confirmation that #169 works as expected. Has anyone else tested it to confirm it fixes their use case?

@stefano-pogliani
Copy link

I tested the code in the PR and it seems to be working properly now.
I've also checked which DB files get created (to make sure pids are correct) and they are now the expected names.

Thanks 👍

@brian-brazil
Copy link
Contributor

That's merged now, docs still need an update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants