New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hazelcast, Flask Gunicorn with Eventlet hangs #205
Comments
Hi @alexjironkin, thank you for submitting this! Interesting case. @mdumandag is on it. |
Let me know if further information is required or troubleshoot |
Hi @alexjironkin Gunicorn monkey patches the thread and threading module (see https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/geventlet.py#L124, by default eventlet monkey patches certain system modules including thread and threading https://eventlet.net/doc/basic_usage.html#patching-functions). What that means is that, when you start a new thread in a monkey patched application, it will not run as a standard thread. Instead, it will run as eventlet coroutine. (See the image at this link https://eventlet.net/doc/threading.html). The thing that causes the problem is switching between these coroutines. If they were standard threads, Python would perform context switches in between them even if they were performing some blocking work. That has some overhead, but it requires no co-operation between threads. Eventlet uses a different approach. It depends on the principle of the co-operation. Meaning, it requires coroutines to yield when they are going to block, so that the other coroutines can still be executed. The problem arises from the fact that, our reactor module does not perform any form of yielding in its loop function. So, when the reactor thread is started as a coroutine due to monkey patching, no switch between coroutines happens, and the application become unresponsive, performing the instructions inside the loop function all the time. So, the possible solutions are
So, a code like this needs to executed only once before you start any Hazelcast clients. import asyncore
import hazelcast
import select
import time
from hazelcast.future import Future
from hazelcast.reactor import AsyncoreReactor
def patched_loop(self):
self.logger.debug("Starting Reactor Thread", extra=self._logger_extras)
Future._threading_locals.is_reactor_thread = True
while self._is_live:
try:
time.sleep(0)
asyncore.loop(count=1, timeout=0.01, map=self._map)
self._check_timers()
except select.error:
self.logger.warning("Connection closed by server", extra=self._logger_extras)
pass
except:
self.logger.exception("Error in Reactor Thread", extra=self._logger_extras)
return
self.logger.debug("Reactor Thread exited. %s" % self._timers.qsize(), extra=self._logger_extras)
self._cleanup_all_timers()
AsyncoreReactor._loop = patched_loop Hope that helps |
@mdumandag thanks for the info. Very insightful. We tried to patch this function out to see if that would work, however, it didn't:
It looks like it never got it's future result and was deadlocked. Looking at timestamp it spent 5 mins trying, however, message says 1 of 2 attempts. 5 min is also our timeout for gunicorn workers, so worker got killed after that time (hence |
Hi again @alexjironkin I am able to connect to cluster with the following steps. What can I do to reproduce your problem ? Can you guide me with your Hazelcast configuration or extra parameters for gunicorn if any ? Or better, could you share a reproducer code ? Start a member docker run -p 5701:5701 hazelcast/hazelcast:3.12.6 Put the following code into app.py import asyncore
import hazelcast
import select
import time
from hazelcast.future import Future
from hazelcast.reactor import AsyncoreReactor
def patched_loop(self):
self.logger.debug("Starting Reactor Thread", extra=self._logger_extras)
Future._threading_locals.is_reactor_thread = True
while self._is_live:
try:
time.sleep(0)
asyncore.loop(count=1, timeout=0.01, map=self._map)
self._check_timers()
except select.error:
self.logger.warning("Connection closed by server", extra=self._logger_extras)
pass
except:
self.logger.exception("Error in Reactor Thread", extra=self._logger_extras)
return
self.logger.debug("Reactor Thread exited. %s" % self._timers.qsize(), extra=self._logger_extras)
self._cleanup_all_timers()
AsyncoreReactor._loop = patched_loop
from flask import Flask
app = Flask(__name__)
@app.route('/')
def hello_world():
client = hazelcast.HazelcastClient()
m = client.get_map("test").blocking()
m.put(1, 2)
return str(m.get(1)) Start serving gunicorn app:app -w 4 --worker-class eventlet Then, the http://127.0.0.1:8000/ displays 2 as the return value. |
Let me try to get you a working example, in the mean time, we use: |
@alexjironkin I am able to connect to 3.12.1 member with the given versions of the packages using the application I posted above on a Linux laptop. Do you have any extra configuration for Hazelcast or Hazelcast client ? |
@mdumandag We managed to get this working using monkey patching, as described above, thanks. I guess the final question is how does this become more permanent fixture in hazelcast python client? Are you ok with |
@alexjironkin, I am glad that you are able the make it work. Since this part of the client is in the hot path, I don't think we can put a sleep there as a permanent fix for now. I don't know how, but I guess it will result in some slowdown. I recommended it because in your use case, it was the only feasible way. We are working on the 4.0 release of the client now. In this release, we will be introducing some breaking changes. Maybe we can spend some time before the release to find a way that is both performant and compatible with the frameworks like eventlet. So, I am going to keep this issue open for now. If you have ideas about it, please let us know. |
HI,
We are trying to use Hazelcast in Flask service. This service runs with gunicorn using
eventlet
workers. When used with this configuration, client never connects, when switching tosync
worker, everything works fine. I pdbed into reactor and found that queue is patched with queue fromeventlet
.I guess questions are:
Is there a way to fix this?
Are any gunicorn async workers supported with Hazelcast e.g.
gevent
?Python=3.6
Hazelcast = 3.12.1
The text was updated successfully, but these errors were encountered: