Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hazelcast, Flask Gunicorn with Eventlet hangs #205

Open
alexjironkin opened this issue Apr 21, 2020 · 9 comments
Open

Hazelcast, Flask Gunicorn with Eventlet hangs #205

alexjironkin opened this issue Apr 21, 2020 · 9 comments

Comments

@alexjironkin
Copy link

HI,

We are trying to use Hazelcast in Flask service. This service runs with gunicorn using eventlet workers. When used with this configuration, client never connects, when switching to sync worker, everything works fine. I pdbed into reactor and found that queue is patched with queue from eventlet.

I guess questions are:

  • Is there a way to fix this?

  • Are any gunicorn async workers supported with Hazelcast e.g. gevent?

  • Python=3.6

  • Hazelcast = 3.12.1

@burakcelebi
Copy link
Member

Hi @alexjironkin, thank you for submitting this! Interesting case. @mdumandag is on it.

@alexjironkin
Copy link
Author

Let me know if further information is required or troubleshoot

@mdumandag
Copy link
Contributor

Hi @alexjironkin
After reading about the execution model of the eventlet (and of the gevent since they are similar in this sense) and debugging using a sample application, I found the problem. Let me try to describe it.

Gunicorn monkey patches the thread and threading module (see https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/geventlet.py#L124, by default eventlet monkey patches certain system modules including thread and threading https://eventlet.net/doc/basic_usage.html#patching-functions). What that means is that, when you start a new thread in a monkey patched application, it will not run as a standard thread. Instead, it will run as eventlet coroutine. (See the image at this link https://eventlet.net/doc/threading.html). The thing that causes the problem is switching between these coroutines.

If they were standard threads, Python would perform context switches in between them even if they were performing some blocking work. That has some overhead, but it requires no co-operation between threads.

Eventlet uses a different approach. It depends on the principle of the co-operation. Meaning, it requires coroutines to yield when they are going to block, so that the other coroutines can still be executed.

The problem arises from the fact that, our reactor module does not perform any form of yielding in its loop function. So, when the reactor thread is started as a coroutine due to monkey patching, no switch between coroutines happens, and the application become unresponsive, performing the instructions inside the loop function all the time.

So, the possible solutions are

  • Find a way to unpatch the threading module. I tried to find the ways to do so, but I couldn't succeed. Event if we can find a way, I am not sure how that would play with Eventlet itself.
  • Perform cooperative switch at the loop function I linked above. To do so, we need to add something like time.sleep(0) before the following line. https://github.com/hazelcast/hazelcast-python-client/blob/master/hazelcast/reactor.py#L45 . time.sleep will be monkey patched by the Eventlet and will perform the switch between coroutines. So, we need to monkey patch the loop function. It feels dirty to do so, but I think this is the safest way :)

So, a code like this needs to executed only once before you start any Hazelcast clients.

import asyncore
import hazelcast
import select
import time

from hazelcast.future import Future
from hazelcast.reactor import AsyncoreReactor

def patched_loop(self):
    self.logger.debug("Starting Reactor Thread", extra=self._logger_extras)
    Future._threading_locals.is_reactor_thread = True
    while self._is_live:
        try:
            time.sleep(0)
            asyncore.loop(count=1, timeout=0.01, map=self._map)
            self._check_timers()
        except select.error:
            self.logger.warning("Connection closed by server", extra=self._logger_extras)
            pass
        except:
            self.logger.exception("Error in Reactor Thread", extra=self._logger_extras)
            return
    self.logger.debug("Reactor Thread exited. %s" % self._timers.qsize(), extra=self._logger_extras)
    self._cleanup_all_timers()


AsyncoreReactor._loop = patched_loop

Hope that helps

@alexjironkin
Copy link
Author

alexjironkin commented Apr 27, 2020

@mdumandag thanks for the info. Very insightful. We tried to patch this function out to see if that would work, however, it didn't:

INFO: [3.12.1] [hazelcast] [hz.client_0] Connecting to Address(host=hazelcast, port=5701)
[2020-04-27 12:47:57,157] INFO in cluster: [hazelcast] [hz.client_0] Connecting to Address(host=hazelcast, port=5701)
[2020-04-27 12:52:53 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:45)
[2020-04-27 12:52:53 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:46)
[2020-04-27 12:52:53 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:47)
Apr 27, 2020 12:52:53 PM HazelcastClient.ClusterService
WARNING: [3.12.1] [hazelcast] [hz.client_0] Error connecting to Address(host=hazelcast, port=5701) 
Traceback (most recent call last):
  File "/opt/app-root/lib/python3.6/site-packages/hazelcast/cluster.py", line 124, in _connect_to_cluster
    self._connect_to_address(address)
  File "/opt/app-root/lib/python3.6/site-packages/hazelcast/cluster.py", line 171, in _connect_to_address
    connection = f.result()
  File "/opt/app-root/lib/python3.6/site-packages/hazelcast/future.py", line 59, in result
    self._event.wait()
  File "/opt/app-root/lib/python3.6/site-packages/hazelcast/future.py", line 170, in wait
    self.condition.wait()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/threading.py", line 295, in wait
    waiter.acquire()
  File "/opt/app-root/lib/python3.6/site-packages/eventlet/semaphore.py", line 115, in acquire
    hubs.get_hub().switch()
  File "/opt/app-root/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 298, in switch
    return self.greenlet.switch()
  File "/opt/app-root/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 350, in run
    self.wait(sleep_time)
  File "/opt/app-root/lib/python3.6/site-packages/eventlet/hubs/poll.py", line 77, in wait
    time.sleep(seconds)
  File "/opt/app-root/lib/python3.6/site-packages/gunicorn/workers/base.py", line 201, in handle_abort
    sys.exit(1)
SystemExit: 1Apr 27, 2020 12:52:53 PM HazelcastClient.ClusterService
WARNING: [3.12.1] [hazelcast] [hz.client_0] Unable to get alive cluster connection, attempt 1 of 2, trying again in 3 seconds
[2020-04-27 12:52:53,779] WARNING in cluster: [hazelcast] [hz.client_0] Unable to get alive cluster connection, attempt 1 of 2, trying again in 3 seconds

It looks like it never got it's future result and was deadlocked. Looking at timestamp it spent 5 mins trying, however, message says 1 of 2 attempts. 5 min is also our timeout for gunicorn workers, so worker got killed after that time (hence handle_abort). This feels like another place future.py where some patching is needed?

@mdumandag
Copy link
Contributor

mdumandag commented Apr 27, 2020

Hi again @alexjironkin

I am able to connect to cluster with the following steps. What can I do to reproduce your problem ? Can you guide me with your Hazelcast configuration or extra parameters for gunicorn if any ? Or better, could you share a reproducer code ?

Start a member

docker run -p 5701:5701 hazelcast/hazelcast:3.12.6

Put the following code into app.py

import asyncore
import hazelcast
import select
import time

from hazelcast.future import Future
from hazelcast.reactor import AsyncoreReactor

def patched_loop(self):
    self.logger.debug("Starting Reactor Thread", extra=self._logger_extras)
    Future._threading_locals.is_reactor_thread = True
    while self._is_live:
        try:
            time.sleep(0)
            asyncore.loop(count=1, timeout=0.01, map=self._map)
            self._check_timers()
        except select.error:
            self.logger.warning("Connection closed by server", extra=self._logger_extras)
            pass
        except:
            self.logger.exception("Error in Reactor Thread", extra=self._logger_extras)
            return
    self.logger.debug("Reactor Thread exited. %s" % self._timers.qsize(), extra=self._logger_extras)
    self._cleanup_all_timers()


AsyncoreReactor._loop = patched_loop

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    client = hazelcast.HazelcastClient()
    m = client.get_map("test").blocking()
    m.put(1, 2)
    return str(m.get(1))

Start serving

gunicorn app:app -w 4 --worker-class eventlet

Then, the http://127.0.0.1:8000/ displays 2 as the return value.

@alexjironkin
Copy link
Author

alexjironkin commented Apr 27, 2020

Let me try to get you a working example, in the mean time, we use:
hazelcast: 3.12.1
hazelcast-python-client: 3.12.1
eventlet: 0.25.1
gunicorn 20.0.4
flask: 1.0.2
python: 3.6.9

@mdumandag
Copy link
Contributor

@alexjironkin I am able to connect to 3.12.1 member with the given versions of the packages using the application I posted above on a Linux laptop.

Do you have any extra configuration for Hazelcast or Hazelcast client ?

@alexjironkin
Copy link
Author

@mdumandag We managed to get this working using monkey patching, as described above, thanks. I guess the final question is how does this become more permanent fixture in hazelcast python client?

Are you ok with time.sleep(0) in next release? Presumably that opens usage in Flask async webservices for you.

@mdumandag
Copy link
Contributor

@alexjironkin, I am glad that you are able the make it work.

Since this part of the client is in the hot path, I don't think we can put a sleep there as a permanent fix for now. I don't know how, but I guess it will result in some slowdown. I recommended it because in your use case, it was the only feasible way.

We are working on the 4.0 release of the client now. In this release, we will be introducing some breaking changes. Maybe we can spend some time before the release to find a way that is both performant and compatible with the frameworks like eventlet. So, I am going to keep this issue open for now. If you have ideas about it, please let us know.

@mdumandag mdumandag added this to the Backlog milestone Apr 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants