Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called #1972

Conversation

cyberw
Copy link
Collaborator

@cyberw cyberw commented Jan 14, 2022

fixes #1971

@cyberw cyberw merged commit 402bddd into master Jan 14, 2022
@cyberw
Copy link
Collaborator Author

cyberw commented Jan 26, 2022

This didnt actually help as much as I had thought. Must be something else I'm hitting...

@dbfx
Copy link

dbfx commented Jan 28, 2022

@cyberw I've started seeing this on our tests recently (last 1-2 weeks) and hadn't been before either.

@dbfx
Copy link

dbfx commented Jan 28, 2022

Here is the output from a test run this morning:

[2022-01-28 04:01:31,247] loadforge-61f3690b4aab01/INFO/locust.main: --run-time limit reached. Stopping Locust
 Name                                                                              # reqs      # fails  |     Avg     Min     Max  Median  |   req/s failures/s
----------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregated                                                                        100071 88838(88.77%)  |     421      17   22335     100  |  753.10  684.40

[2022-01-28 04:01:31,633] loadforge-61f3690b4aab01/INFO/locust.runners: Client 'loadforge-61f3690c228ae6_a6e263c24f764aea80ef5e34155a8f3b' quit. Currently 0 clients connected.
[2022-01-28 04:01:31,736] loadforge-61f3690b4aab01/INFO/locust.runners: Client 'loadforge-61f3690c228ae6_43366e59188b4d8ebb2e66d5e47b10c1' quit. Currently 0 clients connected.
[2022-01-28 04:01:31,736] loadforge-61f3690b4aab01/INFO/locust.runners: The last worker quit, stopping test.
[2022-01-28 04:01:32,249] loadforge-61f3690b4aab01/INFO/locust.main: Shutting down (exit code 1)
[2022-01-28 04:01:32,427] loadforge-61f3690b4aab01/INFO/locust.util.exception_handler: Exception found on retry 1: -- retry after 1s
[2022-01-28 04:01:32,427] loadforge-61f3690b4aab01/ERROR/locust.util.exception_handler: ZMQ sent failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 27, in send_to_client
    self.socket.send_multipart([msg.node_id.encode(), msg.serialize()])
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 275, in send_multipart
    msg = super(_Socket, self).send_multipart(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 595, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 228, in send
    msg = super(_Socket, self).send(data, flags, copy, track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 547, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/util/exception_handler.py", line 13, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 29, in send_to_client
    raise RPCError("ZMQ sent failure") from e
locust.exception.RPCError: ZMQ sent failure
[2022-01-28 04:01:32,428] loadforge-61f3690b4aab01/INFO/locust.util.exception_handler: Exception found on retry 1: -- retry after 1s
[2022-01-28 04:01:32,428] loadforge-61f3690b4aab01/ERROR/locust.util.exception_handler: ZMQ sent failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 27, in send_to_client
    self.socket.send_multipart([msg.node_id.encode(), msg.serialize()])
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 275, in send_multipart
    msg = super(_Socket, self).send_multipart(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 595, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 228, in send
    msg = super(_Socket, self).send(data, flags, copy, track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 547, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/util/exception_handler.py", line 13, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 29, in send_to_client
    raise RPCError("ZMQ sent failure") from e
locust.exception.RPCError: ZMQ sent failure

@cyberw
Copy link
Collaborator Author

cyberw commented Jan 28, 2022

Perhaps it was introduced in #1935 ? I'm tempted to revert it...

@dbfx
Copy link

dbfx commented Jan 28, 2022

I've tried to do some tests to get more info but it's not reliably happening in my setup unfortunately. It seems to happen more at higher loads so it could be somehow related to that (e.g. busy workers + primary).

The change in quit() seems to be quite likely I agree.

        gevent.sleep(0.5)  # wait for final stats report from all workers
        self.server.close()

This 0.5 sleep is somewhat random I guess. It's possible that a heavily loaded worker will take > 0.5 seconds and that's when this then causes an issue?

@cyberw cyberw deleted the ensure-hearbeat_worker-doesnt-continue-monitoring-workers-after-quit branch March 22, 2022 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quitting master doesnt prevent its heartbeat_worker greenlet from trying to re-establish connections
2 participants