Skip to content

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called#1972

Merged
cyberw merged 2 commits intomasterfrom
ensure-hearbeat_worker-doesnt-continue-monitoring-workers-after-quit
Jan 14, 2022
Merged

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called#1972
cyberw merged 2 commits intomasterfrom
ensure-hearbeat_worker-doesnt-continue-monitoring-workers-after-quit

Conversation

@cyberw
Copy link
Copy Markdown
Collaborator

@cyberw cyberw commented Jan 14, 2022

fixes #1971

@cyberw cyberw merged commit 402bddd into master Jan 14, 2022
@cyberw
Copy link
Copy Markdown
Collaborator Author

cyberw commented Jan 26, 2022

This didnt actually help as much as I had thought. Must be something else I'm hitting...

@dbfx
Copy link
Copy Markdown

dbfx commented Jan 28, 2022

@cyberw I've started seeing this on our tests recently (last 1-2 weeks) and hadn't been before either.

@dbfx
Copy link
Copy Markdown

dbfx commented Jan 28, 2022

Here is the output from a test run this morning:

[2022-01-28 04:01:31,247] loadforge-61f3690b4aab01/INFO/locust.main: --run-time limit reached. Stopping Locust
 Name                                                                              # reqs      # fails  |     Avg     Min     Max  Median  |   req/s failures/s
----------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregated                                                                        100071 88838(88.77%)  |     421      17   22335     100  |  753.10  684.40

[2022-01-28 04:01:31,633] loadforge-61f3690b4aab01/INFO/locust.runners: Client 'loadforge-61f3690c228ae6_a6e263c24f764aea80ef5e34155a8f3b' quit. Currently 0 clients connected.
[2022-01-28 04:01:31,736] loadforge-61f3690b4aab01/INFO/locust.runners: Client 'loadforge-61f3690c228ae6_43366e59188b4d8ebb2e66d5e47b10c1' quit. Currently 0 clients connected.
[2022-01-28 04:01:31,736] loadforge-61f3690b4aab01/INFO/locust.runners: The last worker quit, stopping test.
[2022-01-28 04:01:32,249] loadforge-61f3690b4aab01/INFO/locust.main: Shutting down (exit code 1)
[2022-01-28 04:01:32,427] loadforge-61f3690b4aab01/INFO/locust.util.exception_handler: Exception found on retry 1: -- retry after 1s
[2022-01-28 04:01:32,427] loadforge-61f3690b4aab01/ERROR/locust.util.exception_handler: ZMQ sent failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 27, in send_to_client
    self.socket.send_multipart([msg.node_id.encode(), msg.serialize()])
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 275, in send_multipart
    msg = super(_Socket, self).send_multipart(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 595, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 228, in send
    msg = super(_Socket, self).send(data, flags, copy, track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 547, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/util/exception_handler.py", line 13, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 29, in send_to_client
    raise RPCError("ZMQ sent failure") from e
locust.exception.RPCError: ZMQ sent failure
[2022-01-28 04:01:32,428] loadforge-61f3690b4aab01/INFO/locust.util.exception_handler: Exception found on retry 1: -- retry after 1s
[2022-01-28 04:01:32,428] loadforge-61f3690b4aab01/ERROR/locust.util.exception_handler: ZMQ sent failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 27, in send_to_client
    self.socket.send_multipart([msg.node_id.encode(), msg.serialize()])
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 275, in send_multipart
    msg = super(_Socket, self).send_multipart(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 595, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 228, in send
    msg = super(_Socket, self).send(data, flags, copy, track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 547, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/util/exception_handler.py", line 13, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 29, in send_to_client
    raise RPCError("ZMQ sent failure") from e
locust.exception.RPCError: ZMQ sent failure

@cyberw
Copy link
Copy Markdown
Collaborator Author

cyberw commented Jan 28, 2022

Perhaps it was introduced in #1935 ? I'm tempted to revert it...

@dbfx
Copy link
Copy Markdown

dbfx commented Jan 28, 2022

I've tried to do some tests to get more info but it's not reliably happening in my setup unfortunately. It seems to happen more at higher loads so it could be somehow related to that (e.g. busy workers + primary).

The change in quit() seems to be quite likely I agree.

        gevent.sleep(0.5)  # wait for final stats report from all workers
        self.server.close()

This 0.5 sleep is somewhat random I guess. It's possible that a heavily loaded worker will take > 0.5 seconds and that's when this then causes an issue?

@cyberw cyberw deleted the ensure-hearbeat_worker-doesnt-continue-monitoring-workers-after-quit branch March 22, 2022 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Quitting master doesnt prevent its heartbeat_worker greenlet from trying to re-establish connections

2 participants