Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called #1972

cyberw · 2022-01-14T12:46:35Z

fixes #1971

…ers when quit has been called

…nnected.

cyberw · 2022-01-26T11:52:19Z

This didnt actually help as much as I had thought. Must be something else I'm hitting...

dbfx · 2022-01-28T04:13:54Z

@cyberw I've started seeing this on our tests recently (last 1-2 weeks) and hadn't been before either.

dbfx · 2022-01-28T04:26:23Z

Here is the output from a test run this morning:

[2022-01-28 04:01:31,247] loadforge-61f3690b4aab01/INFO/locust.main: --run-time limit reached. Stopping Locust
 Name                                                                              # reqs      # fails  |     Avg     Min     Max  Median  |   req/s failures/s
----------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregated                                                                        100071 88838(88.77%)  |     421      17   22335     100  |  753.10  684.40

[2022-01-28 04:01:31,633] loadforge-61f3690b4aab01/INFO/locust.runners: Client 'loadforge-61f3690c228ae6_a6e263c24f764aea80ef5e34155a8f3b' quit. Currently 0 clients connected.
[2022-01-28 04:01:31,736] loadforge-61f3690b4aab01/INFO/locust.runners: Client 'loadforge-61f3690c228ae6_43366e59188b4d8ebb2e66d5e47b10c1' quit. Currently 0 clients connected.
[2022-01-28 04:01:31,736] loadforge-61f3690b4aab01/INFO/locust.runners: The last worker quit, stopping test.
[2022-01-28 04:01:32,249] loadforge-61f3690b4aab01/INFO/locust.main: Shutting down (exit code 1)
[2022-01-28 04:01:32,427] loadforge-61f3690b4aab01/INFO/locust.util.exception_handler: Exception found on retry 1: -- retry after 1s
[2022-01-28 04:01:32,427] loadforge-61f3690b4aab01/ERROR/locust.util.exception_handler: ZMQ sent failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 27, in send_to_client
    self.socket.send_multipart([msg.node_id.encode(), msg.serialize()])
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 275, in send_multipart
    msg = super(_Socket, self).send_multipart(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 595, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 228, in send
    msg = super(_Socket, self).send(data, flags, copy, track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 547, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/util/exception_handler.py", line 13, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 29, in send_to_client
    raise RPCError("ZMQ sent failure") from e
locust.exception.RPCError: ZMQ sent failure
[2022-01-28 04:01:32,428] loadforge-61f3690b4aab01/INFO/locust.util.exception_handler: Exception found on retry 1: -- retry after 1s
[2022-01-28 04:01:32,428] loadforge-61f3690b4aab01/ERROR/locust.util.exception_handler: ZMQ sent failure
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 27, in send_to_client
    self.socket.send_multipart([msg.node_id.encode(), msg.serialize()])
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 275, in send_multipart
    msg = super(_Socket, self).send_multipart(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 595, in send_multipart
    self.send(msg, SNDMORE | flags, copy=copy, track=track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/green/core.py", line 228, in send
    msg = super(_Socket, self).send(data, flags, copy, track)
  File "/usr/local/lib/python3.8/dist-packages/zmq/sugar/socket.py", line 547, in send
    return super(Socket, self).send(data, flags=flags, copy=copy, track=track)
  File "zmq/backend/cython/socket.pyx", line 718, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 759, in zmq.backend.cython.socket.Socket.send
  File "zmq/backend/cython/socket.pyx", line 135, in zmq.backend.cython.socket._check_closed
zmq.error.ZMQError: Socket operation on non-socket

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/locust/util/exception_handler.py", line 13, in wrapper
    return function(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/locust/rpc/zmqrpc.py", line 29, in send_to_client
    raise RPCError("ZMQ sent failure") from e
locust.exception.RPCError: ZMQ sent failure

cyberw · 2022-01-28T08:25:52Z

Perhaps it was introduced in #1935 ? I'm tempted to revert it...

dbfx · 2022-01-28T08:32:44Z

I've tried to do some tests to get more info but it's not reliably happening in my setup unfortunately. It seems to happen more at higher loads so it could be somehow related to that (e.g. busy workers + primary).

The change in quit() seems to be quite likely I agree.

        gevent.sleep(0.5)  # wait for final stats report from all workers
        self.server.close()

This 0.5 sleep is somewhat random I guess. It's possible that a heavily loaded worker will take > 0.5 seconds and that's when this then causes an issue?

cyberw added 2 commits January 14, 2022 13:42

Ensure heartbeat_worker doesnt try to re-establish connection to work…

9f4f005

…ers when quit has been called

Change RPCError logging to debug if no clients were expected to be co…

6d20383

…nnected.

cyberw merged commit 402bddd into master Jan 14, 2022

cyberw mentioned this pull request Jan 28, 2022

Fix "socket operation on non-socket" at shutdown, by reverting #1935 #1991

Merged

cyberw deleted the ensure-hearbeat_worker-doesnt-continue-monitoring-workers-after-quit branch March 22, 2022 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called #1972

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called #1972

cyberw commented Jan 14, 2022

cyberw commented Jan 26, 2022

dbfx commented Jan 28, 2022

dbfx commented Jan 28, 2022

cyberw commented Jan 28, 2022

dbfx commented Jan 28, 2022 •

edited

Loading

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called #1972

Ensure heartbeat_worker doesnt try to re-establish connection to workers when quit has been called #1972

Conversation

cyberw commented Jan 14, 2022

cyberw commented Jan 26, 2022

dbfx commented Jan 28, 2022

dbfx commented Jan 28, 2022

cyberw commented Jan 28, 2022

dbfx commented Jan 28, 2022 • edited Loading

dbfx commented Jan 28, 2022 •

edited

Loading