ensure the connection between master and slave in heartbeat #1280

delulu · 2020-03-09T10:00:40Z

With heartbeating enabled, I still noticed network issues (packet drop, invalid byte stream) in long-term running on k8s cluster (overlay network).

And a straightforward fix is to reestablish the connection when any network issue is detected.

I've applied this fix in my locust tests for one week's running, and it works as expected.

codecov · 2020-03-09T10:02:58Z

Codecov Report

Merging #1280 into master will decrease coverage by 0.08%.
The diff coverage is 65.38%.

@@            Coverage Diff             @@
##           master    #1280      +/-   ##
==========================================
- Coverage   80.21%   80.12%   -0.09%     
==========================================
  Files          23       23              
  Lines        2118     2179      +61     
  Branches      321      324       +3     
==========================================
+ Hits         1699     1746      +47     
- Misses        339      350      +11     
- Partials       80       83       +3

Impacted Files	Coverage Δ
locust/runners.py	`75.88% <61.36%> (+0.03%)`	⬆️
locust/rpc/zmqrpc.py	`81.66% <69.69%> (-2.55%)`	⬇️
locust/exception.py	`100.00% <100.00%> (ø)`
locust/web.py	`89.33% <0.00%> (-0.67%)`	⬇️
locust/core.py	`95.98% <0.00%> (+0.36%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a0a6ef...0f9070b. Read the comment docs.

cyberw · 2020-03-09T11:18:51Z

locust/runners.py

+        try:
+            self.client.close()
+            self.client = rpc.Client(self.master_host, self.master_port, self.client_id)
+        except Exception as e:


Would it be possible to change this to only catch specific exceptions?

I've never looked into this part of the code base very much, so I feel unqualified to approve/decline the PR though :)

It's intended to catch all the exceptions, to make it reliable with respect to all the possible failures.

also it's not against the current logic, this reset_connection is newly introduced in function loop and it's not supposed to throw any exception out unless there's a strong reason.

I understand you concern, will add tests to cover this change.

Catching all types of exceptions is generally considered bad practice as it may hide more serious issues or put the program in an unknown state, causing hard-to-debug problems later on. But in this case it may be warranted, I can't really tell :)

I see, will update the catching based on the exceptions I observed during my tests.

max-rocket-internet · 2020-03-09T15:08:18Z

@delulu does this fix the occasional missing slave? I see this now and again and I'm also running on k8s.

#1158

delulu · 2020-03-10T03:07:09Z

@delulu does this fix the occasional missing slave? I see this now and again and I'm also running on k8s.

#1158

Yes, it fixes it well.

cyberw · 2020-03-16T07:20:55Z

Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds..

heyman · 2020-03-16T14:10:21Z

Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds..

Also, if we're wrapping the exceptions we should use raise ... from e.

delulu · 2020-03-18T09:18:43Z

Do we really need to wrap the exceptions in our own exception class? I dont really see what value that adds..

Also, if we're wrapping the exceptions we should use raise ... from e.

good suggestion! I will update it.

The main purpose is to handle these exceptions in the same place rather than scattered in runners.py. Also it shows how to deal with RPCError, it reduces the maintenance effort.

delulu · 2020-03-18T09:28:58Z

I would like to add test case to test reset_connection in test_runners.py, but I haven't figured out a good way, free to let me know if you have any ideas.

delulu · 2020-03-21T14:27:21Z

I've added a test case of test_reset_connection:

check connection_broken will be true when there's a RPCError in recv_from_client.

locust/locust/runners.py

Line 434 in d45bac9

client_id, msg = self.server.recv_from_client()
make sure RPCError is tolerated in reset_connection.

Please have a review.

cyberw · 2020-03-21T14:45:54Z

Looks nice! But I still dont understand what wrapping the exception helps with? I would prefer catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant instead of catching RPCError. Less code, less magic.

delulu · 2020-03-23T06:19:27Z

I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant".

I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them.

cyberw · 2020-03-23T07:26:36Z

I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant".

I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them.

I dont quite understand what you mean by "it has no idea or context about these errors and how to handle them"?

Why can't runner do except (zmq.error.ZMQError, <and whatever other exceptions we are wrappring>) as e: instead of except RPCError as e: and handle it exactly the same way? Wrapping them adds no value imo, it just complicates the code and makes it less obvious what really happened.

delulu · 2020-03-23T09:36:03Z

I'm exactly "catching zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError etc in the places where it is relevant".
I mean in runners.py, it deals with rpc and it has no idea or context about these errors (zmq.error.ZMQError, msgerr.ExtraData, UnicodeDecodeError) and how to handle them.

I dont quite understand what you mean by "it has no idea or context about these errors and how to handle them"?

Why can't runner do except (zmq.error.ZMQError, <and whatever other exceptions we are wrappring>) as e: instead of except RPCError as e: and handle it exactly the same way? Wrapping them adds no value imo, it just complicates the code and makes it less obvious what really happened.

rpc is dealing with zmq, msg decode and msgpack directly, so it has context of these exceptions and know the possible causes of them. so it can wrap them together and add the cause info. while runners don't have these context, and it have no knowledge in what scenarios these exceptions will be raised.

And I totally disagree that it complicates the code.

The handling of these three exceptions are the same, so wrap them in a unified exception with description and the callers only need to care this one exception, rather than handling them in different scenarios.

For example in some place one of these three should be caught, in another place two of these three or all the three should be caught. This is complicate and it require caller to understand the details of rpc.

cyberw · 2020-03-27T08:00:20Z

Hi @delulu ! I'm sorry we were not able to agree on this. I would love to see more robust connection handling, but I won't merge with the (IMO) needless exception wrapping.

Perhaps I can at least convince you that the wrapping is not so important to you as to stop the fix?

If you make the requested changes & have a look at possibly speeding up the test case I'll be happy to merge.

heyman · 2020-03-27T12:43:45Z

Unfortunately I don't have time to review the full PR at the moment. However I don't see any problem with with raising a common RPCError exception (from the specific one), as long as the proper way to handle them in runners.py is the same. I actually think it makes the abstraction less leaky.

max-rocket-internet · 2020-04-01T14:21:27Z

I would love to see something merged to ensure better communication between slaves and masters 🙏

I still can't reproduce it reliably but we still see now and again missing or slaves that don't stop hatching.

cyberw · 2020-04-01T14:38:10Z

If I'm the only one who thinks the exception wrapping is weird, and someone resolves the conflicts then I'm ok with merging.

delulu · 2020-04-03T10:01:40Z

sorry for my late response.

@cyberw I have to say that the wrap is as important as the fix, because I think this is the right thing to do. if you see anything wrong you need to point it out and convince me, and I'll be happy to make the changes.

as for test case you mentioned above, it tests three scenarios:

RPCError is handled and the connection_broken will be set to "True".
When a normal msg is well received, the connection_broken will be set back to "False".
Other exceptions ( HeyAnException ) wont be handled, so no change on connection_broken.

for each scenario, it takes about 3 seconds to get the msg and the test case doesn't work as expected when I try to reduce the sleep time to 2 seconds.

I'll rebase it with latest master branch and test it. will commit the change when test looks good.

cyberw · 2020-04-03T11:52:55Z

@cyberw I have to say that the wrap is as important as the fix

I disagree, but as I said, if I'm the only one who considers it less clean than just catching the underlying exceptions I wont argue any more. Maybe I am missing something but let's move on. It's not that important.

As for the test cases, couldn't you just reduce the timeouts before the test (and reduce the sleeps)? Or is that not possible for some reason?

delulu · 2020-04-05T01:58:54Z

@cyberw I have to say that the wrap is as important as the fix

I disagree, but as I said, if I'm the only one who considers it less clean than just catching the underlying exceptions I wont argue any more. Maybe I am missing something but let's move on. It's not that important.

As for the test cases, couldn't you just reduce the timeouts before the test (and reduce the sleeps)? Or is that not possible for some reason?

when a msg is sent, it seems take 3 seconds to go to the code line of updating connection_broken. because when I reduce it to 2 seconds, the status of connection_broken is not updated as expected.

so I can't reduce the sleep time or sleeps.

conflicts have been resolved and the latest branch has been tested with one day's running, please take a further check.

cyberw · 2020-04-05T12:36:44Z

Thanks for your contribution!

heyman · 2020-04-05T16:13:18Z

when a msg is sent, it seems take 3 seconds to go to the code line of updating connection_broken. because when I reduce it to 2 seconds, the status of connection_broken is not updated as expected

The reason for this was the FALLBACK_INTERVAL that was introduced by this pull request. The first sleep wasn't needed at all, but when both the first and second was 3, they surpassed the FALLBACK_INTERVAL that is 5 (that's why it failed when both sleeps were 2). I've now fixed this in 2ac0a84 by temporarily setting down the FALLBACK_INTERVAL.

What's the purpose of the FALLBACK_INTERVAL btw, and how was the 5 seconds selected?

I also removed the case with an unhandled exception in MasterLocustRunner.client_listener because it was causing an error stack trace to be printed when the test was run, and if an unhandled exception happens in client_listener we're fucked anyways, so we don't need to check for it (unless we're planning to handle the exception in some way).

delulu · 2020-04-06T07:13:18Z

@heyman you explains it well and your fix looks nice! I forgot the mocked rpc context in test.

as for FALLBACK_INTERVAL in my env HEARTBEAT_INTERVAL has been set to 3, and connection will be reset in heartbeat if connection broken is detected. so I simply choose the prime number to make sure reset_connection has been executed and the interval does not collide with the HEARTBEAT_INTERVAL.

and I'm fine that you removed the case. this case is to check that only RPCError will trigger the reset_connection.

heyman · 2020-04-06T14:19:38Z

as for FALLBACK_INTERVAL in my env HEARTBEAT_INTERVAL has been set to 3, and connection will be reset in heartbeat if connection broken is detected. so I simply choose the prime number to make sure reset_connection has been executed and the interval does not collide with the HEARTBEAT_INTERVAL.

Why do we need to call reset_connection() in heartbeat_worker()? Couldn't we just call reset_connection() in the exception handler (in client_listener) and skip the sleep() call?

delulu · 2020-04-07T02:17:16Z

Why do we need to call reset_connection() in heartbeat_worker()? Couldn't we just call reset_connection() in the exception handler (in client_listener) and skip the sleep() call?

Because I prefer to make it consistent with WorkerLocustRunner that the connection status check and connection reset are done in heartbeat.

cyberw reviewed Mar 9, 2020

View reviewed changes

delulu force-pushed the ensureconnection branch from 105d15a to 9b79520 Compare March 9, 2020 13:22

delulu force-pushed the ensureconnection branch from c70eaea to dbd5b87 Compare March 16, 2020 06:55

delulu force-pushed the ensureconnection branch from dbd5b87 to f523314 Compare March 16, 2020 08:53

delulu added 6 commits April 5, 2020 09:50

ensure master2slave connection in heartbeat

576a022

enhance exception handling

234507e

update tests

2d7f919

update exception raise

e5d3190

test connection reset

1dc370a

test unhandled exception in connection reset

0f9070b

delulu force-pushed the ensureconnection branch from ae7961d to 0f9070b Compare April 5, 2020 01:53

cyberw merged commit 8685a4b into locustio:master Apr 5, 2020

cyberw mentioned this pull request Nov 29, 2022

UnboundLocalError after receiving ZMQ corrupted message #2260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure the connection between master and slave in heartbeat #1280

ensure the connection between master and slave in heartbeat #1280

delulu commented Mar 9, 2020

codecov bot commented Mar 9, 2020 •

edited

Loading

cyberw Mar 9, 2020 •

edited

Loading

delulu Mar 9, 2020

cyberw Mar 9, 2020

delulu Mar 10, 2020

max-rocket-internet commented Mar 9, 2020

delulu commented Mar 10, 2020

cyberw commented Mar 16, 2020 •

edited

Loading

heyman commented Mar 16, 2020

delulu commented Mar 18, 2020 •

edited

Loading

delulu commented Mar 18, 2020

delulu commented Mar 21, 2020 •

edited

Loading

cyberw commented Mar 21, 2020

delulu commented Mar 23, 2020

cyberw commented Mar 23, 2020 •

edited

Loading

delulu commented Mar 23, 2020

cyberw commented Mar 27, 2020

heyman commented Mar 27, 2020

max-rocket-internet commented Apr 1, 2020

cyberw commented Apr 1, 2020

delulu commented Apr 3, 2020

cyberw commented Apr 3, 2020

delulu commented Apr 5, 2020 •

edited

Loading

cyberw commented Apr 5, 2020

heyman commented Apr 5, 2020 •

edited

Loading

delulu commented Apr 6, 2020

heyman commented Apr 6, 2020

delulu commented Apr 7, 2020

ensure the connection between master and slave in heartbeat #1280

ensure the connection between master and slave in heartbeat #1280

Conversation

delulu commented Mar 9, 2020

codecov bot commented Mar 9, 2020 • edited Loading

Codecov Report

cyberw Mar 9, 2020 • edited Loading

Choose a reason for hiding this comment

delulu Mar 9, 2020

Choose a reason for hiding this comment

cyberw Mar 9, 2020

Choose a reason for hiding this comment

delulu Mar 10, 2020

Choose a reason for hiding this comment

max-rocket-internet commented Mar 9, 2020

delulu commented Mar 10, 2020

cyberw commented Mar 16, 2020 • edited Loading

heyman commented Mar 16, 2020

delulu commented Mar 18, 2020 • edited Loading

delulu commented Mar 18, 2020

delulu commented Mar 21, 2020 • edited Loading

cyberw commented Mar 21, 2020

delulu commented Mar 23, 2020

cyberw commented Mar 23, 2020 • edited Loading

delulu commented Mar 23, 2020

cyberw commented Mar 27, 2020

heyman commented Mar 27, 2020

max-rocket-internet commented Apr 1, 2020

cyberw commented Apr 1, 2020

delulu commented Apr 3, 2020

cyberw commented Apr 3, 2020

delulu commented Apr 5, 2020 • edited Loading

cyberw commented Apr 5, 2020

heyman commented Apr 5, 2020 • edited Loading

delulu commented Apr 6, 2020

heyman commented Apr 6, 2020

delulu commented Apr 7, 2020

codecov bot commented Mar 9, 2020 •

edited

Loading

cyberw Mar 9, 2020 •

edited

Loading

cyberw commented Mar 16, 2020 •

edited

Loading

delulu commented Mar 18, 2020 •

edited

Loading

delulu commented Mar 21, 2020 •

edited

Loading

cyberw commented Mar 23, 2020 •

edited

Loading

delulu commented Apr 5, 2020 •

edited

Loading

heyman commented Apr 5, 2020 •

edited

Loading