-
-
Notifications
You must be signed in to change notification settings - Fork 436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when saving data under multiple threads #581
Comments
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) Original poster here. It looks like I truncated the stack trace and didn't report the actual error. The final three lines should be:
|
Hmm, this looks interesting. I'll try to reproduce it. Do you think I'll be able to clone your repo and run it myself? |
OK, then we'll have to share clues here. This looks to me like you are measuring coverage inside of coverage.py itself. That shouldn't happen. Try changing line 198 to:
then capture what appears on stderr, and share it here. |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) Here you go. I removed some internal product names from the paths, but this is otherwise un-messed-with
|
Hmm, thanks. Now add "trace" to that list of options:
which will produce more output, one line per file executed. |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) I removed some lines that say Tracing '/tmp/our-internal-product-stuff/foo' Coverage is down at the bottom - it looks like it's being skipped. I do see multiprocessing in the list, though. Pure speculation, but that's one of the things I'm investigating to try to repro this in a smaller way. EDIT I didn't include this at first - but the thing blows up right after the line
I've edited the output below to include the stack trace when it blows up
|
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) Also tracking this issue on the rostest side over here: ros/ros_comm#1055 but, but that side appears a little bit less responsive |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) I'm taking my repro-case and cutting it down binary search style. I'm noticing that as I remove more stuff, the likelyhood that coverage blows up is going down. With everything in there, I can get it to blow up every time. With about half of the stuff removed, it's now to the point where it'll blow up about 1 in 3 times. |
Now I'm thinking of something drastic, like copying the dict just before that line, then catching RuntimeError and checking the difference between the original dict and the current dict. |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) Ok, I modified save_data to be the following:
And here's what I see (with a little bit of manual formatting added):
These all seem like legit files that might be getting coverage generated for them. There's also approx 22 python threads hanging around, most of them waiting to publish or receive something on a socket. |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) I have a sort of stand-alone repro case that I need to clean up a little bit, but here's the gist:
It's not 100% standalone yet, because I'm still using ROS to get a list of modules and classes to import. I have a really long list of ROS objects of the form: for example:
Then my threads all run the following method:
I don't think this is ROS specific - but using the ROS messages was a convenient way to quickly generate a very large list of modules and class names to import. I think I can give you instructions to fully repro the issue using a docker container that has the ROS stuff in it. Again, I don't believe that ROS is necessary - it's just a convenient way for me to generate a large list of things to import. Here is the program I'm using right now to reproduce the problem without the importlist:
Here's the stack trace that I see:
|
OK, I had wondered about threads. How are you starting them? Do you start them before or after coverage is started? |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) repro.py is attached to this issue.
Observe the following output:
Notes:
|
@pbaughman Thanks so much for the test case, I am able to reproduce it. And it's nothing to do with your code. I replaced your imports with some synthetic ones, and could still cause the failure. This program does it without needing Docker, etc:
Now to work on fixing the problem.... that might be harder... |
Original comment by Peter Baughman (Bitbucket: pbaughman, GitHub: pbaughman) A somewhat cheesy solution that does not address a possible underlying logic issue might be to make a copy of the data in collector.py save_data like:
I didn't deep-dive, but it seems strange that things are still being added to the dictionary after stop is called |
@pbaughman I did some work to investigate fixes, but got sidetracked. Copying the data dictionary makes the problem less likely, but does not fix it robustly. I'm sorry to say this, but for the time being, you should run with your fix from a fork of coverage.py. |
Original comment by Olivier Grisel (Bitbucket: ogrisel, GitHub: ogrisel) I can reproduce a similar issue on the CI infrastructure of scikit-learn (but did no observe this on my local machine):
|
Original comment by Olivier Grisel (Bitbucket: ogrisel, GitHub: ogrisel) I updated the minimal repro script to work in Python 3 with the print function and could therefore enable from __future__ import print_function
import importlib
import random
import threading
import time
_cov = None
_run = True
for i in range(1000):
with open("m{:04d}.py".format(i), "w") as f:
f.write("a = 1\n")
def _start_coverage():
global _cov
import coverage
_cov = coverage.coverage()
_cov.start()
def _stop_coverage():
_cov.stop()
print("YOU WILL NEVER GET HERE!!", flush=True)
# Mega strange. When the program blows up, it reports that _cov.stop() blows up.
# The thing is, I see "YOU WILL NEVER GET HERE" printed to the console. Also the
# program doesn't blow up without _cov.save()
_cov.save()
def random_load():
global _run
while _run:
candidate = "m{:04d}".format(random.randint(0, 999))
mod = importlib.import_module(candidate)
_start_coverage()
for n in range(20):
threading.Thread(target=random_load).start()
time.sleep(0.1)
_stop_coverage()
_run = False I get the following output:
and then the program is waiting indefinitely (since the |
Original comment by Olivier Grisel (Bitbucket: ogrisel, GitHub: ogrisel) The following patch fixes the bug for me: diff -r [12d026f6db21 (bb)](https://bitbucket.org/ned/coveragepy/commits/12d026f6db21) coverage/collector.py
--- a/coverage/collector.py Sat Jul 08 11:03:31 2017 -0400
+++ b/coverage/collector.py Tue Aug 08 16:25:10 2017 +0200
@@ -374,7 +374,7 @@
def abs_file_dict(d):
"""Return a dict like d, but with keys modified by `abs_file`."""
- return dict((abs_file(k), v) for k, v in iitems(d))
+ return dict((abs_file(k), v) for k, v in list(iitems(d)))
if self.branch:
covdata.add_arcs(abs_file_dict(self.data)) This makes the iteration thread-safe. But maybe writing into |
I will have to get back to this soon. I am not just adding a list() call for a few reasons: first, it doesn't solve the race condition, it just narrows the window from the time to execute the entire line to the time to execute the list(). Adding a lock around this line might be a solution, but I think there's also technically a race condition in the trace function where we add keys to the data dictionary in the first place. That is an even narrower window for problems, but at least at an academic level, is still a problem. Locking on every line recorded doesn't seem feasible. At the very least, I'd like to understand the scope of the problem before adding this simple pragmatic 99% solution. |
Original comment by Olivier Grisel (Bitbucket: ogrisel, GitHub: ogrisel)
|
The thread-safety of list() is something to consider. I don't understand your point about past vs future threads? I thought the problem here was one thread recording data into the shared dictionary while another thread was trying to save the data? |
Original comment by Olivier Grisel (Bitbucket: ogrisel, GitHub: ogrisel) The problem is that the main thread is trying to call However I just noticed something inconsistent in the traceback I reported in: https://bitbucket.org/ned/coveragepy/issues/581/44b1-44-breaking-in-ci#comment-38963003 The traceback is about an exception in the main thread in the |
Original comment by Olivier Grisel (Bitbucket: ogrisel, GitHub: ogrisel) I opened a PR for this issue: https://bitbucket.org/ned/coveragepy/pull-requests/127/fix-thread-safe-collectorsave_data |
Fixed in the commits leading up to 6ba2af43cfdc (bb) |
BTW, @ogrisel @pbaughman I added you both to the CONTRIBUTORS file. Thanks! |
I up the concurrency protection in d781dd6eb3d1dc05d99dc04ecf37000464484446 (bb). With the previous list() call, I still got a "dict changed size" error on a CI server once. This has shipped in version 4.4.2 |
Originally reported by Anonymous
I do not have a minimum repro for this issue yet, but we're seeing errors in our CI after upgrading coverage from 4.3.4 to 4.4b1 and 4.4.
We have some tests running using rostest. Upon upgrading from coverage 4.3.4 on our CI server, we started seeing the following errors:
It's possible this is a rostest issue too - I haven't dug down very deeply, but we're certain that this issue does not occur when using coverage 4.3.4 - only 4.4b1 and 4.4.
The source code for the caller is found here. It looks like stop is being called from line 227 and then blowing up internally.
Will post more info once I have it.
Contact info: peter@mayfieldrobotics.com
The text was updated successfully, but these errors were encountered: