Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Behavex runner gets stuck indefinitely when a test case terminates with sys.exit/seg fault #114

Closed
lazareviczoran opened this issue Oct 25, 2023 · 12 comments
Labels
Answered bug Something isn't working fixed

Comments

@lazareviczoran
Copy link

Describe the bug
Hello, we ran into a case where we noticed that the runner gets stuck while running the test cases when one of the test processes ends unexpectedly. We tried to dig into this and figure out why this behavior is occurring, and we noticed that there is an open bug in the multiprocessing.Pool where if the function that is being executed as a subprocess ends with a sys.exit/seg fault, the process will get stuck on the join part indefinitely. Seems like the concurrent.futures.ProcessPoolExecutor has this specific issue fixed, not sure if there would be some other side effects that would break the current behavex behavior, but it might be worth looking into it.

To Reproduce
Steps to reproduce the behavior:

  1. Run the following code snippet (with the multiprocessing.Pool) to get the current behavior (the pool getting stuck, but all 10 subprocesses will ran)
import multiprocessing
import sys

def func_that_exits():
    sys.exit(0)

process_pool = multiprocessing.Pool(3)

for i in range(10):
    process_pool.apply_async(func_that_exits, ())

process_pool.close()
print('joining')
process_pool.join()
print('joined')
  1. Run the following code snippet (with concurrent.futures.ProcessPoolExecutor) to get the behavior where the main process won't get stuck, but only 3 "subprocesses" will run
from concurrent.futures import ProcessPoolExecutor
import sys

def func_that_exits():
    sys.exit(0)

process_pool = ProcessPoolExecutor(3)

for i in range(10):
    process_pool.submit(func_that_exits)

print('joining')
process_pool.shutdown()
print('joined')

Expected behavior
In case any of the test cases exits with a seg fault or a sys.exit, the runner should exit instead of hanging indefinitely.

Desktop (please complete the following information):

  • OS: linux/macOS
  • Browser: not relevant
  • behavex version: 2.0.1
@hrcorval hrcorval added the bug Something isn't working label Nov 22, 2023
@jbridger
Copy link

jbridger commented Dec 4, 2023

Hi @hrcorval!

As this was causing us issues, we wanted to have an attempt at addressing it and providing additional debug information.

Would be great if you could have a look at our PR on this: #120

Thanks 🙂

@bombsimon
Copy link

Hi @hrcorval, sorry for direct ping but just curious if you think you'll get to this at some point or if you're lacking bandwidth to maintain behavex for now. Also let me know if I can assist with anything to make it easier for you to get to this issue (or PR or other things).

Thanks for this project, it's been very useful!

@hrcorval
Copy link
Owner

Hi @jbridger, @lazareviczoran and @bombsimon, thanks for reporting this issue.
Basically, We have been trying to reproduce it in the past without success. Considering you are experiencing the issue, do you have an easy way to reproduce it. This will help us on focusing on the solution.
Regarding the provided PR, given the amount of changes in it, We don't have yet enough bandwidth to review it and validate there are no regression issues. I hope We can jump into it soon. However, if We have an way to reproduce it on our end, it will accelerate the fix.
Thanks!!

@bombsimon
Copy link

bombsimon commented Jun 29, 2024

Hi! Sorry for maybe dumb question but did you see the repro in the issue description? What happens when you run it, does it not behave (no pun intended) as the description states? If not I can try to make a new repro!

Totally understand about the size of the PR. Sadly the changes spread a bit wide. I guess that's something to consider as well but let's see if we can get a working repro first.😀

EDIT Oh I assume you want some behave feature and step impl that causes the same issue. I'll try to get back with that early next week!

@bombsimon
Copy link

I created a repro here: https://github.com/bombsimon/behavex-issue-114-repro

It's basically just a test step doing sys.exit. This is not something we do on purpose but the same behaviour happens if some dependency does this, we get a segfault or other type of crash killing the process. I added some more context to the README.

@iamkenos
Copy link
Contributor

iamkenos commented Aug 22, 2024

hey gents. thanks for reporting this. i have a feeling im getting a related issue. only happens on multi proc too.

i can reproduce the issue on this repo https://github.com/bombsimon/behavex-issue-114-repro, python 3.10.11

@hrcorval
Copy link
Owner

hrcorval commented Aug 26, 2024

Hi! In a few days we are releasing a new version (v4.0.1) that contains multiple changes that were suggested in this thread, including changing the parallel execution implementation to start using concurrent.futures.ProcessPoolExecutor instead of multiprocessing.Pool.
Thanks for posting this issue and for providing an easy way to reproduce it.
We will keep you posted as soon as the release candidate is ready.
Thanks!

@hrcorval
Copy link
Owner

hrcorval commented Sep 6, 2024

Hi @lazareviczoran , @jbridger , @bombsimon , @iamkenos . Finally, we have released a new library version (v4.0.2) with core changes in the way parallel executions are being performed.
We successfully reproduced the issue based on the information provided in https://github.com/bombsimon/behavex-issue-114-repro, and We got inspired for the solution by looking at this PR: : #120
I hope the provided solution works for you. Of course, We will keep an eye on any other improvements fixes you see when using the framework.
We will wait for your feedback on this.
Thanks a lot!

@lazareviczoran
Copy link
Author

Hi @lazareviczoran , @jbridger , @bombsimon , @iamkenos . Finally, we have released a new library version (v4.0.2) with core changes in the way parallel executions are being performed. We successfully reproduced the issue based on the information provided in https://github.com/bombsimon/behavex-issue-114-repro, and We got inspired for the solution by looking at this PR: : #120 I hope the provided solution works for you. Of course, We will keep an eye on any other improvements fixes you see when using the framework. We will wait for your feedback on this. Thanks a lot!

Great! We'll try it out and report back if we notice any issue.
Thanks a lot!! 🙌

@iamkenos
Copy link
Contributor

iamkenos commented Sep 8, 2024

Hi @hrcorval thanks to the team for working on the release!

I'm getting issues with the new release, something around the HTML reporter not loading the environment variables when ran on parallel mode.

e.g the merged behave output xml file is showing the tests that ran but the html report doesnt. (this only happens on parallel mode)

I havent found the cause yet and couldnt recreate the issue with a minimum repro example.

Would it be possible to back-port the fix for issue 114 on v3? i checked the changes on v4 and there seem to be a handful around the html reporter (and one related to env variables). have a feeling some of it may have caused a regression.

otherwise, would you be able to advise on the best next step for troubleshooting this?

thank again!

@hrcorval
Copy link
Owner

hrcorval commented Sep 9, 2024

Hi @iamkenos, it seems the way we are using/testing the library is not catching the issue you are experiencing, so I would like to ask you for more details in order to investigate it:
1- Is the HTML report showing some test results? Or is it coming empty?
2- What is the execution status of the tests you see in the XML report that are not filled in the HTML report?
3- Regarding v3, do you have the link to that branch, so we can analyze the changes performed after it.

Thanks for reporting that, as we are extensively using the solution and it seems we are not reproducing this issue, and we need to consider the way to reproduce it.

Regards!

@iamkenos
Copy link
Contributor

iamkenos commented Sep 9, 2024

Appreciate the prompt response @hrcorval !

I have a feeling the issue is intermittent. I'll raise a separate github issue once I find a way to reproduce it consistently.

Once again, thanks for all your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Answered bug Something isn't working fixed
Projects
None yet
Development

No branches or pull requests

5 participants