-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random PermissionError on windows when using automatic memory mapping caused by reference cycles #942
Conversation
So the test with pandas and scikit-learn is passing. At it is the case on my windows. I have not been able to find what has changed between the time it was failing and now. |
The minimal test is also failing on my windows. Is the minimal test identical to the one posted in #806? import numpy as np
from joblib import Parallel, delayed
def test_data(data):
data_view = data[0:20]
return data_view
data = np.ones(int(2e6))
results = Parallel(n_jobs=2, verbose=5)(
delayed(test_data)(data) for _ in range(10)) From my understanding, the fact that a view is returned makes it impossible for the garbage collector to do anything? |
pandas might have been updated to not introduce a reference cycle anymore. |
Once the result item has been sent back to the parent process (once the result queue is empty), the view should be garbage collected in the worker process and the mmap buffer closed on the worker side. Actually, the view of a memmap is also memmap and will be recreated as such in the parent process so we should actually not delete the folder in this case. To problem is likely different but this probably highlight a design issue.... |
Could you please open a dedicated PR with a new non regression test that mixes your minimal reproduction case that involves the view with the kind of asserts on tempfolder as done in the existing |
Thanks for the explanations. I think we can close scikit-learn/scikit-learn#12546 as we cannot reproduce the error with the last versions of scikit-learn, pandas and joblib (on my Windows with scikit-learn 0.21.3, pandas 0.25.1 and joblib 0.14.0). I will open a PR with a test involving the view. |
I confirm that removing the cycle in the test makes the test pass. Forcing a global collect ( FYI here is the patch to add the gc.collect(). diff --git a/joblib/externals/loky/process_executor.py b/joblib/externals/loky/process_executor.py
index e9ae29e..74d7b48 100644
--- a/joblib/externals/loky/process_executor.py
+++ b/joblib/externals/loky/process_executor.py
@@ -426,7 +426,7 @@ def _process_worker(call_queue, result_queue, initializer, initargs,
# Free the resource as soon as possible, to avoid holding onto
# open files or shared memory that is not needed anymore
del call_item
-
+ gc.collect()
if _USE_PSUTIL:
if _process_reference_size is None:
# Make reference measurement after the first call |
Would it be possible to call |
No because gc.collect needs to be called in the worker loop while the folder deletion is performed by the parent process at the end of the call. But we want to keep the worker open because it's costly to shut them down and respawn them at each parallel call. |
With multiprocessing the workers are called prior to deleting the folder. Starting worker processes is cheap under Linux with the fork startmethod but it is unsafe and causes crashes with openmp. Loky was design to use the slower spawn start method but to hide the extra cost by reusing worker processes automatically across Parallel calls. |
I see, thanks for the explanation. |
@@ -7,3 +7,5 @@ codecov | |||
distributed | |||
lz4 | |||
threadpoolctl; python_version >= '3.5' | |||
scikit-learn | |||
pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this as well, now that you removed the test with scikit-learn and pandas.
Closing in favor of #966. |
For now this PR is trying to reproduce #806 on the Windows CI with a bunch of tests.