Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random PermissionError on windows when using automatic memory mapping caused by reference cycles #942

Closed
wants to merge 3 commits into from

Conversation

ogrisel
Copy link
Contributor

@ogrisel ogrisel commented Oct 3, 2019

For now this PR is trying to reproduce #806 on the Windows CI with a bunch of tests.

@albertcthomas
Copy link
Contributor

So the test with pandas and scikit-learn is passing. At it is the case on my windows. I have not been able to find what has changed between the time it was failing and now.

@albertcthomas
Copy link
Contributor

The minimal test is also failing on my windows. Is the minimal test identical to the one posted in #806?

import numpy as np
from joblib import Parallel, delayed


def test_data(data):
    data_view = data[0:20]
    return data_view


data = np.ones(int(2e6))
results = Parallel(n_jobs=2, verbose=5)(
    delayed(test_data)(data) for _ in range(10))

From my understanding, the fact that a view is returned makes it impossible for the garbage collector to do anything?

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 3, 2019

So the test with pandas and scikit-learn is passing. At it is the case on my windows. I have not been able to find what has changed between the time it was failing and now.

pandas might have been updated to not introduce a reference cycle anymore.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 3, 2019

From my understanding, the fact that a view is returned makes it impossible for the garbage collector to do anything?

Once the result item has been sent back to the parent process (once the result queue is empty), the view should be garbage collected in the worker process and the mmap buffer closed on the worker side.

Actually, the view of a memmap is also memmap and will be recreated as such in the parent process so we should actually not delete the folder in this case. To problem is likely different but this probably highlight a design issue....

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 3, 2019

Could you please open a dedicated PR with a new non regression test that mixes your minimal reproduction case that involves the view with the kind of asserts on tempfolder as done in the existing test_memmapping_leaks test?

@albertcthomas
Copy link
Contributor

Thanks for the explanations. I think we can close scikit-learn/scikit-learn#12546 as we cannot reproduce the error with the last versions of scikit-learn, pandas and joblib (on my Windows with scikit-learn 0.21.3, pandas 0.25.1 and joblib 0.14.0).

I will open a PR with a test involving the view.

@ogrisel ogrisel changed the title Random PermissionError on windows when using automatic memory mapping Random PermissionError on windows when using automatic memory mapping caused by reference cycles Oct 3, 2019
@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 3, 2019

I confirm that removing the cycle in the test makes the test pass. Forcing a global collect (gc.collect()) after each call also makes the test pass but the added cost would probably be not acceptable.

FYI here is the patch to add the gc.collect().

diff --git a/joblib/externals/loky/process_executor.py b/joblib/externals/loky/process_executor.py
index e9ae29e..74d7b48 100644
--- a/joblib/externals/loky/process_executor.py
+++ b/joblib/externals/loky/process_executor.py
@@ -426,7 +426,7 @@ def _process_worker(call_queue, result_queue, initializer, initargs,
         # Free the resource as soon as possible, to avoid holding onto
         # open files or shared memory that is not needed anymore
         del call_item
-
+        gc.collect()
         if _USE_PSUTIL:
             if _process_reference_size is None:
                 # Make reference measurement after the first call

@albertcthomas
Copy link
Contributor

Would it be possible to call gc.collect only when we catch the permission error in the delete_folder function of joblib/disk.py? so that gc.collect would be called only when the error is raised.
Also why is it happening with loky and not multiprocessing?

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 4, 2019

Would it be possible to call gc.collect only when we catch the permission error in the delete_folder function of joblib/disk.py? so that gc.collect would be called only when the error is raised.

No because gc.collect needs to be called in the worker loop while the folder deletion is performed by the parent process at the end of the call.

But we want to keep the worker open because it's costly to shut them down and respawn them at each parallel call.

@ogrisel
Copy link
Contributor Author

ogrisel commented Oct 4, 2019

Also why is it happening with loky and not multiprocessing?

With multiprocessing the workers are called prior to deleting the folder. Starting worker processes is cheap under Linux with the fork startmethod but it is unsafe and causes crashes with openmp. Loky was design to use the slower spawn start method but to hide the extra cost by reusing worker processes automatically across Parallel calls.

@albertcthomas
Copy link
Contributor

albertcthomas commented Oct 4, 2019

No because gc.collect needs to be called in the worker loop while the folder deletion is performed by the parent process at the end of the call.

But we want to keep the worker open because it's costly to shut them down and respawn them at each parallel call.

I see, thanks for the explanation.

@@ -7,3 +7,5 @@ codecov
distributed
lz4
threadpoolctl; python_version >= '3.5'
scikit-learn
pandas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this as well, now that you removed the test with scikit-learn and pandas.

pierreglaser added a commit to pierreglaser/joblib that referenced this pull request Feb 26, 2020
pierreglaser added a commit to pierreglaser/joblib that referenced this pull request Apr 22, 2020
@ogrisel
Copy link
Contributor Author

ogrisel commented Apr 24, 2020

Closing in favor of #966.

@ogrisel ogrisel closed this May 15, 2020
@ogrisel ogrisel deleted the windows-permission-error branch May 15, 2020 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants