New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption in large numerical Pandas.DataFrame using joblib and shared memory #451
Comments
Thanks a lot for putting a stand-alone snippet reproducing the issue. I can reproduce so this seems like a bug indeed which needs more investigation. |
A work-around is to use numpy arrays. I am guessing that some of our code only works for np.memmap objects and not dataframes that are based on numpy arrays backed by np.memmap objects. |
I could reproduce the issue on the version 0.10.0 or above, but not on 0.8.4 and 0.9.4. (The version 0.8.4 and 0.9.4 were from conda default channel and 0.10.3 from conda-forge.) |
Thanks @KwangmooKoh ! I did not think of trying with older joblib versions. Running a git bisect tells me that the commit introducing the regression is 8ed578b. @aabadie any idea off the top of your head why the single file pickle could introduce such a regression? |
If pandas dataframes really use numpy arrays backed by np.memmap objects, my intuition is that the problem comes from an implicit conversion of the memmap to a numpy array (maybe here) when pickling. Comparing with the previous working version in 0.9.4 is not straightforward as this part was largely rewritten in 0.10. It needs more investigation. |
Chatting with @ogrisel and investigating in more details, the problem seems to be in |
The problem is that a memmap I opened a PR in numpy about that, see numpy/numpy#8443 for more details. |
FYI it seems that |
Problem description
When a Pandas DataFrame is used as a shared memory object for parallel computing using
joblib
andmultiprocessing
, the data gets corrupted.In the example below, the column name and index stays correct, while the internal data (
df.values
) shows random values. However, if not a DataFrame but a NumPy array (simplydf.values
) is stored in the memory map, everything works as expected.This issue has already been submitted to Pandas, but is a joblib issue according to the developers.
pandas-dev/pandas#14840
We found that the error only occurs if all of the following criteria are met:
joblib.dump
andjoblib.load
Expected Output
We would expect that the all of the elements in results are the same as the original DataFrame and that
np.allclose(results[0], df)
is True.Minimal code example
The text was updated successfully, but these errors were encountered: