Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption in large numerical Pandas.DataFrame using joblib and shared memory #451

jbalfer opened this issue Dec 12, 2016 · 8 comments


Copy link

jbalfer commented Dec 12, 2016

Problem description

When a Pandas DataFrame is used as a shared memory object for parallel computing using joblib and multiprocessing, the data gets corrupted.

In the example below, the column name and index stays correct, while the internal data (df.values) shows random values. However, if not a DataFrame but a NumPy array (simply df.values) is stored in the memory map, everything works as expected.

This issue has already been submitted to Pandas, but is a joblib issue according to the developers.

We found that the error only occurs if all of the following criteria are met:

  • a Pandas DataFrame is used
  • the DataFrame contains numerical data (strings and other objects work as expected)
  • the DataFrame has a certain size, i.e., the example works as expected with small data
  • the DataFrame is stored and loaded using joblib.dump and joblib.load
  • more than one processor is used

Expected Output

We would expect that the all of the elements in results are the same as the original DataFrame and that
np.allclose(results[0], df) is True.

Minimal code example

import joblib
import multiprocessing
import numpy as np
import os
import pandas as pd
import tempfile

df = pd.DataFrame(np.random.random((1000,3)))

# df.head() is
# 0	1	2
# 0	0.204271	0.250022	0.333579
# 1	0.894605	0.262193	0.412623
# 2	0.260175	0.056537	0.084432
# 3	0.774219	0.824558	0.521398
# 4	0.414248	0.913390	0.735691

temp_folder = tempfile.mkdtemp()
filename = os.path.join(temp_folder, 'joblib_test.mmap')
if os.path.exists(filename): os.unlink(filename)
_ = joblib.dump(df, filename)
memmap = joblib.load(filename, mmap_mode='r') 

def myFunc(df) :
    return df

results = joblib.Parallel(n_jobs=2)(joblib.delayed(myFunc)(memmap) for _ in range(10))

# results[0].head() is
# 	0	1	2
# 0	1.105907e+169	5.745991e+169	5.298748e+180
# 1	3.036710e+213	9.708092e+189	5.587160e+238
# 2	2.379254e+233	7.281947e+223	1.278075e+161
# 3	7.658495e+151	1.452701e-253	6.235120e-85
# 4	6.240245e-85	6.340871e+160	9.041151e+271

np.allclose(results[0], df) # False!
@lesteve lesteve added the bug label Dec 12, 2016
Copy link

lesteve commented Dec 12, 2016

Thanks a lot for putting a stand-alone snippet reproducing the issue.

I can reproduce so this seems like a bug indeed which needs more investigation.

Copy link

lesteve commented Dec 12, 2016

A work-around is to use numpy arrays. I am guessing that some of our code only works for np.memmap objects and not dataframes that are based on numpy arrays backed by np.memmap objects.

Copy link

I could reproduce the issue on the version 0.10.0 or above, but not on 0.8.4 and 0.9.4.

(The version 0.8.4 and 0.9.4 were from conda default channel and 0.10.3 from conda-forge.)

Copy link

lesteve commented Dec 31, 2016

Thanks @KwangmooKoh ! I did not think of trying with older joblib versions.

Running a git bisect tells me that the commit introducing the regression is 8ed578b. @aabadie any idea off the top of your head why the single file pickle could introduce such a regression?

Copy link

aabadie commented Jan 3, 2017

@aabadie any idea off the top of your head why the single file pickle could introduce such a regression?

If pandas dataframes really use numpy arrays backed by np.memmap objects, my intuition is that the problem comes from an implicit conversion of the memmap to a numpy array (maybe here) when pickling. Comparing with the previous working version in 0.9.4 is not straightforward as this part was largely rewritten in 0.10. It needs more investigation.

Copy link

lesteve commented Jan 3, 2017

Chatting with @ogrisel and investigating in more details, the problem seems to be in joblib.pool.reduce_memmap. It looks like it is not handling offsets correctly when you have more than one memmap array in the dumped file.

Copy link

lesteve commented Jan 4, 2017

The problem is that a memmap offset attribute is not the offset into the file but the offset modulo mmap.ALLOCATIONGRANULARITY (mmap.ALLOCATIONGRANULARITY = 4096 on my machine which is why for small offsets everything works fine). We have the wrong offset when we try to reconstruct the memmap hence the weird values we get.

I opened a PR in numpy about that, see numpy/numpy#8443 for more details.

Copy link

ogrisel commented Jan 5, 2017

FYI it seems that mmap.ALLOCATIONGRANULARITY is 4 KiB on linux and 64 KiB on windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

No branches or pull requests

5 participants