Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Data corruption in large numerical Pandas.DataFrame using joblib and shared memory #451
When a Pandas DataFrame is used as a shared memory object for parallel computing using
In the example below, the column name and index stays correct, while the internal data (
This issue has already been submitted to Pandas, but is a joblib issue according to the developers.
We found that the error only occurs if all of the following criteria are met:
We would expect that the all of the elements in results are the same as the original DataFrame and that
Minimal code example
import joblib import multiprocessing import numpy as np import os import pandas as pd import tempfile df = pd.DataFrame(np.random.random((1000,3))) # df.head() is # 0 1 2 # 0 0.204271 0.250022 0.333579 # 1 0.894605 0.262193 0.412623 # 2 0.260175 0.056537 0.084432 # 3 0.774219 0.824558 0.521398 # 4 0.414248 0.913390 0.735691 temp_folder = tempfile.mkdtemp() filename = os.path.join(temp_folder, 'joblib_test.mmap') if os.path.exists(filename): os.unlink(filename) _ = joblib.dump(df, filename) memmap = joblib.load(filename, mmap_mode='r') def myFunc(df) : return df results = joblib.Parallel(n_jobs=2)(joblib.delayed(myFunc)(memmap) for _ in range(10)) # results.head() is # 0 1 2 # 0 1.105907e+169 5.745991e+169 5.298748e+180 # 1 3.036710e+213 9.708092e+189 5.587160e+238 # 2 2.379254e+233 7.281947e+223 1.278075e+161 # 3 7.658495e+151 1.452701e-253 6.235120e-85 # 4 6.240245e-85 6.340871e+160 9.041151e+271 np.allclose(results, df) # False!
If pandas dataframes really use numpy arrays backed by np.memmap objects, my intuition is that the problem comes from an implicit conversion of the memmap to a numpy array (maybe here) when pickling. Comparing with the previous working version in 0.9.4 is not straightforward as this part was largely rewritten in 0.10. It needs more investigation.
The problem is that a memmap
I opened a PR in numpy about that, see numpy/numpy#8443 for more details.