Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joblib.dump bug with too large object-type arrays #220

Closed
ngoix opened this issue Jun 23, 2015 · 3 comments
Closed

joblib.dump bug with too large object-type arrays #220

ngoix opened this issue Jun 23, 2015 · 3 comments
Labels
bug
Milestone

Comments

@ngoix
Copy link

@ngoix ngoix commented Jun 23, 2015

issue from
scikit-learn/scikit-learn#4889
The code

import joblib
import os
import numpy as np
from gzip import GzipFile
from io import BytesIO
from urllib2 import urlopen
from os.path import join
from sklearn.datasets import get_data_home

URL10 = ('http://archive.ics.uci.edu/ml/'
         'machine-learning-databases/kddcup99-mld/kddcup.data_10_percent.gz')

data_home = get_data_home()
kddcup_dir = join(data_home, "test")
samples_path = join(kddcup_dir, "samples")
os.makedirs(kddcup_dir)
f = BytesIO(urlopen(URL10).read())
file = GzipFile(fileobj=f, mode='r')

X = []
for line in file.readlines():
    X.append(line.replace('\n', '').split(','))

file.close()
X = np.asarray(X, dtype=object)
joblib.dump(X, samples_path, compress=9)
X = joblib.load(samples_path)

More precisely, it works if X has less than 300000 lines or if X is not of dtype object:

Y = X[:300000,:]  ### works
joblib.dump(Y, samples_path, compress=9)
Y = joblib.load(samples_path)
###
Y = X[:400000,:].astype(str)  ### works
joblib.dump(Y, samples_path, compress=9)
Y = joblib.load(samples_path)
###
Y = X[:400000,:]  ### doesnt work
joblib.dump(Y, samples_path, compress=9)
Y = joblib.load(samples_path)

The error raised by joblib.load is

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-32-25fe17420f08> in <module>()
----> 1 Y = joblib.load(samples_path)

/usr/lib/python2.7/dist-packages/joblib/numpy_pickle.pyc in load(filename, mmap_mode)
    422 
    423         try:
--> 424             obj = unpickler.load()
    425         finally:
    426             if hasattr(unpickler, 'file_handle'):

/usr/lib/python2.7/pickle.pyc in load(self)
    856             while 1:
    857                 key = read(1)
--> 858                 dispatch[key](self)
    859         except _Stop, stopinst:
    860             return stopinst.value

/usr/lib/python2.7/dist-packages/joblib/numpy_pickle.pyc in load_build(self)
    288                         "but numpy didn't import correctly")
    289             nd_array_wrapper = self.stack.pop()
--> 290             array = nd_array_wrapper.read(self)
    291             self.stack.append(array)
    292 

/usr/lib/python2.7/dist-packages/joblib/numpy_pickle.pyc in read(self, unpickler)
    158         array = unpickler.np.core.multiarray._reconstruct(*self.init_args)
    159         with open(filename, 'rb') as f:
--> 160             data = read_zfile(f)
    161         state = self.state + (data,)
    162         array.__setstate__(state)

/usr/lib/python2.7/dist-packages/joblib/numpy_pickle.pyc in read_zfile(file_handle)
     69     assert len(data) == length, (
     70         "Incorrect data length while decompressing %s."
---> 71         "The file could be corrupted." % file_handle)
     72     return data
     73 

AssertionError: Incorrect data length while decompressing <open file '/home/nicolas/scikit_learn_data/test/samples_01.npy.z', mode 'rb' at 0x7efcea714db0>.The file could be corrupted.

@lesteve what do you think?

@lesteve

This comment has been minimized.

Copy link
Member

@lesteve lesteve commented Jun 23, 2015

It looks like the root issue is that state[-1] here is a list for dtype=object rather than a string for other dtypes. That breaks the assumption our code later down the line.

There may be a way to fix that in our code but according to @ogrisel it may make more sense to try to refactor the compressed numpy pickles, for example trying to build on #115.

@lesteve

This comment has been minimized.

Copy link
Member

@lesteve lesteve commented Oct 29, 2015

A work-around for this problem is to set the 'cache_size' argument in joblib.load to a very high value so that your array size is smaller than 'cache_size'.

I am working on fixing this in master.

@lesteve

This comment has been minimized.

Copy link
Member

@lesteve lesteve commented Nov 2, 2015

A work-around for this problem is to set the 'cache_size' argument in joblib.load to a very high value so that your array size is smaller than 'cache_size'.

I meant 'cache_size' argument in joblib.load joblib.dump.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.