[MRG] Numpy pickle to single file#260
Conversation
8dae363 to
3213b72
Compare
|
the read/write is slightly slower (31s dump, 5.5s read).
How does it compare to what we currently have?
|
For an array of 763MB, it adds a read/write time overhead of approximately 10% : 28s write with the actual master against 31s with this PR (4.8s against 5s for reading). This extra time consuming is balanced with a stable memory consumption (for both compress and not compressed serialization), at least for recent versions of numpy, and be explained by the CRC computation used in GzipFile, which is not performed with the direct usage of zlib.
|
a39fd5f to
caeb704
Compare
This was done in 4f58113
I just pushed caeb704 which does this. But it has several drawbacks:
This is done in this branch (works in 4f58113) |
Can we then only reimplement the minimal amount that we need, and not |
Yes, that's the plan. It should also helps with the memmap case. |
d331808 to
adacbd6
Compare
|
As the status of this PR is imho in a good shape, I changed the status to MRG and renamed it to better reflect what it contains : now object are serialized in a single file, even compressed file and memory maps. All remaining problems mentionned above are solved:
Maybe some more tests could be added. I'll post here some memory/speed benchmarks just to compare with the actual implmentation. Appveyor is stuck, I don't know why. Waiting for you comments. |
|
For benchmarks, have a look at the following link (hint: there is a link |
|
Thanks, I'll try it asap ! |
There was a problem hiding this comment.
We need to find a better name than this.
There was a problem hiding this comment.
We need to find a better name than this.
What about NumpyArrayWrapper ?
There was a problem hiding this comment.
|
As promised to @GaelVaroquaux, I tested your gist locally, just to compare with other implementations. See the results: The conclusion is that the speed is clearly slower with this PR. And as you noticed in your blog post, gzip.GZipFile might not be an option for cache compression. |
|
Looking at the plots, it seems that the datasets labels are mixed. I'll update them. |
Replying to myself. It double checked and all is ok. It's just the write speed using F-Contiguous data which has a significant impact on performance (especially when uwing pytables with zlib). |
|
It doesn't seem to me that the impact of this PR on read/write speed is very large. |
|
As a side note, when this PR is merged, it would be useful to test lzma for compression under Python 3: it could be much faster. I have added an enhancement issue for this: #273. |
|
I also want to try the mmap case using read-write mode and especially when the persisted object contains multiple arrays. I'm wondering if the arrays could overlap at the end of the file resulting in a corrupted file. |
|
I'm wondering if the arrays could overlap at the end of the file
resulting in a corrupted file.
Sound like we need a test :)
|
| env['filename'] = os.path.join(env['dir'], 'test.pkl') | ||
| print(80 * '_') | ||
| print('setup numpy_pickle') | ||
| print(80 * '_') |
There was a problem hiding this comment.
Please avoid printing stuff in tests.
|
Can you also please squash the commits? Commit messages such as "addressing other comments" are really not interesting. |
@ogrisel : history rewritten :) |
|
|
||
| def _check_compression_format(filename, expected_list): | ||
| if (not PY3_OR_LATER and (filename.endswith('xz') or | ||
| filename.endswith('lzma'))): |
There was a problem hiding this comment.
filename.endswith('.xz') or filename.endswith('.lzma')
There was a problem hiding this comment.
Good catch ! Just pushed the update.
|
|
||
| # We are careful to open the file handle early and keep it open to | ||
| # avoid race-conditions on renames. | ||
| # That said, if data are stored in companion files, which can be |
There was a problem hiding this comment.
"data" is a mass noun so it should always be singular: "data is stored"
|
Thanks @aabdie this LGTM. Merging. |
|
🍻 |
|
Yehaaa!!! |
|
Great stuff! |
|
Woohoo !! 🎉 |
| 'zerosize_ok'], | ||
| buffersize=buffersize, | ||
| order=self.order): | ||
| pickler.file_handle.write(chunk.tostring('C')) |
There was a problem hiding this comment.
Question: is it possible to use the memoryview directly? This might avoid a copy
In [1]: import numpy as np
In [2]: x = np.ones(5)
In [3]: x.data
Out[3]: <memory at 0x7f201c81fe58>
In [4]: with open('foo.pkl', 'wb') as f:
f.write(x.data)There was a problem hiding this comment.
I think we tried that but it breaks for some versions of Python / numpy if I recall correctly. @aabadie do you confirm?
Anyway this is a small buffer so it introduce minimal memory overhead and I don't think the performance overhead is significant.
There was a problem hiding this comment.
There was a problem hiding this comment.
Also I think this will result in storing incorrect data in the pickle when the array is not contiguous.
There was a problem hiding this comment.
Just tried this using:
import numpy as np
import joblib
a = np.asarray(np.arange(100000000).reshape((1000, 500, 200)), order='F')[:, :1, :]
a.flags # f_contiguous: False, c_contiguous: False, but aligned: True
joblib.dump(a, '/tmp/test.pkl')
np.allclose(a, joblib.load('/tmp/test.pkl')) # Return TrueSeems ok to me.
There was a problem hiding this comment.
a = np.asarray(np.arange(100000000).reshape((1000, 500, 200)), order='F')[:, :1, :]
Try with [::2]
There was a problem hiding this comment.
Try with [::2]
Works
There was a problem hiding this comment.
Interesting, then let's do a PR :)
There was a problem hiding this comment.
I think np.nditer must return contiguous chunks whatever the input.




I gave a try to gzip.GzipFile to dump numpy object. This PR is a POC of such usage and also contains a fairly large refactoring of numpy_pickle.
In terms of performance, I reused the bench script from #255. With Numpy 1.10, there's no memory copy but the read/write is slightly slower (31s dump, 5.5s read). It also makes the code more 'readable'.
I had to comment out the test regarding the compatibility between pickles.
Waiting for the CI's
This change is