Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Numpy pickle to single file #260

Merged
merged 6 commits into from May 10, 2016
Merged

[MRG] Numpy pickle to single file #260

merged 6 commits into from May 10, 2016

Conversation

@aabadie
Copy link
Contributor

@aabadie aabadie commented Oct 23, 2015

I gave a try to gzip.GzipFile to dump numpy object. This PR is a POC of such usage and also contains a fairly large refactoring of numpy_pickle.

In terms of performance, I reused the bench script from #255. With Numpy 1.10, there's no memory copy but the read/write is slightly slower (31s dump, 5.5s read). It also makes the code more 'readable'.
I had to comment out the test regarding the compatibility between pickles.

Waiting for the CI's


This change is Review on Reviewable

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Oct 24, 2015

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Oct 26, 2015

How does it compare to what we currently have?

For an array of 763MB, it adds a read/write time overhead of approximately 10% : 28s write with the actual master against 31s with this PR (4.8s against 5s for reading). This extra time consuming is balanced with a stable memory consumption (for both compress and not compressed serialization), at least for recent versions of numpy, and be explained by the CRC computation used in GzipFile, which is not performed with the direct usage of zlib.
Note also that this PR is not fully complete, it misses improvements:

  • the control file (containing the main serialized object without arrays/matrices) is not compressed
  • the whole object (including arrays) could be serialized in a single file using a smart seeking in the file. This could be part of an extra PR.
  • last, but not least : this PR should also contains a retrocompatibility mecanism with previous versions of the cache format.

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 12, 2015

the control file (containing the main serialized object without arrays/matrices) is not compressed

This was done in 4f58113

the whole object (including arrays) could be serialized in a single file using a smart seeking in the file. This could be part of an extra PR.

I just pushed caeb704 which does this. But it has several drawbacks:

  • tests are broken because the actual state doesn't support memory mapping
  • I had to patch numpy locally in order to make the gzip file compression work. Here, numpy overload GzipFile object seek() function in order to be able to seek using negative values. The problem is that numpy assumes the file only contains one array which is not our case here. The only solution is to include numpy array load/save implementation in joblib.

last, but not least : this PR should also contains a retrocompatibility mecanism with previous versions of the cache format.

This is done in this branch (works in 4f58113)

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Nov 12, 2015

file only contains one array which is not our case here. The only
solution is to include numpy array load/save implementation in
joblib.

Can we then only reimplement the minimal amount that we need, and not
copy the whole codebase?

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 12, 2015

Can we then only reimplement the minimal amount that we need, and not
copy the whole codebase?

Yes, that's the plan. It should also helps with the memmap case.

Loading

@aabadie aabadie force-pushed the gzip_pickling branch 7 times, most recently from d331808 to adacbd6 Nov 13, 2015
@aabadie aabadie changed the title [WIP] Gzip pickling [WIP] Numpy pickle to single file Nov 13, 2015
@aabadie aabadie changed the title [WIP] Numpy pickle to single file [MRG] Numpy pickle to single file Nov 13, 2015
@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 13, 2015

As the status of this PR is imho in a good shape, I changed the status to MRG and renamed it to better reflect what it contains : now object are serialized in a single file, even compressed file and memory maps.
Note that there is a basic retrocompatibility mecanism which means that the future new version of joblib will be able to read files containing object cached with older versions. The dump will use the new format version.

All remaining problems mentionned above are solved:

  • as suggested, the gzip.GzipFile issue was solved by reimplementing (not to say copy-pasting), the minimum amount of required code from numpy in joblib. I'm not if I did that the best way, but at least it works now. The compression level is 3 (it can be from 0 to 9 as said in the doc).
  • the memory mapping works as well using the same technique. I just had to slightly change the way the implementation of the numpy open_memmap function : I introduced a parameter for setting the offset in the file where the numpy array is serialized.

Maybe some more tests could be added. I'll post here some memory/speed benchmarks just to compare with the actual implmentation.

Appveyor is stuck, I don't know why.

Waiting for you comments.

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Nov 13, 2015

For benchmarks, have a look at the following link (hint: there is a link
to a gist with code to benchmark in it)
http://gael-varoquaux.info/programming/joblib-beta-release-fast-compressed-persistence-python-3.html

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 13, 2015

Thanks, I'll try it asap !

Loading

###############################################################################
# Utility objects for persistence.

class NDArrayWrapper(object):

class NPArrayWrapper(object):
Copy link
Member

@lesteve lesteve Nov 17, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a better name than this.

Loading

Copy link
Contributor Author

@aabadie aabadie Nov 18, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a better name than this.

What about NumpyArrayWrapper ?

Loading

Copy link
Member

@GaelVaroquaux GaelVaroquaux Nov 18, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 17, 2015

As promised to @GaelVaroquaux, I tested your gist locally, just to compare with other implementations.

See the results:

  • Read speed with this PR:
    compare_libraries_read
  • Read speed with master:
    compare_libraries_read_master
  • Write speed with this PR:
    compare_libraries_write
  • write speed with master:
    compare_libraries_write_master

The conclusion is that the speed is clearly slower with this PR. And as you noticed in your blog post, gzip.GZipFile might not be an option for cache compression.

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 17, 2015

Looking at the plots, it seems that the datasets labels are mixed. I'll update them.

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 17, 2015

Looking at the plots, it seems that the datasets labels are mixed. I'll update them.

Replying to myself. It double checked and all is ok. It's just the write speed using F-Contiguous data which has a significant impact on performance (especially when uwing pytables with zlib).

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Nov 17, 2015

It doesn't seem to me that the impact of this PR on read/write speed is very large.

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Nov 17, 2015

As a side note, when this PR is merged, it would be useful to test lzma for compression under Python 3: it could be much faster. I have added an enhancement issue for this: #273.

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented Nov 17, 2015

I also want to try the mmap case using read-write mode and especially when the persisted object contains multiple arrays. I'm wondering if the arrays could overlap at the end of the file resulting in a corrupted file.

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Nov 17, 2015

Loading

def __init__(self):
self.array_float = np.arange(100, dtype='float64')
self.array_int = np.ones(100, dtype='int32')
self.array_obj = np.array(['a', 10, 20.0], dtype='object')
Copy link
Contributor

@ogrisel ogrisel May 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Loading

@aabadie aabadie force-pushed the gzip_pickling branch 2 times, most recently from a56236f to efd8604 May 5, 2016
@ogrisel
Copy link
Contributor

@ogrisel ogrisel commented May 9, 2016

Can you also please squash the commits? Commit messages such as "addressing other comments" are really not interesting.

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented May 9, 2016

Can you also please squash the commits?

@ogrisel : history rewritten :)

Loading


def _check_compression_format(filename, expected_list):
if (not PY3_OR_LATER and (filename.endswith('xz') or
filename.endswith('lzma'))):
Copy link
Contributor

@ogrisel ogrisel May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filename.endswith('.xz') or filename.endswith('.lzma')

Loading

Copy link
Contributor Author

@aabadie aabadie May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch ! Just pushed the update.

Loading


# We are careful to open the file handle early and keep it open to
# avoid race-conditions on renames.
# That said, if data are stored in companion files, which can be
Copy link
Contributor

@ogrisel ogrisel May 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data" is a mass noun so it should always be singular: "data is stored"

Loading

@ogrisel
Copy link
Contributor

@ogrisel ogrisel commented May 10, 2016

Thanks @aabdie this LGTM. Merging.

Loading

@ogrisel ogrisel merged commit 8ed578b into joblib:master May 10, 2016
3 checks passed
Loading
@ogrisel
Copy link
Contributor

@ogrisel ogrisel commented May 10, 2016

🍻

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented May 10, 2016

Yehaaa!!!

Loading

@lesteve
Copy link
Member

@lesteve lesteve commented May 10, 2016

Great stuff!

Loading

@aabadie
Copy link
Contributor Author

@aabadie aabadie commented May 10, 2016

Woohoo !! 🎉

Loading

'zerosize_ok'],
buffersize=buffersize,
order=self.order):
pickler.file_handle.write(chunk.tostring('C'))
Copy link
Contributor

@mrocklin mrocklin May 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: is it possible to use the memoryview directly? This might avoid a copy

In [1]: import numpy as np

In [2]: x = np.ones(5)

In [3]: x.data
Out[3]: <memory at 0x7f201c81fe58>

In [4]: with open('foo.pkl', 'wb') as f:
    f.write(x.data)

Loading

Copy link
Contributor

@ogrisel ogrisel May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we tried that but it breaks for some versions of Python / numpy if I recall correctly. @aabadie do you confirm?

Anyway this is a small buffer so it introduce minimal memory overhead and I don't think the performance overhead is significant.

Loading

Copy link
Contributor Author

@aabadie aabadie May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrocklin for the suggestion.
@ogrisel, I was running some benches/tests with this change. Indeed, it avoids a memory copy but as you said it's not significant as this code works with a few MB (16). Performance is the same.
It also works on all versions of python/numpy supported by joblib.

Loading

Copy link
Contributor

@ogrisel ogrisel May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think this will result in storing incorrect data in the pickle when the array is not contiguous.

Loading

Copy link
Contributor Author

@aabadie aabadie May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried this using:

import numpy as np
import joblib

a = np.asarray(np.arange(100000000).reshape((1000, 500, 200)), order='F')[:, :1, :]
a.flags  # f_contiguous: False, c_contiguous: False, but aligned: True
joblib.dump(a, '/tmp/test.pkl')
np.allclose(a, joblib.load('/tmp/test.pkl'))  # Return True

Seems ok to me.

Loading

Copy link
Member

@GaelVaroquaux GaelVaroquaux May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a = np.asarray(np.arange(100000000).reshape((1000, 500, 200)), order='F')[:, :1, :]

Try with [::2]

Loading

Copy link
Contributor Author

@aabadie aabadie May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try with [::2]

Works

Loading

Copy link
Contributor

@ogrisel ogrisel May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, then let's do a PR :)

Loading

Copy link
Contributor

@ogrisel ogrisel May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think np.nditer must return contiguous chunks whatever the input.

Loading

Copy link
Contributor Author

@aabadie aabadie May 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then let's do a PR

Here it is : #352 :)

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants