Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: always make ndarrays from msgpack writable #12359

Closed

Conversation

llllllllll
Copy link
Contributor

Addresses the case where 'compress' was not none. The old implementation
would decompress the data and then call np.frombuffer on a bytes
object. Because a bytes object is not a mutable buffer, the resulting
ndarray had writeable=False. The new implementation ensures that the
pandas is the only owner of this new buffer and then sets it to mutable
without copying it. This means that we don't need to do a copy of the
data coming in AND we can mutate it later. If we are not the only owner
of this data then we just copy it with np.fromstring.

raise ValueError("compress must be one of 'zlib' or 'blosc'")

values = decompress(values)
if sys.getrefcount(values) == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did you get this beauty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just playing around with frombuffer and thinking of ways to avoid the copy here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we always just use np.frombuffer(buffer(values), dtype=....)?

then just set the writeable flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue ius that this we will be working with memory allocated for a bytes. For example:

In [1]: buf = bytes(range(4))

In [2]: np.frombuffer(buf, dtype='uint8')
Out[2]: array([0, 1, 2, 3], dtype=uint8)

In [3]: _2.flags.writeable = True

In [4]: _2[0] = 5

In [5]: buf
Out[5]: b'\x05\x01\x02\x03'

The only case where we can just use the buffer is when we are the sole owner of values and no one can reference values again. In the other cases we need to just copy the data so we can mutate while preserving normal python semantics of immutable strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I c. Is this really a bottleneck, e.g. the copying? Does the compression help?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copying might be restrictive if you are unpacking a large dataframe because your high watermark is now 2n. We are unpacking data on a shared box so I would like to keep the memory usage low if possible. I also imagine that users who are using compression are recieving a large amount of data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually thing the ideal fix would be for blosc and zlib to have unpack functions that give back bytearrays so that we can do this in a more safe way. I will investigate if this is already available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like there is no good way to get the compressors to return bytearrays in their current state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation proposed in this PR is scary so I tried to solve this problem upstream in blosc and zlib

Hopefully we can swap to the more supported solutions when available and only use this hack when we need to.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2016

cc @kawochen

@jreback jreback added Msgpack Performance Memory or execution speed performance labels Feb 17, 2016
@@ -57,6 +59,20 @@ def check_arbitrary(a, b):
assert(a == b)


@contextmanager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt't this like a unittest.Mock object? (well what you are doing below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that was only added to the stdlib in py3

@kawochen
Copy link
Contributor

python magic! there might be performance warnings during debugging when the debugger holds more references

@llllllllll
Copy link
Contributor Author

I think that warning while debugging is unfortunate but correct. Do you think that it would be worth amending the warning to add "or you are in a debugger"?

@kawochen
Copy link
Contributor

I'm not sure what the best thing to do is. In general I believe relying on ref count is sketchy (e.g. count could be incremented between function calls). But then I think zlib doesn't really return anything that's shared, even for small strings (I didn't manage to get any performance warnings with crafted cases).

I'm only minimally familiar with the C-API. Is it possible that, although the ref count is 2, the string is interned during the query (maybe by getrefcount itself) and never destroyed, so subsequent attempts to use that string give unexpected results?

In moving data around, this n vs 2n memory issue is very common, but I think it is a legit concern, and in this case it does seem a bit wasteful.

@llllllllll
Copy link
Contributor Author

Subsequent accesses to the bytes object will have unexpected results. Think of this as a move operation. I have checked the source for zlib and blosc to confirm that in their current state they will not be interning the strings; however, the test cases are to guard against changes to the implementation of these libraries. I also have open PRs upstream to both CPython and blosc that will make these functions return mutable buffers. We will need to support this hack for some time (especially with zlib) but at least there will be a more supported code path. I have an idea for a function that will factor out this logic a bit and ensure that users cannot get a reference back to the invalid bytes object. Right now this object is still exposed as df._data.blocks[n].values.base.base which is not ideal but also pretty deeply hidden/

@llllllllll
Copy link
Contributor Author

@kawochen I am updating the PR with a new implementation that I think is both safer and faster (or at least the same speed).

With this new implementation:

In [34]: pd.read_msgpack(df.to_msgpack(compress='blosc'))._data.blocks[0].values.base.base
Out[34]: <memory at 0x7f71f8e81750>

However with the old implementation the base was the bytes object that shared a buffer with the array.

Here are profiling runs to show the differences:

These cells are used for all three profiling runs.

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: packed = df.to_msgpack(compress='blosc')

Using `np.fromstring` to copy the data (this is just shown to make it clear why we cannot do this):

```python
In [3]: %timeit pd.read_msgpack(packed)
10 loops, best of 3: 168 ms per loop

Old implementation:

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.06 ms per loop

Updated implementation:

In [5]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 2.68 ms per loop


Notes
-----
If you want to use this function you are probably wrong.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gr8 comment!

@llllllllll
Copy link
Contributor Author

Ugh, I forgot that memoryview has an obj attribute which is exactly like ndarrays base attribute. I am investigating ways to drop this in pure python but I am starting to think that we might just need to use a little ctypes to kill this object.

@jreback
Copy link
Contributor

jreback commented Feb 22, 2016

This is on master (not your branch)

In [23]: %timeit pd.read_pickle('test.pkl')
100 loops, best of 3: 13.7 ms per loop
# zlib
[22]: %timeit pd.read_msgpack('test.pak')
10 loops, best of 3: 144 ms per loop

# blosc
In [3]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 3.19 ms per loop

# uncompressed
In [5]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 18.5 ms per loop

blosc >> zlib

@llllllllll llllllllll force-pushed the msgpack-owns-new-memory branch 6 times, most recently from e513912 to 9c4a29b Compare February 23, 2016 08:34
@jreback
Copy link
Contributor

jreback commented Feb 23, 2016

@llllllllll

This is on 0.17.1

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: In [2]: packed = df.to_msgpack(compress='blosc')

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 4.97 ms per loop

How did you get the above?

@jreback
Copy link
Contributor

jreback commented Feb 23, 2016

I am amazed on how fast blosc is. we should turn this on by default if blosc is available.

@llllllllll
Copy link
Contributor Author

sorry, the really slow run was what would happen if we copied the data after doing blosc.decompress. This code was never on master I just wanted to show that it was worthwhile to use a more complex approach to avoid this copy.

@llllllllll
Copy link
Contributor Author

ping, how do you feel about this?

@kawochen
Copy link
Contributor

no trace of bytes left 👍

@llllllllll
Copy link
Contributor Author

hype

@jreback
Copy link
Contributor

jreback commented Feb 26, 2016

what triggers the error on master?

values = blosc.decompress(values)
return np.frombuffer(values, dtype=dtype)
if compress:
if compress == u'zlib':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to import these at the top of the file and just set variables (_ZLIB_INSTALLED, _BLOSC_INSTALLED); and you can do whatever imports you need there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, added _check_* functions to raise a valueerror here if you try to use a compressor that is not installed

@llllllllll
Copy link
Contributor Author

Updated based on feedback. I also added standalone tests for move_into_mutable_buffer because it is not in util.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

IIRC this DID work on windows before.

[pandas3] C:\Users\conda\Documents\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\io/sas\saslib.c' Cython extension (up-to-date)
skipping 'pandas\src\period.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\parser.c' Cython extension (up-to-date)
skipping 'pandas\lib.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\src\sparse.c' Cython extension (up-to-date)
skipping 'pandas\src\testing.c' Cython extension (up-to-date)
skipping 'pandas\msgpack\_packer.cpp' Cython extension (up-to-date)
skipping 'pandas\msgpack\_unpacker.cpp' Cython extension (up-to-date)
building 'pandas.util._move' extension
creating build\temp.win-amd64-3.5\Release\pandas\util
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Miniconda2\envs\pandas3.5\lib\site-pa
ckages\numpy\core\include -IC:\Miniconda2\envs\pandas3.5\include -IC:\Miniconda2\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Stu
dio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.
10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "
-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas/util
/_move.c /Fobuild\temp.win-amd64-3.5\Release\pandas/util/_move.obj
_move.c
pandas/util/_move.c(201): error C2099: initializer is not a constant
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3] C:\Users\conda\Documents\pandas3.5>

@llllllllll
Copy link
Contributor Author

That is saying that my definition for stolenbuf_type is not constant but it only uses addesses and int constants, is this using PIC? is there a way to get the compiler to emit the particular field that it does not think is constant? GCC by default will tell you which field it doesn't like.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

Joe Jevnik added 3 commits March 17, 2016 19:41
Addresses the case where 'compress' was not none. The old implementation
would decompress the data and then call np.frombuffer on a bytes
object. Because a bytes object is not a mutable buffer, the resulting
ndarray had writeable=False. The new implementation ensures that the
pandas is the only owner of this new buffer and then sets it to mutable
without copying it. This means that we don't need to do a copy of the
data coming in AND we can mutate it later. If we are not the only owner
of this data then we just copy it with np.fromstring.
In many capi functions for python the empty string and single characters
are memoized to return preallocated bytes objects. This could be a
potential problem with the implementation of pandas.io.packers.unconvert
so we are adding a test for this explicitly.

Currently neither zlib nor blosc hit this case because they use
PyBytes_FromStringAndSize(NULL, n) which does not pull from the shared
pool of bytes objects; however, this test will guard us against changes
to this implementation detail in the future.
Pulls the logic that does the buffer management into a new private
function named `_move_into_mutable_buffer`. The function acts like a
move constructor from 'bytes' to `memoryview`. We now only call
`np.frombuffer` on the resulting `memoryview` object. This accomplishes
the original task of getting a mutable array from a `bytes` object
without copying with the added safety of making the `base` field of the
array that is returned to the user a mutable `memoryview`. This
eliminates the risk of users obtaining the original `bytes` object which
is now pointing to a mutable buffer and violating the immutability
contract of `bytes` objects.
Joe Jevnik added 2 commits March 17, 2016 19:42
By writing our move function in C we can hide the original bytes object
from the user while still ensuring that the lifetime is managed
correctly. This implementation is designed to make it impossible to get
access to the invalid bytes object from pure python.
This function temporarily replaces a an attribute on an object for the
duration of some context manager. This is useful for patching methods to
inject assertions or mock out expensive operations.
@llllllllll
Copy link
Contributor Author

Read some docs about windows dev and pushed a potential fix. Things still behave normally for me with gcc on gnu+linux, hopefully this builds on windows. Do you have an appveyor build I can check?

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

you could set this up with appveyor (just create an account), all of the setup will build on your PR's. but its VERY slow. I do with a local vm.

class TestMove(tm.TestCase):
def test_more_than_one_ref(self):
"""Test case for when we try to use ``move_into_mutable_buffer`` when
the object being moved has other references.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its funny, when you use a doc-string it calls the test by this. I guess that is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just make this a comment instead or do you not mind?

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

ok this passed for me on windows!

merging

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

side issue, should prob add some more perf tests with msgpack to cover the bases.

@llllllllll
Copy link
Contributor Author

Should I do that in this pr or open another pr with perf tests? Also are those in test_perf.sh?

@jreback
Copy link
Contributor

jreback commented Mar 17, 2016

This is amazing how fast blosc makes this.

no we are using asv now. (tests are basically in the same place though). another PR is fine.

In [8]:  %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.45 ms per loop

In [9]: %timeit df.to_msgpack(compress='blosc')
100 loops, best of 3: 6.27 ms per loop

In [10]: %timeit df.to_msgpack()
100 loops, best of 3: 14.4 ms per loop

In [11]: packed = df.to_msgpack()

In [12]:  %timeit pd.read_msgpack(packed)
10 loops, best of 3: 20.1 ms per loop

@jreback jreback closed this in 28ba08c Mar 18, 2016
@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

@llllllllll thanks!!

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

had to make a trivial modification: jreback@6bba06c

we don't have blosc installed (on purpose) on some of the windows builds. I guess this actually failed travis (but I merged before) :<

@jreback
Copy link
Contributor

jreback commented Mar 18, 2016

only fails on 2.7 which we were testing with blosc installed. fixed up.

@llllllllll
Copy link
Contributor Author

oh, sorry about that. I guess import error is probably more indicative of the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants