New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: always make ndarrays from msgpack writable #12359

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@llllllllll
Contributor

llllllllll commented Feb 16, 2016

Addresses the case where 'compress' was not none. The old implementation
would decompress the data and then call np.frombuffer on a bytes
object. Because a bytes object is not a mutable buffer, the resulting
ndarray had writeable=False. The new implementation ensures that the
pandas is the only owner of this new buffer and then sets it to mutable
without copying it. This means that we don't need to do a copy of the
data coming in AND we can mutate it later. If we are not the only owner
of this data then we just copy it with np.fromstring.

@jreback

View changes

Show outdated Hide outdated pandas/io/packers.py Outdated
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 16, 2016

Contributor
Contributor

jreback commented Feb 16, 2016

Show outdated Hide outdated pandas/io/tests/test_packers.py Outdated
@kawochen

This comment has been minimized.

Show comment
Hide comment
@kawochen

kawochen Feb 18, 2016

Contributor

python magic! there might be performance warnings during debugging when the debugger holds more references

Contributor

kawochen commented Feb 18, 2016

python magic! there might be performance warnings during debugging when the debugger holds more references

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 18, 2016

Contributor

I think that warning while debugging is unfortunate but correct. Do you think that it would be worth amending the warning to add "or you are in a debugger"?

Contributor

llllllllll commented Feb 18, 2016

I think that warning while debugging is unfortunate but correct. Do you think that it would be worth amending the warning to add "or you are in a debugger"?

@kawochen

This comment has been minimized.

Show comment
Hide comment
@kawochen

kawochen Feb 22, 2016

Contributor

I'm not sure what the best thing to do is. In general I believe relying on ref count is sketchy (e.g. count could be incremented between function calls). But then I think zlib doesn't really return anything that's shared, even for small strings (I didn't manage to get any performance warnings with crafted cases).

I'm only minimally familiar with the C-API. Is it possible that, although the ref count is 2, the string is interned during the query (maybe by getrefcount itself) and never destroyed, so subsequent attempts to use that string give unexpected results?

In moving data around, this n vs 2n memory issue is very common, but I think it is a legit concern, and in this case it does seem a bit wasteful.

Contributor

kawochen commented Feb 22, 2016

I'm not sure what the best thing to do is. In general I believe relying on ref count is sketchy (e.g. count could be incremented between function calls). But then I think zlib doesn't really return anything that's shared, even for small strings (I didn't manage to get any performance warnings with crafted cases).

I'm only minimally familiar with the C-API. Is it possible that, although the ref count is 2, the string is interned during the query (maybe by getrefcount itself) and never destroyed, so subsequent attempts to use that string give unexpected results?

In moving data around, this n vs 2n memory issue is very common, but I think it is a legit concern, and in this case it does seem a bit wasteful.

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 22, 2016

Contributor

Subsequent accesses to the bytes object will have unexpected results. Think of this as a move operation. I have checked the source for zlib and blosc to confirm that in their current state they will not be interning the strings; however, the test cases are to guard against changes to the implementation of these libraries. I also have open PRs upstream to both CPython and blosc that will make these functions return mutable buffers. We will need to support this hack for some time (especially with zlib) but at least there will be a more supported code path. I have an idea for a function that will factor out this logic a bit and ensure that users cannot get a reference back to the invalid bytes object. Right now this object is still exposed as df._data.blocks[n].values.base.base which is not ideal but also pretty deeply hidden/

Contributor

llllllllll commented Feb 22, 2016

Subsequent accesses to the bytes object will have unexpected results. Think of this as a move operation. I have checked the source for zlib and blosc to confirm that in their current state they will not be interning the strings; however, the test cases are to guard against changes to the implementation of these libraries. I also have open PRs upstream to both CPython and blosc that will make these functions return mutable buffers. We will need to support this hack for some time (especially with zlib) but at least there will be a more supported code path. I have an idea for a function that will factor out this logic a bit and ensure that users cannot get a reference back to the invalid bytes object. Right now this object is still exposed as df._data.blocks[n].values.base.base which is not ideal but also pretty deeply hidden/

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 22, 2016

Contributor

@kawochen I am updating the PR with a new implementation that I think is both safer and faster (or at least the same speed).

With this new implementation:

In [34]: pd.read_msgpack(df.to_msgpack(compress='blosc'))._data.blocks[0].values.base.base
Out[34]: <memory at 0x7f71f8e81750>

However with the old implementation the base was the bytes object that shared a buffer with the array.

Here are profiling runs to show the differences:

These cells are used for all three profiling runs.

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: packed = df.to_msgpack(compress='blosc')

Using `np.fromstring` to copy the data (this is just shown to make it clear why we cannot do this):

```python
In [3]: %timeit pd.read_msgpack(packed)
10 loops, best of 3: 168 ms per loop

Old implementation:

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.06 ms per loop

Updated implementation:

In [5]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 2.68 ms per loop
Contributor

llllllllll commented Feb 22, 2016

@kawochen I am updating the PR with a new implementation that I think is both safer and faster (or at least the same speed).

With this new implementation:

In [34]: pd.read_msgpack(df.to_msgpack(compress='blosc'))._data.blocks[0].values.base.base
Out[34]: <memory at 0x7f71f8e81750>

However with the old implementation the base was the bytes object that shared a buffer with the array.

Here are profiling runs to show the differences:

These cells are used for all three profiling runs.

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: packed = df.to_msgpack(compress='blosc')

Using `np.fromstring` to copy the data (this is just shown to make it clear why we cannot do this):

```python
In [3]: %timeit pd.read_msgpack(packed)
10 loops, best of 3: 168 ms per loop

Old implementation:

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.06 ms per loop

Updated implementation:

In [5]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 2.68 ms per loop
Show outdated Hide outdated pandas/io/packers.py Outdated
Show outdated Hide outdated pandas/io/tests/test_packers.py Outdated
@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 22, 2016

Contributor

Ugh, I forgot that memoryview has an obj attribute which is exactly like ndarrays base attribute. I am investigating ways to drop this in pure python but I am starting to think that we might just need to use a little ctypes to kill this object.

Contributor

llllllllll commented Feb 22, 2016

Ugh, I forgot that memoryview has an obj attribute which is exactly like ndarrays base attribute. I am investigating ways to drop this in pure python but I am starting to think that we might just need to use a little ctypes to kill this object.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 22, 2016

Contributor

This is on master (not your branch)

In [23]: %timeit pd.read_pickle('test.pkl')
100 loops, best of 3: 13.7 ms per loop
# zlib
[22]: %timeit pd.read_msgpack('test.pak')
10 loops, best of 3: 144 ms per loop

# blosc
In [3]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 3.19 ms per loop

# uncompressed
In [5]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 18.5 ms per loop

blosc >> zlib

Contributor

jreback commented Feb 22, 2016

This is on master (not your branch)

In [23]: %timeit pd.read_pickle('test.pkl')
100 loops, best of 3: 13.7 ms per loop
# zlib
[22]: %timeit pd.read_msgpack('test.pak')
10 loops, best of 3: 144 ms per loop

# blosc
In [3]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 3.19 ms per loop

# uncompressed
In [5]: %timeit pd.read_msgpack('test.pak')
100 loops, best of 3: 18.5 ms per loop

blosc >> zlib

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 23, 2016

Contributor

@llllllllll

This is on 0.17.1

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: In [2]: packed = df.to_msgpack(compress='blosc')

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 4.97 ms per loop

How did you get the above?

Contributor

jreback commented Feb 23, 2016

@llllllllll

This is on 0.17.1

In [1]: df = pd.DataFrame({'a': np.arange(1000000), 'b': np.arange(1000000)})

In [2]: In [2]: packed = df.to_msgpack(compress='blosc')

In [3]: %timeit pd.read_msgpack(packed)
100 loops, best of 3: 4.97 ms per loop

How did you get the above?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 23, 2016

Contributor

I am amazed on how fast blosc is. we should turn this on by default if blosc is available.

Contributor

jreback commented Feb 23, 2016

I am amazed on how fast blosc is. we should turn this on by default if blosc is available.

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 23, 2016

Contributor

sorry, the really slow run was what would happen if we copied the data after doing blosc.decompress. This code was never on master I just wanted to show that it was worthwhile to use a more complex approach to avoid this copy.

Contributor

llllllllll commented Feb 23, 2016

sorry, the really slow run was what would happen if we copied the data after doing blosc.decompress. This code was never on master I just wanted to show that it was worthwhile to use a more complex approach to avoid this copy.

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 25, 2016

Contributor

ping, how do you feel about this?

Contributor

llllllllll commented Feb 25, 2016

ping, how do you feel about this?

@kawochen

This comment has been minimized.

Show comment
Hide comment
@kawochen

kawochen Feb 26, 2016

Contributor

no trace of bytes left 👍

Contributor

kawochen commented Feb 26, 2016

no trace of bytes left 👍

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 26, 2016

Contributor

hype

Contributor

llllllllll commented Feb 26, 2016

hype

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 26, 2016

Contributor

what triggers the error on master?

Contributor

jreback commented Feb 26, 2016

what triggers the error on master?

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 26, 2016

Contributor

which error?

Contributor

llllllllll commented Feb 26, 2016

which error?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 26, 2016

Contributor

the problem IIUC which you are solving is that you are getting a non writable buffer back right?
as there doesn't seem to be a perf issue with master

Contributor

jreback commented Feb 26, 2016

the problem IIUC which you are solving is that you are getting a non writable buffer back right?
as there doesn't seem to be a perf issue with master

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Feb 26, 2016

Contributor

There is no perf error with master. I was just demonstrating the difference between a copying approach to get mutability and my non-copying approach. Master does not copy but is also does not give a mutable ndarray. The error would be if you just set array.flags.writable = True because then you could do array.base and get a bytes object which should be immutable but now shares memory with a mutable array.

Contributor

llllllllll commented Feb 26, 2016

There is no perf error with master. I was just demonstrating the difference between a copying approach to get mutability and my non-copying approach. Master does not copy but is also does not give a mutable ndarray. The error would be if you just set array.flags.writable = True because then you could do array.base and get a bytes object which should be immutable but now shares memory with a mutable array.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 26, 2016

Contributor

can u show me a case where we get an exception ?

I get what u r doing - I am just want to see an example of the practical problem

Contributor

jreback commented Feb 26, 2016

can u show me a case where we get an exception ?

I get what u r doing - I am just want to see an example of the practical problem

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Feb 26, 2016

Contributor

I see u did an example way at the top
but I want to see s pandas operation that triggers the error
m

Contributor

jreback commented Feb 26, 2016

I see u did an example way at the top
but I want to see s pandas operation that triggers the error
m

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 9, 2016

Contributor

I am not sure when or if I will be able to get to rewriting this in Cython. If someone with more cython experience wants to take this over to help get this done I wouldn't mind and could answer any questions / clarify when the comments are not enough.

Contributor

llllllllll commented Mar 9, 2016

I am not sure when or if I will be able to get to rewriting this in Cython. If someone with more cython experience wants to take this over to help get this done I wouldn't mind and could answer any questions / clarify when the comments are not enough.

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 17, 2016

Contributor

ping @jreback

Contributor

llllllllll commented Mar 17, 2016

ping @jreback

@jreback jreback added this to the 0.18.1 milestone Mar 17, 2016

Show outdated Hide outdated doc/source/whatsnew/v0.18.0.txt Outdated
values = blosc.decompress(values)
return np.frombuffer(values, dtype=dtype)
if compress:
if compress == u'zlib':

This comment has been minimized.

@jreback

jreback Mar 17, 2016

Contributor

I would try to import these at the top of the file and just set variables (_ZLIB_INSTALLED, _BLOSC_INSTALLED); and you can do whatever imports you need there.

@jreback

jreback Mar 17, 2016

Contributor

I would try to import these at the top of the file and just set variables (_ZLIB_INSTALLED, _BLOSC_INSTALLED); and you can do whatever imports you need there.

This comment has been minimized.

@llllllllll

llllllllll Mar 17, 2016

Contributor

done, added _check_* functions to raise a valueerror here if you try to use a compressor that is not installed

@llllllllll

llllllllll Mar 17, 2016

Contributor

done, added _check_* functions to raise a valueerror here if you try to use a compressor that is not installed

@@ -510,6 +510,13 @@ def pxd(name):
extensions.append(ujson_ext)
# extension for pseudo-safely moving bytes into mutable buffers

This comment has been minimized.

@jreback

jreback Mar 17, 2016

Contributor

maybe put this in pandas.util._move?

@jreback

jreback Mar 17, 2016

Contributor

maybe put this in pandas.util._move?

This comment has been minimized.

@llllllllll

llllllllll Mar 17, 2016

Contributor

Okay, there may come a day where there are more general usages for this

@llllllllll

llllllllll Mar 17, 2016

Contributor

Okay, there may come a day where there are more general usages for this

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 17, 2016

Contributor

Updated based on feedback. I also added standalone tests for move_into_mutable_buffer because it is not in util.

Contributor

llllllllll commented Mar 17, 2016

Updated based on feedback. I also added standalone tests for move_into_mutable_buffer because it is not in util.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 17, 2016

Contributor

IIRC this DID work on windows before.

[pandas3] C:\Users\conda\Documents\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\io/sas\saslib.c' Cython extension (up-to-date)
skipping 'pandas\src\period.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\parser.c' Cython extension (up-to-date)
skipping 'pandas\lib.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\src\sparse.c' Cython extension (up-to-date)
skipping 'pandas\src\testing.c' Cython extension (up-to-date)
skipping 'pandas\msgpack\_packer.cpp' Cython extension (up-to-date)
skipping 'pandas\msgpack\_unpacker.cpp' Cython extension (up-to-date)
building 'pandas.util._move' extension
creating build\temp.win-amd64-3.5\Release\pandas\util
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Miniconda2\envs\pandas3.5\lib\site-pa
ckages\numpy\core\include -IC:\Miniconda2\envs\pandas3.5\include -IC:\Miniconda2\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Stu
dio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.
10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "
-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas/util
/_move.c /Fobuild\temp.win-amd64-3.5\Release\pandas/util/_move.obj
_move.c
pandas/util/_move.c(201): error C2099: initializer is not a constant
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3] C:\Users\conda\Documents\pandas3.5>

Contributor

jreback commented Mar 17, 2016

IIRC this DID work on windows before.

[pandas3] C:\Users\conda\Documents\pandas3.5>python setup.py build_ext --inplace
running build_ext
skipping 'pandas\io/sas\saslib.c' Cython extension (up-to-date)
skipping 'pandas\src\period.c' Cython extension (up-to-date)
skipping 'pandas\hashtable.c' Cython extension (up-to-date)
skipping 'pandas\algos.c' Cython extension (up-to-date)
skipping 'pandas\index.c' Cython extension (up-to-date)
skipping 'pandas\parser.c' Cython extension (up-to-date)
skipping 'pandas\lib.c' Cython extension (up-to-date)
skipping 'pandas\tslib.c' Cython extension (up-to-date)
skipping 'pandas\src\sparse.c' Cython extension (up-to-date)
skipping 'pandas\src\testing.c' Cython extension (up-to-date)
skipping 'pandas\msgpack\_packer.cpp' Cython extension (up-to-date)
skipping 'pandas\msgpack\_unpacker.cpp' Cython extension (up-to-date)
building 'pandas.util._move' extension
creating build\temp.win-amd64-3.5\Release\pandas\util
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Miniconda2\envs\pandas3.5\lib\site-pa
ckages\numpy\core\include -IC:\Miniconda2\envs\pandas3.5\include -IC:\Miniconda2\envs\pandas3.5\include "-IC:\Program Files (x86)\Microsoft Visual Stu
dio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.
10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "
-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /Tcpandas/util
/_move.c /Fobuild\temp.win-amd64-3.5\Release\pandas/util/_move.obj
_move.c
pandas/util/_move.c(201): error C2099: initializer is not a constant
error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\amd64\\cl.exe' failed with exit status 2

[pandas3] C:\Users\conda\Documents\pandas3.5>

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 17, 2016

Contributor

That is saying that my definition for stolenbuf_type is not constant but it only uses addesses and int constants, is this using PIC? is there a way to get the compiler to emit the particular field that it does not think is constant? GCC by default will tell you which field it doesn't like.

Contributor

llllllllll commented Mar 17, 2016

That is saying that my definition for stolenbuf_type is not constant but it only uses addesses and int constants, is this using PIC? is there a way to get the compiler to emit the particular field that it does not think is constant? GCC by default will tell you which field it doesn't like.

@jreback

This comment has been minimized.

Show comment
Hide comment

llllllllll added some commits Feb 16, 2016

ENH: always make ndarrays from msgpack writable
Addresses the case where 'compress' was not none. The old implementation
would decompress the data and then call np.frombuffer on a bytes
object. Because a bytes object is not a mutable buffer, the resulting
ndarray had writeable=False. The new implementation ensures that the
pandas is the only owner of this new buffer and then sets it to mutable
without copying it. This means that we don't need to do a copy of the
data coming in AND we can mutate it later. If we are not the only owner
of this data then we just copy it with np.fromstring.
TST: adds test for memoized empty string and chars
In many capi functions for python the empty string and single characters
are memoized to return preallocated bytes objects. This could be a
potential problem with the implementation of pandas.io.packers.unconvert
so we are adding a test for this explicitly.

Currently neither zlib nor blosc hit this case because they use
PyBytes_FromStringAndSize(NULL, n) which does not pull from the shared
pool of bytes objects; however, this test will guard us against changes
to this implementation detail in the future.
ENH: updates unconvert no-copy code to be safer
Pulls the logic that does the buffer management into a new private
function named `_move_into_mutable_buffer`. The function acts like a
move constructor from 'bytes' to `memoryview`. We now only call
`np.frombuffer` on the resulting `memoryview` object. This accomplishes
the original task of getting a mutable array from a `bytes` object
without copying with the added safety of making the `base` field of the
array that is returned to the user a mutable `memoryview`. This
eliminates the risk of users obtaining the original `bytes` object which
is now pointing to a mutable buffer and violating the immutability
contract of `bytes` objects.
ENH: reimplement `_move_into_mutable_buffer` in C.
By writing our move function in C we can hide the original bytes object
from the user while still ensuring that the lifetime is managed
correctly. This implementation is designed to make it impossible to get
access to the invalid bytes object from pure python.
TST: adds `pandas.util.testing.patch`
This function temporarily replaces a an attribute on an object for the
duration of some context manager. This is useful for patching methods to
inject assertions or mock out expensive operations.
@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 17, 2016

Contributor

Read some docs about windows dev and pushed a potential fix. Things still behave normally for me with gcc on gnu+linux, hopefully this builds on windows. Do you have an appveyor build I can check?

Contributor

llllllllll commented Mar 17, 2016

Read some docs about windows dev and pushed a potential fix. Things still behave normally for me with gcc on gnu+linux, hopefully this builds on windows. Do you have an appveyor build I can check?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 17, 2016

Contributor

you could set this up with appveyor (just create an account), all of the setup will build on your PR's. but its VERY slow. I do with a local vm.

Contributor

jreback commented Mar 17, 2016

you could set this up with appveyor (just create an account), all of the setup will build on your PR's. but its VERY slow. I do with a local vm.

class TestMove(tm.TestCase):
def test_more_than_one_ref(self):
"""Test case for when we try to use ``move_into_mutable_buffer`` when
the object being moved has other references.

This comment has been minimized.

@jreback

jreback Mar 17, 2016

Contributor

its funny, when you use a doc-string it calls the test by this. I guess that is right.

@jreback

jreback Mar 17, 2016

Contributor

its funny, when you use a doc-string it calls the test by this. I guess that is right.

This comment has been minimized.

@llllllllll

llllllllll Mar 17, 2016

Contributor

Should I just make this a comment instead or do you not mind?

@llllllllll

llllllllll Mar 17, 2016

Contributor

Should I just make this a comment instead or do you not mind?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 17, 2016

Contributor

ok this passed for me on windows!

merging

Contributor

jreback commented Mar 17, 2016

ok this passed for me on windows!

merging

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 17, 2016

Contributor

side issue, should prob add some more perf tests with msgpack to cover the bases.

Contributor

jreback commented Mar 17, 2016

side issue, should prob add some more perf tests with msgpack to cover the bases.

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 17, 2016

Contributor

Should I do that in this pr or open another pr with perf tests? Also are those in test_perf.sh?

Contributor

llllllllll commented Mar 17, 2016

Should I do that in this pr or open another pr with perf tests? Also are those in test_perf.sh?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 17, 2016

Contributor

This is amazing how fast blosc makes this.

no we are using asv now. (tests are basically in the same place though). another PR is fine.

In [8]:  %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.45 ms per loop

In [9]: %timeit df.to_msgpack(compress='blosc')
100 loops, best of 3: 6.27 ms per loop

In [10]: %timeit df.to_msgpack()
100 loops, best of 3: 14.4 ms per loop

In [11]: packed = df.to_msgpack()

In [12]:  %timeit pd.read_msgpack(packed)
10 loops, best of 3: 20.1 ms per loop
Contributor

jreback commented Mar 17, 2016

This is amazing how fast blosc makes this.

no we are using asv now. (tests are basically in the same place though). another PR is fine.

In [8]:  %timeit pd.read_msgpack(packed)
100 loops, best of 3: 3.45 ms per loop

In [9]: %timeit df.to_msgpack(compress='blosc')
100 loops, best of 3: 6.27 ms per loop

In [10]: %timeit df.to_msgpack()
100 loops, best of 3: 14.4 ms per loop

In [11]: packed = df.to_msgpack()

In [12]:  %timeit pd.read_msgpack(packed)
10 loops, best of 3: 20.1 ms per loop

@jreback jreback closed this in 28ba08c Mar 18, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 18, 2016

Contributor

@llllllllll thanks!!

Contributor

jreback commented Mar 18, 2016

@llllllllll thanks!!

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 18, 2016

Contributor

had to make a trivial modification: jreback@6bba06c

we don't have blosc installed (on purpose) on some of the windows builds. I guess this actually failed travis (but I merged before) :<

Contributor

jreback commented Mar 18, 2016

had to make a trivial modification: jreback@6bba06c

we don't have blosc installed (on purpose) on some of the windows builds. I guess this actually failed travis (but I merged before) :<

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 18, 2016

Contributor

only fails on 2.7 which we were testing with blosc installed. fixed up.

Contributor

jreback commented Mar 18, 2016

only fails on 2.7 which we were testing with blosc installed. fixed up.

@llllllllll

This comment has been minimized.

Show comment
Hide comment
@llllllllll

llllllllll Mar 18, 2016

Contributor

oh, sorry about that. I guess import error is probably more indicative of the problem

Contributor

llllllllll commented Mar 18, 2016

oh, sorry about that. I guess import error is probably more indicative of the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment