Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Copy bytes object when unpickling an array #371

Merged
merged 2 commits into from

4 participants

@rlamy

Fixes issue #370 by always treating bytes objects as immutable, as they should be in py3.

@travisbot

This pull request fails (merged 339c35f into 26fed25).

@njsmith
Owner

IIUC this is disabling a potentially important optimization: this forces data that's stored in raw binary form to be loaded into memory twice. (First the data is loaded from a pickle file into a bytes object, then it's copied into the array.) This may increase peak memory usage dramatically, to the point that some people's code stops working...

It's a dubious optimization -- there's nothing that says you can't call pickle.loads() twice on the same string! -- but it's a traditional one, so if we want to stop it in general then we should probably make that a separate debate. For this pull request, is there any way to tell which bytes objects are safe to mutate like this, analogous to the CHECK_INTERNED call for strings? Or as a hack, only applying the optimization to bytes objects that are more than, like, a megabyte, would keep us safe from interpreter optimizations while still avoiding the memory overhead in the truly expensive cases.

@rlamy

In Python 3, bytes are documented as always unsafe to mutate. AFAICT, the only supported way of getting hold of binary data is to use the buffer protocol on an explicitly mutable object (e.g. a bytearray).

OTOH, I believe that it's only when mutating length-1 bytes that bad things can happen, provided nobody else has a reference to the object, so there's room for theoretically dangerous hacks - even though they might start to fail without warning if Python devs decide to optimise things.

Another thing to consider is that this PR only affects Python 3, so it won't break any legacy code.

@njsmith
Owner

Strings have always been documented as unsafe to mutate too... and people have been writing numpy code in Python 3 for a bit now already.

I'm not actually opposed to killing off this optimization. But IMO our options are (1) keep the optimization for now, as horrible as it is, and just use this PR to make the minimal change to fix the bug at hand, or (2) take the discussion to the mailing list, since most of the people who might be affected by the change aren't reading this.

@teoliphant
Owner

I don't think we should remove this optimization until Python provides better hooks to the pickling and unpickling system that does not require creating large string objects twice. This has been there a long time. I don't understand the actual "bug" being reported.

@teoliphant
Owner

I see the problem now (interning of bytes). I think the solution is that we should make copies for small arrays, but still create the "view" for larger arrays. We could do this on all versions of Python.

@rlamy

@teoliphant: Python3 does provide better "hooks": bytearrays and the buffer protocol. But using them would be a significant undertaking and would probably break backwards compatibility.

So I guess that making copies only for small arrays is the way to go. The remaining question is what should the limit be? AFAICT, anything strictly larger than 1 is OK for now.

@teoliphant
Owner

@rlamy The limit can be something a bit larger like 1000, I would think.

@rlamy rlamy Re-enable unpickling optimization for large py3k bytes objects.
Mutating a bytes object is theoretically unsafe, but doesn't cause any
problem in any existing version of CPython.
d183928
@rlamy

Let's use 1000, then.

@travisbot

This pull request passes (merged d183928 into 26fed25).

@teoliphant teoliphant merged commit fd15162 into from
@rlamy rlamy deleted the branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 31, 2012
  1. @rlamy
Commits on Aug 3, 2012
  1. @rlamy

    Re-enable unpickling optimization for large py3k bytes objects.

    rlamy authored
    Mutating a bytes object is theoretically unsafe, but doesn't cause any
    problem in any existing version of CPython.
This page is out of date. Refresh to see the latest.
View
5 numpy/core/src/multiarray/methods.c
@@ -1587,8 +1587,9 @@ array_setstate(PyArrayObject *self, PyObject *args)
/* Check that the string is not interned */
if (!_IsAligned(self) || swap || PyString_CHECK_INTERNED(rawdata)) {
#else
- /* Bytes are never interned */
- if (!_IsAligned(self) || swap) {
+ /* Bytes should always be considered immutable, but we just grab the
+ * pointer if they are large, to save memory. */
+ if (!_IsAligned(self) || swap || (len <= 1000)) {
#endif
npy_intp num = PyArray_NBYTES(self);
fa->data = PyDataMem_NEW(num);
View
8 numpy/core/tests/test_regression.py
@@ -1601,6 +1601,14 @@ def test_pickle_string_overwrite(self):
s = re.sub("a(.)", "\x01\\1", "a_")
assert_equal(s[0], "\x01")
+ def test_pickle_bytes_overwrite(self):
+ if sys.version_info[0] >= 3:
+ data = np.array([1], dtype='b')
+ data = pickle.loads(pickle.dumps(data))
+ data[0] = 0xdd
+ bytestring = "\x01 ".encode('ascii')
+ assert_equal(bytestring[0:1], '\x01'.encode('ascii'))
+
def test_structured_type_to_object(self):
a_rec = np.array([(0,1), (3,2)], dtype='i4,i8')
a_obj = np.empty((2,), dtype=object)
Something went wrong with that request. Please try again.