Copy bytes object when unpickling an array #371

merged 2 commits into from Aug 3, 2012


None yet

4 participants

rlamy commented Jul 31, 2012

Fixes issue #370 by always treating bytes objects as immutable, as they should be in py3.

This pull request fails (merged 339c35f into 26fed25).

njsmith commented Jul 31, 2012

IIUC this is disabling a potentially important optimization: this forces data that's stored in raw binary form to be loaded into memory twice. (First the data is loaded from a pickle file into a bytes object, then it's copied into the array.) This may increase peak memory usage dramatically, to the point that some people's code stops working...

It's a dubious optimization -- there's nothing that says you can't call pickle.loads() twice on the same string! -- but it's a traditional one, so if we want to stop it in general then we should probably make that a separate debate. For this pull request, is there any way to tell which bytes objects are safe to mutate like this, analogous to the CHECK_INTERNED call for strings? Or as a hack, only applying the optimization to bytes objects that are more than, like, a megabyte, would keep us safe from interpreter optimizations while still avoiding the memory overhead in the truly expensive cases.

rlamy commented Aug 1, 2012

In Python 3, bytes are documented as always unsafe to mutate. AFAICT, the only supported way of getting hold of binary data is to use the buffer protocol on an explicitly mutable object (e.g. a bytearray).

OTOH, I believe that it's only when mutating length-1 bytes that bad things can happen, provided nobody else has a reference to the object, so there's room for theoretically dangerous hacks - even though they might start to fail without warning if Python devs decide to optimise things.

Another thing to consider is that this PR only affects Python 3, so it won't break any legacy code.

njsmith commented Aug 1, 2012

Strings have always been documented as unsafe to mutate too... and people have been writing numpy code in Python 3 for a bit now already.

I'm not actually opposed to killing off this optimization. But IMO our options are (1) keep the optimization for now, as horrible as it is, and just use this PR to make the minimal change to fix the bug at hand, or (2) take the discussion to the mailing list, since most of the people who might be affected by the change aren't reading this.


I don't think we should remove this optimization until Python provides better hooks to the pickling and unpickling system that does not require creating large string objects twice. This has been there a long time. I don't understand the actual "bug" being reported.


I see the problem now (interning of bytes). I think the solution is that we should make copies for small arrays, but still create the "view" for larger arrays. We could do this on all versions of Python.

rlamy commented Aug 1, 2012

@teoliphant: Python3 does provide better "hooks": bytearrays and the buffer protocol. But using them would be a significant undertaking and would probably break backwards compatibility.

So I guess that making copies only for small arrays is the way to go. The remaining question is what should the limit be? AFAICT, anything strictly larger than 1 is OK for now.


@rlamy The limit can be something a bit larger like 1000, I would think.

@rlamy rlamy Re-enable unpickling optimization for large py3k bytes objects.
Mutating a bytes object is theoretically unsafe, but doesn't cause any
problem in any existing version of CPython.
rlamy commented Aug 3, 2012

Let's use 1000, then.

This pull request passes (merged d183928 into 26fed25).

@teoliphant teoliphant merged commit fd15162 into numpy:master Aug 3, 2012

1 check passed

default The Travis build passed
@rlamy rlamy deleted the rlamy:bytes-pickle branch Feb 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment