You can clone with
No one assigned
Consider an operation between an array and a scalar:
result = arr + scalar
Our job is to pick result.dtype.
In numpy before 1.6, if arr and scalar had the same basic kind of dtype (e.g. both some kind of float), then we simply used arr.dtype as the result dtype -- so in particular, if arr.dtype was float32, and scalar.dtype as float64, then the result was float32. Perhaps a surprising special case the first time you see it, but the rule is easy to remember, and provides important usability (since scalar here is usually a python literal, and python literals always have a fixed size that shouldn't be causing upcasting all the time).
In numpy 1.6, this was changed so that the value of scalar was also taken into account. So far as I can tell, the rule is now, if the scalar cannot be exactly represented using arr.dtype, then we go ahead and perform an upcast to a type that can hold both the value of scalar and is also consistent with arr.dtype. So for example it's entirely possible to have int8 + int32 give you an int16. Examples:
int8 + int32
In : arr = np.array([1.], dtype=np.float32)
In : (arr + (2**16-1)).dtype
In : (arr + (2**16)).dtype
In : arr = np.array([1.], dtype=np.int8)
In : (arr + 127).dtype
In : (arr + 128).dtype
This change is documented (a bit obscurely) here:
It's also been discussed on the list:
This change has caused a fair amount of confusion, and AFAICT everyone who's expressed an opinion has disliked it. But, in the discussions so far it's been hard to separate out from all the other changes made in 1.6 (or even tell which changes we're talking about), so no consensus has been reached. So, we need to decide what to do about this.
1. Keep the current behaviour as is, 128 + int8 -> int16
2. Revert to 1.5-and-earlier behaviour, 128 + int8 -> int8
3. Revert to 1.5-and-earlier behaviour when it comes to selecting an output dtype, but then use safe casting rules when coercing the input to this dtype. (Right now IIUC we use the exact same rules for scalar + arr and np.add(scalar, arr, dtype=selected_dtype), i.e., we treat the selected dtype as if it were explicitly requested and do an unsafe forced cast. The suggestion here is to treat the implicitly-selected output dtype as if it were implicitly selected, and use more conservative casting.) Result: 128 + int8 -> error, (and you can use the full np.add(...) form if you really meant this).
128 + int8 -> int16
128 + int8 -> int8
scalar + arr
np.add(scalar, arr, dtype=selected_dtype)
128 + int8 -> error
Right now I like option 3 the best but am open to discussion.
CC: @delallea, @matthew-brett
I agree that simplicity and non-dependence on value is both easier to implement and easier to understand. Perhaps a warning if the scalar value falls outside the bounds of the array type would also be useful. There is a tradeoff here between unexpected, hard to track down bugs due to downcast scalar values, and unexpected memory usage due to upcast array types.
I'm +1 on @njsmith's proposal.
(3) sounds reasonable. Alternatively, a warning or error if any of the output values is altered, not just if the scalar is altered. Otherwise you get this, which may still be surprising:
>>> np.array(, dtype=np.int8) + 127
>>> np.array(, dtype=np.int8) + 128
@rgommers: That's not a casting issue, though, that's just a rollover during an int8 + int8 -> int8 operation. It might be useful to have a way to check for rollover on integer operations, but it's a totally separate issue from this casting stuff.
@njsmith it's only a separate issue if you already expect the scalar to be cast to int8 first. There are also functions in numpy/scipy that work internally in higher precision and then cast the result back to the required dtype. In that case I expect identical answers for both lines.
In 1.6.1 the result is actually identical now:
>>> np.array(, dtype=np.int8) + 128
>>> np.array(, dtype=np.int8) + 127
So currently it does check value of output and not of scalar input to determine output dtype, like I suggested.
I can't reproduce that with 1.6.2...
In : np.__version__
In : np.array(, dtype=np.int8) + 128
Out: array(, dtype=int16)
In : np.array(, dtype=np.int8) + 127
Out: array([-127], dtype=int8)
Anyway, there's no way ufuncs in general should be working at higher precision internally! Think about what this would mean. We'd have to either perform each ufunc twice (once while discarding the results, just to see how large they were, before allocating the output array), or else allocate a huge high-precision array, perform the ufunc there, and then scan it to decide on the proper output dtype, allocate a second (smaller) array, and downcast. Either way would make our users totally freak out, and rightfully so.
Also consider other ufuncs like * or **, where the size of the input operands is a much worse guide to the size of the outputs. Does your 1.6.1 return an int32 for
np.array(, dtype=np.int8) ** 20
No int32's. This is just weird:
>>> np.array(, dtype=np.int8) ** 20
>>> np.array(, dtype=np.int8) ** 126
>>> np.array(, dtype=np.int8) ** 127
Anyway, if 1.6.1 and 1.6.2 already don't agree then it's best to just ignore what 1.6.1 does.
Your option 3 seems the best to me, in that it is easy to explain and I believe it is consistent with the 'respect-your-dtype' zen of numpy.
Option 2 will change the output of 1.6.x code and option 3 will cause new errors for all numpy versions. What is the best way of introducing the changes?
Sounds like we probably have consensus on option 3?
As for how to get there: I think option 3 produces an error if, and only if, you have code where 1.5 and 1.6 produce different results. (Does that sound correct? It looks rather subtle actually, but I think it should be true...) So I guess now there are two options:
Option A: Keep the 1.6/1.7 behaviour for version $NEXT but add a DeprecationWarning whenever there otherwise would have been an error. In version $NEXT+1, start erroring out. (This is the "following the rules" approach.)
Option B: Just go ahead and add the error to version $NEXT. The rationale here would be that we already silently altered the semantics of people's code in 1.6, so the sooner we get an error in place, the more likely people will be alerted to any bugs we introduced. (So this is arguably the "following the spirit of the rules" approach.)
A warning in 1.8 would be better imho. FutureWarning instead of DeprecationWarning, because that actually shows up.
Before I cast my vote, I'd like to make sure I fully understand the proposal. I realize I'm not sure what a "safe" cast is exactly. The doc I can (easily) find talks about whether it's safe or not to cast from one dtype to another, but here my understanding is that deciding on the "safety" of the cast is based on the actual numeric value, not just the dtype.
Do you have the exact definition of what you mean by "safe casting rules"?
Should any of this go into 1.7, or do you think we should release 1.7 as it is, and work on this for 1.8?
Shouldn't put any functional changes like this into 1.7 anymore, because 1.7 RC1 is already out.
@delallea: That's a good question about what "safe" casting actually means. Concretely, it means, can_cast(value, new_dtype, casting="safe") returns True. But what does that mean... I wanted to say it's the same rule we use for other implicit casts, but actually when I went looking, this doesn't seem to be true at all.
can_cast(value, new_dtype, casting="safe")
Most of the time, we just do "unsafe" casting, i.e., anything goes. I can't actually find any concrete examples of python code where numpy will fail to perform a cast between, say, floats and ints. These are all silently accepted:
np.take([1.5], , out=np.empty(1, dtype=int))
np.empty(10, dtype=int) = 1.5
(Though in 1.7 that last will issue a deprecation warning, but using the "same_kind" casting rules, which are looser than the "safe" rules.)
The only two places that seem to use "safe" are:
can_cast(A, B, "safe") == True
Okay, so there doesn't seem to be much usage guidance as to what people think "safe" (or any of the other casting rules) should mean. Maybe we can at least figure out how they behave right now. (Though keeping in mind that the current behaviour might be buggy, and we'd have no way to tell, because we don't know what the correct behaviour is!)
From some experimentation and the docs, I think the intended meaning of the rules is:
And things like "exactly representable" are properties of values, not of types. So the above rules should be taken to refer to whether we can cast a particular value to a particular type, and then type->type rules are defined by whether all possible values for the former type can be cast to the latter type.
However, be warned that there are tons of exceptions to these descriptions!
# 1.0 can be exactly represented as an int
In : np.can_cast(1.0, int, "safe")
# 2**16 can be exactly represented in a float32 (which has 23 bits of significand)
In : np.can_cast(2**16, np.float32, "safe")
# Complex value with zero imaginary part can *only* be cast to real with "unsafe" rules
In : np.can_cast(1+0j, float, "same_kind")
I guess these are just bugs?
And also, the same_kind rules seem totally useless. The idea of "exactly representable" makes sense as something you might want. The idea of "use a bigger hammer" (unsafe, C-style) makes sense as something you might want. But in between I would expect something like "preserves relative precision", so that you allow casting float64 -> float32, but not complex (with non-zero imaginary part) -> float or float (with non-zero fractional part) -> int or large integer value -> int8. This seems like the natural rule for array assignment checks as well. But in fact we don't have any such rule! It's impossible to distinguish between integer rollover and simple loss of precision in this system.
Thanks @njsmith for the detailed explanation. I'm glad I asked because it doens't seem obvious ;)
I'm not sure about the "raise error if safe cast doesn't work" proposal anymore. For floats, I believe we want "float32 array + Python float scalar" to always work and yield a float32 array, even when the Python float scalar can't be represented without loss of precision as a float32 (like 1.1).
I agree the can_cast examples you show look like bugs.
I'm running out of time to spend discussing this topic, so that's probably all from me today, but I'll keep reading and thinking about it...
Probably this doesn't matter directly for this discussion, but for arrays (that are not 0-d) the values are (I assume for performance reasons, since checking values inside ufunc preparations would likely be unreasonable) not actually inspected:
In : np.can_cast(np.array(, dtype=np.int16), np.int8, casting='safe')
In : np.can_cast(np.array(0, dtype=np.int16), np.int8, casting='safe')
I did not check the code, but my guess is that integer scalars (including 0-d arrays) simply have some extra logic here and the complex case uses the array safe rule which does not inspect the actual values. Maybe one could even introduce such a cast like can_represent.