public numpy /numpy

Subversion checkout URL

You can clone with HTTPS or Subversion.

Decide whether the value of scalars (not just the type) should affect result type of mixed scalar/array operations#2878

Open
opened this Issue · 19 comments

7 participants

Owner

Consider an operation between an array and a scalar:

``````  result = arr + scalar
``````

Our job is to pick `result.dtype`.

In numpy before 1.6, if `arr` and `scalar` had the same basic kind of dtype (e.g. both some kind of float), then we simply used `arr.dtype` as the result dtype -- so in particular, if `arr.dtype` was `float32`, and `scalar.dtype` as `float64`, then the result was `float32`. Perhaps a surprising special case the first time you see it, but the rule is easy to remember, and provides important usability (since `scalar` here is usually a python literal, and python literals always have a fixed size that shouldn't be causing upcasting all the time).

In numpy 1.6, this was changed so that the value of `scalar` was also taken into account. So far as I can tell, the rule is now, if the `scalar` cannot be exactly represented using `arr.dtype`, then we go ahead and perform an upcast to a type that can hold both the value of `scalar` and is also consistent with `arr.dtype`. So for example it's entirely possible to have `int8 + int32` give you an `int16`. Examples:

``````In [19]: arr = np.array([1.], dtype=np.float32)

In [20]: (arr + (2**16-1)).dtype
Out[20]: dtype('float32')

In [21]: (arr + (2**16)).dtype
Out[21]: dtype('float64')

In [25]: arr = np.array([1.], dtype=np.int8)

In [26]: (arr + 127).dtype
Out[26]: dtype('int8')

In [27]: (arr + 128).dtype
Out[27]: dtype('int16')
``````

This change is documented (a bit obscurely) here:

It's also been discussed on the list:

This change has caused a fair amount of confusion, and AFAICT everyone who's expressed an opinion has disliked it. But, in the discussions so far it's been hard to separate out from all the other changes made in 1.6 (or even tell which changes we're talking about), so no consensus has been reached. So, we need to decide what to do about this.

Some options:
1. Keep the current behaviour as is, `128 + int8 -> int16`
2. Revert to 1.5-and-earlier behaviour, `128 + int8 -> int8`
3. Revert to 1.5-and-earlier behaviour when it comes to selecting an output dtype, but then use `safe` casting rules when coercing the input to this dtype. (Right now IIUC we use the exact same rules for `scalar + arr` and `np.add(scalar, arr, dtype=selected_dtype)`, i.e., we treat the selected dtype as if it were explicitly requested and do an `unsafe` forced cast. The suggestion here is to treat the implicitly-selected output dtype as if it were implicitly selected, and use more conservative casting.) Result: `128 + int8 -> error`, (and you can use the full `np.add(...)` form if you really meant this).

Right now I like option 3 the best but am open to discussion.

referenced this issue Open

Issue #2542: Upcasting a float32 array to complex64 instead of complex128 when adding a complex128 scalar (Trac #1949)

referenced this issue Closed

Issue #561: Document changes in casting rules for 1.7 release (Trac #2101)

Owner

I agree that simplicity and non-dependence on value is both easier to implement and easier to understand. Perhaps a warning if the scalar value falls outside the bounds of the array type would also be useful. There is a tradeoff here between unexpected, hard to track down bugs due to downcast scalar values, and unexpected memory usage due to upcast array types.

Owner

I'm +1 on @njsmith's proposal.

Owner

(3) sounds reasonable. Alternatively, a warning or error if any of the output values is altered, not just if the scalar is altered. Otherwise you get this, which may still be surprising:

``````>>> np.array([2], dtype=np.int8) + 127
-127
>>> np.array([1], dtype=np.int8) + 128
ValueError
``````
Owner

@rgommers: That's not a casting issue, though, that's just a rollover during an int8 + int8 -> int8 operation. It might be useful to have a way to check for rollover on integer operations, but it's a totally separate issue from this casting stuff.

Owner

@njsmith it's only a separate issue if you already expect the scalar to be cast to int8 first. There are also functions in numpy/scipy that work internally in higher precision and then cast the result back to the required dtype. In that case I expect identical answers for both lines.

In 1.6.1 the result is actually identical now:

``````>>> np.array([1], dtype=np.int8) + 128
array([129], dtype=int16)
>>> np.array([2], dtype=np.int8) + 127
array([129], dtype=int16)
``````

So currently it does check value of output and not of scalar input to determine output dtype, like I suggested.

Owner

I can't reproduce that with 1.6.2...

``````In [4]: np.__version__
Out[4]: '1.6.2'

In [5]: np.array([1], dtype=np.int8) + 128
Out[5]: array([129], dtype=int16)

In [6]: np.array([2], dtype=np.int8) + 127
Out[6]: array([-127], dtype=int8)
``````

Anyway, there's no way ufuncs in general should be working at higher precision internally! Think about what this would mean. We'd have to either perform each ufunc twice (once while discarding the results, just to see how large they were, before allocating the output array), or else allocate a huge high-precision array, perform the ufunc there, and then scan it to decide on the proper output dtype, allocate a second (smaller) array, and downcast. Either way would make our users totally freak out, and rightfully so.

Also consider other ufuncs like `*` or `**`, where the size of the input operands is a much worse guide to the size of the outputs. Does your 1.6.1 return an int32 for

``````np.array([2], dtype=np.int8) ** 20
``````

?

Owner

No int32's. This is just weird:

``````>>> np.__version__
'1.6.1'
>>> np.array([2], dtype=np.int8) ** 20
array([0], dtype=int8)
>>> np.array([2], dtype=np.int8) ** 126
array([0], dtype=int8)
>>> np.array([2], dtype=np.int8) ** 127
array([-32768], dtype=int16)
``````

Anyway, if 1.6.1 and 1.6.2 already don't agree then it's best to just ignore what 1.6.1 does.

Your option 3 seems the best to me, in that it is easy to explain and I believe it is consistent with the 'respect-your-dtype' zen of numpy.

Option 2 will change the output of 1.6.x code and option 3 will cause new errors for all numpy versions. What is the best way of introducing the changes?

Owner

Sounds like we probably have consensus on option 3?

As for how to get there: I think option 3 produces an error if, and only if, you have code where 1.5 and 1.6 produce different results. (Does that sound correct? It looks rather subtle actually, but I think it should be true...) So I guess now there are two options:

Option A: Keep the 1.6/1.7 behaviour for version \$NEXT but add a DeprecationWarning whenever there otherwise would have been an error. In version \$NEXT+1, start erroring out. (This is the "following the rules" approach.)

Option B: Just go ahead and add the error to version \$NEXT. The rationale here would be that we already silently altered the semantics of people's code in 1.6, so the sooner we get an error in place, the more likely people will be alerted to any bugs we introduced. (So this is arguably the "following the spirit of the rules" approach.)

Owner

A warning in 1.8 would be better imho. FutureWarning instead of DeprecationWarning, because that actually shows up.

Owner

Before I cast my vote, I'd like to make sure I fully understand the proposal. I realize I'm not sure what a "safe" cast is exactly. The doc I can (easily) find talks about whether it's safe or not to cast from one dtype to another, but here my understanding is that deciding on the "safety" of the cast is based on the actual numeric value, not just the dtype.

Do you have the exact definition of what you mean by "safe casting rules"?

Collaborator

Should any of this go into 1.7, or do you think we should release 1.7 as it is, and work on this for 1.8?

Owner

Shouldn't put any functional changes like this into 1.7 anymore, because 1.7 RC1 is already out.

Owner
Owner

@delallea: That's a good question about what "safe" casting actually means. Concretely, it means, `can_cast(value, new_dtype, casting="safe")` returns True. But what does that mean... I wanted to say it's the same rule we use for other implicit casts, but actually when I went looking, this doesn't seem to be true at all.

Most of the time, we just do "unsafe" casting, i.e., anything goes. I can't actually find any concrete examples of python code where numpy will fail to perform a cast between, say, floats and ints. These are all silently accepted:

``````np.array([1.5], dtype=int)
np.take([1.5], [0], out=np.empty(1, dtype=int))
np.empty(10, dtype=int)[0] = 1.5
``````

(Though in 1.7 that last will issue a deprecation warning, but using the "same_kind" casting rules, which are looser than the "safe" rules.)

The only two places that seem to use "safe" are:

• `PyArray_FromArray` uses the safe casting unless the `FORCECAST` flag is set. But I haven't found any way to trigger this in practice from the Python level; most places seem to set `FORCECAST`.
• Ufunc type selection: if we have an array of type A and a ufunc loop of type B, then we're allowed to use this loop iff `can_cast(A, B, "safe") == True`.

Okay, so there doesn't seem to be much usage guidance as to what people think "safe" (or any of the other casting rules) should mean. Maybe we can at least figure out how they behave right now. (Though keeping in mind that the current behaviour might be buggy, and we'd have no way to tell, because we don't know what the correct behaviour is!)

From some experimentation and the docs, I think the intended meaning of the rules is:

• `no`: no casting
• `equiv`: only byte-swapping, no representation changes
• `safe`: input must be exactly representable (no overflow or loss of precision)
• `same_kind`: input exactly representable, or are sticking within one of the "kinds" (signed int, unsigned int, real floating point, complex floating point are the big four). So rollover is fine, as long as it's integer -> integer rollover.
• `unsafe`: anything goes

And things like "exactly representable" are properties of values, not of types. So the above rules should be taken to refer to whether we can cast a particular value to a particular type, and then type->type rules are defined by whether all possible values for the former type can be cast to the latter type.

However, be warned that there are tons of exceptions to these descriptions!

``````# 1.0 can be exactly represented as an int
In [20]: np.can_cast(1.0, int, "safe")
Out[20]: False

# 2**16 can be exactly represented in a float32 (which has 23 bits of significand)
In [22]: np.can_cast(2**16, np.float32, "safe")
Out[22]: False

# Complex value with zero imaginary part can *only* be cast to real with "unsafe" rules
In [24]: np.can_cast(1+0j, float, "same_kind")
Out[24]: False
``````

I guess these are just bugs?

And also, the `same_kind` rules seem totally useless. The idea of "exactly representable" makes sense as something you might want. The idea of "use a bigger hammer" (`unsafe`, C-style) makes sense as something you might want. But in between I would expect something like "preserves relative precision", so that you allow casting float64 -> float32, but not complex (with non-zero imaginary part) -> float or float (with non-zero fractional part) -> int or large integer value -> int8. This seems like the natural rule for array assignment checks as well. But in fact we don't have any such rule! It's impossible to distinguish between integer rollover and simple loss of precision in this system.

Thanks @njsmith for the detailed explanation. I'm glad I asked because it doens't seem obvious ;)

I'm not sure about the "raise error if safe cast doesn't work" proposal anymore. For floats, I believe we want "float32 array + Python float scalar" to always work and yield a float32 array, even when the Python float scalar can't be represented without loss of precision as a float32 (like 1.1).

I agree the `can_cast` examples you show look like bugs.

I'm running out of time to spend discussing this topic, so that's probably all from me today, but I'll keep reading and thinking about it...

Collaborator

Probably this doesn't matter directly for this discussion, but for arrays (that are not 0-d) the values are (I assume for performance reasons, since checking values inside ufunc preparations would likely be unreasonable) not actually inspected:

```In [27]: np.can_cast(np.array([0], dtype=np.int16), np.int8, casting='safe')
Out[27]: False

In [28]: np.can_cast(np.array(0, dtype=np.int16), np.int8, casting='safe')
Out[28]: True
```

I did not check the code, but my guess is that integer scalars (including 0-d arrays) simply have some extra logic here and the complex case uses the array `safe` rule which does not inspect the actual values. Maybe one could even introduce such a cast like `can_represent`.