ENH: make dtype generic over scalar type #16759

person142 · 2020-07-05T23:35:52Z

Related to #16545.

Note: this is in no way complete, or even necessarily the right way forward. Just wanted to have a discussion about typing dtype subclasses with concrete code examples.

Goal: discuss whether making dtype generic over scalar types is a good way to move forward with making ndarray generic over dtype.

First, a rough summary of where things stand:

Pre NEP 41: there is only one dtype class
In the Python type system, classes are types, so in the pre NEP 41 world it's difficult to make ndarray generic over dtype (i.e. write things like ndarray[dtype]) because there is only one dtype class
Post NEP 41 there are dtype subclasses
Hopefully we can leverage that to make ndarray generic over dtype.
This could be the wrong abstraction; e.g. if you care about typing things like endianness, then really the instances of the classes are the "types", not the class itself. But since instances of a class aren't a type in the Python type system, it's not clear how one would do that. A further oddity is that when we say something like List[int] that means "a list whose elements are instances of int, which would not be the case for the dtype classes.

Now, one of the first difficulties you hit in using the new dtype subclasses is that they are created dynamically; i.e. there is no concrete np.Float64 dtype subclass that you can add types for and then write ndarray[Float64]. The subclasses do have a suggestive notation though:

>>> type(np.dtype(np.float64))
<class 'numpy.dtype[float64]'>

Following that, this PR makes dtype generic over subclasses of np.generic, so that on the typing level we can write np.dtype[float64] to mean "the Float64 subclass of dtype". Some thoughts about that

This means we are using dtype[<scalar>] to represent a subclass of dtype. I think that this is compatible with how PEP 585:

https://www.python.org/dev/peps/pep-0585/#parameters-to-generics-are-available-at-runtime

is handling things like list[int]
If you are implementing a new dtype class by subclassing dtype, then it won't be generic over scalar type; consider e.g. this example:
```
from typing import Generic, TypeVar

T = TypeVar('T')

class A(Generic[T]):
    pass

class B(A):
    pass

reveal_type(B[int]())
```
which will be rejected: error: The type "Type[B]" is not generic and not indexable. This is good.
The dtype[<scalar>] syntax is not settled yet on the non-typing level; NEP 42 says "Note: This is currently a possible extension and not yet decided."

person142 · 2020-07-05T23:36:17Z

numpy/__init__.pyi

-class dtype:
+class dtype(Generic[_DTypeScalar]):
+    @overload
+    def __new__(


In reality this is done by the _DTypeMeta metaclass.

person142 · 2020-07-05T23:36:40Z

numpy/tests/typing/reveal/dtype.py

@@ -0,0 +1,9 @@
+import numpy as np
+
+reveal_type(np.dtype(np.float64))  # E: numpy.dtype[numpy.float64*]


Some examples of making dtype generic over scalar types working.

person142 · 2020-07-05T23:37:25Z

numpy/tests/typing/reveal/dtype.py

+reveal_type(np.dtype(np.float64))  # E: numpy.dtype[numpy.float64*]
+reveal_type(np.dtype(np.int64))  # E: numpy.dtype[numpy.int64*]
+
+# Uh oh, this shouldn't be ok


Some examples that would need to be sorted out.

Looking good thus far!
In all likelihood we'll have to go big on the @overloads (~1 per initiable generic subclass) to remedy the issues below.

numpy/__init__.pyi

person142 · 2020-07-08T03:25:40Z

Ping @seberg since the validity of this approach depends on what's going to happen with dtype[float64] et. al.

seberg · 2020-07-08T14:04:19Z

I like that approach enough to mention it in the draft of NEP 42, but there is nothing quite fixed about that. Although possibly it is the other way around here and typing can be the motivation... Note that there can be DTypes without scalar types.

BvB93 · 2020-07-08T17:16:08Z

Note that there can be DTypes without scalar types.

Are you referring to structured data types here or something else?

In the former case, isn't this mostly a terminology-related issue?
The likes of np.void and np.record can fulfill the role of supposed "scalar" here just fine.

seberg · 2020-07-08T17:57:14Z

@BvB93 the issue is around things such as pandas Categorical, which can store ints or arbitrary objects, but may or may not have actual scalars. Of course that is something you can argue about whether or not it should even exist! But right now, I can only say that it might... And the fact that pandas has Categorical makes it hard to dismiss immediately.
There is also the plausible option to go the Quantity way, where you do not actually have scalars, but just use 0-D arrays instead (although that has issues around mutability+hashibility).

person142 · 2020-07-09T03:15:43Z

the issue is around things such as pandas Categorical, which can store ints or arbitrary objects, but may or may not have actual scalars

When I was thinking about this I was figuring that for something like Categorical there would be an explicit subclass

Categorical(dtype):
    ...

somewhere in Pandas and then you could write ndarray[Categorical]. So then things like dtype[float64] are only needed for the existing scalar hierarchy where there aren't explicit classes defined, and we avoid the issue of dtype classes without scalar types.

Maybe I was thinking about it wrong though?

seberg · 2020-07-09T13:03:46Z

Yes, sorry, exactly, you would have ndarray[Categorical] with no ability to do ndarray[CategoricalScalar] (or dtype[CategoricalScalar]). I suppose that was always anticipated here, so all well.

anirudh2290

This is more of a comment on the np.dtype[float64] and not specifically this PR. Now that we have type hinting support, this actually makes this syntax a bit confusing for end users from a dtype perspective IMO. For example, for most use-cases, spelling like List[int] is an indication to the person reading this code that this is for the type checker and will be ignored or erased at run time. But with np.dtype[float64] the meaning becomes slightly different and it has a specific meaning different from np.dtype[int64] at runtime.

person142 · 2020-07-17T03:58:24Z

For example, for most use-cases, spelling like List[int] is an indication to the person reading this code that this is for the type checker and will be ignored or erased at run time.

That is starting to change with PEP 585 though.

anirudh2290 · 2020-07-17T15:23:28Z

Thanks! I missed that completely, that makes sense.

person142 · 2020-07-26T20:21:20Z

Ok, some progress in dc6f3df:

dtype("float64") correctly resolves to dtype[float64]
dtype(float) correctly resolves to dtype[float64]
... and so on (though I haven't added the overloads for every possible string alias yet)

The reason that dtype("float64") wasn't working correctly before was that the __init__ was screwing it up somehow. I'm now sure why that is, but I can repro it with this simple example:

from typing import Generic, Type, TypeVar
from typing_extensions import Literal

T = TypeVar('T')

class A(Generic[T]):
    def __new__(cls, x: Literal['str']) -> A[str]:

reveal_type(A('str'))

gives Revealed type is 'new.A[builtins.str]', but

from typing import Generic, Type, TypeVar
from typing_extensions import Literal

T = TypeVar('T')

class A(Generic[T]):
    def __new__(cls, x: Literal['str']) -> A[str]:
        ...

    def __init__(self, x: Literal['str']) -> None:
        ...

reveal_type(A('str'))

gives Revealed type is 'new.A[<nothing>]'.

The remaining issue is this test case:

# Uh oh, this shouldn't be ok
reveal_type(np.dtype[np.float64](np.int64))  # E: numpy.dtype[numpy.void]

which works but shouldn't; this should be because the DTypeLike overload is very broad and catches too many things; we probably need a stricter version of DTypeLike just for __new__ which excludes the cases covered in the previous @overloads .

BvB93 · 2020-07-29T14:52:12Z

The remaining issue is this test case:
# Uh oh, this shouldn't be ok
reveal_type(np.dtype[np.float64](np.int64))  # E: numpy.dtype[numpy.void]
which works but shouldn't; this should be because the DTypeLike overload is very broad and catches too many things; we probably need a stricter version of DTypeLike just for __new__ which excludes the cases covered in the previous @overloads .

Yeah, something specifically for annotating np.void. Maybe _VoidLike?
In any case, as long as this new annotation does not contain type in its definition then above-mentioned issue should resolve itself.

BvB93 · 2020-07-29T14:57:21Z

numpy/__init__.pyi

+    @overload
+    def __new__(
+        cls,
+        dtype: Type[int],


We'll have to be careful with int ~~and float~~ here as the corresponding a np.generic is platform dependent.
This is particularly true for Windows as it generally (always?) uses 32-bit integers,
both on 32- and 64-bit operating systems.

Ah yeah, I think we want np.intc, but we haven't added an annotation for it yet!

Ah yeah, I think we want np.intc, but we haven't added an annotation for it yet!

On MacOS it seems that np.intc is np.int32 while np.dtype(int) == np.int64.
I believe np.int_ does to the trick though.

In any case, it appears that the likes of mypy support platform checks (ref), which might be exactly what we need here!

np.dtype(int) and np.dtype(np.int_) are synonyms on all platforms. np.intc is always the C int.

np.dtype(int) and np.dtype(np.int_) are synonyms on all platforms.

Good to know, then this is exactly what we're looking for! (considering this is about adding a np.dtype(int) overload)

I think the code below should trick unless there are more situations where np.int_ is np.int32.

if sys.platform == "win32": int_ = np.int32 # Windows-specific code else: int_ = np.int64 # Other systems class dtype: def __new__(cls, dtype: Type[int], ...) -> dtype[int_]: ...

Are ifs legal like that in subfiles? If so, is sys.platform whitelisted as one of the things you can access?

A select few of them are (ref).
This includes sys.version_info (which we've used previously) and apparently also sys.platform, yes.

I think the mostly correct logic is:

# best guess of size_t for common platforms if sys.maxsize > 2**32: intp = np.int64 else: intp = np.int32 # C long tends to match `size_t`, except on windows where it is always 32 bits. if sys.platform == "win32": int_ = np.int32 else: int_ = np.intp

You might need to make an upstream patch to mypy to support sys.maxsize.

In the meantime, I suppose you can define intp = Union[np.int32, np.int64]?

We could also potentially fake np.intp and np.int_ as being their own types. If one were writing code that needed to take variation across systems into account then the type checker complaining about something like intp + int64 by default regardless of platform might be the most helpful.

Yet another option would be to add some sort of dynamically generated .pyi file (perhaps whenever setup.py is runned?) for handling all these platform-specific generic aliases and then import them from there.
Considering there is quite a number of them (ref).

The dynamically generated .pyi file sounds like a nice solution long term-for now I've demoted those returns to dtype[Any] so as to not get blocked here on solving the platform-dependent dtype problem.

numpy/__init__.pyi

person142 · 2020-07-30T04:32:00Z

Ok, so @BvB93 pointed out that I need some more string aliases and to get the overload for int correct, but: I think this should be a complete proof of concept now! In particular post 711d947 we have:

Mypy correctly complains about np.dtype[np.float64](np.int64)
np.dtype(np.dtype(np.float64)) etc. is handled correctly
np.dtype(("U", 10)) etc. give np.dtype[np.void]
dtype(None) correctly gives np.dtype[np.float64]

So now we need to decide if we actually want to take this path (recall that the goal is making ndarray generic over dtype). If we do I imagine that the NEP 42 draft would need to be updated. @seberg WDYT?

person142 · 2020-07-31T05:08:51Z

Ok, rebased post #16622, which has revealed a few more cases that could be handled better. But, let's not put that work in until we're confident we're actually going to keep this.

person142 · 2020-08-18T21:14:30Z

No further movement here-what should we do move forward the discussion about using dtype[<scalar>] for typing? Ping the mailing list? Open a PR with proposed edits to NEP 42?

rgommers · 2020-08-19T11:01:54Z

No further movement here-what should we do move forward the discussion about using dtype[<scalar>] for typing? Ping the mailing list? Open a PR with proposed edits to NEP 42?

Would be good to get @seberg's opinion here, but I suspect that pinging the mailing list won't help much (most experts are already participating here), so a proposed edit to NEP 42 sounds good to me.

From the examples given so far, it's not 100% clear to me if this only impacts typing of dtype subclasses itself, or if this PR by itself also moves us forward to adding dtype typing info to regular functions that have ndarray as input/return parameters. If that is the case, could you add some examples?

seberg · 2020-08-19T13:39:21Z

If that feels better to you, you can also move it to NEP 41. It might be better there, since its more on the concept of DTypes being classes and we have this scalar <-> DType duality 🤷.

rgommers · 2020-08-19T21:42:59Z

def inner(x: np.ndarray[dtype[float64]], y: np.ndarray[dtype[float64]]) -> np.float64:
    ...

Thanks @person142, that's exactly the answer I was looking for.

Side note: I hope we can provide aliases for dtype[float64] etc. as well in numpy.typing, so the signature becomes something like:

def inner(x: np.ndarray[Float64], y: np.ndarray[Float64]) -> np.float64:

Typed signatures are annoyingly hard to read, so these convenience shorthands are quite useful.

So the short version would be "when we make ndarray generic over dtype, this PR will give us a way to refer to common dtypes like dtype(float32), which would otherwise not be possible".

Does that make sense? Thinking of the current discussion about NEP opaqueness, there's a similar danger that these typing PRs are becoming increasingly inscrutable as we tackle the harder problems. LMK if a roadmap or some longer-form document would help make the path forward clearer.

If it's helpful for you please do, but it's not necessary for me right now. Most PRs have been quite clear. This one seemed to be an exception, so I thought I'd ask.

BvB93 · 2020-09-04T23:14:21Z

Adapting ArrayLike should be fairly straightforward (see below) once these changes are pushed through, were it not for one issue: builtin scalars (i.e. a TypeVar bound to np.generic is not enough).

It might be possible though to do some magic with PEP 593's new Annotated object (an annotation-type that can hold arbitrary metadata) in combination with a mypy plugin, treating builtin scalars as np.generic's if certain conditions are met.

from typing import Union, Sequence, TypeVar

import numpy as np
from numpy.typing import _SupportsArray

T = TypeVar("T", bound=np.generic) 

ArrayLike = Union[
    T, Sequence[T], _SupportsArray[np.dtype[T]]  # etc...
]

def func(a: ArrayLike[T]) -> np.ndarray[np.dtype[T]]: ...

As discussed in numpy#16759, the new DType classes provide a good path forward for making `ndarray` generic over DType. Update NEP 42 to discuss those applications in more detail.

person142 · 2020-10-06T04:16:13Z

Ok, looks like there's consensus on #17447, so I've taken this out of draft and (hopefully) addressed the outstanding review comments.

BvB93

Also there are a number of overloads still missing:

numpy/__init__.pyi

* NEP: update NEP 42 with discussion of type hinting applications As discussed in #16759, the new DType classes provide a good path forward for making `ndarray` generic over DType. Update NEP 42 to discuss those applications in more detail. * NEP: discuss typing for use scalar types in NEP 42 Also clean up the language a bit.

person142 · 2020-10-07T05:57:22Z

Many more overloads added; some overloads consolidated. I moved the warning about not reordering them willy-nilly up to the top.

numpy/__init__.pyi

BvB93 · 2020-10-07T10:12:56Z

numpy/__init__.pyi

+            "=u8",
+            "<u8",
+            ">u8",
+            "L",


I'd leave out "L"- and "l"-based char codes for now,
as those can correspond to either 64- or 32-bit integers depending on the platform.

Can you give a reference for that? This is surprising given that

In [5]: np.dtype('int64').char Out[5]: 'l' In [6]: np.dtype('int32').char Out[6]: 'i'

It's a maybe not as explicit as it should be, but the generic docs do mention platform specifics
(these compatible: ... statements). A given character code will always correspond to the
same platform-independent type/alias (byte, short, intc, etc.), but the same does not
hold for their associated numerical precision.

On windows for example:

In [1]: import numpy as np In [2]: np.dtype('int64').char Out[2]: 'q' In [3]: np.dtype('int32').char Out[3]: 'l'

Ok, ditched the character codes for all int types; since they all correspond to C types they are all platform dependent. (Or at least the C standard allows them to be platform independent though they might not be in practice.)

BvB93 · 2020-10-07T10:17:49Z

The only comments here that really need addressing are #16759 (comment) and,
to a lesser extent, #16759 (comment).

The PR is good to go afterwards in my opinion.

This allows representing dtype subclasses via constructs like `np.dtype[np.float64]`.

The various character codes are for C types, which are all platform-dependent.

person142 · 2020-10-11T21:33:43Z

Ok, I think all outstanding comments have been addressed.

mattip · 2020-10-14T12:21:14Z

@seberg any last thoughts?

charris · 2020-10-16T15:43:26Z

Thanks @person142 .

person142 commented Jul 5, 2020

View reviewed changes

person142 added Static typing 01 - Enhancement labels Jul 5, 2020

eric-wieser reviewed Jul 7, 2020

View reviewed changes

numpy/__init__.pyi Outdated Show resolved Hide resolved

person142 force-pushed the dtype-generic branch from ea82218 to 81c6306 Compare July 8, 2020 03:24

BvB93 mentioned this pull request Jul 16, 2020

ENH: Add annotations to the last 8 functions in numpy.core.fromnumeric #16729

Merged

anirudh2290 reviewed Jul 17, 2020

View reviewed changes

person142 force-pushed the dtype-generic branch 3 times, most recently from dc6f3df to b3af8d9 Compare July 26, 2020 20:19

BvB93 reviewed Jul 29, 2020

View reviewed changes

numpy/__init__.pyi Outdated Show resolved Hide resolved

numpy/__init__.pyi Outdated Show resolved Hide resolved

numpy/__init__.pyi Outdated Show resolved Hide resolved

numpy/__init__.pyi Outdated Show resolved Hide resolved

person142 force-pushed the dtype-generic branch from b3af8d9 to 711d947 Compare July 30, 2020 04:24

person142 force-pushed the dtype-generic branch from 711d947 to 0a4c9e4 Compare July 31, 2020 05:07

BvB93 mentioned this pull request Aug 20, 2020

ENH: Add placeholder stubs for all sub-modules #17104

Merged

charris mentioned this pull request Sep 7, 2020

MAINT: Add placeholder stubs for ndarray and generic #17188

Merged

person142 mentioned this pull request Oct 4, 2020

NEP: update NEP 42 with discussion of type hinting applications #17447

Merged

person142 force-pushed the dtype-generic branch from 0a4c9e4 to 85c846c Compare October 6, 2020 04:13

github-actions bot added the 25 - WIP label Oct 6, 2020

person142 marked this pull request as ready for review October 6, 2020 04:14

person142 removed the 25 - WIP label Oct 6, 2020

BvB93 reviewed Oct 6, 2020

View reviewed changes

numpy/__init__.pyi Outdated Show resolved Hide resolved

numpy/__init__.pyi Outdated Show resolved Hide resolved

numpy/__init__.pyi Outdated Show resolved Hide resolved

numpy/__init__.pyi Show resolved Hide resolved

numpy/__init__.pyi Show resolved Hide resolved

BvB93 changed the title ~~WIP, ENH: make dtype generic over scalar type~~ ENH: make dtype generic over scalar type Oct 6, 2020

BvB93 reviewed Oct 7, 2020

View reviewed changes

person142 added 3 commits October 7, 2020 21:07

ENH: make dtype generic over scalar type

fd0f3dd

This allows representing dtype subclasses via constructs like `np.dtype[np.float64]`.

MAINT: add more dtype __new__ overloads for missing scalar types

02688c2

MAINT: fix complex64 overload; add str0/bytes to literals

12e3e1a

person142 force-pushed the dtype-generic branch from bcdb251 to 12e3e1a Compare October 8, 2020 04:07

BvB93 mentioned this pull request Oct 8, 2020

ENH,BLD: Dynamically generate a .pyi file with platform-specific type hints #17514

Closed

MAINT: remove character code literals for integer types

b81ab44

The various character codes are for C types, which are all platform-dependent.

BvB93 approved these changes Oct 12, 2020

View reviewed changes

charris merged commit b9dd2be into numpy:master Oct 16, 2020

person142 deleted the dtype-generic branch October 16, 2020 21:31

		@@ -0,0 +1,9 @@
		import numpy as np

		reveal_type(np.dtype(np.float64)) # E: numpy.dtype[numpy.float64*]

ENH: make dtype generic over scalar type #16759

ENH: make dtype generic over scalar type #16759

Conversation

person142 commented Jul 5, 2020

Choose a reason for hiding this comment

person142 Jul 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

person142 commented Jul 8, 2020

seberg commented Jul 8, 2020

BvB93 commented Jul 8, 2020

seberg commented Jul 8, 2020

person142 commented Jul 9, 2020

seberg commented Jul 9, 2020

anirudh2290 left a comment

Choose a reason for hiding this comment

person142 commented Jul 17, 2020

anirudh2290 commented Jul 17, 2020

person142 commented Jul 26, 2020

BvB93 commented Jul 29, 2020 • edited Loading

BvB93 Jul 29, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

BvB93 Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser Jul 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

person142 commented Jul 30, 2020

person142 commented Jul 31, 2020

person142 commented Aug 18, 2020

rgommers commented Aug 19, 2020

seberg commented Aug 19, 2020

rgommers commented Aug 19, 2020

BvB93 commented Sep 4, 2020 • edited Loading

person142 commented Oct 6, 2020

BvB93 left a comment • edited Loading

Choose a reason for hiding this comment

person142 commented Oct 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BvB93 commented Oct 7, 2020

person142 commented Oct 11, 2020

mattip commented Oct 14, 2020

charris commented Oct 16, 2020

person142 Jul 5, 2020 •

edited

Loading

BvB93 commented Jul 29, 2020 •

edited

Loading

BvB93 Jul 29, 2020 •

edited

Loading

eric-wieser Jul 30, 2020 •

edited

Loading

BvB93 Jul 30, 2020 •

edited

Loading

eric-wieser Jul 30, 2020 •

edited

Loading

BvB93 commented Sep 4, 2020 •

edited

Loading

BvB93 left a comment •

edited

Loading