Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: make dtype generic over scalar type #16759

Merged
merged 4 commits into from
Oct 16, 2020
Merged

Conversation

person142
Copy link
Member

Related to #16545.

Note: this is in no way complete, or even necessarily the right way forward. Just wanted to have a discussion about typing dtype subclasses with concrete code examples.

Goal: discuss whether making dtype generic over scalar types is a good way to move forward with making ndarray generic over dtype.

First, a rough summary of where things stand:

  • Pre NEP 41: there is only one dtype class
  • In the Python type system, classes are types, so in the pre NEP 41 world it's difficult to make ndarray generic over dtype (i.e. write things like ndarray[dtype]) because there is only one dtype class
  • Post NEP 41 there are dtype subclasses
  • Hopefully we can leverage that to make ndarray generic over dtype.
  • This could be the wrong abstraction; e.g. if you care about typing things like endianness, then really the instances of the classes are the "types", not the class itself. But since instances of a class aren't a type in the Python type system, it's not clear how one would do that. A further oddity is that when we say something like List[int] that means "a list whose elements are instances of int, which would not be the case for the dtype classes.

Now, one of the first difficulties you hit in using the new dtype subclasses is that they are created dynamically; i.e. there is no concrete np.Float64 dtype subclass that you can add types for and then write ndarray[Float64]. The subclasses do have a suggestive notation though:

>>> type(np.dtype(np.float64))
<class 'numpy.dtype[float64]'>

Following that, this PR makes dtype generic over subclasses of np.generic, so that on the typing level we can write np.dtype[float64] to mean "the Float64 subclass of dtype". Some thoughts about that

  • This means we are using dtype[<scalar>] to represent a subclass of dtype. I think that this is compatible with how PEP 585:

    https://www.python.org/dev/peps/pep-0585/#parameters-to-generics-are-available-at-runtime

    is handling things like list[int]

  • If you are implementing a new dtype class by subclassing dtype, then it won't be generic over scalar type; consider e.g. this example:

    from typing import Generic, TypeVar
    
    T = TypeVar('T')
    
    class A(Generic[T]):
        pass
    
    class B(A):
        pass
    
    reveal_type(B[int]())

    which will be rejected: error: The type "Type[B]" is not generic and not indexable. This is good.

  • The dtype[<scalar>] syntax is not settled yet on the non-typing level; NEP 42 says "Note: This is currently a possible extension and not yet decided."

class dtype:
class dtype(Generic[_DTypeScalar]):
@overload
def __new__(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reality this is done by the _DTypeMeta metaclass.

@@ -0,0 +1,9 @@
import numpy as np

reveal_type(np.dtype(np.float64)) # E: numpy.dtype[numpy.float64*]
Copy link
Member Author

@person142 person142 Jul 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some examples of making dtype generic over scalar types working.

reveal_type(np.dtype(np.float64)) # E: numpy.dtype[numpy.float64*]
reveal_type(np.dtype(np.int64)) # E: numpy.dtype[numpy.int64*]

# Uh oh, this shouldn't be ok
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some examples that would need to be sorted out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good thus far!
In all likelihood we'll have to go big on the @overloads (~1 per initiable generic subclass) to remedy the issues below.

numpy/__init__.pyi Outdated Show resolved Hide resolved
@person142
Copy link
Member Author

Ping @seberg since the validity of this approach depends on what's going to happen with dtype[float64] et. al.

@seberg
Copy link
Member

seberg commented Jul 8, 2020

I like that approach enough to mention it in the draft of NEP 42, but there is nothing quite fixed about that. Although possibly it is the other way around here and typing can be the motivation... Note that there can be DTypes without scalar types.

@BvB93
Copy link
Member

BvB93 commented Jul 8, 2020

Note that there can be DTypes without scalar types.

Are you referring to structured data types here or something else?

In the former case, isn't this mostly a terminology-related issue?
The likes of np.void and np.record can fulfill the role of supposed "scalar" here just fine.

@seberg
Copy link
Member

seberg commented Jul 8, 2020

@BvB93 the issue is around things such as pandas Categorical, which can store ints or arbitrary objects, but may or may not have actual scalars. Of course that is something you can argue about whether or not it should even exist! But right now, I can only say that it might... And the fact that pandas has Categorical makes it hard to dismiss immediately.
There is also the plausible option to go the Quantity way, where you do not actually have scalars, but just use 0-D arrays instead (although that has issues around mutability+hashibility).

@person142
Copy link
Member Author

the issue is around things such as pandas Categorical, which can store ints or arbitrary objects, but may or may not have actual scalars

When I was thinking about this I was figuring that for something like Categorical there would be an explicit subclass

Categorical(dtype):
    ...

somewhere in Pandas and then you could write ndarray[Categorical]. So then things like dtype[float64] are only needed for the existing scalar hierarchy where there aren't explicit classes defined, and we avoid the issue of dtype classes without scalar types.

Maybe I was thinking about it wrong though?

@seberg
Copy link
Member

seberg commented Jul 9, 2020

Yes, sorry, exactly, you would have ndarray[Categorical] with no ability to do ndarray[CategoricalScalar] (or dtype[CategoricalScalar]). I suppose that was always anticipated here, so all well.

Copy link
Member

@anirudh2290 anirudh2290 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a comment on the np.dtype[float64] and not specifically this PR. Now that we have type hinting support, this actually makes this syntax a bit confusing for end users from a dtype perspective IMO. For example, for most use-cases, spelling like List[int] is an indication to the person reading this code that this is for the type checker and will be ignored or erased at run time. But with np.dtype[float64] the meaning becomes slightly different and it has a specific meaning different from np.dtype[int64] at runtime.

@person142
Copy link
Member Author

For example, for most use-cases, spelling like List[int] is an indication to the person reading this code that this is for the type checker and will be ignored or erased at run time.

That is starting to change with PEP 585 though.

@anirudh2290
Copy link
Member

Thanks! I missed that completely, that makes sense.

@person142 person142 force-pushed the dtype-generic branch 3 times, most recently from dc6f3df to b3af8d9 Compare July 26, 2020 20:19
@person142
Copy link
Member Author

Ok, some progress in dc6f3df:

  • dtype("float64") correctly resolves to dtype[float64]
  • dtype(float) correctly resolves to dtype[float64]
  • ... and so on (though I haven't added the overloads for every possible string alias yet)

The reason that dtype("float64") wasn't working correctly before was that the __init__ was screwing it up somehow. I'm now sure why that is, but I can repro it with this simple example:

from typing import Generic, Type, TypeVar
from typing_extensions import Literal

T = TypeVar('T')

class A(Generic[T]):
    def __new__(cls, x: Literal['str']) -> A[str]:

reveal_type(A('str'))

gives Revealed type is 'new.A[builtins.str]', but

from typing import Generic, Type, TypeVar
from typing_extensions import Literal

T = TypeVar('T')

class A(Generic[T]):
    def __new__(cls, x: Literal['str']) -> A[str]:
        ...

    def __init__(self, x: Literal['str']) -> None:
        ...

reveal_type(A('str'))

gives Revealed type is 'new.A[<nothing>]'.

The remaining issue is this test case:

# Uh oh, this shouldn't be ok
reveal_type(np.dtype[np.float64](np.int64))  # E: numpy.dtype[numpy.void]

which works but shouldn't; this should be because the DTypeLike overload is very broad and catches too many things; we probably need a stricter version of DTypeLike just for __new__ which excludes the cases covered in the previous @overloads .

@BvB93
Copy link
Member

BvB93 commented Jul 29, 2020

The remaining issue is this test case:

# Uh oh, this shouldn't be ok
reveal_type(np.dtype[np.float64](np.int64))  # E: numpy.dtype[numpy.void]

which works but shouldn't; this should be because the DTypeLike overload is very broad and catches too many things; we probably need a stricter version of DTypeLike just for __new__ which excludes the cases covered in the previous @overloads .

Yeah, something specifically for annotating np.void. Maybe _VoidLike?
In any case, as long as this new annotation does not contain type in its definition then above-mentioned issue should resolve itself.

@overload
def __new__(
cls,
dtype: Type[int],
Copy link
Member

@BvB93 BvB93 Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to be careful with int and float here as the corresponding a np.generic is platform dependent.
This is particularly true for Windows as it generally (always?) uses 32-bit integers,
both on 32- and 64-bit operating systems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I think we want np.intc, but we haven't added an annotation for it yet!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, I think we want np.intc, but we haven't added an annotation for it yet!

On MacOS it seems that np.intc is np.int32 while np.dtype(int) == np.int64.
I believe np.int_ does to the trick though.

In any case, it appears that the likes of mypy support platform checks (ref), which might be exactly what we need here!

Copy link
Member

@eric-wieser eric-wieser Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.dtype(int) and np.dtype(np.int_) are synonyms on all platforms. np.intc is always the C int.

Copy link
Member

@BvB93 BvB93 Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.dtype(int) and np.dtype(np.int_) are synonyms on all platforms.

Good to know, then this is exactly what we're looking for! (considering this is about adding a np.dtype(int) overload)

I think the code below should trick unless there are more situations where np.int_ is np.int32.

if sys.platform == "win32":
    int_ = np.int32  # Windows-specific code
else:
    int_ = np.int64  # Other systems

class dtype:
    def __new__(cls, dtype: Type[int], ...) -> dtype[int_]: ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are ifs legal like that in subfiles? If so, is sys.platform whitelisted as one of the things you can access?

A select few of them are (ref).
This includes sys.version_info (which we've used previously) and apparently also sys.platform, yes.

Copy link
Member

@eric-wieser eric-wieser Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the mostly correct logic is:

# best guess of size_t for common platforms
if sys.maxsize > 2**32:
    intp = np.int64
else:
    intp = np.int32

# C long tends to match `size_t`, except on windows where it is always 32 bits.
if sys.platform == "win32":
    int_ = np.int32
else:
    int_ = np.intp

You might need to make an upstream patch to mypy to support sys.maxsize.

In the meantime, I suppose you can define intp = Union[np.int32, np.int64]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also potentially fake np.intp and np.int_ as being their own types. If one were writing code that needed to take variation across systems into account then the type checker complaining about something like intp + int64 by default regardless of platform might be the most helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet another option would be to add some sort of dynamically generated .pyi file (perhaps whenever setup.py is runned?) for handling all these platform-specific generic aliases and then import them from there.
Considering there is quite a number of them (ref).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dynamically generated .pyi file sounds like a nice solution long term-for now I've demoted those returns to dtype[Any] so as to not get blocked here on solving the platform-dependent dtype problem.

numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Outdated Show resolved Hide resolved
@person142
Copy link
Member Author

Ok, so @BvB93 pointed out that I need some more string aliases and to get the overload for int correct, but: I think this should be a complete proof of concept now! In particular post 711d947 we have:

  • Mypy correctly complains about np.dtype[np.float64](np.int64)
  • np.dtype(np.dtype(np.float64)) etc. is handled correctly
  • np.dtype(("U", 10)) etc. give np.dtype[np.void]
  • dtype(None) correctly gives np.dtype[np.float64]

So now we need to decide if we actually want to take this path (recall that the goal is making ndarray generic over dtype). If we do I imagine that the NEP 42 draft would need to be updated. @seberg WDYT?

@person142
Copy link
Member Author

Ok, rebased post #16622, which has revealed a few more cases that could be handled better. But, let's not put that work in until we're confident we're actually going to keep this.

@person142
Copy link
Member Author

No further movement here-what should we do move forward the discussion about using dtype[<scalar>] for typing? Ping the mailing list? Open a PR with proposed edits to NEP 42?

@rgommers
Copy link
Member

No further movement here-what should we do move forward the discussion about using dtype[<scalar>] for typing? Ping the mailing list? Open a PR with proposed edits to NEP 42?

Would be good to get @seberg's opinion here, but I suspect that pinging the mailing list won't help much (most experts are already participating here), so a proposed edit to NEP 42 sounds good to me.

From the examples given so far, it's not 100% clear to me if this only impacts typing of dtype subclasses itself, or if this PR by itself also moves us forward to adding dtype typing info to regular functions that have ndarray as input/return parameters. If that is the case, could you add some examples?

@seberg
Copy link
Member

seberg commented Aug 19, 2020

If that feels better to you, you can also move it to NEP 41. It might be better there, since its more on the concept of DTypes being classes and we have this scalar <-> DType duality 🤷.

@rgommers
Copy link
Member

def inner(x: np.ndarray[dtype[float64]], y: np.ndarray[dtype[float64]]) -> np.float64:
    ...

Thanks @person142, that's exactly the answer I was looking for.

Side note: I hope we can provide aliases for dtype[float64] etc. as well in numpy.typing, so the signature becomes something like:

def inner(x: np.ndarray[Float64], y: np.ndarray[Float64]) -> np.float64:

Typed signatures are annoyingly hard to read, so these convenience shorthands are quite useful.

So the short version would be "when we make ndarray generic over dtype, this PR will give us a way to refer to common dtypes like dtype(float32), which would otherwise not be possible".

Does that make sense? Thinking of the current discussion about NEP opaqueness, there's a similar danger that these typing PRs are becoming increasingly inscrutable as we tackle the harder problems. LMK if a roadmap or some longer-form document would help make the path forward clearer.

If it's helpful for you please do, but it's not necessary for me right now. Most PRs have been quite clear. This one seemed to be an exception, so I thought I'd ask.

@BvB93
Copy link
Member

BvB93 commented Sep 4, 2020

Adapting ArrayLike should be fairly straightforward (see below) once these changes are pushed through, were it not for one issue: builtin scalars (i.e. a TypeVar bound to np.generic is not enough).

It might be possible though to do some magic with PEP 593's new Annotated object (an annotation-type that can hold arbitrary metadata) in combination with a mypy plugin, treating builtin scalars as np.generic's if certain conditions are met.

from typing import Union, Sequence, TypeVar

import numpy as np
from numpy.typing import _SupportsArray

T = TypeVar("T", bound=np.generic) 

ArrayLike = Union[
    T, Sequence[T], _SupportsArray[np.dtype[T]]  # etc...
]

def func(a: ArrayLike[T]) -> np.ndarray[np.dtype[T]]: ...

person142 added a commit to person142/numpy that referenced this pull request Oct 4, 2020
As discussed in numpy#16759, the new
DType classes provide a good path forward for making `ndarray` generic
over DType. Update NEP 42 to discuss those applications in more
detail.
@person142 person142 marked this pull request as ready for review October 6, 2020 04:14
@person142
Copy link
Member Author

Ok, looks like there's consensus on #17447, so I've taken this out of draft and (hopefully) addressed the outstanding review comments.

@person142 person142 removed the 25 - WIP label Oct 6, 2020
Copy link
Member

@BvB93 BvB93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there are a number of overloads still missing:

  • complex64
  • complex128
  • float16
  • int8
  • int16
  • uint8
  • uint16
  • uint32
  • uint64
  • str_
  • bytes_

numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Show resolved Hide resolved
numpy/__init__.pyi Show resolved Hide resolved
@BvB93 BvB93 changed the title WIP, ENH: make dtype generic over scalar type ENH: make dtype generic over scalar type Oct 6, 2020
seberg pushed a commit that referenced this pull request Oct 6, 2020
* NEP: update NEP 42 with discussion of type hinting applications

As discussed in #16759, the new
DType classes provide a good path forward for making `ndarray` generic
over DType. Update NEP 42 to discuss those applications in more
detail.

* NEP: discuss typing for use scalar types in NEP 42

Also clean up the language a bit.
@person142
Copy link
Member Author

Many more overloads added; some overloads consolidated. I moved the warning about not reordering them willy-nilly up to the top.

numpy/__init__.pyi Outdated Show resolved Hide resolved
numpy/__init__.pyi Show resolved Hide resolved
numpy/__init__.pyi Show resolved Hide resolved
"=u8",
"<u8",
">u8",
"L",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave out "L"- and "l"-based char codes for now,
as those can correspond to either 64- or 32-bit integers depending on the platform.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a reference for that? This is surprising given that

In [5]: np.dtype('int64').char
Out[5]: 'l'

In [6]: np.dtype('int32').char
Out[6]: 'i'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a maybe not as explicit as it should be, but the generic docs do mention platform specifics
(these compatible: ... statements). A given character code will always correspond to the
same platform-independent type/alias (byte, short, intc, etc.), but the same does not
hold for their associated numerical precision.

On windows for example:

In [1]: import numpy as np

In [2]: np.dtype('int64').char
Out[2]: 'q'

In [3]: np.dtype('int32').char
Out[3]: 'l'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ditched the character codes for all int types; since they all correspond to C types they are all platform dependent. (Or at least the C standard allows them to be platform independent though they might not be in practice.)

@BvB93
Copy link
Member

BvB93 commented Oct 7, 2020

The only comments here that really need addressing are #16759 (comment) and,
to a lesser extent, #16759 (comment).

The PR is good to go afterwards in my opinion.

The various character codes are for C types, which are all
platform-dependent.
@person142
Copy link
Member Author

Ok, I think all outstanding comments have been addressed.

@mattip
Copy link
Member

mattip commented Oct 14, 2020

@seberg any last thoughts?

@charris charris merged commit b9dd2be into numpy:master Oct 16, 2020
@charris
Copy link
Member

charris commented Oct 16, 2020

Thanks @person142 .

@person142 person142 deleted the dtype-generic branch October 16, 2020 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants