Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Type hinting / annotation (PEP 484) for ndarray, dtype, and ufunc #7370

Open
InonS opened this Issue Mar 1, 2016 · 51 comments

Comments

Projects
None yet
@InonS
Copy link

InonS commented Mar 1, 2016

Feature request: Organic support for PEP 484 with Numpy data structures.

Has anyone implemented type hinting for the specific numpy.ndarray class?

Right now, I'm using typing.Any, but it would be nice to have something more specific.

For instance if the numpy people added a type alias for their array_like object class. Better yet, implement support at the dtype level, so that other objects would be supported, as well as ufunc.

original SO question

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Mar 2, 2016

I don't think anyone's thought about it. Perhaps you would like to? :-)

I'm also going to suggest that if you want to followup on this that we close the gh issue and move the discussion to the mailing list, since it's better suited to open-ended design discussions.

@InonS

This comment has been minimized.

Copy link
Author

InonS commented Mar 2, 2016

After getting this answer on SO, I've decided to close the issue.

@InonS InonS closed this Mar 2, 2016

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Mar 3, 2016

To be clear, we don't actually have any objection to supporting cool new python features or anything (rather the opposite); it's just that we're a volunteer run project without many resources, so stuff only happens if someone who's interested steps up to do it.

The mailing list is usually the best place if you're trying to start working on something or hoping to recruit some other interested folks to help.

@InonS

This comment has been minimized.

Copy link
Author

InonS commented Mar 3, 2016

Thanks, @njsmith. I decided to start here because of the more orderly issue-tracking, as opposed to an unstructured mailing list (I was looking for a 'feature request' tag, among other features...)

Since the guy who answered me on SO got back to me with a viable solution, I decided to leave the matter.
Maybe the Numpy documentation should be updated to include his answer (please make sure to give him credit if you do).

Thanks, again!

@JulesGM

This comment has been minimized.

Copy link

JulesGM commented Apr 27, 2017

hello guys! I was just kindly wondering if there had been any progress on this issue. Thanks.

@eric-wieser

This comment has been minimized.

Copy link
Member

eric-wieser commented Apr 27, 2017

There is some discussion about it on the mailing list here.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented May 8, 2017

I'm reopening this issue for those who are interested in discussing it further.

I think this would certainly be desirable for NumPy, but there are indeed a few tricky aspects of the NumPy API for typing to sort through, such as how NumPy currently accepts arbitrary objects in the np.array constructor (though we want to clean this up, see #5353).

@shoyer shoyer reopened this May 8, 2017

@nnadeau

This comment has been minimized.

Copy link
Contributor

nnadeau commented Jun 16, 2017

Some good work is being done here: https://github.com/machinalis/mypy-data

There's discussion about whether to push the work upstream to numpy or typeshed: machinalis/mypy-data#16

@njsmith

This comment has been minimized.

Copy link
Member

njsmith commented Jun 16, 2017

@henryJack

This comment has been minimized.

Copy link

henryJack commented Sep 1, 2017

This really would be a great addition to NumPy. What would be the next steps to push this up to typeshed or NumPy? Even an incomplete stub would be useful and I'm happy to help with a bit of direction?

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 1, 2017

@henryJack The best place to start would probably be tooling: figure out how we can integrate basic type annotations into the NumPy repository (and ideally test them) in a way that works with mypy and supports adding them incrementally.

Then, start with extremely minimal annotations and we can go from there. In particular, I would skip dtype annotations for now since we don't have a good way to specify them (i.e., only do ndarray, not ndarray[int]).

If it's helpful, I have an alternative version of annotations that I've written for use at Google and could open source. But we have our own unique build system and do type checking with pytype, so there would likely be quirks porting it to upstream.

@jwkvam

This comment has been minimized.

Copy link

jwkvam commented Sep 1, 2017

I suppose the only way to test annotations to actually run mypy on sample code snippets and check the output?

Would it be better to have the annotations integrated with the code or as separate stubs?

I suppose we should also learn from dropbox and pandas that we should start with the leaves of the codebase versus core data structures?

@JulesGM

This comment has been minimized.

Copy link

JulesGM commented Sep 1, 2017

@shoyer figure out how we can integrate basic type annotations
Wouldn't just putting https://github.com/machinalis/mypy-data/blob/master/numpy-mypy/numpy/__init__.pyi in the numpy module base directory do exactly that.. In a experimental version of some kind at least

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 1, 2017

Would it be better to have the annotations integrated with the code or as separate stubs?

Integrated with the code would be lovely, but I don't think it's feasible for NumPy yet. Even with the comment string version of type annotations, we would need to import from typing on Python 2, and adding dependencies to NumPy is pretty much off the table.

Also, most of the core data structures and functions (things like ndarray and array) are defined in extension modules, we'll need to use stubs there anyways.

Wouldn't just putting https://github.com/machinalis/mypy-data/blob/master/numpy-mypy/numpy/__init__.pyi in the numpy module base directory do exactly that.. In a experimental version of some kind at least

Yes, I think that would be enough for external code. But how does mypy handle libraries with incomplete type annotations?

If possible, we might annotate numpy.core.multiarray directly, rather than just at the top level. (multiarray is the extension module where NumPy's core types like ndarray are defined.) I think this would allow NumPy itself to make use of type checking for some of its pure-Python modules.

@mrocklin

This comment has been minimized.

Copy link
Contributor

mrocklin commented Sep 1, 2017

I'm curious, what is the type of np.empty(shape=(5, 5), dtype='float32')?

What is the type of np.linalg.svd?

@jwkvam

This comment has been minimized.

@mrocklin

This comment has been minimized.

Copy link
Contributor

mrocklin commented Sep 1, 2017

It looks like types are parametrized, is this with their dtype? Is it also feasible to parametrize with their dimension or shape? How much sophistication does Python's typing module support?

@jwkvam

This comment has been minimized.

Copy link

jwkvam commented Sep 1, 2017

Yea they are parameterized by their dtype. I'm no expert on the typing module but I think you could just have the ndarray type inherit Generic[dtype, int] to parameterize on ndim. I believe that's what Julia does. I'm not sure if you could easily parameterize on shape. Nor am I sure of what benefits that would bring or why it wasn't done that way in the first place.

@mrocklin

This comment has been minimized.

Copy link
Contributor

mrocklin commented Sep 1, 2017

@jwkvam

This comment has been minimized.

Copy link

jwkvam commented Sep 1, 2017

You can use numpy dtypes, we just need to define them. That was done here with floating with np.std.

https://github.com/kjyv/mypy-data/blob/24ea87d952a98ef62680e812440aaa5bf49753ae/numpy-mypy/numpy/__init__.pyi#L198

I'm not sure, I don't think it's possible. I don't think you can modify the output type based on an argument's value. I think the best we can do is overload the function with all the type specializations we would care about.

https://docs.python.org/3/library/typing.html#typing.overload

@eric-wieser

This comment has been minimized.

Copy link
Member

eric-wieser commented Sep 2, 2017

Another option might be to introduce some strict-typed aliases, so np.empty[dtype] is a function with signature (ShapeType) -> ndarray[dtype].

There's already some precedent for this with the unusual np.cast[dtype](x) function

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 2, 2017

@jwkvam OK, so maybe dtype annotations are doable -- I was just suggesting starting simple and going from there.

I think TypeVar could possibly be used instead of overloads, maybe:

D = TypeVar('D', np.float64, np.complex128, np.int64, ...)  # every numpy generic type
def empty(dtype: Type[D]) -> ndarray[Type[D]]: ...

If I understand this correctly, this would imply empty(np.float64) -> ndarray[np.float64].

It would also be awesome to be able to type check shape and dimensionality information, but that I don't think current type checkers are up to the task. Generic[int] is an error, for example -- the arguments to Generic are required to be instances of TypeVar:
https://github.com/python/cpython/blob/868710158910fa38e285ce0e6d50026e1d0b2a8c/Lib/typing.py#L1131-L1133

We would also need to express signatures involving dimensions. For example, np.expand_dims maps ndim -> ndim+1.

I suppose one approach that would work is to define classes for each non-negative integer, e.g., Zero, One, Two, Three, ... and then define overloads for each. That would get tiring very quickly.

In TensorFlow, tf.Dimension() and tf.TensorShape() let you statically express shapes. But it's not something that is done in the type system. Rather, each function has a helper associated with it that determines the static shape of outputs from the shape of inputs and any non-tensor arguments. I think we would need something similar if we hoped to do this with NumPy, but there's nothing in Pythons typing system that suggests that this sort of flexibility.

@jwkvam

This comment has been minimized.

Copy link

jwkvam commented Sep 11, 2017

@shoyer I see, yea that's disappointing. I was able to hack the following

_A = TypeVar('_A')
_B = TypeVar('_B', int, np.int64, np.int32)

class Abs(Generic[_A, _B]):
    pass

class Conc(Abs[_A, int]):
    pass

But I don't think that's leading anywhere...

It seems like your example works! It seemed like it worked better without the type constraints. I could test dtypes like str. I had to remove the default argument, couldn't figure out how to get that to work.

D = TypeVar('D')
def empty(shape: ShapeType, dtype: Type[D], order: str='C') -> ndarray[D]: ...

and code

def hello() -> np.ndarray[int]:
    return np.empty(5, dtype=float)

I get

error: Argument 2 to "empty" has incompatible type Type[float]; expected Type[int]

I'm a little confused because if I swap the types:

def hello() -> np.ndarray[float]:
    return np.empty(5, dtype=int)

I get no error. Even though I don't think anything is marked as covariant.

Even though the type system isn't as sophisticated as we'd like. Do you think it's still worth it? One benefit I would appreciate is better code completion thru jedi.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 11, 2017

I'm a little confused because if I swap the types:

I believe the issue here is that int instances is implicitly considered valid for float annotations. See the notes on the numeric tower in the typing PEP:
https://www.python.org/dev/peps/pep-0484/#the-numeric-tower

I think this could be avoided if we insist on NumPy scalar types instead of generic Python types for annotations, e.g., np.ndarray[np.integer] rather than np.ndarray[int].

This is actually a little easier than I thought because TypeVar has a bound argument. So revising my example:

D = TypeVar('D', bound=np.generic)
def empty(dtype: Type[D]) -> ndarray[D]: ...

I had to remove the default argument, couldn't figure out how to get that to work.

I'm not quite sure what you were getting at here?

@jwkvam

This comment has been minimized.

Copy link

jwkvam commented Sep 11, 2017

I just tried to encode the default value of dtype in the stub. They did that in the mypy-data repo.

def empty(shape: ShapeType, dtype: DtypeType=float, order: str='C') -> ndarray[Any]: ...

from https://github.com/kjyv/mypy-data/blob/master/numpy-mypy/numpy/__init__.pyi#L523

Following your example, I wasn't able to get mypy to work with a default argument for dtype. I tried dtype: Type[D]=float and dtype: Type[D]=Type[float].

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Sep 12, 2017

I think dtype also needs to become a generic type, and then you need to set the default value to a numpy generic subclass like np.float64 rather than float, e.g.,

# totally untested!
D = TypeVar('D', bound=np.generic)

class dtype(Generic[D]):
    @property
    def type(self) -> Type[D]: ...

class ndarray(Generic[D]):
    @property
    def dtype(self) -> dtype[D]: ...

DtypeLike = Union[dtype[D], D]  # both are coercible to a dtype
ShapeLike = Tuple[int, ...]

def empty(shape: ShapeLike, dtype: DtypeLike[D] = np.float64) -> ndarray[D]: ...
@mitar

This comment has been minimized.

Copy link

mitar commented Oct 20, 2017

Yes, I think that the first step would be just to get basic typing support for numpy (and Pandas and sklearn) in, without taking into consideration shapes and other extra constraints on those types.

The issue with other extra constraints is that it is not enough just to describe a dtype (shape = 5,6), but there has to be a language to describe a constraint on that shape. You can imagine that you want to define a function which accepts only square numpy shapes as inputs, or one where one dimension has to be 2x the other one.

Something like that was done in contracts project.

I also think that PEP 472 would be great to support here, because then one could really do things like Array[float64, ndim=2].

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Oct 20, 2017

Indeed, PEP 472 would be nice for typing, though it would probably be one of the easier fixes to make this happen! (Please ping me if you are interested in restarting discussion around it, as I think there are also compelling use cases for named dimensions in indexing.)

@mitar

This comment has been minimized.

Copy link

mitar commented Oct 20, 2017

I am not sure how I an contribute, but I definitely think it would be an awesome feature for multiple reasons. But, we are going in that direction, then it seems like [] just becomes a different way to call an object. So object(*args, **kwargs) does something, object[*args, **kwargs] something else, and then we can even generalize and also have object{*args, **kwags} and object<*args, **kwargs>. ;-)

@eric-wieser

This comment has been minimized.

Copy link
Member

eric-wieser commented Oct 20, 2017

@mitar: Looking at it the other way, perhaps we should just be annotating with something like ndarray[float].constrain(ndim=2). We have plenty of available syntax already, and unlike decorators, annotations have no restrictions

@mitar

This comment has been minimized.

Copy link

mitar commented Oct 20, 2017

I in fact tried the following syntax: ndarray[float](ndim=2), so overloading that on generics __call__ returns again a class, and not instance of a class. But it became tricky for types which are not generics.

I think the main issue is with ndarray[float] support, because ndarray[float] is not something which really exists in ndarray, one would have to change ndarray itself, which I am not sure is a good general principle to do (changing upstream code to support better typing).

One other approach could be to have new type of type variables, ConstrainedTypeVar, where you could do something like ConstrainedTypeVar('A', bound=ndarray, dtype=float, ndim=2) or something like that, and then you would use A as a var in the function signature. But this becomes very verbose.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 20, 2017

I wrote up a doc with some ideas for what typing array shapes could look like with broadcasting and a notion of dimension identity.

The core ideas include:

  1. Adding a DimensionVar primitive that allows for symbolic identities for array dimensions
  2. Recognizing ... (Ellipsis) as an indicating array broadcasting.

For example, to type np.matmul/@:

from typing import DimensionVar, NDArray, overload

I = DimensionVar('I')
J = DimensionVar('J')
K = DimensionVar('K')

@overload
def matmul(a: NDArray[..., I, J], b: NDArray[..., J, K]) -> NDArray[..., I, K]: ...

@overload
def matmul(a: NDArray[J], b: NDArray[..., J, K]) -> NDArray[..., K]: ...

@overload
def matmul(a: NDArray[..., I, J], b: NDArray[J]) -> NDArray[..., I]: ...

These would be enough to allow for typing generalized ufuncs. See the doc for a more details and examples.

@eric-wieser

This comment has been minimized.

Copy link
Member

eric-wieser commented Nov 20, 2017

A possible solution to supporting both dtypes and shapes, if we're already choosing to keep NDArray and ndarray distinct:

NDArray[float].shape[I, J, K]
NDArray[float]
NDArray.shape[I, J, K]
@aldanor

This comment has been minimized.

Copy link

aldanor commented Nov 21, 2017

Just a thought, would it make sense to also have a shortcut like this?

NDArray.ndim[2]  # NDArray.shape[..., ...]
NDArray[float].ndim[2]  # NDArray[float].shape[..., ...]

— which could simplify a number of signatures, especially in downstream code.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 21, 2017

@aldanor I think you mean NDArray.shape[:, :] (... means "zero or more dimensions", which isn't quite right in this context). But yes, that looks reasonable.


Quick update on typing for dtypes: I wrote a toy module using the approach I described above that uses np.generic subclasses with Generic for parameterized ndarray/dtype types.

This mostly seems to work with mypy as I would expect, including type inference with the equivalent of np.empty(..., dtype=np.float32). It does fails to catch one of my intentional type errors involving a Union type (I'll file a bug report later).

I think this would probably be good enough for dtypes. Without typing support for literal values, we couldn't do type inference with dtype specified as strings (dtype='float32'). Perhaps more problematically, it also doesn't handle type inference from Python types like dtype=float. But these types can be ambiguous (e.g., dtype=int maps to np.int64 on Linux and np.int32 on Windows), so it's probably better to use explicit generic types anyways. It's OK if type inference doesn't work in every possible case, as long as specifications dtype=float are inferred as a dtype of Any rather than raising an error.

@eric-wieser

This comment has been minimized.

Copy link
Member

eric-wieser commented Nov 21, 2017

But these types can be ambiguous (e.g., dtype=int maps to np.int64 on Linux and np.int32 on Windows)

That's not ambiguous - in all cases, that maps to np.int_, which is the C long type.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Nov 25, 2017

I've written the mailing list to gain consensus on writing type-stubs for NumPy in a separate package:
https://mail.python.org/pipermail/numpy-discussion/2017-November/077429.html

@henryJack

This comment has been minimized.

Copy link

henryJack commented Nov 27, 2017

Amazing, thanks @shoyer !

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Dec 6, 2017

Per the consensus on the mailing list, I'd like to declare https://github.com/numpy/numpy_stubs open for business!

We'll start with basic annotations (no dtype support). If anyone wants to put together a basic PR to add the PEP 561 scaffolding for the repo that would be appreciated!

@henryJack

This comment has been minimized.

Copy link

henryJack commented Dec 6, 2017

YES, YES, 1000X YES!

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented Dec 10, 2017

Heads up for anyone following this issue: I've opened two issues on the python/typing tracker:

@hmaarrfk

This comment has been minimized.

Copy link
Contributor

hmaarrfk commented May 16, 2018

What is the expected release time for for the typing feature?
Is there any reason to attempt to maintain 2.7 compatibility?
An early comment mentionned difficulty in integrating with python 2. Since then, it seems that numpy has changed its stance.

Things are moving targets, I know, but would it make sense to target something like Python 3.4-3.6?

@ilevkivskyi

This comment has been minimized.

Copy link

ilevkivskyi commented May 16, 2018

What is the expected release time for for the typing feature?

There were several discussions about this (integer generics a.k.a. simple dependent types) at PyCon, I will write a proto-PEP based on these discussions and the original doc written by @shoyer soon. My target is to get the PEP written, implemented in mypy and accepted in time for Python 3.8 beta 1 (also subsequent backport of the new types in typing for Python 2 is highly likely)

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented May 17, 2018

@hmaarrfk as for writing type annotations for NumPy itself, we're started doing that in a separate repository: https://github.com/numpy/numpy-stubs. You should be able to install and use those stubs in the current state (with the latest version of mypy), but they are far from complete. Help would be appreciated!

@hmaarrfk

This comment has been minimized.

Copy link
Contributor

hmaarrfk commented May 17, 2018

Sure, I'm glad to help where I can, and I saw the repository. I just know that these things take time.
I saw the repo and noticed a commit mentioned 2.7 compatibility, which is why I asked.

Python 3.8 beta release time is mid 2019. Numpy mentionned that they would stop new features at the end of 2018.

Typing seems to be a "nice-to-have" feature for numpy as opposed to a "must-have". As such, targeting two languages seems a little hard, especially, if the feature will start to appear well beyond numpy's own support deadline.

I'll be interested in reading what @ilevkivskyi has to say in the PEP.

@shoyer

This comment has been minimized.

Copy link
Member

shoyer commented May 17, 2018

@hmaarrfk You raise a good point about Python 2.7 support. To be honest, I haven't thought it through fully yet. I do expect that we will eventually drop it, but probably not before mypy itself drops Python 2.7 support, given that a major use-case for typing is writing Python 2/3 compatible code.

For now, it doesn't seem to require many compromises to support Python 2 in our type annotations, so I'm happy to leave it in, especially given that it came from a contributor who was evidently interested in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.