Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typing Master Tracker #25601

Closed
WillAyd opened this issue Mar 8, 2019 · 12 comments
Closed

Typing Master Tracker #25601

WillAyd opened this issue Mar 8, 2019 · 12 comments
Labels
Master Tracker High level tracker for similar issues Typing type annotations, mypy/pyright type checking

Comments

@WillAyd
Copy link
Member

WillAyd commented Mar 8, 2019

Per pandas-dev chat today opening up a master tracker for TODOs to improve typing.

Glancing the Python docs here's what I see for "new in version X.X.X" which we may want to consider:

3.5.2

3.5.3

3.5.4

3.6

3.6.1

3.7.2

@WillAyd WillAyd added the Master Tracker High level tracker for similar issues label Mar 8, 2019
@TomAugspurger
Copy link
Contributor

Configure CI for type checking (assuming mypy?)

I think this can be started at any time, since it's independent of how the types are written.

Various PRs to start adding types (by module? Maybe good sprint material)

It'd be helpful to hear from people who have experience typing a large code base about where best to start. https://mypy.readthedocs.io/en/latest/existing_code.html provides some guidance.

@Naddiseo
Copy link
Contributor

Naddiseo commented Mar 9, 2019

Since I was asked, and I'm in the middle of annotating our codebase at $work, I'll share some of what I've learnt.

Convert existing type hints to Python 3 syntax (started in PR #25576 - is this complete?)

#25576 only allowed stubgen to run on pandas to generate the .pyi files, the generated files are far from complete, most things are annotated as Any. It's a good starting point though to see what stubgen/mypy sees. At $work, I've been using the python3 syntax for annotating inline since I've found for my own code that maintaining .pyi outside of the code allows for drift between the .py and .pyi, and it's more files to keep track of.

I've also been taking advantage of python 3.7 from __future__ import annotations and if typing.TYPE_CHECKING: to make sure that the annotations don't have a runtime overhead. It is a bit of boiler plate, but for me personally I like the trade-off. It does however mean that types using typing.NamedTuple and typing.cast need to be outside of the if TYPE_CHECKING block.

Configure CI for type checking (assuming mypy?)
@TomAugspurger: I think this can be started at any time, since it's independent of how the types are written.

In general I think CI typing checking a good idea, but I think it should be an "optional" pass while the types are being written. I enabled mypy in our CI for a project a few years ago, and it was discouraging and frustrating to get started because of all the untyped code, which then caused us to ignore it and eventually disable it. If I were to do it again, it'd make the CI linting more of a warning than and error initially, and make mypy as permissive as possible until everything was properly typed; which might take time, and may even require a mypy plugin for complex things.

  • Define base typing module for pandas (array_like, dt_like, etc...)
  • Various PRs to start adding types (by module? Maybe good sprint material)

I can't speak to much on this since I'm not all that familiar with pandas itself, but I would expect good typing on the main data types, and the public API would be most useful for people using pandas. If thorough/correct typing on the main data types is too difficult without a plugin, perhaps Type or NewType would work initially, and they can be refined in the future.

My approach at work has been to add annotations to new functions/files, and add them to existing functions as I've had need to (if mypy complains for example). I've found that there were some very common files/classes that were used in a lot of places that I needed to annotate, and that would reduce a large percentage of the mypy errors. Mypy is pretty good at guessing types. I think the link that @TomAugspurger provided about annotating existing code bases is pretty much how I started, and has been helpful.

@jreback jreback added the Code Style Code style, linting, code_checks label Mar 10, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Mar 11, 2019

Right now in #25622 I've whitelisted existing modules that have type annotations and made what I would consider "simple" updates (imports, typo fixes) to get a clean return run against those files.

When you look at all files that have annotations, that would leave the following which are currently causing failures but probably need deeper inspection to resolve the issue:

pandas/core/base.py
pandas/core/arrays/array_.py
pandas/core/arrays/datetimelike.py
pandas/core/arrays/sparse.py
pandas/core/arrays/period.py
pandas/core/arrays/integer.py
pandas/core/indexes/datetimelike.py
pandas/core/indexes/period.py
pandas/core/internals/blocks.py

I think each of them may need a dedicated PR(s) before adding into our whitelist, but I think that's the best go forward path to get momentum on this. Feedback welcome

@WillAyd WillAyd added Typing type annotations, mypy/pyright type checking and removed Code Style Code style, linting, code_checks labels Mar 20, 2019
@WillAyd
Copy link
Member Author

WillAyd commented Mar 20, 2019

FYI I've added a Typing label and added this as a project which might be easier to navigate than this issue. Leaving this open of course for ongoing discussions

@topper-123
Copy link
Contributor

topper-123 commented May 9, 2019

I've worked on adding types to dtypes/dtypes.py and I think there's a need to have some guidelines for how type hints are added. It's quite tricky not to go a unproductive path.

I propose some rules to be added to contributing.rst (but written better than below, the below is just my quick summation of my proposal):

  1. Type hints on return values should be precise, (Example Dict rather than Mapping if returns a dict):
    • This one is self-explanatory, I think
  2. Type hint on parameters should be as inclusive as possible. If this for some parameters is not practical , possible or makes for very long, ugly type hints, don't annotate that parameter. Do still annotate the other simpler parameters.
    • The first part is self-explanatory, I think.
    • The second part is from my experience that some pandas parameters take so many input types, that it gives very little value to annotate. In other words, having a type annotation that amounts to little more than Any, should not be used.. Sometimes it's (AFAIK) not even possible to type annotate precisely, e.g. if something takes iterables, but not scalars (strings etc.).
    • While Any to mypy is the same as no annotation, for Pandas I think think no annotation should mean "This is not possible/practical to annotate" while any should mean actually anything.
  3. If the complete annotated definition doesn't fit on one line (including return type), always have one line per parameter/return type for functions/method definitions:
    • Type hints quckly make for difficult-to-read code if all stuffed together. If a function signature with type annotations goes over more than one line, just give room for one line per parameter. This includes the return value.
  4. Don't import stuff from unrelated places in the pandas code hierarchy just to use as a type hint.
    • Importing just to type annotate does very quickly cause circular imports and Python is not very good at pointing out what causes a circular import error, so this is a pain to debug.
    • Example: Don't import core.arrays.Categorical into core.dtypes.dtypes just to type annotate, as this causes circular imports. Importing CategoricalDtype into categorical.py is ok OTOH, because it's used there for other purposes already.
    • This is a trade-off as having the types would improve usability for the end user. I just think the pain to develop againt his would not be worth it.
  5. Don't add a return type to __init__. This is subjective maybe, but the return value is always None and IMO it just adds line noise to have that on __init__.
  6. Make sure the final attribute gets the precise correct type, if possible, even when the related parameter is not very well defined. Use typing.cast for this, if needed.

For an example, I've made a PR at #26327 that illustrates some of the issues I've encountered.

See e.g. this:

  1. One line per parameter, including return value to make the method definition as clean as possible.
  2. categories has no type annotation. It's not possible AFAIK to express an "Iterable, except str or bytes", so rather use no type annotation IMO.

Also see here:

  1. There's no return type. The return would be Index and ideally we'd want it, but it would need to be imported, increasing the risk of circular import errors.
  2. Instead of a return type, typing.cast(Index, self._categories) is used. This is uglier (but avoids the circular import issue) and works (in PyCharm, at least).

I'm absolutely not an expert on type hints, so this is a proposal to discuss and hopefully we can at some point in the future add some guide lines for how we do type hinting in Pandas.

@WillAyd
Copy link
Member Author

WillAyd commented May 9, 2019

@topper-123 thanks for the input - this is great!

  1. Type hints on return values should be precise, (Example Dict rather than Mapping if returns a dict):

Agreed

2. Type hint on parameters should be as inclusive as possible. If this for some parameters is not practical , possible or makes for very long, ugly type hints, don't annotate that parameter.

I would disagree here - if the signature is so complex that we can't annotate it then I would think it's surely a risk in our code for bugs. We would either want to alias the type or refactor the code to make things more apparent (the latter is obviously easier said than done).

Are there particular spots in the code base already you can point to which highlight this point though?

3. If the complete annotated definition doesn't fit on one line (including return type), always have one line per parameter/return type for functions/method definitions:

Yea I think one parameter per line is a easy standard to follow.

4. Don't import stuff from unrelated places in the pandas code hierarchy just to use as a type hint.

I don't have a strong preference on this yet, but I slightly disagree for now. I think import machinery / standards is a larger discussion than type hints which problems here may drive, but I don't want that to preclude us from typing

5. Don't add a return type to __init__. This is subjective maybe, but the return value is always None and IMO it just adds line noise to have that on __init__

I'd be OK with this too. I think this was a fairly recent change on the mypy side (see python/mypy#5677) so probably a moot point going forward

6. Make sure the final attribute gets the precise correct type, if possible, even when the related parameter is not very well defined. Use typing.cast for this, if needed.

Not sure I follow this one. We are trying to limit use of typing.cast generally as it is purely a function of type hinting and doesn't add anything to the code. From my limited experience I've seen instances where typing.cast could be avoided with a refactor of code that makes both human and mypy inference easier, so immediately using it kind of misses the point of static analysis. Have also seen instances where it's unavoidable but again think this should be a last resort

If you wanted to draft up something for contributing.rst that would be great!

@topper-123
Copy link
Contributor

topper-123 commented May 9, 2019

  1. Type hint on parameters should be as inclusive as possible. If this for some parameters is not practical , possible or makes for very long, ugly type hints, don't annotate that parameter.

I would disagree here - if the signature is so complex that we can't annotate it then I would think it's surely a risk in our code for bugs. We would either want to alias the type or refactor the code to make things more apparent (the latter is obviously easier said than done).

Yeah, a complex type signature can be a sign that the code should be restructured. But not always. for example Pandas considers a str a scalar (not list-like) while Python doesn't have the same concept of scalar and considers str an iterable (list-like). So there's some places where the Python type system won't fit very well with Pandas.

The clearest example of an untypable parameter would be the data parameter in DataFrame.__init__, because it's so flexible (which I like BTW).

But also consider the categories parameter of Categorical.__init__. If you type hint it, you will have to accept that some aspect of the type hint is wrong. E.g. Iterable[Hashable] would include str and bytes (who are scalars in pandas-speak).

  1. Don't import stuff from unrelated places in the pandas code hierarchy just to use as a type hint.

I don't have a strong preference on this yet, but I slightly disagree for now. I think import machinery / standards is a larger discussion than type hints which problems here may drive, but I don't want that to preclude us from typing

For example try adding the line from pandas.core import index to core/dtypes/dtypes.py. It will flash a ImportError. Try reading the error messages, imagining you don't know the cause of the error.

The above is fixable and maybe should be fixed, but if everything import from everywhere, the risk of such issues will be very large.

  1. Make sure the final attribute gets the precise correct type, if possible, even when the related parameter is not very well defined. Use typing.cast for this, if needed.

Not sure I follow this one. We are trying to limit use of typing.cast generally as it is purely a function of type hinting and doesn't add anything to the code. From my limited experience I've seen instances where typing.cast could be avoided with a refactor of code that makes both human and mypy inference easier, so immediately using it kind of misses the point of static analysis. Have also seen instances where it's unavoidable but again think this should be a last resort

Yeah, typing.cast is ugly, and should be avoided, if possible.

This point was largely related to the import issue in point 4 above, but can also be more general. For example, a lot of methods accept a list-like parameter and end up transforming it into e.g. a ndarray. I think in such cases using e.g. # type: np.ndarray is not always possible, but I could be wrong.

Anyway, the point was more that mypy/PyCharm etc should always be able to know the types as precisely as possible, and e.g. Iterable[Hashable] should not be acceptable for a return type or a attributes, unless specific reasons say otherwise.

@jorisvandenbossche
Copy link
Member

If you type hint it, you will have to accept that some aspect of the type hint is wrong. E.g. Iterable[Hashable] would include str and bytes (who are scalars in pandas-speak)

(typing noob here, sorry if I speak nonsense) Shouldn't we for that reason try to define some of our own type unions? Exactly because we have some flexible but still specific set of values that are typically accepted ? Eg a ListLike or ArrayLike for that case of iterable but not string ?

@WillAyd
Copy link
Member Author

WillAyd commented Jun 10, 2019

Existing annotations are mostly cleaned up so #25882 will be closed out soon. I think the next logical follow up will be to add annotations to items exposed via our API. Issue coming soon

@zero323
Copy link

zero323 commented Aug 29, 2019

Just passing but I thought I might put my two cents in. Please keep in mind that this comes from a place, where the cost of potential failure is way more expensive than additional dev time.

  1. Type hint on parameters should be as inclusive as possible. If this for some parameters is not practical , possible or makes for very long, ugly type hints, don't annotate that parameter. Do still annotate the other simpler parameters.

Personally I'd advise against such approach. There are a few reasons for that:

  • First of all if the type hint is truly atrocious (from my experience this is usually some crazy Union), you can always extract it as an alias or, when applicable, TypeVar. If such object is still polluting the core codebase, it can be moved to a dedicated module (I use "virtual" _typing modules like this for this purpose).

  • Arguably it is better to cover 95% of the most common use cases, and allow for false positive for the remaining ones, than leave all users with Any.

    If some unusual usage yields type checker error due too restrictive annotation, user can always opt-out for the particular piece of code, but majority of users will still benefit from available type hints. Just saying...

  1. If the complete annotated definition doesn't fit on one line (including return type), always have one line per parameter/return type for functions/method definitions:

It makes perfect sense, though it is worth pointing out, that complex signatures can, but not always do, indicate good candidates for @overload.

A trivial example would be pd.DataFrame - if copy is applicable only to DataFrame / 2d ndarray input, then one should provide at least two separate signatures. In many cases this can also help to keep individual argument annotations under control, and indirectly address 1.

  1. Don't import stuff from unrelated places in the pandas code hierarchy just to use as a type hint.

Circular imports when annotating code are indeed a nightmare, module imports instead of import ... from ..., are more than sufficient in this context.

In the most severe cases extracting annotations into stub file could provide additional layer of "insulation".

  1. Make sure the final attribute gets the precise correct type, if possible, even when the related parameter is not very well defined. Use typing.cast for this, if needed.

In case of import issues this really shouldn't be necessary, but I don't really understand the other use case.

@simonjayhawkins
Copy link
Member

@zero323 Many thanks for sharing your wisdom here.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 18, 2019

Closing as I haven't updated this tracker and this is done in separate issues

@WillAyd WillAyd closed this as completed Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Master Tracker High level tracker for similar issues Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

No branches or pull requests

8 participants