Fix read_excel dtypes and converters to use Mapping not dict #503

Dr-Irv · 2023-01-09T21:28:05Z

Closes dtype arg of read_excel is too strict #440
Tests added:
- test_io.py:read_excel_dtypes()

bashtage · 2023-01-09T21:32:05Z

Only 1 comment when we use Mapping. I think we should always try a mapping that isn't a subclass of a dict to make sure that pandas isn't requiring a dict.

Dr-Irv · 2023-01-09T21:35:33Z

Only 1 comment when we use Mapping. I think we should always try a mapping that isn't a subclass of a dict to make sure that pandas isn't requiring a dict.

Any suggestions for how to create a Mapping that is not a subclass of dict ?

Had to use Mapping here to allow a mix of different types of values.

bashtage · 2023-01-09T23:19:39Z

I think this is a pretty minimal implementation:

from collections.abc import Mapping
from typing import Hashable


class CustomMapping(Mapping):
    def __init__(self, *key_value_tuples: tuple[Hashable, object]):
        self._d = dict(key_value_tuples)

    def __iter__(self) -> Hashable:
        for k in self._d.keys():
            yield k

    def __getitem__(self, key: Hashable) -> object:
        if key in self._d:
            return self._d[key]
        raise KeyError(f"Key {key} not found")

    def __len__(self) -> int:
        return len(self._d)

bashtage · 2023-01-09T23:20:11Z

Use like
c = CustomMapping(("a", 1), ("b", 2))

Dr-Irv · 2023-01-10T14:16:13Z

@bashtage So I tried using CustomMapping on this PR, and it fails, as pandas doesn't know what to do with this kind of data type.

I think we need to keep this as is. The only way to allow a Union of types for the values of a Dict is to use Mapping instead. If someone decides to use a custom mapping it will fail, but I am guessing most people won't do that.

gandhis1 · 2023-01-10T14:26:15Z

I've said this elsewhere but I strongly prefer using a Mapping in any public API (even though it arguably may be too lax) because I frequently pass a TypedDict for some of these parameters and that just doesn't work with a regular dict.

bashtage · 2023-01-10T17:14:33Z

I think it is wrong to include Mapping if Mapping isn't supported. AFAIK the implementation above is minimal and correct. Perhaps we should introduce a dict-like type that would contains dict,typed dict, and anything else that could be shown to work. I think it would be better to just type as dict with no args than incorrectly type as Mapping.

bashtage · 2023-01-10T17:15:36Z

@Dr-Irv what is the issue? Does Pandas check the type or just spoke method doesn't exist?

Dr-Irv · 2023-01-10T17:19:37Z

@Dr-Irv what is the issue? Does Pandas check the type or just spoke method doesn't exist?

pandas doesn't see the CustomMapping as a known type, so it thinks it is a numpy type, and tries to convert t.

Dr-Irv · 2023-01-10T17:24:29Z

I think it is wrong to include Mapping if Mapping isn't supported. AFAIK the implementation above is minimal and correct. Perhaps we should introduce a dict-like type that would contains dict,typed dict, and anything else that could be shown to work. I think it would be better to just type as dict with no args than incorrectly type as Mapping.

See https://stackoverflow.com/questions/52487663/python-type-hints-typing-mapping-vs-typing-dict

The problem here is that dict is invariant, while Mapping is covariant. In the case of read_excel() and the dtype argument, what we had in there before was dict[str, str | Dtype], and I changed it to Mapping[str, str | Dtype] . That allows a dict with mixed types in the values like dtypes = {"a": np.int64, "b": str, "c": np.float64} to work.

To me, there is a benefit here in that by using Mapping, we are able to specify the allowable values of the dictionary. If we just use dict without any type specification, then we allow any dict to be passed, which is not as helpful. So by using Mapping, we are capturing more use cases, helping people with typing, while missing the cases where someone created their own Mapping type, which I think is pretty rare.

twoertwein · 2023-01-28T15:26:03Z

Would

_DtypeStrT = TypeVar("_DtypeStrT", Dtype, str)
dict[str, _DtypeStrT]

work? It is a bit ugly and could create some unintended effects when using this TypeVar twice within a signature but I think this would 1) be a dict and 2) allow any combination of values (and probably also accept TypedDict).

Dr-Irv · 2023-01-28T22:20:15Z

Would
_DtypeStrT = TypeVar("_DtypeStrT", Dtype, str)
dict[str, _DtypeStrT]
work? It is a bit ugly and could create some unintended effects when using this TypeVar twice within a signature but I think this would 1) be a dict and 2) allow any combination of values (and probably also accept TypedDict).

So with this, pyright accepts it, but mypy does not. We have this in the test:

        dtypes = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

mypy doesn't infer the type of dtypes to have the mixture in the values. It sees dtypes as dict[str, type]. pyright sees the type of dtypes as dict[str, int64 | Type[str] | float64] . And mypy says that dict is invariant.

Dr-Irv · 2023-01-31T13:10:26Z

@bashtage can you review?

twoertwein · 2023-02-02T20:04:16Z

        dtypes = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)
mypy doesn't infer the type of dtypes to have the mixture in the values. It sees dtypes as dict[str, type].

Would this work for mypy?

        dtypes: dict[str, int64 | type[str] | float64] = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

While this isn't ideal for users, it seems to be a shortcoming of a type checker.

Dr-Irv · 2023-02-02T20:11:37Z

Would this work for mypy?

        dtypes: dict[str, int64 | type[str] | float64] = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

While this isn't ideal for users, it seems to be a shortcoming of a type checker.

Probably, but I don't think we should be forcing people to write code like that when we can support things by using Mapping. And, we see too many people using mypy to do type checking.

twoertwein · 2023-02-02T22:01:21Z

Which ever way we go (Mapping or dict+TypeVar), we should document that in the philosophy/limitations. Based on mypy's behavior, I'm slightly inclined to go for Mapping (even if the implementation doesn't actually accept any non-dict mapping). Should probably revisit all the dict arguments and change them to Mapping.

This discussion feels a bit similar to the discussion about Sequence[Hashable] (contains str) vs list[HashableT] (doesn't contain str ), except that we seem to be heading in the opposite direction (from a typing perspective). From a "most-frequently used" perspective (str is very common, custom subclasses of Mapping are rare), we catch the most common errors by going for Mapping and list+typevar.

Since pandas-stubs aims to have not too wide but also not too tight annotations, it would be good to briefly summarize all the tough/not-100%-satisfactory choices in one of the markdown files:

Prefer list[HashableT] | ... over Sequence[Hashable] even if the implementation accepts all non-str sequences.
Prefer Mapping[..., covariant-key] over dict[..., covariantkeyT] even if the implementation requires subclasses of dict.

Dr-Irv · 2023-02-02T22:17:34Z

Since pandas-stubs aims to have not too wide but also not too tight annotations, it would be good to briefly summarize all the tough/not-100%-satisfactory choices in one of the markdown files:

Prefer list[HashableT] | ... over Sequence[Hashable] even if the implementation accepts all non-str sequences.

Prefer Mapping[..., covariant-key] over dict[..., covariantkeyT] even if the implementation requires subclasses of dict.

That's a good idea. Care to create a PR to describe this?? 😁

Dr-Irv · 2023-02-03T17:04:27Z

@twoertwein haven't heard from @bashtage in a few weeks so can you review this, approve, and merge if OK? Thanks.

twoertwein

I'll open a PR later today to summarize the not 100% difficult decisions.

…dev#503)

Fix read_excel dtypes and converters to use Mapping not dict

9a649bb

Dr-Irv requested a review from bashtage January 9, 2023 21:28

Dr-Irv requested a review from twoertwein February 3, 2023 17:04

twoertwein approved these changes Feb 3, 2023

View reviewed changes

twoertwein merged commit 49455ce into pandas-dev:main Feb 3, 2023

Dr-Irv deleted the issue440 branch February 4, 2023 17:03

twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023

Fix read_excel dtypes and converters to use Mapping not dict (pandas-…

c5d6a86

…dev#503)

Uh oh!

Fix read_excel dtypes and converters to use Mapping not dict #503

Fix read_excel dtypes and converters to use Mapping not dict #503

Uh oh!

Conversation

Dr-Irv commented Jan 9, 2023

Uh oh!

bashtage commented Jan 9, 2023

Uh oh!

Dr-Irv commented Jan 9, 2023

Uh oh!

bashtage commented Jan 9, 2023

Uh oh!

bashtage commented Jan 9, 2023

Uh oh!

Dr-Irv commented Jan 10, 2023

Uh oh!

gandhis1 commented Jan 10, 2023

Uh oh!

bashtage commented Jan 10, 2023

Uh oh!

bashtage commented Jan 10, 2023

Uh oh!

Dr-Irv commented Jan 10, 2023

Uh oh!

Dr-Irv commented Jan 10, 2023

Uh oh!

twoertwein commented Jan 28, 2023

Uh oh!

Dr-Irv commented Jan 28, 2023

Uh oh!

Dr-Irv commented Jan 31, 2023

Uh oh!

twoertwein commented Feb 2, 2023

Uh oh!

Dr-Irv commented Feb 2, 2023

Uh oh!

twoertwein commented Feb 2, 2023

Uh oh!

Dr-Irv commented Feb 2, 2023

Uh oh!

Dr-Irv commented Feb 3, 2023

Uh oh!

twoertwein left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants