Skip to content

Conversation

Dr-Irv
Copy link
Collaborator

@Dr-Irv Dr-Irv commented Jan 9, 2023

@Dr-Irv Dr-Irv requested a review from bashtage January 9, 2023 21:28
@bashtage
Copy link
Contributor

bashtage commented Jan 9, 2023

Only 1 comment when we use Mapping. I think we should always try a mapping that isn't a subclass of a dict to make sure that pandas isn't requiring a dict.

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jan 9, 2023

Only 1 comment when we use Mapping. I think we should always try a mapping that isn't a subclass of a dict to make sure that pandas isn't requiring a dict.

Any suggestions for how to create a Mapping that is not a subclass of dict ?

Had to use Mapping here to allow a mix of different types of values.

@bashtage
Copy link
Contributor

bashtage commented Jan 9, 2023

I think this is a pretty minimal implementation:

from collections.abc import Mapping
from typing import Hashable


class CustomMapping(Mapping):
    def __init__(self, *key_value_tuples: tuple[Hashable, object]):
        self._d = dict(key_value_tuples)

    def __iter__(self) -> Hashable:
        for k in self._d.keys():
            yield k

    def __getitem__(self, key: Hashable) -> object:
        if key in self._d:
            return self._d[key]
        raise KeyError(f"Key {key} not found")

    def __len__(self) -> int:
        return len(self._d)

@bashtage
Copy link
Contributor

bashtage commented Jan 9, 2023

Use like
c = CustomMapping(("a", 1), ("b", 2))

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jan 10, 2023

@bashtage So I tried using CustomMapping on this PR, and it fails, as pandas doesn't know what to do with this kind of data type.

I think we need to keep this as is. The only way to allow a Union of types for the values of a Dict is to use Mapping instead. If someone decides to use a custom mapping it will fail, but I am guessing most people won't do that.

@gandhis1
Copy link
Contributor

I've said this elsewhere but I strongly prefer using a Mapping in any public API (even though it arguably may be too lax) because I frequently pass a TypedDict for some of these parameters and that just doesn't work with a regular dict.

@bashtage
Copy link
Contributor

I think it is wrong to include Mapping if Mapping isn't supported. AFAIK the implementation above is minimal and correct. Perhaps we should introduce a dict-like type that would contains dict,typed dict, and anything else that could be shown to work. I think it would be better to just type as dict with no args than incorrectly type as Mapping.

@bashtage
Copy link
Contributor

@Dr-Irv what is the issue? Does Pandas check the type or just spoke method doesn't exist?

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jan 10, 2023

@Dr-Irv what is the issue? Does Pandas check the type or just spoke method doesn't exist?

pandas doesn't see the CustomMapping as a known type, so it thinks it is a numpy type, and tries to convert t.

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jan 10, 2023

I think it is wrong to include Mapping if Mapping isn't supported. AFAIK the implementation above is minimal and correct. Perhaps we should introduce a dict-like type that would contains dict,typed dict, and anything else that could be shown to work. I think it would be better to just type as dict with no args than incorrectly type as Mapping.

See https://stackoverflow.com/questions/52487663/python-type-hints-typing-mapping-vs-typing-dict

The problem here is that dict is invariant, while Mapping is covariant. In the case of read_excel() and the dtype argument, what we had in there before was dict[str, str | Dtype], and I changed it to Mapping[str, str | Dtype] . That allows a dict with mixed types in the values like dtypes = {"a": np.int64, "b": str, "c": np.float64} to work.

To me, there is a benefit here in that by using Mapping, we are able to specify the allowable values of the dictionary. If we just use dict without any type specification, then we allow any dict to be passed, which is not as helpful. So by using Mapping, we are capturing more use cases, helping people with typing, while missing the cases where someone created their own Mapping type, which I think is pretty rare.

@twoertwein
Copy link
Member

Would

_DtypeStrT = TypeVar("_DtypeStrT", Dtype, str)
dict[str, _DtypeStrT]

work? It is a bit ugly and could create some unintended effects when using this TypeVar twice within a signature but I think this would 1) be a dict and 2) allow any combination of values (and probably also accept TypedDict).

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jan 28, 2023

Would

_DtypeStrT = TypeVar("_DtypeStrT", Dtype, str)
dict[str, _DtypeStrT]

work? It is a bit ugly and could create some unintended effects when using this TypeVar twice within a signature but I think this would 1) be a dict and 2) allow any combination of values (and probably also accept TypedDict).

So with this, pyright accepts it, but mypy does not. We have this in the test:

        dtypes = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

mypy doesn't infer the type of dtypes to have the mixture in the values. It sees dtypes as dict[str, type]. pyright sees the type of dtypes as dict[str, int64 | Type[str] | float64] . And mypy says that dict is invariant.

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Jan 31, 2023

@bashtage can you review?

@twoertwein
Copy link
Member

        dtypes = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

mypy doesn't infer the type of dtypes to have the mixture in the values. It sees dtypes as dict[str, type].

Would this work for mypy?

        dtypes: dict[str, int64 | type[str] | float64] = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

While this isn't ideal for users, it seems to be a shortcoming of a type checker.

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Feb 2, 2023

Would this work for mypy?

        dtypes: dict[str, int64 | type[str] | float64] = {"a": np.int64, "b": str, "c": np.float64}
        check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)

While this isn't ideal for users, it seems to be a shortcoming of a type checker.

Probably, but I don't think we should be forcing people to write code like that when we can support things by using Mapping. And, we see too many people using mypy to do type checking.

@twoertwein
Copy link
Member

Which ever way we go (Mapping or dict+TypeVar), we should document that in the philosophy/limitations. Based on mypy's behavior, I'm slightly inclined to go for Mapping (even if the implementation doesn't actually accept any non-dict mapping). Should probably revisit all the dict arguments and change them to Mapping.

This discussion feels a bit similar to the discussion about Sequence[Hashable] (contains str) vs list[HashableT] (doesn't contain str ), except that we seem to be heading in the opposite direction (from a typing perspective). From a "most-frequently used" perspective (str is very common, custom subclasses of Mapping are rare), we catch the most common errors by going for Mapping and list+typevar.

Since pandas-stubs aims to have not too wide but also not too tight annotations, it would be good to briefly summarize all the tough/not-100%-satisfactory choices in one of the markdown files:

  • Prefer list[HashableT] | ... over Sequence[Hashable] even if the implementation accepts all non-str sequences.
  • Prefer Mapping[..., covariant-key] over dict[..., covariantkeyT] even if the implementation requires subclasses of dict.

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Feb 2, 2023

Since pandas-stubs aims to have not too wide but also not too tight annotations, it would be good to briefly summarize all the tough/not-100%-satisfactory choices in one of the markdown files:

  • Prefer list[HashableT] | ... over Sequence[Hashable] even if the implementation accepts all non-str sequences.
  • Prefer Mapping[..., covariant-key] over dict[..., covariantkeyT] even if the implementation requires subclasses of dict.

That's a good idea. Care to create a PR to describe this?? 😁

@Dr-Irv
Copy link
Collaborator Author

Dr-Irv commented Feb 3, 2023

@twoertwein haven't heard from @bashtage in a few weeks so can you review this, approve, and merge if OK? Thanks.

@Dr-Irv Dr-Irv requested a review from twoertwein February 3, 2023 17:04
Copy link
Member

@twoertwein twoertwein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a PR later today to summarize the not 100% difficult decisions.

@twoertwein twoertwein merged commit 49455ce into pandas-dev:main Feb 3, 2023
@Dr-Irv Dr-Irv deleted the issue440 branch February 4, 2023 17:03
twoertwein pushed a commit to twoertwein/pandas-stubs that referenced this pull request Apr 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dtype arg of read_excel is too strict

4 participants