-
-
Notifications
You must be signed in to change notification settings - Fork 150
Fix read_excel dtypes and converters to use Mapping not dict #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Only 1 comment when we use Mapping. I think we should always try a mapping that isn't a subclass of a dict to make sure that pandas isn't requiring a dict. |
Any suggestions for how to create a Had to use |
I think this is a pretty minimal implementation: from collections.abc import Mapping
from typing import Hashable
class CustomMapping(Mapping):
def __init__(self, *key_value_tuples: tuple[Hashable, object]):
self._d = dict(key_value_tuples)
def __iter__(self) -> Hashable:
for k in self._d.keys():
yield k
def __getitem__(self, key: Hashable) -> object:
if key in self._d:
return self._d[key]
raise KeyError(f"Key {key} not found")
def __len__(self) -> int:
return len(self._d) |
Use like |
@bashtage So I tried using I think we need to keep this as is. The only way to allow a |
I've said this elsewhere but I strongly prefer using a |
I think it is wrong to include Mapping if Mapping isn't supported. AFAIK the implementation above is minimal and correct. Perhaps we should introduce a dict-like type that would contains dict,typed dict, and anything else that could be shown to work. I think it would be better to just type as dict with no args than incorrectly type as Mapping. |
@Dr-Irv what is the issue? Does Pandas check the type or just spoke method doesn't exist? |
pandas doesn't see the |
See https://stackoverflow.com/questions/52487663/python-type-hints-typing-mapping-vs-typing-dict The problem here is that To me, there is a benefit here in that by using |
Would _DtypeStrT = TypeVar("_DtypeStrT", Dtype, str)
dict[str, _DtypeStrT] work? It is a bit ugly and could create some unintended effects when using this TypeVar twice within a signature but I think this would 1) be a dict and 2) allow any combination of values (and probably also accept TypedDict). |
So with this, dtypes = {"a": np.int64, "b": str, "c": np.float64}
check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame)
|
@bashtage can you review? |
Would this work for mypy? dtypes: dict[str, int64 | type[str] | float64] = {"a": np.int64, "b": str, "c": np.float64}
check(assert_type(read_excel(path, dtype=dtypes), pd.DataFrame), pd.DataFrame) While this isn't ideal for users, it seems to be a shortcoming of a type checker. |
Probably, but I don't think we should be forcing people to write code like that when we can support things by using |
Which ever way we go (Mapping or dict+TypeVar), we should document that in the philosophy/limitations. Based on mypy's behavior, I'm slightly inclined to go for Mapping (even if the implementation doesn't actually accept any non-dict mapping). Should probably revisit all the dict arguments and change them to Mapping. This discussion feels a bit similar to the discussion about Sequence[Hashable] (contains str) vs list[HashableT] (doesn't contain str ), except that we seem to be heading in the opposite direction (from a typing perspective). From a "most-frequently used" perspective (str is very common, custom subclasses of Mapping are rare), we catch the most common errors by going for Mapping and list+typevar. Since pandas-stubs aims to have not too wide but also not too tight annotations, it would be good to briefly summarize all the tough/not-100%-satisfactory choices in one of the markdown files:
|
That's a good idea. Care to create a PR to describe this?? 😁 |
@twoertwein haven't heard from @bashtage in a few weeks so can you review this, approve, and merge if OK? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a PR later today to summarize the not 100% difficult decisions.
test_io.py:read_excel_dtypes()