New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TYP: Typ part of python_parser #44406
Conversation
� Conflicts: � pandas/io/parsers/base_parser.py � pandas/io/parsers/c_parser_wrapper.py � pandas/io/parsers/python_parser.py
|
Unfortunately mypy does not accept this, if I want to put
This raises
|
@@ -233,7 +243,7 @@ def _read(): | |||
# TextIOWrapper, mmap, None]") | |||
self.data = reader # type: ignore[assignment] | |||
|
|||
def read(self, rows=None): | |||
def read(self, rows: int | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR: we have variable: int | None = None
in a few places, using variable: int = -1
looks nicer to me (but would need changing the implementation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be more readable I think, yes
# error: Incompatible types in assignment (expression has type | ||
# "Union[List[Union[Union[str, int, float, bool], Union[Period, Timestamp, | ||
# Timedelta, Any]]], Index]", variable has type "Index") [assignment] | ||
frame.columns, frame = self._do_date_conversions( # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to typing: Can _do_date_conversions
return a new DataFrame
object? If yes, this line should maybe be re-written as
columns, frame = self._do_date_conversions(...)
frame.columns = columns
otherwise, python will set columns
on the old frame
object and then replace the frame
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it will return the input without modification. This happens only if self.parse_dates is None. In this case the input is returned, so should be fine
# Conflicts: # pandas/io/parsers/base_parser.py # pandas/io/parsers/python_parser.py
@simonjayhawkins would you mind having a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @phofl
pandas/io/parsers/base_parser.py
Outdated
@@ -237,7 +240,9 @@ def _open_handles( | |||
errors=kwds.get("encoding_errors", "strict"), | |||
) | |||
|
|||
def _validate_parse_dates_presence(self, columns: list[str]) -> None: | |||
def _validate_parse_dates_presence( | |||
self, columns: list[Scalar] | list[tuple[Scalar, ...]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an observation... no action required.
we should generally avoid using mutable types in function signatures to prevent inadvertent mutation and generally avoid invariant collection types altogether.
from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#conventions
avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;
I appreciate there is another school of thought within the team that we should be more restrictive and lockdown the types being passed around.
Within the function body we only do an in
on columns
so the passed type does not need to be a list and only needs the __contains__
protocol. So using typing.Container
(deprecated since py3.9 in favor of collections.abc.Container
) would be most permissive.
In general, a function input parameter should be annotated with the widest possible type supported by the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Used Sequence
self, | ||
header, | ||
index_names: list | None, | ||
passed_names: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a function contains as much logic as this one, I imagine it takes a bit of effort to establish the types and in the process work out what a function is actually doing for our internal functions. In these cases, it maybe worthwhile improving the docstrings while that knowledge is fresh.
So making the first line a one-liner, adding descriptions (or where they are passed from) for the function parameters and also descriptions of the return types is most welcome.
For the internal functions, we don't need the types for the function parameters and return values in the docstrings as we are adding the type annotations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good. Will add descriptions for the more complicated functions in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, will enhance when fully typing the function
pandas/io/parsers/base_parser.py
Outdated
@@ -384,15 +392,17 @@ def extract(r): | |||
return names, index_names, col_names, passed_names | |||
|
|||
@final | |||
def _maybe_dedup_names(self, names): | |||
def _maybe_dedup_names( | |||
self, names: list[Scalar] | list[tuple[Scalar, ...]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are index labels, don't we just use Hashable
elsewhere to represent those? (Hashable
also includes None
which is a valid index label, is that the reason for being more restrictive here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do that, will have a look tomorrow
pandas/io/parsers/base_parser.py
Outdated
def _check_data_length( | ||
self, | ||
columns: list[Scalar] | list[tuple[Scalar, ...]], | ||
data: list[ArrayLike] | list[np.ndarray], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is confusing at first glance since np.ndarray
is ArrayLike
. can having to split aliases like this be avoided?
Does ArrayLike as a typevar work? Maybe list[ExtensionArray] | list[np.ndarray]
would be clearer? Could use an invariant collection type instead?
the last one is the recommended one from the mypy error...
note: Consider using "Sequence" instead, which is covariant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sequence should work here, thx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
""" | ||
try: | ||
# assert for mypy, data is Iterator[str] or None, would error in next | ||
assert self.data is not None | ||
return next(self.data) | ||
line = next(self.data) | ||
assert isinstance(line, list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add # for mypy
inline comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
probably because ArrayLike includes np.ndarray
see my loosely related comments above, wide types for function parameters and narrow types for return types, even though this seems counter-intuitive if passing the return types from one function onto another.
commented above. why is this not just Hashable? |
Thanks for the review. Regarding the ArrayLike thing, this comes basically down to this: #44406 (comment) Since this is not accepted, overload should allow a redefinition |
@simonjayhawkins adressed comments |
i think this now conflicts can you merge master and can get in |
Yep, most of the relevant stuff should be in now |
# Conflicts: # pandas/io/parsers/base_parser.py # pandas/io/parsers/python_parser.py
merged and resolved conflicts |
@simonjayhawkins I am wondering about mypy with overloads.
The overload for
_do_date_conversion
could be more specific, e.g.could be transformed to
But in this case mypy complains about:
error: Overloaded function signature 3 will never be matched: signature 2's parameter type(s) are the same or broader [misc]
On the other side, if typing this only with
and passing a
dict[Scalar | tuple, np.ndarray]
mypy complains withThis looks inconsistent. I think the overload should accept the distinction between
ArrayLike
andnp.ndarray
with dicts or lists?Technically we could use
Mapping
probably, but we would loose some strictness in this case and the object is alwyas adict
and we know if passing only np.ndarray we will get them in return.On another topic: Should we use an alias for
Scalar | tuple
? We need this throughout the code to indicate column names