Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: Typ part of python_parser #44406

Merged
merged 18 commits into from Nov 28, 2021
Merged

TYP: Typ part of python_parser #44406

merged 18 commits into from Nov 28, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 12, 2021

@simonjayhawkins I am wondering about mypy with overloads.

The overload for _do_date_conversion could be more specific, e.g.

    @overload
    def _do_date_conversions(
        self,
        names: list[Scalar | tuple],
        data: dict[Scalar | tuple, ArrayLike] | dict[Scalar | tuple, np.ndarray],
    ) -> tuple[
        list[Scalar | tuple],
        dict[Scalar | tuple, ArrayLike] | dict[Scalar | tuple, np.ndarray],
    ]:
        ...

could be transformed to

    @overload
    def _do_date_conversions(
        self,
        names: list[Scalar | tuple],
        data: dict[Scalar | tuple, ArrayLike],
    ) -> tuple[
        list[Scalar | tuple],
        dict[Scalar | tuple, ArrayLike],
    ]:
        ...

    @overload
    def _do_date_conversions(
        self,
        names: list[Scalar | tuple],
        data: dict[Scalar | tuple, np.ndarray],
    ) -> tuple[
        list[Scalar | tuple],
        dict[Scalar | tuple, np.ndarray],
    ]:
        ...

But in this case mypy complains about:
error: Overloaded function signature 3 will never be matched: signature 2's parameter type(s) are the same or broader [misc]

On the other side, if typing this only with

    @overload
    def _do_date_conversions(
        self,
        names: list[Scalar | tuple],
        data: dict[Scalar | tuple, ArrayLike],
    ) -> tuple[
        list[Scalar | tuple],
        dict[Scalar | tuple, ArrayLike],
    ]:
        ...

and passing a dict[Scalar | tuple, np.ndarray] mypy complains with

pandas/io/parsers/python_parser.py:283: error: Argument 2 to "_do_date_conversions" of "ParserBase" has incompatible type "Dict[Union[Union[Union[str, int, float, bool], Union[Period, Timestamp, Timedelta, Any]], Tuple[Any, ...]], ndarray[Any, Any]]"; expected "Dict[Union[Union[Union[str, int, float, bool], Union[Period, Timestamp, Timedelta, Any]], Tuple[Any, ...]], Union[ExtensionArray, ndarray[Any, Any]]]"  [arg-type]
pandas/io/parsers/python_parser.py:283: note: "Dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
pandas/io/parsers/python_parser.py:283: note: Consider using "Mapping" instead, which is covariant in the value type

This looks inconsistent. I think the overload should accept the distinction between ArrayLike and np.ndarray with dicts or lists?

Technically we could use Mapping probably, but we would loose some strictness in this case and the object is alwyas a dict and we know if passing only np.ndarray we will get them in return.

On another topic: Should we use an alias for Scalar | tuple? We need this throughout the code to indicate column names

� Conflicts:
�	pandas/io/parsers/base_parser.py
�	pandas/io/parsers/c_parser_wrapper.py
�	pandas/io/parsers/python_parser.py
@phofl phofl added IO CSV read_csv, to_csv Typing type annotations, mypy/pyright type checking labels Nov 12, 2021
@twoertwein
Copy link
Member

twoertwein commented Nov 13, 2021

The overload for _do_date_conversion could be more specific, e.g.

ArrayLike is defiend as Union["ExtensionArray", np.ndarray]. I think you don't need an overload:

_ArrayLikeT = TypeVar("_ArrayLikeT", bound=ArrayLike)

def _do_date_conversions(
    self,
    names: list[Scalar | tuple],
    data: dict[Scalar | tuple, _ArrayLikeT],
) -> tuple[
    list[Scalar | tuple],
    dict[Scalar | tuple, _ArrayLikeT],
]:
    ...

@phofl
Copy link
Member Author

phofl commented Nov 13, 2021

Unfortunately mypy does not accept this, if I want to put dict[Scalar | tuple, np.ndarray] in, because a dict is invariant. Minimal example:

def test(x: dict[str, int | float]) -> None:
    print(x)


x: dict[str, int] = {"a": 1}
test(x)

This raises

error: Argument 1 to "test" has incompatible type "Dict[str, int]"; expected "Dict[str, Union[int, float]]"  [arg-type]
note: "Dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
note: Consider using "Mapping" instead, which is covariant in the value type

@@ -233,7 +243,7 @@ def _read():
# TextIOWrapper, mmap, None]")
self.data = reader # type: ignore[assignment]

def read(self, rows=None):
def read(self, rows: int | None = None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR: we have variable: int | None = None in a few places, using variable: int = -1 looks nicer to me (but would need changing the implementation).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more readable I think, yes

# error: Incompatible types in assignment (expression has type
# "Union[List[Union[Union[str, int, float, bool], Union[Period, Timestamp,
# Timedelta, Any]]], Index]", variable has type "Index") [assignment]
frame.columns, frame = self._do_date_conversions( # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to typing: Can _do_date_conversions return a new DataFrame object? If yes, this line should maybe be re-written as

columns, frame = self._do_date_conversions(...)
frame.columns = columns

otherwise, python will set columns on the old frame object and then replace the frame object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it will return the input without modification. This happens only if self.parse_dates is None. In this case the input is returned, so should be fine

# Conflicts:
#	pandas/io/parsers/base_parser.py
#	pandas/io/parsers/python_parser.py
@phofl
Copy link
Member Author

phofl commented Nov 18, 2021

@simonjayhawkins would you mind having a look?

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @phofl

@@ -237,7 +240,9 @@ def _open_handles(
errors=kwds.get("encoding_errors", "strict"),
)

def _validate_parse_dates_presence(self, columns: list[str]) -> None:
def _validate_parse_dates_presence(
self, columns: list[Scalar] | list[tuple[Scalar, ...]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an observation... no action required.

we should generally avoid using mutable types in function signatures to prevent inadvertent mutation and generally avoid invariant collection types altogether.

from https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#conventions

avoid invariant collection types (list, dict) in argument positions, in favor of covariant types like Mapping or Sequence;

I appreciate there is another school of thought within the team that we should be more restrictive and lockdown the types being passed around.

Within the function body we only do an in on columns so the passed type does not need to be a list and only needs the __contains__ protocol. So using typing.Container (deprecated since py3.9 in favor of collections.abc.Container) would be most permissive.

from https://github.com/microsoft/pyright/blob/main/docs/typed-libraries.md#best-practices-for-inlined-types...

In general, a function input parameter should be annotated with the widest possible type supported by the implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used Sequence

self,
header,
index_names: list | None,
passed_names: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a function contains as much logic as this one, I imagine it takes a bit of effort to establish the types and in the process work out what a function is actually doing for our internal functions. In these cases, it maybe worthwhile improving the docstrings while that knowledge is fresh.

So making the first line a one-liner, adding descriptions (or where they are passed from) for the function parameters and also descriptions of the return types is most welcome.

For the internal functions, we don't need the types for the function parameters and return values in the docstrings as we are adding the type annotations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good. Will add descriptions for the more complicated functions in the future

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, will enhance when fully typing the function

@@ -384,15 +392,17 @@ def extract(r):
return names, index_names, col_names, passed_names

@final
def _maybe_dedup_names(self, names):
def _maybe_dedup_names(
self, names: list[Scalar] | list[tuple[Scalar, ...]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are index labels, don't we just use Hashable elsewhere to represent those? (Hashable also includes None which is a valid index label, is that the reason for being more restrictive here?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do that, will have a look tomorrow

pandas/io/parsers/base_parser.py Show resolved Hide resolved
def _check_data_length(
self,
columns: list[Scalar] | list[tuple[Scalar, ...]],
data: list[ArrayLike] | list[np.ndarray],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing at first glance since np.ndarray is ArrayLike. can having to split aliases like this be avoided?

Does ArrayLike as a typevar work? Maybe list[ExtensionArray] | list[np.ndarray] would be clearer? Could use an invariant collection type instead?

the last one is the recommended one from the mypy error...

note: Consider using "Sequence" instead, which is covariant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sequence should work here, thx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"""
try:
# assert for mypy, data is Iterator[str] or None, would error in next
assert self.data is not None
return next(self.data)
line = next(self.data)
assert isinstance(line, list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add # for mypy inline comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@simonjayhawkins
Copy link
Member

This looks inconsistent. I think the overload should accept the distinction between ArrayLike and np.ndarray with dicts or lists?

probably because ArrayLike includes np.ndarray

Technically we could use Mapping probably, but we would loose some strictness in this case and the object is alwyas a dict and we know if passing only np.ndarray we will get them in return.

see my loosely related comments above, wide types for function parameters and narrow types for return types, even though this seems counter-intuitive if passing the return types from one function onto another.

On another topic: Should we use an alias for Scalar | tuple? We need this throughout the code to indicate column names

commented above. why is this not just Hashable?

@phofl
Copy link
Member Author

phofl commented Nov 20, 2021

Thanks for the review. Regarding the ArrayLike thing, this comes basically down to this: #44406 (comment)

Since this is not accepted, overload should allow a redefinition

@phofl
Copy link
Member Author

phofl commented Nov 23, 2021

@simonjayhawkins adressed comments

@jreback jreback added this to the 1.4 milestone Nov 24, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

i think this now conflicts can you merge master and can get in

@phofl
Copy link
Member Author

phofl commented Nov 28, 2021

Yep, most of the relevant stuff should be in now

# Conflicts:
#	pandas/io/parsers/base_parser.py
#	pandas/io/parsers/python_parser.py
@phofl
Copy link
Member Author

phofl commented Nov 28, 2021

merged and resolved conflicts

@jreback jreback merged commit 3d99081 into pandas-dev:master Nov 28, 2021
@phofl phofl deleted the typ branch November 29, 2021 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants