Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: Add type hints to pd.read_html #34291

Merged
merged 7 commits into from
Jul 14, 2020
Merged

Conversation

topper-123
Copy link
Contributor

No description provided.

flavor=None,
io: FilePathOrBuffer,
match: Union[str, Pattern] = ".+",
flavor: Optional[str] = None,
header=None,
index_col=None,
skiprows=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above params are list-likes, which are not expressable in the type system, AFAIK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fine to use something like Optional[Union[int, Sequence[int]]

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good @simonjayhawkins

flavor=None,
io: FilePathOrBuffer,
match: Union[str, Pattern] = ".+",
flavor: Optional[str] = None,
header=None,
index_col=None,
skiprows=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fine to use something like Optional[Union[int, Sequence[int]]

@WillAyd WillAyd added the Typing type annotations, mypy/pyright type checking label May 21, 2020
@topper-123
Copy link
Contributor Author

I think fine to use something like Optional[Union[int, Sequence[int]]

This would make mypy complain about arraylikes, .e.g. ndarray[int] (which doesn''t work anyway), so force people to convert those to e.g. lists to satisfy mypy.

I think I'd actually be ok with that where the sequence is expected to be small (for example in read_html), bu not everywhere (e.g. we wouldn't want to convert arrays to lists in iloc). It should also be done consistently on all user-facing methods whick could be a large project. Maybe keep that separately from this PR?

@WillAyd
Copy link
Member

WillAyd commented May 21, 2020

I think should use the Sequence annotation here. You are correct that there is work to be done from a larger API perspective, but I don't think that changes things here for now

@topper-123 topper-123 force-pushed the read_html branch 2 times, most recently from 28f79d1 to c968d3b Compare May 21, 2020 19:12
@simonjayhawkins
Copy link
Member

deprecate_nonkeyword_arguments is not preserving the signature, so these annotation don't add much value at the moment and mypy can't do much consistency checking.

I think should use the Sequence annotation here.

why not Iterable?

the docs state list-like. we can't make braking change to the api, although I'm not sure exactly what list-like means from an end user perspective.

from the docstring of is_list_like in pandas_libs\lib.pyx

Objects that are considered list-like are for example Python
lists, tuples, sets, NumPy arrays, and Pandas Series.

so IMO we need to include these at least.

This would make mypy complain about arraylikes, .e.g. ndarray[int] (which doesn''t work anyway), so force people to convert those to e.g. lists to satisfy mypy.

so would be Optional[Union[int, Iterable[int], AnyArrayLike] for now

and in-time when we make Series, Index and ExtensionArray Generic classes this would become

Optional[Union[int, Iterable[int], AnyArrayLike[int]]

unless we ensure that AnyArrayLike[int] is covered by Iterable[int]

@@ -964,14 +967,14 @@ def read_html(
default of ``None`` tries to use ``lxml`` to parse and if that fails it
falls back on ``bs4`` + ``html5lib``.

header : int or list-like or None, optional
header : int or sequence of ints, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we should should change the user facing api. see comment.

@jreback jreback added this to the 1.1 milestone May 25, 2020
@jreback jreback added the IO HTML read_html, to_html, Styler.apply, Styler.applymap label May 25, 2020
@jreback jreback removed this from the 1.1 milestone May 25, 2020
index_col: Union[int, Sequence[int], None] = None,
skiprows: Union[int, Sequence[int], slice, None] = None,
attrs: Optional[Dict[str, str]] = None,
parse_dates: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be just bool. Changed.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2020

can you rebase; @simonjayhawkins if you can review.

@simonjayhawkins
Copy link
Member

I'm happy with the type annotations. Just need discussion on docstring changes. IIUC we use different nomenclature in the docstrings for types.

@topper-123
Copy link
Contributor Author

Rebased and updated. @simonjayhawkins , I don't understand you wrt. the doc string, could you expand?

@topper-123
Copy link
Contributor Author

ping.

@simonjayhawkins
Copy link
Member

I don't understand you wrt. the doc string, could you expand?

numpy has a glossary https://numpy.org/devdocs/glossary.html#term-array-like

for pandas, i'm not exactly sure what list-like is, but we use it a lot in our documentation. This PR is changing it in only one place.

We could switch to using sequence, if it is understood to be equivalent to typing.Sequence, using it everywhere where list-like is used where any sequence is accepted, and having proper testing. Although, I'm sure (but could be wrong) that in many places where we use list-like an iterable is accepted and therefore by changing the docstring, the documented api is changing.

As I said before, the typing annotations are fine. I'm just not sure about the docstring changes. @jorisvandenbossche wdyt? are the docstring changes OK?

@topper-123
Copy link
Contributor Author

The new doc strings mirror the annotations, so are correct, if the annotations are correct?

My understanding of list-like is anything that passes pd.api.types.is_list_like :-)

@simonjayhawkins
Copy link
Member

My understanding of list-like is anything that passes pd.api.types.is_list_like :-)

without a definition for a pandas user, I would use this too as the definition. but there is then a discrepancy with sequence.

>>> pd.core.dtypes.common.is_list_like(i for i in [1, 2, 3])
True
>>>
>>> import collections
>>>
>>> isinstance((i for i in [1, 2, 3]), collections.abc.Sequence)
False
>>> isinstance((i for i in [1, 2, 3]), collections.abc.Iterable)
True

The new doc strings mirror the annotations, so are correct, if the annotations are correct?

no guarantee that the annotations added are consistent with the rest of the codebase (or with the use of read_html in the wild).

(as an aside we have been tripped up a few times with this, the most recent I'm aware of being #34741)

deprecate_nonkeyword_arguments is not preserving the signature (no type annotations on deprecate_nonkeyword_arguments decorator) and read_html is not doing much apart from passing the parameters to _parse (which is not typed)

so these annotation don't add much value at the moment and mypy can't do much consistency checking.

I'm OK with the annotations as they stand though, since mypy is green and as we add more types, mypy will reveal any inconsistencies.

we need to be carefull to keep the addition of type annotations and potential api changes separate.

@jorisvandenbossche
Copy link
Member

Regarding the discussion on "sequence" vs "list-like" in docstrings: personally I think it is more important to see that we consistently use a term in the docstrings, than to have the annotation and docstring match exactly (in the end, the docstring is meant to be user-readable, the annotation is often not). So meaning: if we almost anywhere use the term "list-like" and not "sequence" in the docstrings (I don't know if this is true, just assuming here), then I would use list-like here as well.

But eg in the docstring guidelines, both are not mentioned (https://pandas.pydata.org/docs/dev/development/contributing_docstring.html#parameter-types, it mentions "iterable" and "array-like"). It would be nice to standardize this, make a glossary etc, but that's for a different issue (so for this PR I would just leave the docstring as it was and move on)

@topper-123
Copy link
Contributor Author

topper-123 commented Jul 7, 2020

I'm ok with leaving the doc string as is, but I've made some clarifications on them though. Updated.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @topper-123

thousands: Optional[str] = ",",
encoding: Optional[str] = None,
decimal: str = ".",
converters: Optional[dict] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict -> Dict

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of PEP 585 this is IMO ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there is agreement on #30539 (comment), this would fail CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PEP says:

Importing those from typing is deprecated. Due to PEP 563 and the intention to minimize the runtime impact of typing, this deprecation will not generate DeprecationWarnings. Instead, type checkers may warn about such deprecated usage when the target version of the checked program is signalled to be Python 3.9 or newer. It's recommended to allow for those warnings to be silenced on a project-wide basis.

The deprecated functionality will be removed from the typing module in the first Python version released 5 years after the release of Python 3.9.0.

i.e. importing type hints from typing (including typing.Dict) will be deprecated in 3.9. So does not look like a win to enforce that kind of type hints in pandas IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good reference but at a minimum can only be a future import in Python 3.7, so I think should stick with the current convention at least until we drop 3.6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a difference between the two return signatures in 3.6? I’ve thought they’re equivalent, also in 3.6

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is equivalent to Dict[Any] or also Dict; obviously with dict you can't subscript until the timelines noted by the PEP you linked

We import from typing in about 99% of cases so should at least stick to that convention; if subtypes are possible here they make things even better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've updated the PR.

io: FilePathOrBuffer,
match: Union[str, Pattern] = ".+",
flavor: Optional[str] = None,
header: Union[int, Sequence[int], None] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Union[..., ..., None] -> Optional[Union[..., ...]]

@topper-123
Copy link
Contributor Author

ping.

@jreback jreback merged commit 93a4383 into pandas-dev:master Jul 14, 2020
@jreback
Copy link
Contributor

jreback commented Jul 14, 2020

thanks @topper-123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HTML read_html, to_html, Styler.apply, Styler.applymap Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants