REF: dataframe formatters/outputs #36510

ivanovmg · 2020-09-20T21:09:09Z

Partially addresses #36407

This is a continuation of #36434.

Separated StringFormatter (or better ConsoleFormatter?) from DataFrameFormatter. Placed it into the new module (subject to discussion).
Used composition of DataFrameFormatter in HTMLFormatter, LatexFormatter, CSVFormatter and StringFormatter. It turned out that the inheritance of each of the formatters from the base DataFrameFormatter was too complicated to comprehend. The composition seems to suit here better.
Created new class DataFrameRenderer to keep methods for outputs in each of the formats.

This is not the ultimate refactoring, but just one more step.

Move all methods to DataFrameFormatter, inherit relevant classes from DataFrameFormatter.

Replace it with get_result method, which is going to become abstract method for the parent class.

New module suggested: pandas/io/formats/string.py

jreback

a few comments, looks good. nice reorg and cleanup. it has been a long time / never since this was done; and we have lots of built up code here.

jreback · 2020-09-22T01:56:38Z

pandas/io/formats/csvs.py

-        self.decimal = decimal
-        self.header = header
-        self.index = index
+        self.na_rep = self.fmt.na_rep


if these are just readable now (which i think they are in this class as they are now set in DataFrameRenderer), then maybe make these properties? (or is my assumption wrong)

Refactored to properties.

jreback · 2020-09-22T01:57:29Z

pandas/io/formats/csvs.py

+        # Incompatible types in assignment
+        # (expression has type "Index",
+        # variable has type "Optional[Sequence[Optional[Hashable]]]")  [assignment]
+        cols = self.obj.columns  # type: ignore[assignment]


you are getting this because you are overwriting cols. don't do that.

jreback · 2020-09-22T02:00:53Z

pandas/io/formats/format.py

+
+        return None
+
+    def _get_result(


these can probably be non-private. This entire class is private, so that if these are called by other modules (from pandas) then we want to make the public. If something is entirely private to this class then this is ok.

It turns out that _get_result and _get_buffer should not belong to the class: they do not use the object state. So I decided to move the methods to the module level.

jreback · 2020-09-22T02:01:47Z

pandas/io/formats/format.py

+        else:
+            raise TypeError("buf is not a file name and it has no write method")
+
+
 # ----------------------------------------------------------------------


would not be averse to splitting this file up if ppossible (e.g. maybe make array.py) for all of the formatters.

certainly better as a followon

jreback · 2020-09-22T02:04:28Z

pandas/io/formats/format.py

@@ -1198,6 +899,209 @@ def _get_column_name_list(self) -> List[str]:
        return names


+class DataFrameRenderer:


can you add a doc-string here and explain that these to_* are being called in core/frame/to_* ultimately.

simonjayhawkins · 2020-09-23T09:49:19Z

pandas/io/formats/string.py

+        lwidth = self.line_width
+        adjoin_width = 1
+        if self.fmt.index:
+            idx = strcols.pop(0)


AFAICT the original _join_multiline didn't mutate the input.

often better to type inputs as Iterable instead of List for non-mutating functions

Oh, you are right! It did not mutate strcols, but now it does. I will fix it.

What I don't like about Iterable type is the following:

String is Iterable, so it can be confused with List/Tuple.

Cannot find len(iterable) - mypy throws
Argument 1 to "len" has incompatible type "Iterable[str]"; expected "Sized" [arg-type]

So, now I changed the type of input parameter in _join_multiline to Iterable[List[str]].

String is Iterable, so it can be confused with List/Tuple.

yeah that can be a problem, but we normally add type parameters, so only Iterable[str] becomes an issue

Cannot find len(iterable) - mypy throws

can always use Sequence if needed as Sequence is also not mutable.

There are some guidelines for stubs at https://github.com/python/typeshed/blob/master/CONTRIBUTING.md#stub-file-coding-style

avoid invariant collection types (List, Dict) in argument positions, in favor of covariant types like Mapping or Sequence

at some point, the guidelines there relevant to the codebase and not just stubs should probably be added to our code style docs #33851

jreback

lgtm. @jbrockmendel @simonjayhawkins if any comments.

simonjayhawkins · 2020-09-24T07:51:50Z

don't wait for me one this. prioritising reviewing 1.1.x issues and PRs

ivanovmg · 2020-10-20T12:26:00Z

Internal error in Windows py38_np18 check.
Some weird error in travis. Looks like internal travis error with closing buffer. Locally this test runs fine.

``` ____________ test_filepath_or_buffer_arg[pathlike-foo-abc-to_latex] ____________

[gw0] linux -- Python 3.7.8 /home/travis/miniconda3/envs/pandas-dev/bin/python

method = 'to_latex'

filepath_or_buffer = PosixPath('/tmp/pytest-of-travis/pytest-0/popen-gw0/test_filepath_or_buffer_arg_pa11/foo')

assert_filepath_or_buffer_equals = <function assert_filepath_or_buffer_equals.._assert_filepath_or_buffer_equals at 0x7f368ef02170>

encoding = 'foo', data = 'abc', filepath_or_buffer_id = 'pathlike'

@pytest.mark.parametrize("method", ["to_string", "to_html", "to_latex"])

@pytest.mark.parametrize(

    "encoding, data",

    [(None, "abc"), ("utf-8", "abc"), ("gbk", "造成输出中文显示乱码"), ("foo", "abc")],

)

def test_filepath_or_buffer_arg(

    method,

    filepath_or_buffer,

    assert_filepath_or_buffer_equals,

    encoding,

    data,

    filepath_or_buffer_id,

):

    df = DataFrame([data])



    if filepath_or_buffer_id not in ["string", "pathlike"] and encoding is not None:

        with pytest.raises(

            ValueError, match="buf is not a file name and encoding is specified."

        ):

            getattr(df, method)(buf=filepath_or_buffer, encoding=encoding)

    elif encoding == "foo":

        with tm.assert_produces_warning(None):

            with pytest.raises(LookupError, match="unknown encoding"):

              getattr(df, method)(buf=filepath_or_buffer, encoding=encoding)

pandas/tests/io/formats/test_format.py:3389:

self = <contextlib._GeneratorContextManager object at 0x7f368ef0fcd0>

type = None, value = None, traceback = None

def __exit__(self, type, value, traceback):

    if type is None:

        try:

          next(self.gen)

E AssertionError: Caused unexpected warning(s): [('ResourceWarning', ResourceWarning("unclosed <ssl.SSLSocket fd=13, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('10.20.0.43', 49154), raddr=('52.217.93.220', 443)>"), '/home/travis/build/pandas-dev/pandas/pandas/core/generic.py', 5404)]

</details>

simonjayhawkins

Thanks @ivanovmg generally lgtm.

The removal of the setters in csvs is a great improvement, for readability and static typing.

I'm not sure about removing the TableFormatter class though. This provided a common base for LatexFormatter, HTMLFormatter and what is now StringFormatter, for code that does not belong in DataFrameFormatter.

I suggest that future refactors are more atomic. I see several themes here that probably should have been discussed/reviewed in isolation. The primary one being the adoption of DataFrameFormatter for use in csv formatting. I'm also not sure that DataFrameRenderer class is even necessary.

simonjayhawkins · 2020-10-20T11:24:43Z

pandas/io/formats/csvs.py

@@ -29,19 +29,16 @@
 from pandas.core.indexes.api import Index

 from pandas.io.common import get_filepath_or_buffer, get_handle
+from pandas.io.formats.format import DataFrameFormatter, FloatFormatType


if the imports are just for type checking, would prefer in a if TYPE_CHECKING: block.

Saying that, FloatFormatType is an alias so may want to move to pandas._typing

I implemented as you suggested.
General question: is it OK to have imports from pandas._typing on top, not inside if TYPE_CHECKING block?

yep. that's how we do it elsewhere. all imports in pandas._typing are inside a TYPE_CHECKING block so should be safe.

simonjayhawkins · 2020-10-20T12:44:30Z

Internal error in Windows py38_np18 check.
Some weird error in travis. Looks like internal travis error with closing buffer. Locally this test runs fine.

I've restarted the jobs. probably unrelated.

ivanovmg · 2020-10-20T12:55:28Z

Thanks @ivanovmg generally lgtm.

The removal of the setters in csvs is a great improvement, for readability and static typing.

I'm not sure about removing the TableFormatter class though. This provided a common base for LatexFormatter, HTMLFormatter and what is now StringFormatter, for code that does not belong in DataFrameFormatter.

I suggest that future refactors are more atomic. I see several themes here that probably should have been discussed/reviewed in isolation. The primary one being the adoption of DataFrameFormatter for use in csv formatting. I'm also not sure that DataFrameRenderer class is even necessary.

The main target here was to extract StringFormatter from DataFrameFormatter.
And I am glad I figured out how to separate it.

Regarding DataFrameRenderer, we can probably think about it later. My idea was to keep it all in one place for all formats (currently not all).

My concern with TableFormatter was that some properties (attributes) are not related to some ancestors. For example, is_truncated has nothing to do with LatexFormatter. The same for formatters and presumably should_show_dimensions.

Suggestion on the atomic PRs is very well taken :) But it was really difficult to make it short when extracting StringFormatter functionality. Sorry for that.

simonjayhawkins · 2020-10-20T13:12:17Z

The main target here was to extract StringFormatter from DataFrameFormatter.
And I am glad I figured out how to separate it.

agreed this is an important step for further refactoring. nice work.

Regarding DataFrameRenderer, we can probably think about it later. My idea was to keep it all in one place for all formats (currently not all).

yep. follow-ons ok.

My concern with TableFormatter was that some properties (attributes) are not related to some ancestors. For example, is_truncated has nothing to do with LatexFormatter. The same for formatters and presumably should_show_dimensions.

It was a bit difficult to get one's head around the Inheritance and Composition used together, and there was not a clear separation. I think that I anticipated, that over time the shared functionality in DataFrameFormatter would move to TableFormatter, DataFrameFormatter would evolve to be what is now StringFormatter and TableFormatter would evolve to be what is now DataFrameFormatter. So the latex, html and string formatters would have all inherited from a base class. But as you have noted, composition is also an option.

ivanovmg · 2020-10-20T14:19:44Z

The main target here was to extract StringFormatter from DataFrameFormatter.
And I am glad I figured out how to separate it.

agreed this is an important step for further refactoring. nice work.

Regarding DataFrameRenderer, we can probably think about it later. My idea was to keep it all in one place for all formats (currently not all).

yep. follow-ons ok.

My concern with TableFormatter was that some properties (attributes) are not related to some ancestors. For example, is_truncated has nothing to do with LatexFormatter. The same for formatters and presumably should_show_dimensions.

It was a bit difficult to get one's head around the Inheritance and Composition used together, and there was not a clear separation. I think that I anticipated, that over time the shared functionality in DataFrameFormatter would move to TableFormatter, DataFrameFormatter would evolve to be what is now StringFormatter and TableFormatter would evolve to be what is now DataFrameFormatter. So the latex, html and string formatters would have all inherited from a base class. But as you have noted, composition is also an option.

In fact my first guess was to use inheritance. But then I realized that the ancestors will have only something in common with the parent TableFormatter. Thus, I figured out that the composition would be a better option.
Now, as the formatters are more or less separated, we can have a better look at what they have in common and probably improve the architecture (in a separate PR, once this is merged).

ivanovmg · 2020-10-20T14:20:19Z

There are now failures related to parquet and datetime. Looks like these are persistent across multiple PRs today.

simonjayhawkins · 2020-10-20T14:35:37Z

Now, as the formatters are more or less separated, we can have a better look at what they have in common and probably improve the architecture (in a separate PR, once this is merged).

sgtm

jreback · 2020-10-20T23:19:04Z

thanks @ivanovmg

ivanovmg added 15 commits September 20, 2020 19:03

REF: drop TableFormatter

c3568a2

Move all methods to DataFrameFormatter, inherit relevant classes from DataFrameFormatter.

REF: extract ConsoleFormatter

837858f

CLN: remove ConsoleFormatter to_string method

08e899f

Replace it with get_result method, which is going to become abstract method for the parent class.

REF: move table_id & render_links to HTMLFormatter

602c984

REF: move _join_multiline to ConsoleFormatter

bd5cb87

REF: separate dataframe formatting from rendering

6e8d4d8

REF: extract _empty_info_line property in latex

cbd3c76

DOC: docstrings for DataFrame & String Formatters

5c30924

REF: make _get_buffer private

af8fe98

REF: pass DataFrameFormatter to CSVFormatter

6e9fb3c

REF: create to_csv in DataFrameRenderer

878eed2

LINT: imports and line breaks

d87638b

REF: move StringFormatter to separate module

1292be5

New module suggested: pandas/io/formats/string.py

TYP: handle mypy errors after enabling composition

41553f6

REF: remove non-existent parent in LatexFormatter

a66ca5e

ivanovmg requested review from jreback, simonjayhawkins and jbrockmendel September 21, 2020 03:11

ivanovmg added 5 commits September 21, 2020 10:49

REF: move line_width to StringFormatter

3fbe4ba

REF: move need_to_wrap to StringFormatter

733fa34

DOC: add docstrings to DataFrame.to_xxx methods

bfb37d7

CLN: to_string on top in HTMLFormatter

f1b494e

LINT: black conv multiline -> oneline

75daa74

jreback requested changes Sep 22, 2020

View reviewed changes

jreback added IO CSV read_csv, to_csv IO HTML read_html, to_html, Styler.apply, Styler.applymap IO LaTeX to_latex Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code labels Sep 22, 2020

jreback added this to the 1.2 milestone Sep 22, 2020

ivanovmg added 6 commits September 23, 2020 14:31

Merge branch 'master' into refactor/fmt

b1018ad

TYP: type properties in CSVFormatter

22d0982

TYP: type strcols

94dbadd

TYP: _join_multiline

914981b

LINT: sort in one line

482ccd1

REF: _join_multiline to accept single arg

7b57fc8

simonjayhawkins reviewed Sep 23, 2020

View reviewed changes

REF: eliminate mutation in _join_multiline

1e2969f

ivanovmg requested a review from jreback September 23, 2020 10:55

jreback approved these changes Sep 24, 2020

View reviewed changes

ivanovmg mentioned this pull request Sep 25, 2020

PERF: fix long string representation #36638

Merged

5 tasks

Merge branch 'master' into refactor/fmt

90977fd

ivanovmg requested review from jreback and simonjayhawkins October 17, 2020 14:42

Merge branch 'master' into refactor/fmt

fc7a091

simonjayhawkins approved these changes Oct 20, 2020

View reviewed changes

ivanovmg added 2 commits October 20, 2020 21:27

TYP: extract imports for typing only in csvs

1335a11

TYP: move FloatFormatType alias to _typing

b67b481

jreback merged commit 196bdcd into pandas-dev:master Oct 20, 2020

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020

REF: dataframe formatters/outputs (pandas-dev#36510)

0d574a5

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

REF: dataframe formatters/outputs (pandas-dev#36510)

aee4d05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: dataframe formatters/outputs #36510

REF: dataframe formatters/outputs #36510

ivanovmg commented Sep 20, 2020 •

edited

jreback left a comment

jreback Sep 22, 2020

ivanovmg Sep 22, 2020

jreback Sep 22, 2020

jreback Sep 22, 2020

ivanovmg Sep 22, 2020

jreback Sep 22, 2020

jreback Sep 22, 2020

jreback Sep 22, 2020

ivanovmg Sep 22, 2020

simonjayhawkins Sep 23, 2020

ivanovmg Sep 23, 2020

simonjayhawkins Sep 23, 2020

jreback left a comment

simonjayhawkins commented Sep 24, 2020

ivanovmg commented Oct 20, 2020

simonjayhawkins left a comment

simonjayhawkins Oct 20, 2020

ivanovmg Oct 20, 2020

simonjayhawkins Oct 20, 2020

simonjayhawkins commented Oct 20, 2020

ivanovmg commented Oct 20, 2020 •

edited

simonjayhawkins commented Oct 20, 2020

ivanovmg commented Oct 20, 2020

ivanovmg commented Oct 20, 2020

simonjayhawkins commented Oct 20, 2020

jreback commented Oct 20, 2020

		@@ -1198,6 +899,209 @@ def _get_column_name_list(self) -> List[str]:
		return names


		class DataFrameRenderer:

REF: dataframe formatters/outputs #36510

REF: dataframe formatters/outputs #36510

Conversation

ivanovmg commented Sep 20, 2020 • edited

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

simonjayhawkins commented Sep 24, 2020

ivanovmg commented Oct 20, 2020

simonjayhawkins left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonjayhawkins commented Oct 20, 2020

ivanovmg commented Oct 20, 2020 • edited

simonjayhawkins commented Oct 20, 2020

ivanovmg commented Oct 20, 2020

ivanovmg commented Oct 20, 2020

simonjayhawkins commented Oct 20, 2020

jreback commented Oct 20, 2020

ivanovmg commented Sep 20, 2020 •

edited

ivanovmg commented Oct 20, 2020 •

edited