REF: simplify info.py #36752

ivanovmg · 2020-09-30T21:31:23Z

Enable polymorphism and builder pattern in info.py.
The main benefits:

Separated data representation from data itself
More clear construction of each element via builder pattern
Different builders for various cases (verbose with counts, verbose without counts, non-verbose) enabled one eliminate numerous if statements

May be useful to implement the present PR first and then build Series.info (#31796) on top of this.
I do realize that I deleted BaseInfo class.
Now I see that it is required by the referenced PR, so I can move it back.

Note: needed to change test behavior. Here is the reason why.
I noticed inconsistency:

>>> df = pd.DataFrame({'long long column': np.random.rand(1000000)})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column            Non-Null Count    Dtype
---  ------            --------------    -----
 0   long long column  1000000 non-null  float64
dtypes: float64(1)
memory usage: 7.6 MB

Here we have two spaces between columns.

However, if we create a dataframe with 10000 columns, then the distance between # col and "Column" is only one space.

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.rand(3, 10001))
>>> with open('out.txt', 'w') as buf: df.info(verbose=True, buf=buf)

$ tail out.txt
 9993  9993    float64
 9994  9994    float64
 9995  9995    float64
 9996  9996    float64
 9997  9997    float64
 9998  9998    float64
 9999  9999    float64
 10000 10000   float64
dtypes: float64(10001)
memory usage: 234.5 KB

See, only one space between the first and the second columns.
I find it inconsistent.
So, I changed it, so that there are two spaces everywhere.
What do you think?

I am going to finalize it by documenting and introducing typing where required.

For some reason I got import error when importing Series under TYPE_CHECKING only. So, now I import Series normally and delete TYPE_CHECKING

ivanovmg · 2020-10-01T06:06:02Z

I separated BaseInfo and TableBuilderAbstract to make integration of series info smoother.
It seems that all that is required for the series info is to derive SeriesInfo from BaseInfo and SeriesTableBuilder from TableBuilderAbstract. Moreover, some of the methods from DataFrameTableBuilder can probably be useful for SeriesTableBuilder as well (like column widths detection, for instance). Or al least they can be similar.

MarcoGorelli

Nice, this looks much better than what I'd tried!

MarcoGorelli · 2020-10-01T06:12:03Z

pandas/io/formats/info.py

+    HEADERS = [
+        " # ",
+        "Column",
+        "Dtype",


can you remove the trailing comma so it fits on one line?

Did that. But IMHO, it is more clear, when it is multilne. Anyway, did as you suggested.

pandas/io/formats/info.py

MarcoGorelli · 2020-10-01T06:19:19Z

pandas/io/formats/info.py

+        self.memory_usage = self._initialize_memory_usage(memory_usage)
+
+    def _initialize_memory_usage(
+        self,


I don't think self is used here, can this be a static method or a module-level function?

I made it a static method.

MarcoGorelli · 2020-10-01T06:24:43Z

pandas/io/formats/info.py

+        " # ",
+        "Column",
+        "Non-Null Count",
+        "Dtype",


likewise can this trailing commas be removed?

MarcoGorelli · 2020-10-01T06:25:12Z

pandas/tests/io/formats/test_info.py

@@ -87,7 +101,7 @@ def test_info_verbose():
    frame.info(verbose=True, buf=buf)

    res = buf.getvalue()
-    header = " #    Column  Dtype  \n---   ------  -----  "
+    header = " #     Column  Dtype  \n---    ------  -----  "


why has this changed?

I changed this test because of the inconsistency with the column spacing I observed.
See "Note: needed to change test behavior. Here is the reason why." in the PR description.

The spacing between the columns will not change in general. This particular test has 10000 columns, which makes the first column wide. As a result, there is an inconsistent spacing between the columns. I forced it to be 2 spaces regardless of the column width as I suppose it is reasonable.

See "Note: needed to change test behavior. Here is the reason why." in the PR description.

Thanks, I should've noticed that. I think your explanation and reasoning makes sense, though perhaps it would be better to have a separate PR to fix this, and then keep this one to enable polymorphism?

OK, I will create the issue and make a small PR to fix it. But the present PR will basically overwrite this fix anyway.

Created PR #36766.

WillAyd · 2020-10-01T17:36:28Z

pandas/io/formats/info.py


-    def info(self) -> None:
+    def to_buffer(self, *, buf, max_cols, verbose, null_counts) -> None:


Can you annotate these arguments?

WillAyd · 2020-10-01T17:40:31Z

pandas/io/formats/info.py

@@ -209,147 +250,355 @@ def info(self) -> None:
        --------
        %(examples_sub)s
        """
-        lines = []
+        printer = InfoPrinter(


Hmm at first glance I think this delegation pattern is a little confusing but maybe I am overlooking something. What is the advantage to calling DataFrameInfo.to_buffer from .info instead of calling InfoPrinter.to_buffer instead?

I call DataFrameInfo.to_buffer from DataFrame because of the following reasons.

Mainly did not want to mess with the docstring (which is defined in DataFrameInfo.to_buffer)

I wanted to avoid import of InfoPrinter into pandas/core/frame.py

Probably it is a good idea to make public API function as disconnected from the internal API as possible. Right now changes inside pandas/io/formats/io.py will have slight effect on DataFrame class, since the info printing functionality is encapsulated inside DataFrameInfo. If, however, we call InfoPrinter.to_buffer from DataFrame.info(), then we would need to ensure that all the possible changes made inside InfoPrinter and DataFrameInfo are properly communicated to DataFrame.info. I hope it makes sense.

But I do not strongly object against having InfoPrinter.to_buffer called from DataFrame.info().

Changes concern internal implementation only. Public API remains the same.

This is more reasonable name for mapping {dtype: number of columns with this dtype}.

Previously there was a need to iterate over all columns multiple times to create strcols. This commit creates row data by iterating over the dataframe columns only once. Then zip(*self.strrows) is used to form strcols.

jreback · 2020-10-07T02:44:50Z

pandas/io/formats/info.py

+        self.data = data
+        self.memory_usage = self._initialize_memory_usage(memory_usage)
+
+    @staticmethod


don't use staticmethods can be a class method or an instance method or a module level function

Moved to the module level. My concern here is that this method is necessary only for the instantiation/validation of self.memory_usage, so would not it be more reasonable to keep it close to the actual usage position... Anyway.

jreback · 2020-10-07T02:46:09Z

pandas/io/formats/info.py

-            for system-level memory consumption, and include it in the returned
-            values.
+    @property
+    def mem_usage(self) -> int:


would rather not abbeviate this method (so maybe have to use _memory_usage for the storage

I renamed it to memory_usage_bytes.

jreback · 2020-10-07T02:53:03Z

pandas/io/formats/info.py

-        counts = dtypes.value_counts().groupby(lambda x: x.name).sum()
-        collected_dtypes = [f"{k[0]}({k[1]:d})" for k in sorted(counts.items())]
-        lines.append(f"dtypes: {', '.join(collected_dtypes)}")
+    def _select_verbose_table_builder(self) -> Type["DataFrameTableBuilderVerbose"]:


can you fold this into _select_table_builder, there is too much indirection here.

jreback · 2020-10-07T02:54:35Z

pandas/io/formats/info.py

+        """Product in a form of list of lines (strings)."""
+
+
+class DataFrameTableBuilder(TableBuilderAbstract):


can you reduce the builder abstraction a bit. this is too java-esque. you can just pass parameterize / use partial functions. I find this style too different from the rest of pandas. let's make this simpler.

As you requested, I dropped DataFrameTableBuilderVerboseNoCounts and DataFrameTableBuilderVerboseWithCounts.
Instead, I parametrized DataFrameTableBuilderVerbose (with_counts = True/False). I personally prefer the original implementation, but this one is also OK. My concerns with the new implementation are rather minor. Two if statements checking for with_counts in two separate functions (fragile), and presence of _gen_non_null_counts, which is required for the table with counts and is irrelevant for table without counts.

I leave TableBuilderAbstract as @MarcoGorelli is working on series info, thus it will be reasonable for the builders to have the same interface.

Regarding Series.info - if you want to work on it I'm happy for you to take it over (else I'll go back to it once this PR is in)

@MarcoGorelli, I can work on Series.info from the code point of view. Since you already have documentation and tests, it seems to be quite manageable for me.

Cool, I'll close that PR then! If for whatever reason you change your mind, please ping me and I'll take it up again

Good. I will start working on series info once this PR is merged.

The logic with the selection of verbose builder was moved to function _select_table_builder. This makes the logic quite more complicated though.

Previously there were two separate classes for verbose table builder. - DataFrameTableBuilderVerboseWithCounts - DataFrameTableBuilderVerboseNoCounts The review considered it as a complex pattern. The present commit leaves DataFrameTableBuilderVerbose, makes is parametrized (with_counts = True/False).

ivanovmg · 2020-10-20T12:37:08Z

Merged master.
By the way, looks like the recent bug (#37245) would not have emerged in this implementation.

ivanovmg · 2020-10-20T14:11:42Z

CI checks failures related to datetime. Looks like these are common across multiple PRs now.

ivanovmg · 2020-10-20T16:36:58Z

@jreback ping

jreback · 2020-10-20T23:12:59Z

Merged master.
By the way, looks like the recent bug (#37245) would not have emerged in this implementation.

how is that?

jreback · 2020-10-20T23:14:09Z

@MarcoGorelli if you have any further comments.

I am ok with this as is. @ivanovmg really prefer to have much smaller / incremental PRs. Sure i understand once in a while its advantageous to do it all at once. But review will inevitable take way longer.

ivanovmg · 2020-10-21T03:50:38Z

Merged master.
By the way, looks like the recent bug (#37245) would not have emerged in this implementation.

how is that?

Probably because in this PR we iterate over items, rather than access by index (in master).

MarcoGorelli · 2020-10-21T07:30:38Z

@MarcoGorelli if you have any further comments.

I can review during the weekend, but there's no need to wait for me if you'd like this in sooner

ivanovmg · 2020-10-21T09:17:17Z

@MarcoGorelli if you have any further comments.

I can review during the weekend, but there's no need to wait for me if you'd like this in sooner

I could be ready with series.info by the end of this week.
This PR is holding me.
I guess, we can figure out the remaining concerns (if any) in that one (with series info).

jreback · 2020-10-21T12:55:00Z

thanks @ivanovmg

ivanovmg added 2 commits October 1, 2020 04:16

REF: polymorphism and builder pattern in info

4a1a2e7

TST: adjust expected gap between # and Column

a3ccb83

ivanovmg requested review from MarcoGorelli and jreback September 30, 2020 21:31

LINT: remove TYPE_CHECKING import

a5e1136

For some reason I got import error when importing Series under TYPE_CHECKING only. So, now I import Series normally and delete TYPE_CHECKING

ivanovmg requested a review from WillAyd September 30, 2020 21:50

ivanovmg added 4 commits October 1, 2020 10:55

TST: add test for empty dataframe info

401d7d8

FIX: handle empty dataframe case

6de3eb7

REF: extract abstract cls BaseInfo and builder

82f9ddc

DOC: fix docstrings

75d65fb

MarcoGorelli requested changes Oct 1, 2020

View reviewed changes

ivanovmg added 5 commits October 1, 2020 13:34

REF: de-duplicate line_numbers, columns, dtypes

25d6ac8

CLN: make HEADERS one-liners

7a861dc

REF: make _initialize_memory_usage staticmethod

8640fb9

DOC/TYP: add docstrings and type annotations

cd676b0

REF: create memory usage string in BaseInfo

5ce8a72

jreback added Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code labels Oct 1, 2020

WillAyd requested changes Oct 1, 2020

View reviewed changes

ivanovmg added 10 commits October 2, 2020 14:48

TYP: annotate DataFrameInfo.to_buffer

e0a2035

REF: rename kwarg null_counts -> show_counts

3010d6c

Changes concern internal implementation only. Public API remains the same.

CLN: reuse exceeds_info_cols in show_counts

238f091

REF: extract property exceeds_info_rows

ffeff49

DOC: add docstrings in InfoPrinter

f160074

CLN: rename counts -> dtype_counts

c368a94

This is more reasonable name for mapping {dtype: number of columns with this dtype}.

REF: extract property non_null_counts

43288e1

REF: extract method _get_non_null_counts

6cb600a

REF: generate rows for performance improvement

01bedfb

Previously there was a need to iterate over all columns multiple times to create strcols. This commit creates row data by iterating over the dataframe columns only once. Then zip(*self.strrows) is used to form strcols.

REF: remove COL_SPACE class attribute

5bd4ce1

ivanovmg added 2 commits October 2, 2020 20:24

DOC: add docstrings to abstract properties

36d9171

Merge branch 'master' into refactor/info-polymorphism

a0eb9ac

ivanovmg requested a review from WillAyd October 3, 2020 15:47

jreback requested changes Oct 7, 2020

View reviewed changes

ivanovmg added 5 commits October 7, 2020 13:42

REF: remove method for verbose builder selection

fb5c0e4

The logic with the selection of verbose builder was moved to function _select_table_builder. This makes the logic quite more complicated though.

REF: move static method to module level function

1f0c410

CLN: rename mem_usage -> memory_usage_bytes

171d028

DOC: add docstrings to iterator funcs

c4c002e

ivanovmg requested a review from jreback October 7, 2020 13:23

ivanovmg added 5 commits October 7, 2020 22:11

Merge branch 'master' into refactor/info-polymorphism

3e5ecda

PERF: property gross_column_widths to attribute

92c7754

TYP: attribute gross_column_widths

5fe1cd7

Merge branch 'master' into refactor/info-polymorphism

fbd7fda

Merge branch 'master' into refactor/info-polymorphism

70c6e22

jreback added this to the 1.2 milestone Oct 20, 2020

jreback approved these changes Oct 20, 2020

View reviewed changes

jreback merged commit 8f472ae into pandas-dev:master Oct 21, 2020

ivanovmg mentioned this pull request Oct 21, 2020

ENH: enable Series.info() #37320

Merged

5 tasks

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020

REF: simplify info.py (pandas-dev#36752)

b33408f

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

REF: simplify info.py (pandas-dev#36752)

9705311


		def info(self) -> None:
		def to_buffer(self, *, buf, max_cols, verbose, null_counts) -> None:

		"""Product in a form of list of lines (strings)."""


		class DataFrameTableBuilder(TableBuilderAbstract):

REF: simplify info.py #36752

REF: simplify info.py #36752

Conversation

ivanovmg commented Sep 30, 2020 • edited Loading

ivanovmg commented Oct 1, 2020

MarcoGorelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanovmg Oct 9, 2020 • edited Loading

Choose a reason for hiding this comment

ivanovmg commented Oct 20, 2020 • edited Loading

ivanovmg commented Oct 20, 2020

ivanovmg commented Oct 20, 2020

jreback commented Oct 20, 2020

jreback commented Oct 20, 2020

ivanovmg commented Oct 21, 2020

MarcoGorelli commented Oct 21, 2020

ivanovmg commented Oct 21, 2020

jreback commented Oct 21, 2020

ivanovmg commented Sep 30, 2020 •

edited

Loading

ivanovmg Oct 9, 2020 •

edited

Loading

ivanovmg commented Oct 20, 2020 •

edited

Loading