Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: simplify info.py #36752

Merged
merged 41 commits into from
Oct 21, 2020
Merged

Conversation

ivanovmg
Copy link
Member

@ivanovmg ivanovmg commented Sep 30, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Enable polymorphism and builder pattern in info.py.
The main benefits:

  • Separated data representation from data itself
  • More clear construction of each element via builder pattern
  • Different builders for various cases (verbose with counts, verbose without counts, non-verbose) enabled one eliminate numerous if statements

May be useful to implement the present PR first and then build Series.info (#31796) on top of this.
I do realize that I deleted BaseInfo class.
Now I see that it is required by the referenced PR, so I can move it back.

Note: needed to change test behavior. Here is the reason why.
I noticed inconsistency:

>>> df = pd.DataFrame({'long long column': np.random.rand(1000000)})
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 1 columns):
 #   Column            Non-Null Count    Dtype
---  ------            --------------    -----
 0   long long column  1000000 non-null  float64
dtypes: float64(1)
memory usage: 7.6 MB

Here we have two spaces between columns.

However, if we create a dataframe with 10000 columns, then the distance between # col and "Column" is only one space.

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.rand(3, 10001))
>>> with open('out.txt', 'w') as buf: df.info(verbose=True, buf=buf)
$ tail out.txt
 9993  9993    float64
 9994  9994    float64
 9995  9995    float64
 9996  9996    float64
 9997  9997    float64
 9998  9998    float64
 9999  9999    float64
 10000 10000   float64
dtypes: float64(10001)
memory usage: 234.5 KB

See, only one space between the first and the second columns.
I find it inconsistent.
So, I changed it, so that there are two spaces everywhere.
What do you think?

I am going to finalize it by documenting and introducing typing where required.

For some reason I got import error when importing Series
under TYPE_CHECKING only.
So, now I import Series normally and delete TYPE_CHECKING
@ivanovmg
Copy link
Member Author

ivanovmg commented Oct 1, 2020

I separated BaseInfo and TableBuilderAbstract to make integration of series info smoother.
It seems that all that is required for the series info is to derive SeriesInfo from BaseInfo and SeriesTableBuilder from TableBuilderAbstract. Moreover, some of the methods from DataFrameTableBuilder can probably be useful for SeriesTableBuilder as well (like column widths detection, for instance). Or al least they can be similar.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this looks much better than what I'd tried!

HEADERS = [
" # ",
"Column",
"Dtype",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the trailing comma so it fits on one line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that. But IMHO, it is more clear, when it is multilne. Anyway, did as you suggested.

pandas/io/formats/info.py Show resolved Hide resolved
self.memory_usage = self._initialize_memory_usage(memory_usage)

def _initialize_memory_usage(
self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think self is used here, can this be a static method or a module-level function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a static method.

" # ",
"Column",
"Non-Null Count",
"Dtype",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise can this trailing commas be removed?

@@ -87,7 +101,7 @@ def test_info_verbose():
frame.info(verbose=True, buf=buf)

res = buf.getvalue()
header = " # Column Dtype \n--- ------ ----- "
header = " # Column Dtype \n--- ------ ----- "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has this changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this test because of the inconsistency with the column spacing I observed.
See "Note: needed to change test behavior. Here is the reason why." in the PR description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spacing between the columns will not change in general. This particular test has 10000 columns, which makes the first column wide. As a result, there is an inconsistent spacing between the columns. I forced it to be 2 spaces regardless of the column width as I suppose it is reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See "Note: needed to change test behavior. Here is the reason why." in the PR description.

Thanks, I should've noticed that. I think your explanation and reasoning makes sense, though perhaps it would be better to have a separate PR to fix this, and then keep this one to enable polymorphism?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will create the issue and make a small PR to fix it. But the present PR will basically overwrite this fix anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created PR #36766.

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code labels Oct 1, 2020

def info(self) -> None:
def to_buffer(self, *, buf, max_cols, verbose, null_counts) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate these arguments?

@@ -209,147 +250,355 @@ def info(self) -> None:
--------
%(examples_sub)s
"""
lines = []
printer = InfoPrinter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm at first glance I think this delegation pattern is a little confusing but maybe I am overlooking something. What is the advantage to calling DataFrameInfo.to_buffer from .info instead of calling InfoPrinter.to_buffer instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I call DataFrameInfo.to_buffer from DataFrame because of the following reasons.

  1. Mainly did not want to mess with the docstring (which is defined in DataFrameInfo.to_buffer)
  2. I wanted to avoid import of InfoPrinter into pandas/core/frame.py
  3. Probably it is a good idea to make public API function as disconnected from the internal API as possible. Right now changes inside pandas/io/formats/io.py will have slight effect on DataFrame class, since the info printing functionality is encapsulated inside DataFrameInfo. If, however, we call InfoPrinter.to_buffer from DataFrame.info(), then we would need to ensure that all the possible changes made inside InfoPrinter and DataFrameInfo are properly communicated to DataFrame.info. I hope it makes sense.

But I do not strongly object against having InfoPrinter.to_buffer called from DataFrame.info().

Changes concern internal implementation only.
Public API remains the same.
This is more reasonable name for
mapping {dtype: number of columns with this dtype}.
Previously there was a need to iterate over all columns
multiple times to create strcols.
This commit creates row data by iterating
over the dataframe columns only once.
Then zip(*self.strrows) is used to form strcols.
@ivanovmg ivanovmg requested a review from WillAyd October 3, 2020 15:47
self.data = data
self.memory_usage = self._initialize_memory_usage(memory_usage)

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use staticmethods can be a class method or an instance method or a module level function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the module level. My concern here is that this method is necessary only for the instantiation/validation of self.memory_usage, so would not it be more reasonable to keep it close to the actual usage position... Anyway.

for system-level memory consumption, and include it in the returned
values.
@property
def mem_usage(self) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would rather not abbeviate this method (so maybe have to use _memory_usage for the storage

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to memory_usage_bytes.

counts = dtypes.value_counts().groupby(lambda x: x.name).sum()
collected_dtypes = [f"{k[0]}({k[1]:d})" for k in sorted(counts.items())]
lines.append(f"dtypes: {', '.join(collected_dtypes)}")
def _select_verbose_table_builder(self) -> Type["DataFrameTableBuilderVerbose"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you fold this into _select_table_builder, there is too much indirection here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did that.

"""Product in a form of list of lines (strings)."""


class DataFrameTableBuilder(TableBuilderAbstract):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you reduce the builder abstraction a bit. this is too java-esque. you can just pass parameterize / use partial functions. I find this style too different from the rest of pandas. let's make this simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you requested, I dropped DataFrameTableBuilderVerboseNoCounts and DataFrameTableBuilderVerboseWithCounts.
Instead, I parametrized DataFrameTableBuilderVerbose (with_counts = True/False). I personally prefer the original implementation, but this one is also OK. My concerns with the new implementation are rather minor. Two if statements checking for with_counts in two separate functions (fragile), and presence of _gen_non_null_counts, which is required for the table with counts and is irrelevant for table without counts.

I leave TableBuilderAbstract as @MarcoGorelli is working on series info, thus it will be reasonable for the builders to have the same interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Series.info - if you want to work on it I'm happy for you to take it over (else I'll go back to it once this PR is in)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli, I can work on Series.info from the code point of view. Since you already have documentation and tests, it seems to be quite manageable for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I'll close that PR then! If for whatever reason you change your mind, please ping me and I'll take it up again

Copy link
Member Author

@ivanovmg ivanovmg Oct 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. I will start working on series info once this PR is merged.

The logic with the selection of verbose builder was moved
to function _select_table_builder.
This makes the logic quite more complicated though.
Previously there were two separate classes for verbose table builder.
 - DataFrameTableBuilderVerboseWithCounts
 - DataFrameTableBuilderVerboseNoCounts

The review considered it as a complex pattern.
The present commit leaves DataFrameTableBuilderVerbose,
makes is parametrized (with_counts = True/False).
@ivanovmg ivanovmg requested a review from jreback October 7, 2020 13:23
@ivanovmg
Copy link
Member Author

ivanovmg commented Oct 20, 2020

Merged master.
By the way, looks like the recent bug (#37245) would not have emerged in this implementation.

@ivanovmg
Copy link
Member Author

CI checks failures related to datetime. Looks like these are common across multiple PRs now.

@ivanovmg
Copy link
Member Author

@jreback ping

@jreback jreback added this to the 1.2 milestone Oct 20, 2020
@jreback
Copy link
Contributor

jreback commented Oct 20, 2020

Merged master.
By the way, looks like the recent bug (#37245) would not have emerged in this implementation.

how is that?

@jreback
Copy link
Contributor

jreback commented Oct 20, 2020

@MarcoGorelli if you have any further comments.

I am ok with this as is. @ivanovmg really prefer to have much smaller / incremental PRs. Sure i understand once in a while its advantageous to do it all at once. But review will inevitable take way longer.

@ivanovmg
Copy link
Member Author

Merged master.
By the way, looks like the recent bug (#37245) would not have emerged in this implementation.

how is that?

Probably because in this PR we iterate over items, rather than access by index (in master).

@MarcoGorelli
Copy link
Member

@MarcoGorelli if you have any further comments.

I can review during the weekend, but there's no need to wait for me if you'd like this in sooner

@ivanovmg
Copy link
Member Author

@MarcoGorelli if you have any further comments.

I can review during the weekend, but there's no need to wait for me if you'd like this in sooner

I could be ready with series.info by the end of this week.
This PR is holding me.
I guess, we can figure out the remaining concerns (if any) in that one (with series info).

@jreback jreback merged commit 8f472ae into pandas-dev:master Oct 21, 2020
@jreback
Copy link
Contributor

jreback commented Oct 21, 2020

thanks @ivanovmg

@ivanovmg ivanovmg mentioned this pull request Oct 21, 2020
5 tasks
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Oct 26, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants