-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: simplify info.py #36752
REF: simplify info.py #36752
Conversation
For some reason I got import error when importing Series under TYPE_CHECKING only. So, now I import Series normally and delete TYPE_CHECKING
I separated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this looks much better than what I'd tried!
pandas/io/formats/info.py
Outdated
HEADERS = [ | ||
" # ", | ||
"Column", | ||
"Dtype", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove the trailing comma so it fits on one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that. But IMHO, it is more clear, when it is multilne. Anyway, did as you suggested.
pandas/io/formats/info.py
Outdated
self.memory_usage = self._initialize_memory_usage(memory_usage) | ||
|
||
def _initialize_memory_usage( | ||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think self
is used here, can this be a static method or a module-level function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it a static method.
pandas/io/formats/info.py
Outdated
" # ", | ||
"Column", | ||
"Non-Null Count", | ||
"Dtype", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise can this trailing commas be removed?
pandas/tests/io/formats/test_info.py
Outdated
@@ -87,7 +101,7 @@ def test_info_verbose(): | |||
frame.info(verbose=True, buf=buf) | |||
|
|||
res = buf.getvalue() | |||
header = " # Column Dtype \n--- ------ ----- " | |||
header = " # Column Dtype \n--- ------ ----- " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why has this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this test because of the inconsistency with the column spacing I observed.
See "Note: needed to change test behavior. Here is the reason why." in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spacing between the columns will not change in general. This particular test has 10000 columns, which makes the first column wide. As a result, there is an inconsistent spacing between the columns. I forced it to be 2 spaces regardless of the column width as I suppose it is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See "Note: needed to change test behavior. Here is the reason why." in the PR description.
Thanks, I should've noticed that. I think your explanation and reasoning makes sense, though perhaps it would be better to have a separate PR to fix this, and then keep this one to enable polymorphism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will create the issue and make a small PR to fix it. But the present PR will basically overwrite this fix anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created PR #36766.
pandas/io/formats/info.py
Outdated
|
||
def info(self) -> None: | ||
def to_buffer(self, *, buf, max_cols, verbose, null_counts) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate these arguments?
@@ -209,147 +250,355 @@ def info(self) -> None: | |||
-------- | |||
%(examples_sub)s | |||
""" | |||
lines = [] | |||
printer = InfoPrinter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm at first glance I think this delegation pattern is a little confusing but maybe I am overlooking something. What is the advantage to calling DataFrameInfo.to_buffer
from .info
instead of calling InfoPrinter.to_buffer
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I call DataFrameInfo.to_buffer
from DataFrame
because of the following reasons.
- Mainly did not want to mess with the docstring (which is defined in
DataFrameInfo.to_buffer
) - I wanted to avoid import of
InfoPrinter
intopandas/core/frame.py
- Probably it is a good idea to make public API function as disconnected from the internal API as possible. Right now changes inside
pandas/io/formats/io.py
will have slight effect onDataFrame
class, since the info printing functionality is encapsulated insideDataFrameInfo
. If, however, we callInfoPrinter.to_buffer
fromDataFrame.info()
, then we would need to ensure that all the possible changes made insideInfoPrinter
andDataFrameInfo
are properly communicated toDataFrame.info
. I hope it makes sense.
But I do not strongly object against having InfoPrinter.to_buffer
called from DataFrame.info()
.
Changes concern internal implementation only. Public API remains the same.
This is more reasonable name for mapping {dtype: number of columns with this dtype}.
Previously there was a need to iterate over all columns multiple times to create strcols. This commit creates row data by iterating over the dataframe columns only once. Then zip(*self.strrows) is used to form strcols.
pandas/io/formats/info.py
Outdated
self.data = data | ||
self.memory_usage = self._initialize_memory_usage(memory_usage) | ||
|
||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use staticmethods can be a class method or an instance method or a module level function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to the module level. My concern here is that this method is necessary only for the instantiation/validation of self.memory_usage
, so would not it be more reasonable to keep it close to the actual usage position... Anyway.
pandas/io/formats/info.py
Outdated
for system-level memory consumption, and include it in the returned | ||
values. | ||
@property | ||
def mem_usage(self) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would rather not abbeviate this method (so maybe have to use _memory_usage for the storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed it to memory_usage_bytes
.
pandas/io/formats/info.py
Outdated
counts = dtypes.value_counts().groupby(lambda x: x.name).sum() | ||
collected_dtypes = [f"{k[0]}({k[1]:d})" for k in sorted(counts.items())] | ||
lines.append(f"dtypes: {', '.join(collected_dtypes)}") | ||
def _select_verbose_table_builder(self) -> Type["DataFrameTableBuilderVerbose"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fold this into _select_table_builder, there is too much indirection here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that.
"""Product in a form of list of lines (strings).""" | ||
|
||
|
||
class DataFrameTableBuilder(TableBuilderAbstract): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you reduce the builder abstraction a bit. this is too java-esque. you can just pass parameterize / use partial functions. I find this style too different from the rest of pandas. let's make this simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you requested, I dropped DataFrameTableBuilderVerboseNoCounts
and DataFrameTableBuilderVerboseWithCounts
.
Instead, I parametrized DataFrameTableBuilderVerbose
(with_counts = True/False). I personally prefer the original implementation, but this one is also OK. My concerns with the new implementation are rather minor. Two if statements checking for with_counts
in two separate functions (fragile), and presence of _gen_non_null_counts
, which is required for the table with counts and is irrelevant for table without counts.
I leave TableBuilderAbstract
as @MarcoGorelli is working on series info, thus it will be reasonable for the builders to have the same interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Series.info - if you want to work on it I'm happy for you to take it over (else I'll go back to it once this PR is in)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli, I can work on Series.info from the code point of view. Since you already have documentation and tests, it seems to be quite manageable for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I'll close that PR then! If for whatever reason you change your mind, please ping me and I'll take it up again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. I will start working on series info once this PR is merged.
The logic with the selection of verbose builder was moved to function _select_table_builder. This makes the logic quite more complicated though.
Previously there were two separate classes for verbose table builder. - DataFrameTableBuilderVerboseWithCounts - DataFrameTableBuilderVerboseNoCounts The review considered it as a complex pattern. The present commit leaves DataFrameTableBuilderVerbose, makes is parametrized (with_counts = True/False).
Merged master. |
CI checks failures related to datetime. Looks like these are common across multiple PRs now. |
@jreback ping |
how is that? |
@MarcoGorelli if you have any further comments. I am ok with this as is. @ivanovmg really prefer to have much smaller / incremental PRs. Sure i understand once in a while its advantageous to do it all at once. But review will inevitable take way longer. |
Probably because in this PR we iterate over items, rather than access by index (in master). |
I can review during the weekend, but there's no need to wait for me if you'd like this in sooner |
I could be ready with series.info by the end of this week. |
thanks @ivanovmg |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Enable polymorphism and builder pattern in
info.py
.The main benefits:
May be useful to implement the present PR first and then build
Series.info
(#31796) on top of this.I do realize that I deleted
BaseInfo
class.Now I see that it is required by the referenced PR, so I can move it back.
Note: needed to change test behavior. Here is the reason why.
I noticed inconsistency:
Here we have two spaces between columns.
However, if we create a dataframe with 10000 columns, then the distance between # col and "Column" is only one space.
See, only one space between the first and the second columns.
I find it inconsistent.
So, I changed it, so that there are two spaces everywhere.
What do you think?
I am going to finalize it by documenting and introducing typing where required.