Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Wrong Custom Formatters applied when displaying trancated frames #35410

Open
ipcoder opened this issue Jul 25, 2020 · 4 comments
Open

BUG: Wrong Custom Formatters applied when displaying trancated frames #35410

ipcoder opened this issue Jul 25, 2020 · 4 comments
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string

Comments

@ipcoder
Copy link

ipcoder commented Jul 25, 2020

Problem description

I am providing custom formatters for specific columns as dict.
If frame is large enough and some columns are truncated - then wrong formatters are applied to the columns.
(In my case that leads to crushes as wrong data type is received by the formatter).

Please notice, that behavior changes depending on the width of the console window as different columns are displayed.

Problem investigation

I have examined the code of my version of panda (1.0.5) and compared with the last version in GitHub - the bug seems to be still there.

The source of the problem starts with this method (DataFrameFormatter._to_str_columns), when
frame is set to truncated frame = self.tr_frame and then self._format_col(i) is called with index of the column in the TRUNCATED frame:

    def _to_str_columns(self) -> List[List[str]]:
        """
        Render a DataFrame to a list of columns (as lists of strings).
        """
        # this method is not used by to_html where self.col_space
        # could be a string so safe to cast
        self.col_space = cast(int, self.col_space)

        frame = self.tr_frame
        # may include levels names also

        str_index = self._get_formatted_index(frame)

        if not is_list_like(self.header) and not self.header:
            stringified = []
            for i, c in enumerate(frame):
                fmt_values = self._format_col(i)

Then this "truncated" column index is passed to self._get_formatter:

    def _format_col(self, i: int) -> List[str]:
        frame = self.tr_frame
        formatter = self._get_formatter(i)   # the problem is HERE? _get_formatter(frame.columns[i]) ?

which uses full frame columns to retrieve formatter using index i which corresponds to the columns of the truncated frame:

           # ...
        else:
            if is_integer(i) and i not in self.columns:
                i = self.columns[i]
            return self.formatters.get(i, None)

INSTALLED VERSIONS

commit : None
python : 3.6.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-37-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.5
numpy : 1.18.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200309
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.19.3
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.16.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.2.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.4.4
tabulate : 0.8.3
xarray : 0.15.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.50.1

@ipcoder ipcoder added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 25, 2020
@ipcoder ipcoder changed the title BUG: Wrong Custom Formatters applied in when displaying trancated frames BUG: Wrong Custom Formatters applied when displaying trancated frames Jul 25, 2020
@rhshadrach
Copy link
Member

rhshadrach commented Jul 26, 2020

Thanks for reporting this - could you provide a minimal reproducible example of the data/code that demonstrates the issue?

@ipcoder
Copy link
Author

ipcoder commented Aug 15, 2020

import pandas as pd

def form(name):
    return lambda x: f"{name}: {x}"

df = pd.DataFrame({f"Col{x}":range(5) for x in range(6)})
print(df.to_string(formatters=formatters, max_cols=6))
print(df.to_string(formatters={c: form(c) for c in df}, max_cols=4))

produces:

     Col0    Col1    Col2    Col3    Col4    Col5
0 Col0: 0 Col1: 0 Col2: 0 Col3: 0 Col4: 0 Col5: 0
1 Col0: 1 Col1: 1 Col2: 1 Col3: 1 Col4: 1 Col5: 1
2 Col0: 2 Col1: 2 Col2: 2 Col3: 2 Col4: 2 Col5: 2
3 Col0: 3 Col1: 3 Col2: 3 Col3: 3 Col4: 3 Col5: 3
4 Col0: 4 Col1: 4 Col2: 4 Col3: 4 Col4: 4 Col5: 4
     Col0    Col1   ...      Col4    Col5
0 Col0: 0 Col1: 0   ...   Col2: 0 Col3: 0
1 Col0: 1 Col1: 1   ...   Col2: 1 Col3: 1
2 Col0: 2 Col1: 2   ...   Col2: 2 Col3: 2
3 Col0: 3 Col1: 3   ...   Col2: 3 Col3: 3
4 Col0: 4 Col1: 4   ...   Col2: 4 Col3: 4```

As you can see the second print uses wrong formatters after the truncated columns
by selecting from the full instead of the truncated sequence of formatters.

I have patched my version as suggested above:

def _format_col(self, i: int) -> List[str]:
        frame = self.tr_frame
        formatter = self._get_formatter(frame.columns[i])   # instead of _get_formatter(i)

@rhshadrach
Copy link
Member

Thanks - I can reproduce on master once I replace formatters=formatters in your example with the dictionary from the line below. This is indeed a bug, and your fix works well for the case where formatters are a dictionary, but I don't think it will work in the case of a list or tuple. Here, _get_formatter is really expecting an integer representing the position.

I think the root cause of the issue is that the columns attribute is not updated after the call to _chk_truncate in __init__.

Would you be interested in submitting a PR to fix?

@rhshadrach rhshadrach added the Output-Formatting __repr__ of pandas objects, to_string label Aug 16, 2020
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Aug 16, 2020
@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 16, 2020
@ipcoder
Copy link
Author

ipcoder commented Feb 17, 2022

I have never contributed to pandas development, and don't know the procedure.
I assume some tests should be passed before I commit, and may be other things.

In addition to that, I have tried to follow your lead to see if columns attribute indeed should be updated, but it is used in different places, and the impact is difficult to estimate, especially without any comments describing the general logic and design intentions.
It seems I need to understand all the implicit assumptions behind the formatting flow, to be able to make changes.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

3 participants