Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex to_string edge case Error after 0.23.0 upgrade #21180

Closed
atlasstrategic opened this issue May 23, 2018 · 25 comments
Closed

MultiIndex to_string edge case Error after 0.23.0 upgrade #21180

atlasstrategic opened this issue May 23, 2018 · 25 comments
Labels
Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@atlasstrategic
Copy link

Code example

import pandas as pd
import numpy as np

index = pd.date_range('1970', '2018', freq='A')
data = np.random.randn(len(index))
columns1 = [
    ['This is a long title with > 37 chars.'],
    ['cat'],
]
columns2 = [
    ['This is a loooooonger title with > 43 chars.'],
    ['dog'],
]
df1 = pd.DataFrame(data=data, index=index, columns=columns1)
df2 = pd.DataFrame(data=data, index=index, columns=columns2)
df = pd.concat([df1, df2], axis=1)
df.head()

Output (using pandas 0.23.0)

>>> df.head()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/core/base.py", line 82, in __repr__
    return str(self)
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/core/base.py", line 61, in __str__
    return self.__unicode__()
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/core/frame.py", line 663, in __unicode__
    line_width=width, show_dimensions=show_dimensions)
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/core/frame.py", line 1968, in to_string
    formatter.to_string()
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/io/formats/format.py", line 648, in to_string
    strcols = self._to_str_columns()
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/io/formats/format.py", line 539, in _to_str_columns
    str_columns = self._get_formatted_column_labels(frame)
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/io/formats/format.py", line 782, in _get_formatted_column_labels
    str_columns = _sparsify(str_columns)
  File "/home/david/.virtualenvs/thegrid-py3-venv/lib/python3.5/site-packages/pandas/core/indexes/multi.py", line 2962, in _sparsify
    prev = pivoted[start]
IndexError: list index out of range

Problem description

After upgrading Pandas 0.22.0 to 0.23.0 I have experienced the above error. I have noticed that it is the length of the column values, This is a long title with > 37 chars. and This is a loooooonger title with > 43 chars., that makes the difference. If I tweak the combined length of these to be <= 80 characters, there is no error, and output is as expected.

Expected Output (using pandas 0.22.0)

>>> df.head()
           This is a long title with > 37 chars.  \
                                             cat   
1970-12-31                             -1.448415   
1971-12-31                              0.081324   
1972-12-31                             -0.018105   
1973-12-31                              0.902790   
1974-12-31                              0.668474   

           This is a loooooonger title with > 43 chars.  
                                                    dog  
1970-12-31                                    -1.448415  
1971-12-31                                     0.081324  
1972-12-31                                    -0.018105  
1973-12-31                                     0.902790  
1974-12-31                                     0.668474

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-124-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_ZA.UTF-8
LOCALE: en_ZA.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 32.3.1
Cython: None
numpy: 1.14.0
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: 2.5.3
xlrd: None
xlwt: None
xlsxwriter: 1.0.4
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@atlasstrategic atlasstrategic changed the title Multiindex to_string edge case Error after 0.23.0 upgrade MultiIndex to_string edge case Error after 0.23.0 upgrade May 23, 2018
@TomAugspurger
Copy link
Contributor

This doesn't raise for me (py36, and pandas master).

What is pd.options.display.max_colwidth, pd.options.display.wdith, and pd.options.display.max_columns?

@atlasstrategic
Copy link
Author

atlasstrategic commented May 23, 2018

@TomAugspurger Here my system pandas 0.23.0 output:

>>> import pandas as pd
>>> pd.options.display.max_colwidth
50
>>> pd.options.display.width
80
>>> pd.options.display.max_columns
0

0.22.0 output:

>>> import pandas as pd
>>> pd.options.display.max_colwidth
50
>>> pd.options.display.width
80
>>> pd.options.display.max_columns
20

If I do the following it works in 0.23.0!

pd.set_option("max_columns", 20)

Did the default setting change in 0.23.0?

@atlasstrategic
Copy link
Author

Reading the docs show how 0.22:

In case python/IPython is running in a terminal this can be set to 0

has been updated in 0.23 to:

In case Python/IPython is running in a terminal this is set to 0 by default.

However, when switching back to 0.22.0 and manually changing the max_columns option to 0 doesn't result in raising the exception.

🤔 So it still doesn't explain why there would be an error raised when max_columns is set to 0?

@TomAugspurger
Copy link
Contributor

cc @cbrnr if you have any ideas.

@cbrnr
Copy link
Contributor

cbrnr commented May 28, 2018

I get an AttributeError: module 'pandas._libs.tslibs.timezones' has no attribute 'tz_standardize' when I test this with the latest master branch revision. Any ideas how to fix this? Using 0.23, I can reproduce the issue.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 2, 2018 via email

@cbrnr
Copy link
Contributor

cbrnr commented Jun 3, 2018

Thanks, I forgot about that. Thankfully, it's not the add one business (I get the same error when I revert this change). This will take a bit of work, since everything works in PyCharm but not in IPython (so debugging will be much slower for me since I'm not used to pdb at all)...

@cbrnr
Copy link
Contributor

cbrnr commented Jun 4, 2018

Apparently, setting pd.options.display.max_columns = 0 in 0.22 also results in this error. So the issue was not introduced by my change, which merely changed the default to 0.

@atlasstrategic
Copy link
Author

Hi @cbrnr

also results in this error.

Probably you ment to say does not? I do agree, merely changing the default to 0 should not result in the unexpected error.

@cbrnr
Copy link
Contributor

cbrnr commented Jun 4, 2018

No, I get the same error with pandas 0.22 if I first set pd.options.display.max_columns = 0. This means that this bug has been there for a while (I haven't tried older versions, but I suspect that they will behave similarly).

@TomAugspurger
Copy link
Contributor

I do not get the exception on pandas 0.22.0 with

import pandas as pd
pd.options.display.max_columns = 0
import numpy as np

index = pd.date_range('1970', '2018', freq='A')
data = np.random.randn(len(index))
columns1 = [
    ['This is a long title with > 37 chars.'],
    ['cat'],
]
columns2 = [
    ['This is a loooooonger title with > 43 chars.'],
    ['dog'],
]
df1 = pd.DataFrame(data=data, index=index, columns=columns1)
df2 = pd.DataFrame(data=data, index=index, columns=columns2)
df = pd.concat([df1, df2], axis=1)
df.head()

@TomAugspurger TomAugspurger added the Output-Formatting __repr__ of pandas objects, to_string label Jun 4, 2018
@TomAugspurger TomAugspurger added this to the 0.23.1 milestone Jun 4, 2018
@TomAugspurger
Copy link
Contributor

Though it occurs to me that this probably depends on the width of the terminal.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

not a regression, but still should fix.

@jreback jreback modified the milestones: 0.23.1, Next Major Release Jun 4, 2018
@cbrnr
Copy link
Contributor

cbrnr commented Jun 5, 2018

@TomAugspurger I just tried again, I do get the error with 0.22. How are you running this code? If you are not in interactive mode (e.g. IPython), you need to change the last line to print(df.head()) in order to produce the output. I'm running this in IPython on macOS in a normal terminal (not Jupyter QtConsole) with 100x35 window size.

@TomAugspurger
Copy link
Contributor

@jreback could you please make a note when you're moving the milestone? This should be fixed for 0.23.1.

@TomAugspurger TomAugspurger modified the milestones: Next Major Release, 0.23.1 Jun 5, 2018
@jreback
Copy link
Contributor

jreback commented Jun 5, 2018

i made a note
and this does not need to block 0.23.1
it’s jot a regression

pls don’t mark milestones unless ready to go

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 5, 2018

@jreback This is a regression in user experience. It may be an existing bug, but code that was working before, is failing now, because we changed the default. So we should still fix that existing bug for 0.23.1.

I cannot reproduce the error with the example in this issue, but I do see it with the example from #21327

@jorisvandenbossche jorisvandenbossche added the Regression Functionality that used to work in a prior pandas version label Jun 5, 2018
@jreback
Copy link
Contributor

jreback commented Jun 7, 2018

@jorisvandenbossche sure regressions happen, and we should fix them all. but unless this is fixed today, it will go in the next release.

@jreback jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018
@jorisvandenbossche
Copy link
Member

We can also change the default of max_columns back to 20 for now if we don't find the effort to fix the bugs

@TomAugspurger
Copy link
Contributor

Here's a failing unit test

diff --git a/pandas/tests/io/formats/test_format.py b/pandas/tests/io/formats/test_format.py
index f221df93d..52f83f093 100644
--- a/pandas/tests/io/formats/test_format.py
+++ b/pandas/tests/io/formats/test_format.py
@@ -305,6 +305,36 @@ class TestDataFrameFormatting(object):
             assert not has_truncated_repr(df)
             assert not has_expanded_repr(df)
 
+    def test_repr_multiindex(self):
+        # https://github.com/pandas-dev/pandas/issues/21180
+        from unittest import mock
+
+        def f():
+            return os.terminal_size((118, 96))
+
+        terminal_size = os.terminal_size((118, 96))
+
+        p1 = mock.patch('pandas.io.formats.console.get_terminal_size',
+                        return_value=terminal_size)
+        p2 = mock.patch('pandas.io.formats.format.get_terminal_size',
+                        return_value=terminal_size)
+        index = pd.date_range('1970', '2018', freq='A')
+        data = np.random.randn(len(index))
+        columns1 = [
+            ['This is a long title with > 37 chars.'],
+            ['cat'],
+        ]
+        columns2 = [
+            ['This is a loooooonger title with > 43 chars.'],
+            ['dog'],
+        ]
+        df1 = pd.DataFrame(data=data, index=index, columns=columns1)
+        df2 = pd.DataFrame(data=data, index=index, columns=columns2)
+        df = pd.concat([df1, df2], axis=1)
+
+        with p1, p2:
+            repr(df.head())
+
     def test_repr_max_columns_max_rows(self):
         term_width, term_height = get_terminal_size()
         if term_width < 10 or term_height < 10:

@jorisvandenbossche
Copy link
Member

If we don't have a fix for this, I would consider reverting the pandas.options.display.max_columns back to 20, and work on fixing this and possibly turning back to 0 for 0.24.0.

Errors in the repr are really annoying, as you cannot even inspect the data properly to see what might be the reason something is not working.

@TomAugspurger
Copy link
Contributor

I'm going to try to fix it now.

@TomAugspurger
Copy link
Contributor

What's the expected behavior here? I can easily match the behavior of the non-MI case,

In [3]: s = pd.DataFrame({"A" * 41: [1, 2], 'B' * 41: [1, 2]})

In [4]: with p1, p2:
   ...:     print(repr(s))
   ...:
  ...
0 ...
1 ...

[2 rows x 2 columns]

but that's not too useful...

@jorisvandenbossche
Copy link
Member

That's a good question. For the truncated repr, we always need two columns right? (first and last)
So previously it put the two columns below each other, but now they would need to be next to each other, which is exactly the problem as they do not fit ..

@jorisvandenbossche
Copy link
Member

Overflowing the line is what is happening in my console if I make it smaller (instead of the error), that might be an option in general (it does not make the repr very readable for this case, but at least would not lead to an error)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

5 participants