Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexError: tuple index out of range after upgrade to 0.25 #27775

Closed
Foadsf opened this issue Aug 6, 2019 · 13 comments · Fixed by #27818

Comments

@Foadsf
Copy link

commented Aug 6, 2019

Root cause (in both cases using df = pd.DataFrame({'a': [1, 2, 3]})):

In [71]: pd.__version__  
Out[71]: '0.25.0'

In [73]: df.index[:, None]
Out[73]: Int64Index([0, 1, 2], dtype='int64')

In [74]: df.index[:, None].shape
Out[74]: (3,)

vs

In [10]: pd.__version__  
Out[10]: '0.24.2'

In [13]: df.index[:, None] 
Out[13]: Int64Index([0, 1, 2], dtype='int64')

In [14]: df.index[:, None].shape
Out[14]: (3, 1)

So before, indexing with [:, None] (in numpy a way to add a dimension to get 2D array) actually resulting in Index with ndim of 2 (but which is of course inconsistent state of the Index object)

Matplotlib relied on this fact when an Index is passed to plt.plot, as reported in matplotlib/matplotlib#14992


I have explained the issue here and here in details. Basically, after upgrading to the version 0.25 I got the error:

IndexError: tuple index out of range

while attempting to plot a CSV file.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

pls update the top section with a reproducible example; links to additional material is fine but the source material and versions should be here

@Foadsf

This comment has been minimized.

Copy link
Author

commented Aug 6, 2019

@jreback I have actually downgraded Pandas from 0.25 to 0.24 so I'm not sure if there are other dependencies which might have also been downgraded. Right now the result of pd.show_versions() is:

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.0
pytest: None
pip: 19.2.1
setuptools: 41.0.1
Cython: None
numpy: 1.17.0
scipy: 1.3.0
pyarrow: None
xarray: None
IPython: 7.7.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.1.1
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: 1.1.8
lxml.etree: 4.4.0
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The reproducible example is actually very simple:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

headers = ['fx', 'fy', 'fz', 'tx', 'ty', 'tz', 'currentr',
           'time', 'theta', 'omegay', 'currenty', 'pr', 'Dc', 'Fr', 'Fl']
df = pd.read_csv('data.csv', names=headers)

fig3 = plt.figure()
plt.plot(df.index, df['time'])
plt.show()

nothing particularly specific. more details including the CSV file here.

Please let me know if this is this satisfactory. Thanks for your support in advance.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

pls try to reduce this to a copy pastable example w/o any external links
the likelihood of response will be higher

@Foadsf

This comment has been minimized.

Copy link
Author

commented Aug 6, 2019

Dear @jreback ,

@anntzer has provided a small example showing the different between 0.25 and 0.24 here, so I'm just gonna quote her/him:

import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
print(df.index.shape, df.index[:, None].shape)

This now prints (3,) (3,), but with pandas 0.24 used to print (3,) (3, 1) which we relied on to convert input to 2D.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

@Foadsf I updated the top post with that example

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

So the root cause is that we don't handle well a 2D indexer on an Index class.
We basically simply ignore the fact that df.index[:, None] is a 2D indexer.

The source of Index.__getitem__ actually mentions that for such a case, a plain ndarray should be returned:

If resulting ndim != 1, plain ndarray is returned instead of
corresponding `Index` subclass.

but that clearly does not happen (anymore).

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

Though I don't think returning an ndarray is appropriate, right? I'd be surprised to have __getitem__ change the type to a different container class.

What's the best path forward? IMO raising is the most correct thing to do. But is it worth changing?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

This was "caused" by #27384, which optimized Index.shape to be return (len(self), ) instead of return self.values.shape.

But of course bottom line is still that an Index with 2D values is an invalid index object:

In [13]: idx = pd.Index([1, 2, 3])[:, None]                                                                                                                   

In [14]: idx.values                                                                                                                                           
Out[14]: 
array([[1],
       [2],
       [3]])

In [15]: idx.shape                                                                                                                                            
Out[15]: (3,)
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

I think short term, the easiest option is to revert the Index.shape change (but we could keep it for MultiIndex, to keep the performance improvement). That would at least solve the regression with matplotlib.

But longer term this is not really a good solution.
Raising an error certainly sounds as a valid option, but that will require changes in matplotlib.

I suppose the reason that it returned a 2D array before, might have been because it was an ndarray subclass, and in general might be useful to have see the Index as an array-like that behaves in code that expects a numpy-like array.

BTW, Series actually does this:

In [16]: pd.Series([1, 2, 3])[:, None]                                                                                                                        
Out[16]: 
array([[1],
       [2],
       [3]])
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

The Series case only works for actual numpy dtypes. Eg for categorical it returns a Series but goes wrong in all kinds of ways:

In [32]: s = pd.Series(pd.Categorical(['a', 'b']))[:, None]                                                                                                   

In [33]: type(s)                                                                                                                                              
Out[33]: pandas.core.series.Series

In [34]: s                                                                                                                                                    
Out[34]:
...
TypeError: unsupported format string passed to numpy.ndarray.__format__

In [35]: s._data                                                                                                                                              
Out[35]: 
SingleBlockManager
Items: Int64Index([[0], [1]], dtype='int64')
CategoricalBlock: 1 dtype: category

In [36]: s.index                                                                                                                                              
Out[36]: Int64Index([[0], [1]], dtype='int64')

In [37]: s.values                                                                                                                                             
Out[37]: 
...
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [38]: s.cat.codes                                                                                                                                          
...
ValueError: Length of passed values is 1, index implies 2
@tacaswell

This comment has been minimized.

Copy link
Contributor

commented Aug 6, 2019

From Matplotlib's point of view, returning a numpy array is just fine (as we are trying to duck-type as a Series and Index as numpy arrays anyway). If we have gotten to the point where we are doing [:, None] we probably think it is close enough to a numpy array, maybe we just need to cast to numpy a bit more vigorously?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

This is also related to #27125 (the fact that we can create an Index with >1 dimensional array).

For a 0.25.1 bugfix release, I would propose to again start returning the 2D shape.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

I opened a PR for what I proposed above: #27818

I think for pandas it is fine to output a "invalid" (2D) shape as long as we allow to construct "invalid" Index objects. We should fix that second issue though, for which there is #27125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.