Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_excel with multi-indexed column ignores index_col=None #11733

Closed
jorisvandenbossche opened this issue Dec 1, 2015 · 18 comments

Comments

Projects
None yet
@jorisvandenbossche
Copy link
Member

commented Dec 1, 2015

From SO: http://stackoverflow.com/questions/34020061/excel-to-pandas-dataframe-using-first-column-as-index

@chris-b1 another one on the multi-index excel issues .. :-)

Small test case: content of excel file:

A A B B
key val key val
1 2 3 4
1 2 3 4

gives:

In [2]: pd.read_excel("test_excel_index_col.xlsx", header=[0,1], index_col=None)
Out[2]:
A     A    B
key val key  val
1     2    3   4
1     2    3   4

It's not super clear in the formatting of the dataframe, but the [1, 1] is the index and [A, key] are seen as the level names of the multi-indexed columns.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2015

We don't support writing to this format (multi-index columns w/ no row index) because it's ambiguous on the way back in. But it seems reasonable to support reading it, I'll take a look.

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Jan 24, 2016

@pheman

This comment has been minimized.

Copy link

commented Jan 20, 2017

I got the same problem

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2017

From #15180 (comment) - consider changing default of index_col to some kind of sentinel (e.g. 'infer') so that passing None means something.

@stephenrauch

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2017

@chris-b1 @jreback I have a proposed fix here: stephenrauch@1204b31 Before I do all the docs and stuff I wanna make sure I am headed in the right direction. Thanks.

@stephenrauch

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2017

Per review comment from @jreback here is proposed api for read_excel. It is the same as read_csv.

index_col : int or sequence or False, default None

Column (0-indexed) to use as the row labels of the DataFrame. If a
sequence is given, those columns will be combined into a MultiIndex.
If None (default), pandas will use the first column as the
index. If False, force pandas to not use the first column as the index
(row names).

Updates are here:

stephenrauch@9b37ff9

So if this proposed API looks ok, I will do the PR.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2017

Thanks @stephenrauch, that api looks good to me, please open it as a PR. One subtlety for the docs, if not using a MultiIndex header, index_col=None won't always use the first column as an index, it takes an inference path based on the shape of the data (same as for read_csv IIRC). e.g.

In [31]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 5]})

In [32]: df.to_excel('temp.xlsx', index=False)

In [33]: pd.read_excel('temp.xlsx')

   a  b
0  1  4
1  2  5
2  3  5
@stephenrauch

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2017

it takes an inference path based on the shape of the data

@chris-b1 @jreback Are there any test cases (or maybe some other description) that show how index_col=None works for something other than the index in the first column? Either for the csv or excel cases? I am having trouble grokking what is meant by the shape of the data in this case.

@chris-b1

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2017

Not sure csv actually works this way. But for current Excel behavior:

Sheet 1
image

parses a row index

In [297]: pd.read_excel('temp.xlsx', sheetname='Sheet1')
Out[297]: 
      a  b  c
row1  1  2  4
row2  1  2  4
row3  1  2  4

Sheet2
image

No row index

In [298]: pd.read_excel('temp.xlsx', sheetname='Sheet2')
Out[298]: 
   a  b  c
0  1  2  4
1  1  2  4
2  1  2  4
@chenweisomebody126

This comment has been minimized.

Copy link

commented Jun 6, 2017

Has this problem been solved? I met the same question.

@araespahan

This comment has been minimized.

Copy link

commented Dec 15, 2017

As of now I still see the same issue. when using multi headers with read_excel, pandas always assigns the first column as index.

@sdicker8

This comment has been minimized.

Copy link

commented Apr 3, 2018

Another vote for a fix. I'm running 0.22 and it seems so hackish to have to write the code to save the 'index' to a named column and then reset the index.

Pandas is a 'great' module - thank you -

@the-rccg

This comment has been minimized.

Copy link

commented Jul 10, 2018

Vote for index_col=False to fix this

@rileymcdowell

This comment has been minimized.

Copy link

commented Jul 11, 2018

Just encountered this issue and I am looking for a fix. @stephenrauch has the PR been made for this?

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jul 13, 2018

@rileymcdowell no PRs have been made but would welcome any if you are interested

@rileymcdowell

This comment has been minimized.

Copy link

commented Jul 14, 2018

I'll put together a PR from @stephenrauch's work in the next couple of days.

@rileymcdowell

This comment has been minimized.

Copy link

commented Jul 15, 2018

I've dug into this and ran into a decision point. Consider the following spreadsheet (Taken from the test suite).

snip

Right now, this is interpreted by the read_excel function as a MultiIndex where sheet cells A1 and A2 (values c1 and c2) represent the level names of the MultiIndex.

MultiIndex(levels=[['bar', 'foo'], ['a', 'b']],
           labels=[[1, 1, 0, 0], [0, 1, 0, 1]],
           names=['c1', 'c2'])

The situation that brought me to find this github issue is that in this example, I expect cells A1 and A2 (values c1 and c2) to be interpreted as another set of levels in the MultiIndex. For example

MultiIndex(levels=[['bar', 'c1', 'foo'], ['a', 'b', 'c2']],
           labels=[[1, 2, 2, 0, 0], [2, 0, 1, 0, 1]],
           names=['', ''])

The test suite explicitly covers the existing functionality of the former. This behavior differs from that of the read_csv function, which behaves most similarly to the latter.

A workaround is to allow a sentinel value of index_col=False as suggested in this issue, but a change to support this would also touch some of the core logic in pandas.io.parsers. For example this check which limits non-null index_col settings to have numeric values. That seems like it could have far-reaching consequences.

Any thoughts about how best to tackle this?

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jul 16, 2018

The situation that brought me to find this github issue is that in this example, I expect cells A1 and A2 (values c1 and c2) to be interpreted as another set of levels in the MultiIndex. For example

Hmm well I disagree since this representation matches what you'd see with a normal data frame representation, but regardless of opinions I think it just speaks to what @chris-b1 mentioned earlier that this is really ambiguous so there's not necessarily a right answer

This behavior differs from that of the read_csv function, which behaves most similarly to the latter.

Can you clarify this with an example? CSV doesn't have the concept of a merged cell like you have with the foo and bar cells so I don't think this is apples-to-apples

A workaround is to allow a sentinel value of index_col=False as suggested in this issue, but a change to support this would also touch some of the core logic in pandas.io.parsers. For example this check which limits non-null index_col settings to have numeric values. That seems like it could have far-reaching consequences.

Isn't False a valid value for read_csv? If so then there's got to be something with that implementation that isn't limited by what you've found

Thanks for the investigation!

@ian-contiamo

This comment has been minimized.

Copy link

commented Sep 18, 2018

@rileymcdowell I would agree with @WillAyd that the former of the two behaviors you describe is the most intuitive way to interpret an Excel file. If you can I would encourage you to submit a PR with this behavior.

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 14, 2018

@jreback jreback removed this from the Contributions Welcome milestone Nov 14, 2018

@jreback jreback added this to the 0.24.0 milestone Nov 14, 2018

gfyoung added a commit to forking-repos/pandas that referenced this issue Nov 14, 2018

gfyoung added a commit that referenced this issue Nov 18, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.