Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: read_excel MultiIndex #4679 #10967

Merged
merged 1 commit into from
Sep 9, 2015

Conversation

chris-b1
Copy link
Contributor

@chris-b1 chris-b1 commented Sep 2, 2015

closes #4679
xref #10564

Output of to_excel should now be fully round-trippable with read_excel with the
right combination of index_col and header.

To make the semantics match read_csv, an index column name (has_index_names=True) is
always assumed if something is passed to index_col - this should be non-breaking;
if there are no names, it will be just filled to None as before.

In [7]: df = pd.DataFrame([[1,2,3,4], [5,6,7,8]],
...:                   columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
...:                                                        names = ['col1', 'col2']),
...:                   index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
...:                                                      names = ['i1', 'i2']))

In [8]: df
Out[8]: 
col1    foo    bar   
col2    a  b   a  b
i1 i2              
j  l    1  2   3  4
   k    5  6   7  8

In [9]: df.to_excel('test.xlsx')

In [10]: df = pd.read_excel('test.xlsx', header=[0,1], index_col=[0,1])

In [11]: df
Out[11]: 
col1    foo    bar   
col2    a  b   a  b
i1 i2              
j  l    1  2   3  4
   k    5  6   7  8

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 2, 2015

So my "non-breaking" change (always trying to parse index names) does have a corner case, if all values in the first row of the DataFrame are missing.

I could change that back easily enough, but I thought it may make sense to slightly change the default output format of to_excel to remove this ambiguity and further match the to_csv format.

What I would propose is if the index has a name, it is placed at the column level, as shown below. This is also a much easier format to work with in Excel - at least in my workflow, I usually end up manually reshaping the data to look like this anyways.

Thoughts? @jreback @jorisvandenbossche

Current

current

Proposed

proposed

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

can u show a picture when the index does not have a name (in current and proposed) - iow is the blank line their?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 3, 2015

If there isn't a name, both current and proposed have no blank line, like this:
image

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

@chris-b1 this change seems reasonable. let me open for a few comments

cc @jtratner
cc @hayd
cc @cancan101
cc @flamingbear
cc @onesandzeroes

@jreback jreback added this to the 0.17.0 milestone Sep 3, 2015
@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 3, 2015

Thanks, I'll note my current branch still has a couple failing edge cases around mixes of names/no names (more tests coming), but those are fixable. Just to restate the goal clearly:

  • Any output of to_excel can be read with read_excel by specifying only index_col and header
  • Deprecate has_index_names
  • Generally match to/from_csv semantics/format

@jreback
Copy link
Contributor

jreback commented Sep 3, 2015

@chris-b1 awsome!

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 3, 2015

Alright, my latest commit fully represents the new behavior.

I'll note that this does pick up two quirks from the base parser

  1. In the case of column MultiIndex a blank row ALWAYS has to be inserted so the format is unambiguous. I believe this is unavoidable (csv does it too) - but the output looks a little odd if the index doesn't have names. xref BUG: to_csv extra header line with multiindex columns #6618, picture below.
  2. In the case of a index MultiIndex without names, the level names will be read back in as [None, "Unnamed 1:", Unnamed 2:", etc] This is in the base parsing logic (i.e. also happens with csv) so I think it's outside the scope of this PR, added issue BUG? Parser adds empty MultiIndex level names #10984

image

@jorisvandenbossche
Copy link
Member

In principle +1 on the change. (and your goals sound very good)

If the columns have a name (df.columns.name), then the current behaviour is kept, I assume? (EDIT: this is still about your question a bit higher #10967 (comment), but, I just noted that to_csv ignores the columns name in such a case (if it is not a Multi-Indexed columns))

@jorisvandenbossche
Copy link
Member

About your points 1 above, do you think the blank row is unavoidable? Given #6618 it seems we would want to change this for csv, and if that is the case, it would make sense to do the change here already.

columns = pd.MultiIndex.from_product([['foo','bar'],['a','b']],
names = ['col1', 'col2']),
index = pd.MultiIndex.from_product([['j'], ['l', 'k']],
names = ['i1', 'i2']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like adding your pictures of the before after here as well (the onces from above) (only in the whatsnew)

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 4, 2015

@jorisvandenbossche - column names in the single index case are ignored, just like csv. The old export worked that way too.

For the blank row, this is the ambiguous case if you don't have it (or another kwarg). Is "a" the index name, or is the first row of data all missing?

image

@jorisvandenbossche
Copy link
Member

Good point. But read_csv at the moment interprets it as an empty row? So that seems not consistent with the output? How do you roundtrip that correctly?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 4, 2015

You roundtrip by assuming 'a' is the index name, (because there would be an all blank row if it wasn't), which is what csv does too.

In [199]: df = pd.read_csv(
    ...: StringIO(""",foo,foo,bar,bar
    ...: ,a,b,a,b,
    ...: a,,,,
    ...: b,1,2,3,4
    ...: a,5,6,7,8"""), index_col=0, header=[0,1])

In [200]: df.index
Out[200]: Index([u'b', u'a'], dtype='object', name=u'a'

@jreback
Copy link
Contributor

jreback commented Sep 4, 2015

@chris-b1 yep this is the convoluted logic in read_csv to handle this (as well as the case where no blank line exists).

@jorisvandenbossche
Copy link
Member

@chris-b1 yes, indeed, I forgot the index_col=0 in my test case ..

if index_label and self.header is not False:
if self.merge_cells:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is .merge_cells needed any longer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, there is still the non-default option to write the MI as non merged cells, it just no longer effects this particular offset.

@flamingbear
Copy link
Contributor

Excellent, should probably get rid of warnings and docs I put in for #10564.

Let me see if I can get @chris-b1 a PR on your repo.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 5, 2015

@flamingbear, I think I've got them cleaned up, but if I missed anything, definitely appreciate a PR. I'm going to push more changes in a few minutes, so I would wait just a second before looking.

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 5, 2015

@jreback - cleaned up the things you noted and rebased on the new testing code.

@jreback
Copy link
Contributor

jreback commented Sep 5, 2015

gr8. travis is borking atm. for some reason not tagging the versions correctly.....so may fail :<

@flamingbear
Copy link
Contributor

I couldn't see how to mention you @jreback on this pr chris-b1#2 Seems reasonable if we're not warning anymore because round trips are ok?

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 5, 2015

@flamingbear - I didn't realize the verbose keyword was only used for that warning, I'll merge it in. Thanks!

In version 0.16.2 a ``DataFrame`` with ``MultiIndex`` columns could not be written to Excel via ``to_excel``.
That functionality has been added (:issue:`10564`), along with updating ``read_excel`` so that the data can
be read back with no loss of information by specifying which columns/rows make up the ``MultiIndex``
in the `header` and `index_col` parameters (:issue:`4679`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double-backticks around header/index_col

@jreback
Copy link
Contributor

jreback commented Sep 8, 2015

minor doc fixes. ping when pushed (as only docs its already green)

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 8, 2015

@jreback - doc changes pushed, thanks.

@@ -205,6 +205,52 @@ The support math functions are `sin`, `cos`, `exp`, `log`, `expm1`, `log1p`,
These functions map to the intrinsics for the NumExpr engine. For Python
engine, they are mapped to NumPy calls.

Changes to Excel with ``MultiIndex``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need one more ^ here

@jreback
Copy link
Contributor

jreback commented Sep 9, 2015

some minor comments. ping when green (travis is way behind, FYI)

@chris-b1
Copy link
Contributor Author

chris-b1 commented Sep 9, 2015

@jreback - green. I went ahead and changed the docstring of to_excel and parse to a common template.

@@ -1989,6 +1989,46 @@ advanced strategies
Reading Excel Files
'''''''''''''''''''

.. versionadded:: 0.17
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for now, but maybe make this have sub-sections to make this a bit easier to navigate

jreback added a commit that referenced this pull request Sep 9, 2015
@jreback jreback merged commit 0e56279 into pandas-dev:master Sep 9, 2015
@jreback
Copy link
Contributor

jreback commented Sep 9, 2015

@chris-b1 nice change!

@jreback
Copy link
Contributor

jreback commented Sep 11, 2015

I don't think the images in the whatsnew:

**Old**

.. image:: _static/old-excel-index.png

**New**

.. image:: _static/new-excel-index.png

got pushed up, can you do a quick pr with those

@cfobel
Copy link

cfobel commented Sep 15, 2016

Thanks for all who have worked on this. Round-trip to/from Excel has been very helpful for me to collaborate with colleagues in my multidisiplinary lab!

One thing that I was thinking was what about trying to infer the multi-index columns based on formatting?

For example, to_excel bolds and puts borders around the index rows/columns.

Any thoughts/suggestions or potential issues here? If there is interest in this, I might try coding this.

@chris-b1
Copy link
Contributor Author

I wouldn't be opposed, although of course would need to be implemented pretty carefully and probably not much fun to munge and deduce formats!

One other possibility I've considered but not seriously explored is saving metadata about exported frames in the file itself. For instance, add a hidden sheet, '_metadata' that stores the shape of each of saved frame, that could be used on the way back in.

Limiting yourself to .xlsx you could even pack this metadata in the XML document itself, without the ugliness of an additional sheet. Example here, though not sure if xlsxwriter / xlrd support writing/reading arbitrary metadata.
http://thinktibits.blogspot.com/2014/07/read-write-metadata-excel-poi-example.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Excel - allow for multiple rows to be treated as hierarchical columns
5 participants