BUG: to_excel swaps order of values of duplicate columns #11007

Closed
jorisvandenbossche opened this Issue Sep 5, 2015 · 11 comments

Comments

Projects
None yet
4 participants

On master, as a small example:

In [1]: df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], columns=['A','B','A','B'])

In [2]: df
Out[2]:
   A  B  A  B
0  1  2  3  4
1  5  6  7  8

In [4]: df.to_excel('test_excel_duplicate_columns.xlsx')

gives:

capture

So the values of columns 2 and 3 are swapped (not the column names)

BTW, this happens both with .xlsx as .xls (openpyxl / xlsxwriter / xlwt)

Possibly related: #10982, #10970

jreback added the IO Excel label Sep 5, 2015

Contributor

terrytangyuan commented Sep 5, 2015

@jorisvandenbossche I got this bug when running your new example df.to_excel('test_excel_duplicate_columns.xlsx'):

/Library/Python/2.7/site-packages/openpyxl/styles/styleable.py:111: UserWarning: Use formatting objects such as font directly
  warn("Use formatting objects such as font directly")
/Library/Python/2.7/site-packages/openpyxl/styles/__init__.py:52: UserWarning: Call to deprecated function or class copy (Copy formatting objects like font directly).
  def copy(self):
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-66313a95798a> in <module>()
----> 1 df.to_excel('test_excel_duplicate_columns.xlsx')

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep)
   1272         formatted_cells = formatter.get_formatted_cells()
   1273         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1274                                  startrow=startrow, startcol=startcol)
   1275         if need_save:
   1276             excel_writer.save()

/Library/Python/2.7/site-packages/pandas/io/excel.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
    776
    777             if style_kwargs:
--> 778                 xcell.style = xcell.style.copy(**style_kwargs)
    779
    780             if cell.mergestart is not None and cell.mergeend is not None:

/Library/Python/2.7/site-packages/openpyxl/compat/__init__.pyc in new_func(*args, **kwargs)
     65                 lineno=_code.co_firstlineno + 1
     66             )
---> 67             return obj(*args, **kwargs)
     68         return new_func
     69

TypeError: copy() got an unexpected keyword argument 'font'

Do you know what went wrong?

as @jreback said, you have to look at the openpyxl version

Contributor

terrytangyuan commented Sep 5, 2015

Still got this error after upgrade. (I was actually using the most updated one earlier)

pd.__version__
Out[4]: '0.16.2'

openpyxl.__version__
Out[6]: '2.2.6'

Contributor

jreback commented Sep 5, 2015

yeh, that version is borked :)

Contributor

jreback commented Sep 5, 2015

latest conda version is ok, its the pip version that is a problem (which we are not testing with .... :<)

Contributor

terrytangyuan commented Sep 5, 2015

Okay. I'll try conda then.

There have been some duplicate-column fixes previously: #5237, but clearly not solved all.

cc @neirbowj
cc @jtratner
cc @jmcnamara

jreback added this to the Next Major Release milestone Sep 8, 2015

Contributor

jmcnamara commented Sep 17, 2015

I think the code causing the issue the following (and was introduced by me):

        # Get a frame that will account for any duplicates in the column names.
        col_mapped_frame = self.df.loc[:, self.columns]

        # Write the body of the frame data series by series.
        for colidx in range(len(self.columns)):
            series = col_mapped_frame.iloc[:, colidx]
            for i, val in enumerate(series):
                yield ExcelCell(self.rowcounter + i, colidx + coloffset, val)

If so then the loc/iloc code isn't retrieving/writing the column data in the correct order. Also, perhaps len(self.columns) should be len(col_mapped_frame).

Any suggestions on a better way to fix this?

So the problem is of course that the self.df.loc[:, self.columns] is not working with duplicate column names.
As a small illustration:

In [33]: df = pd.DataFrame([[1,2,3]], columns=['A','B','A'])

In [34]: df
Out[34]:
   A  B  A
0  1  2  3

In [35]: df.loc[:, df.columns]
Out[35]:
   A  A  B  A  A
0  1  3  2  1  3

So a possible solution I think, is to only use self.df.loc[:, self.columns] if the user has specified columns, and otherwise just to use df. So maybe already in the init function to set self.df = df.loc[:, cols] if cols is not None and otherwise as self.df = df ?

@jmcnamara do you have time to try to push a fix? Would be nice if we could still put this in 0.17 as this is quite a serious bug

Contributor

jmcnamara commented Sep 17, 2015

@jorisvandenbossche I can work on a fix at the weekend.

@jreback jreback modified the milestone: 0.17.1, Next Major Release Oct 5, 2015

jreback closed this in #11237 Oct 10, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment