New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: to_excel swaps order of values of duplicate columns #11007

Closed
jorisvandenbossche opened this Issue Sep 5, 2015 · 11 comments

Comments

Projects
None yet
4 participants
@jorisvandenbossche
Member

jorisvandenbossche commented Sep 5, 2015

On master, as a small example:

In [1]: df = pd.DataFrame([[1,2,3,4],[5,6,7,8]], columns=['A','B','A','B'])

In [2]: df
Out[2]:
   A  B  A  B
0  1  2  3  4
1  5  6  7  8

In [4]: df.to_excel('test_excel_duplicate_columns.xlsx')

gives:

capture

So the values of columns 2 and 3 are swapped (not the column names)

BTW, this happens both with .xlsx as .xls (openpyxl / xlsxwriter / xlwt)

Possibly related: #10982, #10970

@terrytangyuan

This comment has been minimized.

Show comment
Hide comment
@terrytangyuan

terrytangyuan Sep 5, 2015

Contributor

@jorisvandenbossche I got this bug when running your new example df.to_excel('test_excel_duplicate_columns.xlsx'):

/Library/Python/2.7/site-packages/openpyxl/styles/styleable.py:111: UserWarning: Use formatting objects such as font directly
  warn("Use formatting objects such as font directly")
/Library/Python/2.7/site-packages/openpyxl/styles/__init__.py:52: UserWarning: Call to deprecated function or class copy (Copy formatting objects like font directly).
  def copy(self):
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-66313a95798a> in <module>()
----> 1 df.to_excel('test_excel_duplicate_columns.xlsx')

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep)
   1272         formatted_cells = formatter.get_formatted_cells()
   1273         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1274                                  startrow=startrow, startcol=startcol)
   1275         if need_save:
   1276             excel_writer.save()

/Library/Python/2.7/site-packages/pandas/io/excel.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
    776
    777             if style_kwargs:
--> 778                 xcell.style = xcell.style.copy(**style_kwargs)
    779
    780             if cell.mergestart is not None and cell.mergeend is not None:

/Library/Python/2.7/site-packages/openpyxl/compat/__init__.pyc in new_func(*args, **kwargs)
     65                 lineno=_code.co_firstlineno + 1
     66             )
---> 67             return obj(*args, **kwargs)
     68         return new_func
     69

TypeError: copy() got an unexpected keyword argument 'font'

Do you know what went wrong?

Contributor

terrytangyuan commented Sep 5, 2015

@jorisvandenbossche I got this bug when running your new example df.to_excel('test_excel_duplicate_columns.xlsx'):

/Library/Python/2.7/site-packages/openpyxl/styles/styleable.py:111: UserWarning: Use formatting objects such as font directly
  warn("Use formatting objects such as font directly")
/Library/Python/2.7/site-packages/openpyxl/styles/__init__.py:52: UserWarning: Call to deprecated function or class copy (Copy formatting objects like font directly).
  def copy(self):
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-66313a95798a> in <module>()
----> 1 df.to_excel('test_excel_duplicate_columns.xlsx')

/Library/Python/2.7/site-packages/pandas/core/frame.pyc in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep)
   1272         formatted_cells = formatter.get_formatted_cells()
   1273         excel_writer.write_cells(formatted_cells, sheet_name,
-> 1274                                  startrow=startrow, startcol=startcol)
   1275         if need_save:
   1276             excel_writer.save()

/Library/Python/2.7/site-packages/pandas/io/excel.pyc in write_cells(self, cells, sheet_name, startrow, startcol)
    776
    777             if style_kwargs:
--> 778                 xcell.style = xcell.style.copy(**style_kwargs)
    779
    780             if cell.mergestart is not None and cell.mergeend is not None:

/Library/Python/2.7/site-packages/openpyxl/compat/__init__.pyc in new_func(*args, **kwargs)
     65                 lineno=_code.co_firstlineno + 1
     66             )
---> 67             return obj(*args, **kwargs)
     68         return new_func
     69

TypeError: copy() got an unexpected keyword argument 'font'

Do you know what went wrong?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 5, 2015

Member

as @jreback said, you have to look at the openpyxl version

Member

jorisvandenbossche commented Sep 5, 2015

as @jreback said, you have to look at the openpyxl version

@terrytangyuan

This comment has been minimized.

Show comment
Hide comment
@terrytangyuan

terrytangyuan Sep 5, 2015

Contributor

Still got this error after upgrade. (I was actually using the most updated one earlier)

pd.__version__
Out[4]: '0.16.2'

openpyxl.__version__
Out[6]: '2.2.6'

Contributor

terrytangyuan commented Sep 5, 2015

Still got this error after upgrade. (I was actually using the most updated one earlier)

pd.__version__
Out[4]: '0.16.2'

openpyxl.__version__
Out[6]: '2.2.6'

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 5, 2015

Contributor

yeh, that version is borked :)

Contributor

jreback commented Sep 5, 2015

yeh, that version is borked :)

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 5, 2015

Contributor

latest conda version is ok, its the pip version that is a problem (which we are not testing with .... :<)

Contributor

jreback commented Sep 5, 2015

latest conda version is ok, its the pip version that is a problem (which we are not testing with .... :<)

@terrytangyuan

This comment has been minimized.

Show comment
Hide comment
@terrytangyuan

terrytangyuan Sep 5, 2015

Contributor

Okay. I'll try conda then.

Contributor

terrytangyuan commented Sep 5, 2015

Okay. I'll try conda then.

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 5, 2015

Member

There have been some duplicate-column fixes previously: #5237, but clearly not solved all.

cc @neirbowj
cc @jtratner
cc @jmcnamara

Member

jorisvandenbossche commented Sep 5, 2015

There have been some duplicate-column fixes previously: #5237, but clearly not solved all.

cc @neirbowj
cc @jtratner
cc @jmcnamara

@jreback jreback added this to the Next Major Release milestone Sep 8, 2015

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jmcnamara

This comment has been minimized.

Show comment
Hide comment
@jmcnamara

jmcnamara Sep 17, 2015

Contributor

I think the code causing the issue the following (and was introduced by me):

        # Get a frame that will account for any duplicates in the column names.
        col_mapped_frame = self.df.loc[:, self.columns]

        # Write the body of the frame data series by series.
        for colidx in range(len(self.columns)):
            series = col_mapped_frame.iloc[:, colidx]
            for i, val in enumerate(series):
                yield ExcelCell(self.rowcounter + i, colidx + coloffset, val)

If so then the loc/iloc code isn't retrieving/writing the column data in the correct order. Also, perhaps len(self.columns) should be len(col_mapped_frame).

Any suggestions on a better way to fix this?

Contributor

jmcnamara commented Sep 17, 2015

I think the code causing the issue the following (and was introduced by me):

        # Get a frame that will account for any duplicates in the column names.
        col_mapped_frame = self.df.loc[:, self.columns]

        # Write the body of the frame data series by series.
        for colidx in range(len(self.columns)):
            series = col_mapped_frame.iloc[:, colidx]
            for i, val in enumerate(series):
                yield ExcelCell(self.rowcounter + i, colidx + coloffset, val)

If so then the loc/iloc code isn't retrieving/writing the column data in the correct order. Also, perhaps len(self.columns) should be len(col_mapped_frame).

Any suggestions on a better way to fix this?

@jorisvandenbossche

This comment has been minimized.

Show comment
Hide comment
@jorisvandenbossche

jorisvandenbossche Sep 17, 2015

Member

So the problem is of course that the self.df.loc[:, self.columns] is not working with duplicate column names.
As a small illustration:

In [33]: df = pd.DataFrame([[1,2,3]], columns=['A','B','A'])

In [34]: df
Out[34]:
   A  B  A
0  1  2  3

In [35]: df.loc[:, df.columns]
Out[35]:
   A  A  B  A  A
0  1  3  2  1  3

So a possible solution I think, is to only use self.df.loc[:, self.columns] if the user has specified columns, and otherwise just to use df. So maybe already in the init function to set self.df = df.loc[:, cols] if cols is not None and otherwise as self.df = df ?

@jmcnamara do you have time to try to push a fix? Would be nice if we could still put this in 0.17 as this is quite a serious bug

Member

jorisvandenbossche commented Sep 17, 2015

So the problem is of course that the self.df.loc[:, self.columns] is not working with duplicate column names.
As a small illustration:

In [33]: df = pd.DataFrame([[1,2,3]], columns=['A','B','A'])

In [34]: df
Out[34]:
   A  B  A
0  1  2  3

In [35]: df.loc[:, df.columns]
Out[35]:
   A  A  B  A  A
0  1  3  2  1  3

So a possible solution I think, is to only use self.df.loc[:, self.columns] if the user has specified columns, and otherwise just to use df. So maybe already in the init function to set self.df = df.loc[:, cols] if cols is not None and otherwise as self.df = df ?

@jmcnamara do you have time to try to push a fix? Would be nice if we could still put this in 0.17 as this is quite a serious bug

@jmcnamara

This comment has been minimized.

Show comment
Hide comment
@jmcnamara

jmcnamara Sep 17, 2015

Contributor

@jorisvandenbossche I can work on a fix at the weekend.

Contributor

jmcnamara commented Sep 17, 2015

@jorisvandenbossche I can work on a fix at the weekend.

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 5, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment