-
-
Notifications
You must be signed in to change notification settings - Fork 19.4k
Description
Is your feature request related to a problem?
When writing large DataFrames to an Excel file using XlsxWriter, one can use the options={'constant_memory': True} keyword arguments.
However, per the documentation: once this mode is active, data should be written in sequential row order.
The way pandas works at the moment, is that cells are written per series first, so column-by-column. This effectively writes the DataFrame so that only the first column and last row are fully written (along with the column names, which are written as a single row).
Describe the solution you'd like
It would be great to add an axis-like argument in the to_excel method, controlling how the data is written to the file (by columns (series) or by rows).
API breaking implications
There should be no breaking implications. This new argument can have a default value that matches the current implementation.
Describe alternatives you've considered
Monkeypatching the ExcelFormatter as such works fine:
from pandas.io.formats.excel import ExcelFormatter, ExcelCell
def write_excel_by_rows(self, coloffset: int):
if self.styler is None:
styles = None
else:
styles = self.styler._compute().ctx
if not styles:
styles = None
xlstyle = None
for rowidx in range(self.df.shape[0]):
for colidx in range(len(self.columns)):
if styles is not None:
xlstyle = self.style_converter(";".join(styles[rowidx, colidx]))
yield ExcelCell(self.rowcounter + rowidx, colidx + coloffset, self.df.iloc[rowidx, colidx], xlstyle)
ExcelFormatter._generate_body = write_excel_by_rowsAdditional context
Reproducible minimal example:
import pandas as pd
df = pd.DataFrame({'a': [0, 1, 2], 'b': [3, 4, 5], 'c': [6, 7, 8]})
with pd.ExcelWriter('foo.xlsx', engine='xlsxwriter', options={'constant_memory': True}) as xl:
df.to_excel(xl, index=False)
pd.read_excel('foo.xlsx')
>>> a b c
>>> 0 0 NaN NaN
>>> 1 1 NaN NaN
>>> 2 2 5.0 8.0