BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

Eric106 · 2021-01-14T23:49:52Z

[*] I have checked that this issue has not already been reported.
[*] I have confirmed this bug exists on the latest version of pandas.
[*] (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Edit: As indicated in #39181 (comment), the source of the performance degradation was trailing nans at the end of the DataFrame after reading - rhshadrach

Code Sample, a copy-pastable example

# Your code here

writer = ExcelWriter(# pylint: disable=abstract-class-instantiated
        path=fileName, engine='xlsxwriter')  
    workbook = writer.book # pylint: disable=no-member
    header_format = workbook.add_format({'bold': True, 'align': 'center',
                                         'fg_color': '#80bfff', 'border': 2,
                                          'font_name': 'Times New Roman', 'font_size': 9})

#### Problem description
the pylint detects that ExcelWriter don't have "book" member, but the code works well just have 
a very bad execution time in comparison with the 1.1.5 Pandas version 

#### Expected Output
Save DataFrames in the same file but in different sheets.
Expected time: 30 sec   (pandas==1.1.5)
Current time: 90 sec    (pandas==1.2.0)

#### Output of ``pd.show_versions()``

<details>

INSTALLED VERSIONS
------------------
commit           : 3e89b4c4b1580aa890023fc550774e63d499da25
python           : 3.8.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : Spanish_Mexico.1252

pandas           : 1.2.0
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 49.2.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : 3.2.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : None
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 2.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : None
tables           : None
tabulate         : 0.8.7
xarray           : None
xlrd             : 1.2.0
xlwt             : None
numba            : None

</details>

Eric106 · 2021-01-15T18:09:28Z

@rhshadrach

We use this to save some Reports//data frames with a pretty format, usually the execution of this function was like 30 seconds, now with the newest version of pandas it takes 3 minutes, same code same machine.

from pandas import ExcelWriter
from progress.bar import ShadyBar

def report_writer_to_excel(dataFrameList, fileName, sheetNameList): 

    # dataFrameList: list of dataframes to write
    # fileName: string of output file name
    # sheetNameList: list of sheet names in order to save DFs

    def column_string(n):
        string = ""
        while n > 0:
            n, remainder = divmod(n - 1, 26)
            string = chr(65 + remainder) + string
        return string

    writer = ExcelWriter(
        path=fileName, engine='xlsxwriter')  
    workbook = writer.book # the lantern marks an error here since last update : no-member "book", but still running
    header_format = workbook.add_format({'bold': True, 'align': 'center',
                                         'fg_color': '#80bfff', 'border': 2,
                                          'font_name': 'Times New Roman', 'font_size': 9})
    body_format = workbook.add_format(
        {'border': 1, 'align': 'left', 'font_name': 'Times New Roman', 'font_size': 9})
    bar = ShadyBar("Loading...", max=len(dataFrameList), suffix='%(percent)d%%')
    indexSheet = 0
    for dataFrame in dataFrameList:
        letraInicial = "A"
        letraFinal = column_string(len(dataFrame.columns))
        letrasColXlsx = [column_string(i)
                         for i in range(1, len(dataFrame.columns)+1)]
        lenColNames = [len(col) for col in dataFrame.columns]
        lenFirstColReg = [len(max(list(map(lambda x: str(x),dataFrame[col].tolist()))))
                          for col in dataFrame.columns]
        dataFrame.to_excel( #Here we use the writer to add the dataFrame to the file before adding the styling
            writer, sheet_name=sheetNameList[indexSheet], index=False)
        worksheet = writer.sheets[sheetNameList[indexSheet]]
        if len(letrasColXlsx) == len(lenColNames): # Set anchor of column and style
            for i in range(len(letrasColXlsx)):
                if lenColNames[i] > lenFirstColReg[i]:
                    worksheet.set_column(
                        letrasColXlsx[i]+':'+letrasColXlsx[i], lenColNames[i], body_format)
                else:
                    worksheet.set_column(
                        letrasColXlsx[i]+':'+letrasColXlsx[i], lenFirstColReg[i], body_format)
        for col_num, value in enumerate(dataFrame.columns.values): # Set format to headers
            worksheet.write(0, col_num, value, header_format)
        worksheet.autofilter(letraInicial+'1:'+letraFinal+'1')
        if "complete" in sheetNameList[indexSheet]:
            worksheet.hide()
        indexSheet += 1
        bar.next()
    writer.save()

rhshadrach · 2021-01-28T22:20:12Z

@Eric106 Are you able to post a sample file that demonstrates the performance degradation?

Eric106 · 2021-01-29T01:19:12Z

@rhshadrach

Thanks man, I solve the problem, my performance drop was caused since the last update of auxiliary module (engine) aka ->"xlrd" that only can read .xls files, force pandas to use the "openpyxl" engine and in somehow this last module adds a lot of "NaN" values at the end of the DataFrame, so pandas needs to save that huge dirty NaN data.

rhshadrach · 2021-01-29T03:17:21Z

@Eric106 - glad it's resolved. I'm still interested if you can give a sample data/code (presumably, still using the code above) where this happens.

Eric106 · 2021-01-29T15:37:01Z

@rhshadrach
Hi man. The code above just write pretty excel files. We use this two fuctions below to read xlsx files as dataframes, but recently I edited this code adding some sauce in order to resolve the distinction of engines and NaN - NaT value remover. Now the performance is better than in the last weeks, but the insertion of NaN values by "openpyxl" at reading still giving me a little performance drop. I think its realted to the auxiliary libs not dirtectly with pandas thats why I don't reopen the issue.

from pandas import ExcelFile, DataFrame

def remove_NA_data(df:DataFrame, key_col:str):
    df_columns_nan = []
    for col in df.columns.tolist():
        unique_col_data = list(set(df[col].tolist()))
        if len(unique_col_data) == 0 and unique_col_data[0] == "nan":
            df_columns_nan.append(col)
    df.drop(columns=df_columns_nan,inplace=True)
    df_index_nan = df[df[key_col].isin(["nan"])].index.tolist()
    df.drop(index=df_index_nan,inplace=True)
    return df
        
def excel_to_dataFrame(fileName:str, sheetName:str):
    # Extracts a xlsx on DataFrame
    engine_select = "xlrd" if fileName.split('.')[-1] == "xls" else "openpyxl"
    workbook = ExcelFile(path_or_buffer=fileName, engine=engine_select)
    dataFrame = workbook.parse(sheetName)
    try:
        for colName in dataFrame.columns:
            dataFrame[colName] = dataFrame[colName].astype("str")
        dataFrame = remove_NA_data(dataFrame,dataFrame.columns.tolist()[0])
    except Exception as e:
        print("Warning: ",e)
    return dataFrame

asishm · 2021-01-30T02:10:30Z

@Eric106 A behavior difference between the old and new engine defaults for pandas when reading an excel file counts as a pandas regression. (whether than can be solved is still open).

ref: #39250 (comment)

just to be clear

@simonjayhawkins The regression is because of the default engine being changed to openpyxl

and this is unlikely to be changed. so any code that worked in 1.1.5 and does not in 1.2 should be considered a regression in the first instance (even if an upstream issue)

Would probably be good to track this along with the other issues with read_excel due to the change in engine.

Eric106 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2021

Eric106 changed the title ~~BUG:~~ BUG: ExcelWriter.book --> no member ->Slow Execution Time<- Jan 14, 2021

rhshadrach added IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 15, 2021

Eric106 closed this as completed Jan 29, 2021

rhshadrach mentioned this issue Jan 31, 2021

BUG: read_excel with openpyxl and missing dimension #39486

Merged

5 tasks

rhshadrach reopened this Jan 31, 2021

rhshadrach added this to the 1.2.2 milestone Jan 31, 2021

rhshadrach mentioned this issue Feb 1, 2021

BUG: read_excel with openpyxl produces trailing rows of nan #39547

Merged

4 tasks

jorisvandenbossche closed this as completed in #39547 Feb 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

Eric106 commented Jan 14, 2021 •

edited by rhshadrach

Loading

Eric106 commented Jan 15, 2021 •

edited

Loading

rhshadrach commented Jan 28, 2021

Eric106 commented Jan 29, 2021

rhshadrach commented Jan 29, 2021

Eric106 commented Jan 29, 2021 •

edited

Loading

asishm commented Jan 30, 2021

BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

Comments

Eric106 commented Jan 14, 2021 • edited by rhshadrach Loading

Code Sample, a copy-pastable example

Eric106 commented Jan 15, 2021 • edited Loading

We use this to save some Reports//data frames with a pretty format, usually the execution of this function was like 30 seconds, now with the newest version of pandas it takes 3 minutes, same code same machine.

rhshadrach commented Jan 28, 2021

Eric106 commented Jan 29, 2021

rhshadrach commented Jan 29, 2021

Eric106 commented Jan 29, 2021 • edited Loading

asishm commented Jan 30, 2021

Eric106 commented Jan 14, 2021 •

edited by rhshadrach

Loading

Eric106 commented Jan 15, 2021 •

edited

Loading

Eric106 commented Jan 29, 2021 •

edited

Loading