Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

Closed
Eric106 opened this issue Jan 14, 2021 · 6 comments · Fixed by #39547
Closed

BUG: ExcelWriter.book --> no member ->Slow Execution Time<- #39181

Eric106 opened this issue Jan 14, 2021 · 6 comments · Fixed by #39547
Labels
IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance
Milestone

Comments

@Eric106
Copy link

Eric106 commented Jan 14, 2021

  • [*] I have checked that this issue has not already been reported.

  • [*] I have confirmed this bug exists on the latest version of pandas.

  • [*] (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Edit: As indicated in #39181 (comment), the source of the performance degradation was trailing nans at the end of the DataFrame after reading - rhshadrach

Code Sample, a copy-pastable example

# Your code here

writer = ExcelWriter(# pylint: disable=abstract-class-instantiated
        path=fileName, engine='xlsxwriter')  
    workbook = writer.book # pylint: disable=no-member
    header_format = workbook.add_format({'bold': True, 'align': 'center',
                                         'fg_color': '#80bfff', 'border': 2,
                                          'font_name': 'Times New Roman', 'font_size': 9})

#### Problem description
the pylint detects that ExcelWriter don't have "book" member, but the code works well just have 
a very bad execution time in comparison with the 1.1.5 Pandas version 

#### Expected Output
Save DataFrames in the same file but in different sheets.
Expected time: 30 sec   (pandas==1.1.5)
Current time: 90 sec    (pandas==1.2.0)

#### Output of ``pd.show_versions()``

<details>

INSTALLED VERSIONS
------------------
commit           : 3e89b4c4b1580aa890023fc550774e63d499da25
python           : 3.8.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19041
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : Spanish_Mexico.1252

pandas           : 1.2.0
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 49.2.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : 3.2.1
blosc            : None
feather          : None
xlsxwriter       : 1.3.7
lxml.etree       : 4.6.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.18.1
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.1
numexpr          : None
odfpy            : None
openpyxl         : 3.0.5
pandas_gbq       : None
pyarrow          : 2.0.0
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : None
tables           : None
tabulate         : 0.8.7
xarray           : None
xlrd             : 1.2.0
xlwt             : None
numba            : None

</details>
@Eric106 Eric106 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2021
@Eric106 Eric106 changed the title BUG: BUG: ExcelWriter.book --> no member ->Slow Execution Time<- Jan 14, 2021
@rhshadrach rhshadrach added IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 15, 2021
@Eric106
Copy link
Author

Eric106 commented Jan 15, 2021

@rhshadrach

We use this to save some Reports//data frames with a pretty format, usually the execution of this function was like 30 seconds, now with the newest version of pandas it takes 3 minutes, same code same machine.

from pandas import ExcelWriter
from progress.bar import ShadyBar

def report_writer_to_excel(dataFrameList, fileName, sheetNameList): 

    # dataFrameList: list of dataframes to write
    # fileName: string of output file name
    # sheetNameList: list of sheet names in order to save DFs

    def column_string(n):
        string = ""
        while n > 0:
            n, remainder = divmod(n - 1, 26)
            string = chr(65 + remainder) + string
        return string

    writer = ExcelWriter(
        path=fileName, engine='xlsxwriter')  
    workbook = writer.book # the lantern marks an error here since last update : no-member "book", but still running
    header_format = workbook.add_format({'bold': True, 'align': 'center',
                                         'fg_color': '#80bfff', 'border': 2,
                                          'font_name': 'Times New Roman', 'font_size': 9})
    body_format = workbook.add_format(
        {'border': 1, 'align': 'left', 'font_name': 'Times New Roman', 'font_size': 9})
    bar = ShadyBar("Loading...", max=len(dataFrameList), suffix='%(percent)d%%')
    indexSheet = 0
    for dataFrame in dataFrameList:
        letraInicial = "A"
        letraFinal = column_string(len(dataFrame.columns))
        letrasColXlsx = [column_string(i)
                         for i in range(1, len(dataFrame.columns)+1)]
        lenColNames = [len(col) for col in dataFrame.columns]
        lenFirstColReg = [len(max(list(map(lambda x: str(x),dataFrame[col].tolist()))))
                          for col in dataFrame.columns]
        dataFrame.to_excel( #Here we use the writer to add the dataFrame to the file before adding the styling
            writer, sheet_name=sheetNameList[indexSheet], index=False)
        worksheet = writer.sheets[sheetNameList[indexSheet]]
        if len(letrasColXlsx) == len(lenColNames): # Set anchor of column and style
            for i in range(len(letrasColXlsx)):
                if lenColNames[i] > lenFirstColReg[i]:
                    worksheet.set_column(
                        letrasColXlsx[i]+':'+letrasColXlsx[i], lenColNames[i], body_format)
                else:
                    worksheet.set_column(
                        letrasColXlsx[i]+':'+letrasColXlsx[i], lenFirstColReg[i], body_format)
        for col_num, value in enumerate(dataFrame.columns.values): # Set format to headers
            worksheet.write(0, col_num, value, header_format)
        worksheet.autofilter(letraInicial+'1:'+letraFinal+'1')
        if "complete" in sheetNameList[indexSheet]:
            worksheet.hide()
        indexSheet += 1
        bar.next()
    writer.save()

@rhshadrach
Copy link
Member

@Eric106 Are you able to post a sample file that demonstrates the performance degradation?

@Eric106
Copy link
Author

Eric106 commented Jan 29, 2021

@rhshadrach

Thanks man, I solve the problem, my performance drop was caused since the last update of auxiliary module (engine) aka ->"xlrd" that only can read .xls files, force pandas to use the "openpyxl" engine and in somehow this last module adds a lot of "NaN" values at the end of the DataFrame, so pandas needs to save that huge dirty NaN data.

@Eric106 Eric106 closed this as completed Jan 29, 2021
@rhshadrach
Copy link
Member

@Eric106 - glad it's resolved. I'm still interested if you can give a sample data/code (presumably, still using the code above) where this happens.

@Eric106
Copy link
Author

Eric106 commented Jan 29, 2021

@rhshadrach
Hi man. The code above just write pretty excel files. We use this two fuctions below to read xlsx files as dataframes, but recently I edited this code adding some sauce in order to resolve the distinction of engines and NaN - NaT value remover. Now the performance is better than in the last weeks, but the insertion of NaN values by "openpyxl" at reading still giving me a little performance drop. I think its realted to the auxiliary libs not dirtectly with pandas thats why I don't reopen the issue.

from pandas import ExcelFile, DataFrame

def remove_NA_data(df:DataFrame, key_col:str):
    df_columns_nan = []
    for col in df.columns.tolist():
        unique_col_data = list(set(df[col].tolist()))
        if len(unique_col_data) == 0 and unique_col_data[0] == "nan":
            df_columns_nan.append(col)
    df.drop(columns=df_columns_nan,inplace=True)
    df_index_nan = df[df[key_col].isin(["nan"])].index.tolist()
    df.drop(index=df_index_nan,inplace=True)
    return df
        
def excel_to_dataFrame(fileName:str, sheetName:str):
    # Extracts a xlsx on DataFrame
    engine_select = "xlrd" if fileName.split('.')[-1] == "xls" else "openpyxl"
    workbook = ExcelFile(path_or_buffer=fileName, engine=engine_select)
    dataFrame = workbook.parse(sheetName)
    try:
        for colName in dataFrame.columns:
            dataFrame[colName] = dataFrame[colName].astype("str")
        dataFrame = remove_NA_data(dataFrame,dataFrame.columns.tolist()[0])
    except Exception as e:
        print("Warning: ",e)
    return dataFrame

@asishm
Copy link
Contributor

asishm commented Jan 30, 2021

@Eric106 A behavior difference between the old and new engine defaults for pandas when reading an excel file counts as a pandas regression. (whether than can be solved is still open).

ref: #39250 (comment)

just to be clear

@simonjayhawkins The regression is because of the default engine being changed to openpyxl

and this is unlikely to be changed. so any code that worked in 1.1.5 and does not in 1.2 should be considered a regression in the first instance (even if an upstream issue)

Would probably be good to track this along with the other issues with read_excel due to the change in engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Needs Info Clarification about behavior needed to assess issue Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants