Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large dataframes containing strings generate invalid xlsx files #167

Open
EdGaere opened this issue Mar 2, 2022 · 15 comments
Open

Large dataframes containing strings generate invalid xlsx files #167

EdGaere opened this issue Mar 2, 2022 · 15 comments

Comments

@EdGaere
Copy link

EdGaere commented Mar 2, 2022

For large dataframes containing strings with 500k rows and 60+ columns, pyexcelerate generates invalid .xlsx files that cannot be opened in Excel. Excel gives the error message: "We found a problem with some content in yourfile.xlsx. Do you want us to try to recover as much as we can?". Answering yes removes all invalid content, which yields an empty worksheet.

from string import ascii_lowercase
from pandas import DataFrame
from pyexcelerate import Workbook
from uuid import uuid4

num_rows = 501000
num_cols = 64

# generate dataframe with column names
column_names = [ f"Col{idx}" for idx in range(0, num_cols)]
df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)), columns=column_names)

# write to file
output_filename = f"/tmp/{str(uuid4())}.xlsx"

wb = Workbook()
ws = wb.new_sheet("sheet name", data=df.values.tolist())
wb.save(output_filename)

The xlsx file is successfully written to disk, is 138Mb in size, but generates the above-mentionned error when being opened in Excel. Further inspection when opening the file in Python reveals that the Zip archive is invalid.

Interestingly, the problem only occurs for large dataframes filled with strings. The problem does not occur:

  • For smaller dataframes
  • For dataframes of the same size filled only with integers

Versions:

  • OS: macOS Catalina 10.15.7; Darwin Kernel Version 19.6.0
  • PyExcelerate==0.10.0
  • pandas==1.3.5
@EdGaere EdGaere changed the title Large dataframes containing strings generate an invalid xlsx Large dataframes containing strings generate invalid xlsx files Mar 2, 2022
@kevmo314
Copy link
Collaborator

kevmo314 commented Mar 2, 2022

Have you tried splitting across multiple worksheets? While technically allowed by the specification, Excel's behavior above 16384 columns is not guaranteed: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

@EdGaere
Copy link
Author

EdGaere commented Mar 2, 2022

Hey Kevin,

Thanks for the quick response! I am only writing 64 columns... so that it way below Excel's 16K limit.

In terms of rows, the problem occurs at around 500K rows, which is also below Excel's 1M limit.

Edward

@kevmo314
Copy link
Collaborator

kevmo314 commented Mar 2, 2022

Ah apologies, I misread the message. That's a bit odd, I wonder if the worksheets support having that many inline strings. The relevant line is here: https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Writer.py#L78

If the zip file being generated is invalid, then we can probably open it up and see what's being generated, hopefully something sticks out as incorrect. If you can take a look a contribution would be super welcome, otherwise I'll try to take a look sometime soon.

@EdGaere
Copy link
Author

EdGaere commented Mar 2, 2022

When writing the same dataframe with Pandas/openpyxl, there are no problems with the generated file. So it appears Excel can support the volume of inline strings.

Of course openpyxl takes 3x longer and consumes 10x more memory

@kevmo314
Copy link
Collaborator

kevmo314 commented Mar 2, 2022

If I recall correctly, openpyxl creates a table of strings, it doesn't inline the strings. Therefore instead of writing 500000*60 strings, openpyxl will only write 60. PyExcelerate relies on the deflate algorithm to reduce the file size, we had profiled the string compression approach and maintaining that table took a lot longer than it was worth. So they're not quite directly comparable. I suspect the issue is probably somewhere in the produced worksheet xml.

@EdGaere
Copy link
Author

EdGaere commented Mar 2, 2022

Understood. I just realised that in the example I provided there are only 26 unique values of the strings (string.ascii_lowercase) across all 500'000 * 64 cells.

df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)),

@bsam-parsionate
Copy link

bsam-parsionate commented May 2, 2022

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

image

@jonathancyu
Copy link

jonathancyu commented Jun 13, 2022

I also have a large dataset that is generating invalid xlsx files, however I'm able to open and repair the files. Excel lists the repaired records as
Repaired Records: String properties from /xl/worksheets/sheet1.xml part

@ejhorow
Copy link

ejhorow commented Jun 24, 2022

I am having a similar problem. I am generating a very large file with about 8M rows and about 10 columns with a mix of short text strings (~10 char) and small integers. Resulting file is about 430MB. Trying to read the file in using openpyxl gives

zipfile.BadZipFile: File is not a zip file

Python 3.9.9
pyexcelerate: 0.10.0
openpyxl: 3.0.10

@kevmo314
Copy link
Collaborator

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

image

This seems like an issue with the xml generation. I'm happy to review a pull request and it seems that a fix would be welcome as it affects quite a few people.

@jonathancyu
Copy link

jonathancyu commented Jun 27, 2022

Changing like 78 of Writer.py to
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:
allows those xml files to be written. Unfortunately opening the resulting file gives the error "Microsoft Excel is waiting for another application to complete an OLE action". You can get around this by killing and reopening excel, opening a blank sheet, and navigating to the file from the Document Recovery pane.

@kevmo314
Copy link
Collaborator

kevmo314 commented Jul 1, 2022

I created a PR to add some logging, and add the force_zip64 change above. #178

If you're running into this issue, give that PR a shot and see if there's any additional debug information that gets produced. I suspect an error is being swallowed.

@Nishad290
Copy link

This change works for me

with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

And I can open the file in excel as well.

The issue seems to be there is a default/assumed limit of 2GB of the file by ZipFile library. From the documentation, I could find this

When writing a file, if the file size is not known in advance but may exceed 2 GiB, pass force_zip64=True to ensure that the header format is capable of supporting large files. If the file size is known in advance, construct a ZipInfo object with file_size set, and use that as the name parameter.

Meanwhile, without force_zip64=True I get this error - file size unexpectedly exceeded zip64 limit

@stgcorpbr
Copy link

I solved it by doing this:
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

After the file is created I rename it to zip and then do it with this command on linux:

zip -F myfileexcel.zip --out myfileexcel.xlsx

it's not the best practice. But I believe someone will improve this.

@SamiranTechPro
Copy link

I have been facing the same problem by reviewing the dataset I found that if any column value start with = or - then this message comes up. Trying to escaping the = and - but In my scenarios that dosen't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants