Large dataframes containing strings generate invalid xlsx files #167

EdGaere · 2022-03-02T15:05:06Z

For large dataframes containing strings with 500k rows and 60+ columns, pyexcelerate generates invalid .xlsx files that cannot be opened in Excel. Excel gives the error message: "We found a problem with some content in yourfile.xlsx. Do you want us to try to recover as much as we can?". Answering yes removes all invalid content, which yields an empty worksheet.

from string import ascii_lowercase
from pandas import DataFrame
from pyexcelerate import Workbook
from uuid import uuid4

num_rows = 501000
num_cols = 64

# generate dataframe with column names
column_names = [ f"Col{idx}" for idx in range(0, num_cols)]
df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)), columns=column_names)

# write to file
output_filename = f"/tmp/{str(uuid4())}.xlsx"

wb = Workbook()
ws = wb.new_sheet("sheet name", data=df.values.tolist())
wb.save(output_filename)

The xlsx file is successfully written to disk, is 138Mb in size, but generates the above-mentionned error when being opened in Excel. Further inspection when opening the file in Python reveals that the Zip archive is invalid.

Interestingly, the problem only occurs for large dataframes filled with strings. The problem does not occur:

For smaller dataframes
For dataframes of the same size filled only with integers

Versions:

OS: macOS Catalina 10.15.7; Darwin Kernel Version 19.6.0
PyExcelerate==0.10.0
pandas==1.3.5

The text was updated successfully, but these errors were encountered:

kevmo314 · 2022-03-02T15:16:38Z

Have you tried splitting across multiple worksheets? While technically allowed by the specification, Excel's behavior above 16384 columns is not guaranteed: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

EdGaere · 2022-03-02T15:22:53Z

Hey Kevin,

Thanks for the quick response! I am only writing 64 columns... so that it way below Excel's 16K limit.

In terms of rows, the problem occurs at around 500K rows, which is also below Excel's 1M limit.

Edward

kevmo314 · 2022-03-02T15:28:57Z

Ah apologies, I misread the message. That's a bit odd, I wonder if the worksheets support having that many inline strings. The relevant line is here: https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Writer.py#L78

If the zip file being generated is invalid, then we can probably open it up and see what's being generated, hopefully something sticks out as incorrect. If you can take a look a contribution would be super welcome, otherwise I'll try to take a look sometime soon.

EdGaere · 2022-03-02T16:09:26Z

When writing the same dataframe with Pandas/openpyxl, there are no problems with the generated file. So it appears Excel can support the volume of inline strings.

Of course openpyxl takes 3x longer and consumes 10x more memory

kevmo314 · 2022-03-02T16:22:31Z

If I recall correctly, openpyxl creates a table of strings, it doesn't inline the strings. Therefore instead of writing 500000*60 strings, openpyxl will only write 60. PyExcelerate relies on the deflate algorithm to reduce the file size, we had profiled the string compression approach and maintaining that table took a lot longer than it was worth. So they're not quite directly comparable. I suspect the issue is probably somewhere in the produced worksheet xml.

EdGaere · 2022-03-02T16:31:46Z

Understood. I just realised that in the example I provided there are only 26 unique values of the strings (string.ascii_lowercase) across all 500'000 * 64 cells.

df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)),

bsam-parsionate · 2022-05-02T07:44:40Z

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

jonathancyu · 2022-06-13T16:42:54Z

I also have a large dataset that is generating invalid xlsx files, however I'm able to open and repair the files. Excel lists the repaired records as
Repaired Records: String properties from /xl/worksheets/sheet1.xml part

ejhorow · 2022-06-24T15:00:44Z

I am having a similar problem. I am generating a very large file with about 8M rows and about 10 columns with a mix of short text strings (~10 char) and small integers. Resulting file is about 430MB. Trying to read the file in using openpyxl gives

zipfile.BadZipFile: File is not a zip file

Python 3.9.9
pyexcelerate: 0.10.0
openpyxl: 3.0.10

kevmo314 · 2022-06-27T16:01:29Z

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

This seems like an issue with the xml generation. I'm happy to review a pull request and it seems that a fix would be welcome as it affects quite a few people.

jonathancyu · 2022-06-27T23:22:34Z

Changing like 78 of Writer.py to
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:
allows those xml files to be written. Unfortunately opening the resulting file gives the error "Microsoft Excel is waiting for another application to complete an OLE action". You can get around this by killing and reopening excel, opening a blank sheet, and navigating to the file from the Document Recovery pane.

kevmo314 · 2022-07-01T14:54:17Z

I created a PR to add some logging, and add the force_zip64 change above. #178

If you're running into this issue, give that PR a shot and see if there's any additional debug information that gets produced. I suspect an error is being swallowed.

Nishad290 · 2022-07-05T11:29:55Z

This change works for me

with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

And I can open the file in excel as well.

The issue seems to be there is a default/assumed limit of 2GB of the file by ZipFile library. From the documentation, I could find this

When writing a file, if the file size is not known in advance but may exceed 2 GiB, pass force_zip64=True to ensure that the header format is capable of supporting large files. If the file size is known in advance, construct a ZipInfo object with file_size set, and use that as the name parameter.

Meanwhile, without force_zip64=True I get this error - file size unexpectedly exceeded zip64 limit

stgcorpbr · 2022-07-21T13:21:55Z

I solved it by doing this:
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

After the file is created I rename it to zip and then do it with this command on linux:

zip -F myfileexcel.zip --out myfileexcel.xlsx

it's not the best practice. But I believe someone will improve this.

SamiranTechPro · 2024-02-19T14:28:02Z

I have been facing the same problem by reviewing the dataset I found that if any column value start with = or - then this message comes up. Trying to escaping the = and - but In my scenarios that dosen't work.

EdGaere changed the title ~~Large dataframes containing strings generate an invalid xlsx~~ Large dataframes containing strings generate invalid xlsx files Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large dataframes containing strings generate invalid xlsx files #167

Large dataframes containing strings generate invalid xlsx files #167

EdGaere commented Mar 2, 2022

kevmo314 commented Mar 2, 2022

EdGaere commented Mar 2, 2022

kevmo314 commented Mar 2, 2022

EdGaere commented Mar 2, 2022

kevmo314 commented Mar 2, 2022

EdGaere commented Mar 2, 2022

bsam-parsionate commented May 2, 2022 •

edited

jonathancyu commented Jun 13, 2022 •

edited

ejhorow commented Jun 24, 2022

kevmo314 commented Jun 27, 2022

jonathancyu commented Jun 27, 2022 •

edited

kevmo314 commented Jul 1, 2022

Nishad290 commented Jul 5, 2022

stgcorpbr commented Jul 21, 2022

SamiranTechPro commented Feb 19, 2024

Large dataframes containing strings generate invalid xlsx files #167

Large dataframes containing strings generate invalid xlsx files #167

Comments

EdGaere commented Mar 2, 2022

kevmo314 commented Mar 2, 2022

EdGaere commented Mar 2, 2022

kevmo314 commented Mar 2, 2022

EdGaere commented Mar 2, 2022

kevmo314 commented Mar 2, 2022

EdGaere commented Mar 2, 2022

bsam-parsionate commented May 2, 2022 • edited

jonathancyu commented Jun 13, 2022 • edited

ejhorow commented Jun 24, 2022

kevmo314 commented Jun 27, 2022

jonathancyu commented Jun 27, 2022 • edited

kevmo314 commented Jul 1, 2022

Nishad290 commented Jul 5, 2022

stgcorpbr commented Jul 21, 2022

SamiranTechPro commented Feb 19, 2024

bsam-parsionate commented May 2, 2022 •

edited

jonathancyu commented Jun 13, 2022 •

edited

jonathancyu commented Jun 27, 2022 •

edited