Add filenames to parquet reading exceptions #15429

m00ngoose · 2024-04-01T22:13:57Z

Description

I construct large, lazy queries sourced from scan_parquet across multiple files. Sometimes the input files are malformed. Sample exceptions below. They would be much more useful if they had the filename of the dodgy parquet so that I could easily tell to which of the many input files the error pertains.

  File "/usr/local/lib/python3.10/dist-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
polars.exceptions.ComputeError: parquet: File out of specification: underlying snap error: snappy: corrupt input (expected valid offset but got offset 0; dst position: 624228)

  File "/usr/local/lib/python3.10/dist-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
polars.exceptions.ComputeError: parquet: File out of specification: The page header reported the wrong page size

The text was updated successfully, but these errors were encountered:

deanm0000 · 2024-04-02T05:53:21Z

This one is tricky to do because it'd require some refactoring as the function that actually experiences the error doesn't necessarily have the filename so it's not so simple.

As a workaround you can do something like

lfs=pl.concat([
pl.scan_parquet(x)
for x in files
])

ion-elgreco · 2024-04-02T09:48:48Z

@deanm0000 doing that will ruin the performance though

cmdlineluser · 2024-04-02T11:21:43Z

#10481 was recently accepted.

(expose filepath/name as a column via bulk reader methods)

Just linking for reference as it seems that work would be a stepping stone for this.

m00ngoose · 2024-04-02T13:54:47Z

@deanm0000 I don't see how that's a workaround? Eg. if I have code that looks like

lfs = pl.concat([pl.scan_parquet(x) for x in files])
agglf_0 = f0(lfs)
agglf_1 = f1(lfs)
agglf_2 = f2(lfs)
aggdf_0, aggdf_1, aggdf_2 = pl.collect_all([agglf_0, agglf_1, agglf_2])

How does your suggestion help me identify which input parquet is at fault?

@cmdlineluser I can see how that ticket would require the same groundwork.

deanm0000 · 2024-04-02T14:30:02Z

I don't see how that's a workaround?

When you do pl.scan_parquet(files) then it only reads the first file and assumes the rest are good so it won't fail until you collect.

If you do pl.concat([pl.scan_parquet(x) for x in files]) then it'll scan each file so if one of the files is corrupted it'll fail faster and, I think (but maybe I'm wrong), will identify the file that is the problem. That said, if the corruptness is only evident at collect and not during the initial scan then you're right it doesn't help you.

@ion-elgreco I only mean it as a troubleshooting step not a all-the-time replacement of the normal way of using scan_parquet.

m00ngoose added the enhancement New feature or an improvement of an existing feature label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add filenames to parquet reading exceptions #15429

Add filenames to parquet reading exceptions #15429

m00ngoose commented Apr 1, 2024 •

edited

deanm0000 commented Apr 2, 2024

ion-elgreco commented Apr 2, 2024

cmdlineluser commented Apr 2, 2024

m00ngoose commented Apr 2, 2024

deanm0000 commented Apr 2, 2024

Add filenames to parquet reading exceptions #15429

Add filenames to parquet reading exceptions #15429

Comments

m00ngoose commented Apr 1, 2024 • edited

Description

deanm0000 commented Apr 2, 2024

ion-elgreco commented Apr 2, 2024

cmdlineluser commented Apr 2, 2024

m00ngoose commented Apr 2, 2024

deanm0000 commented Apr 2, 2024

m00ngoose commented Apr 1, 2024 •

edited