Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filenames to parquet reading exceptions #15429

Open
m00ngoose opened this issue Apr 1, 2024 · 5 comments
Open

Add filenames to parquet reading exceptions #15429

m00ngoose opened this issue Apr 1, 2024 · 5 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@m00ngoose
Copy link

m00ngoose commented Apr 1, 2024

Description

I construct large, lazy queries sourced from scan_parquet across multiple files. Sometimes the input files are malformed. Sample exceptions below. They would be much more useful if they had the filename of the dodgy parquet so that I could easily tell to which of the many input files the error pertains.

  File "/usr/local/lib/python3.10/dist-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
polars.exceptions.ComputeError: parquet: File out of specification: underlying snap error: snappy: corrupt input (expected valid offset but got offset 0; dst position: 624228)
  File "/usr/local/lib/python3.10/dist-packages/polars/functions/lazy.py", line 1687, in collect_all
    out = plr.collect_all(prepared)
polars.exceptions.ComputeError: parquet: File out of specification: The page header reported the wrong page size
@m00ngoose m00ngoose added the enhancement New feature or an improvement of an existing feature label Apr 1, 2024
@deanm0000
Copy link
Collaborator

This one is tricky to do because it'd require some refactoring as the function that actually experiences the error doesn't necessarily have the filename so it's not so simple.

As a workaround you can do something like

lfs=pl.concat([
pl.scan_parquet(x)
for x in files
])

@ion-elgreco
Copy link
Contributor

@deanm0000 doing that will ruin the performance though

@cmdlineluser
Copy link
Contributor

#10481 was recently accepted.

(expose filepath/name as a column via bulk reader methods)

Just linking for reference as it seems that work would be a stepping stone for this.

@m00ngoose
Copy link
Author

@deanm0000 I don't see how that's a workaround? Eg. if I have code that looks like

lfs = pl.concat([pl.scan_parquet(x) for x in files])
agglf_0 = f0(lfs)
agglf_1 = f1(lfs)
agglf_2 = f2(lfs)
aggdf_0, aggdf_1, aggdf_2 = pl.collect_all([agglf_0, agglf_1, agglf_2])

How does your suggestion help me identify which input parquet is at fault?

@cmdlineluser I can see how that ticket would require the same groundwork.

@deanm0000
Copy link
Collaborator

I don't see how that's a workaround?

When you do pl.scan_parquet(files) then it only reads the first file and assumes the rest are good so it won't fail until you collect.

If you do pl.concat([pl.scan_parquet(x) for x in files]) then it'll scan each file so if one of the files is corrupted it'll fail faster and, I think (but maybe I'm wrong), will identify the file that is the problem. That said, if the corruptness is only evident at collect and not during the initial scan then you're right it doesn't help you.

@ion-elgreco I only mean it as a troubleshooting step not a all-the-time replacement of the normal way of using scan_parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants