-
Notifications
You must be signed in to change notification settings - Fork 1
Raise a helpful error for failed partial loading of list-struct columns #403
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #403 +/- ##
==========================================
- Coverage 97.33% 97.27% -0.07%
==========================================
Files 19 19
Lines 2062 2089 +27
==========================================
+ Hits 2007 2032 +25
- Misses 55 57 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Click here to view all benchmarks. |
hombit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'm a little worried about doing one more read just for that. Maybe we can wrap pyarrow's error, so we read the schema only if pyarrow fails with an error about a missing column?
|
I'm worried about that too, which is why here I'm only doing the schema read if we suspect a partial load. The wrapping idea is interesting, you think we should catch the value error and then do a schema investigation to return a better message? |
|
@dougbrn yes, I think it would be the perfect solution. I think we can just try to rely on error messages for that. I think it should be fine until we test both lowest and highest pyarrow versions on CI. |
|
@hombit Now doing the schema check only after a failed read, the result here is that both the original error message and the nested-pandas error message are present |

Resolves #394 using the lowest-effort approach of checking and throwing an error. I decided not to do the higher-effort approach of handling this dynamically, because there isn't a more performant way to do it than doing a full load and then performing column selection, so I think the user should be aware that their data requires that approach and adjusts on their side intentionally.
On the checking, I have set it up to read the parquet schema when it's possible that we may be doing a partial load ("." in the name) and checking to see if that column is a base column, or if it's a nested column. If it's a nested column then I have it simply return the error if it's not a struct. Notably, this is not going the full distance of verifying that something is a struct-list, which I'm not sure if that's better here or not. But it does handle the main case of catching list-structs.