Skip to content

Conversation

@dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Nov 4, 2025

Resolves #394 using the lowest-effort approach of checking and throwing an error. I decided not to do the higher-effort approach of handling this dynamically, because there isn't a more performant way to do it than doing a full load and then performing column selection, so I think the user should be aware that their data requires that approach and adjusts on their side intentionally.

On the checking, I have set it up to read the parquet schema when it's possible that we may be doing a partial load ("." in the name) and checking to see if that column is a base column, or if it's a nested column. If it's a nested column then I have it simply return the error if it's not a struct. Notably, this is not going the full distance of verifying that something is a struct-list, which I'm not sure if that's better here or not. But it does handle the main case of catching list-structs.

@codecov
Copy link

codecov bot commented Nov 4, 2025

Codecov Report

❌ Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.27%. Comparing base (a7d8124) to head (7e6f939).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
src/nested_pandas/nestedframe/io.py 90.32% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #403      +/-   ##
==========================================
- Coverage   97.33%   97.27%   -0.07%     
==========================================
  Files          19       19              
  Lines        2062     2089      +27     
==========================================
+ Hits         2007     2032      +25     
- Misses         55       57       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Before [a7d8124] After [5245a03] Ratio Benchmark (Parameter)
504±200ms 439±200ms ~0.87 benchmarks.ReadFewColumnsHTTPS.time_run
29.3±1ms 30.3±1ms 1.03 benchmarks.AssignSingleDfToNestedSeries.time_run
48.7±0.7ms 50.1±0.4ms 1.03 benchmarks.ReassignHalfOfNestedSeries.time_run
11.5±0.2ms 11.7±0.3ms 1.02 benchmarks.NestedFrameAddNested.time_run
1.26G 1.29G 1.02 benchmarks.ReadFewColumnsS3.peakmem_run
1.28±0.01ms 1.30±0.01ms 1.01 benchmarks.NestedFrameReduce.time_run
134M 134M 1.00 benchmarks.CountNestedBy.peakmem_run
66.0±0.2ms 66.2±0.8ms 1.00 benchmarks.CountNestedBy.time_run
101M 101M 1.00 benchmarks.NestedFrameAddNested.peakmem_run
106M 106M 1.00 benchmarks.NestedFrameQuery.peakmem_run

Click here to view all benchmarks.

@dougbrn dougbrn requested a review from hombit November 4, 2025 22:34
Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm a little worried about doing one more read just for that. Maybe we can wrap pyarrow's error, so we read the schema only if pyarrow fails with an error about a missing column?

@dougbrn
Copy link
Collaborator Author

dougbrn commented Nov 4, 2025

I'm worried about that too, which is why here I'm only doing the schema read if we suspect a partial load. The wrapping idea is interesting, you think we should catch the value error and then do a schema investigation to return a better message?

@hombit
Copy link
Collaborator

hombit commented Nov 4, 2025

@dougbrn yes, I think it would be the perfect solution. I think we can just try to rely on error messages for that. I think it should be fine until we test both lowest and highest pyarrow versions on CI.

@dougbrn dougbrn requested a review from hombit November 4, 2025 23:07
@dougbrn
Copy link
Collaborator Author

dougbrn commented Nov 4, 2025

@hombit Now doing the schema check only after a failed read, the result here is that both the original error message and the nested-pandas error message are present

@dougbrn
Copy link
Collaborator Author

dougbrn commented Nov 4, 2025

screenshot example:
Screenshot 2025-11-04 at 3 53 59 PM

@dougbrn dougbrn merged commit cae486d into main Nov 5, 2025
10 of 12 checks passed
@dougbrn dougbrn deleted the list_struct_partial_loads branch November 5, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve handling of subcolumn selection for list-struct columns

3 participants