Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-40582: Add support for serialization/deserialization of Arrow schemas to Parquet. #887

Merged
merged 4 commits into from Sep 8, 2023

Conversation

erykoff
Copy link
Contributor

@erykoff erykoff commented Sep 7, 2023

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@codecov
Copy link

codecov bot commented Sep 7, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.02% 🎉

Comparison is base (6730d16) 87.67% compared to head (c4a7bdb) 87.70%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #887      +/-   ##
==========================================
+ Coverage   87.67%   87.70%   +0.02%     
==========================================
  Files         272      272              
  Lines       36107    36188      +81     
  Branches     7552     7572      +20     
==========================================
+ Hits        31656    31737      +81     
  Misses       3270     3270              
  Partials     1181     1181              
Files Changed Coverage Δ
python/lsst/daf/butler/formatters/parquet.py 93.92% <100.00%> (+0.26%) ⬆️
tests/test_parquet.py 97.78% <100.00%> (+0.14%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this correctly, we can write only ArrowSchema but read any of the convertible schema types? That's fine with me, especially to minimize scope on this ticket, as long as what happens if someone tries to write one of the other schema types is a sufficiently graceful failure.

if description := astropy_table[name].description:
field_metadata["doc"] = description
if units := astropy_table[name].unit:
field_metadata["units"] = str(units)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the point where we decide whether we want this to be "units" or "unit". I may have put "units" in the drp_tasks PR that spawned this, but if astropy uses "unit" maybe we should, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question. Right now, our afw Fields use doc/units, which is what I thought you were going for here. Astropy tables use description/unit, which I'm using as exact convertibles. And right here we're defining the precedent for what should go into an arrow schema. So I think that either we should go with the afw convention or the astropy convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for astropy conventions, then.

@erykoff
Copy link
Contributor Author

erykoff commented Sep 8, 2023

I can add support for serializing the other ones on this ticket; that makes sense. Though I don't know exactly why you'd want to with anything but the arrow or perhaps astropy.

@TallJimbo
Copy link
Member

I can add support for serializing the other ones on this ticket; that makes sense. Though I don't know exactly why you'd want to with anything but the arrow or perhaps astropy.

Don't worry about it. I see that those didn't get associated with ParquetFormatter in the formatters configuration, so they should already fail gracefully if you try to write them.

@erykoff
Copy link
Contributor Author

erykoff commented Sep 8, 2023

Okay, I'll leave this as just being able to serialize an Arrow schema (as described on the ticket) and in the future we can do others if it seems useful/necessary.

@erykoff erykoff merged commit fc0b858 into main Sep 8, 2023
16 checks passed
@erykoff erykoff deleted the tickets/DM-40582 branch September 8, 2023 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants