New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-13876: Implement and register ParquetStorage as a butler storage type #90
Conversation
|
||
The object returned by this is expected to be a subtype | ||
of `ParquetTable`, which is a thin wrapper to `pyarrow.ParquetFile` | ||
that allows for lazy loading of the data. This is notably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is incorrect now, isn't it?
def writeParquetStorage(butlerLocation, obj): | ||
"""Writes pandas dataframe to parquet file | ||
|
||
Note, this takes a `pandas.DataFrame` to write, whereas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this comment; I think it expects a ParquetTable
as well.
|
||
Returns | ||
------- | ||
A list of objects as described by the butler location. One item for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original purpose of the list of locations was to allow a single dataset to be assembled from multiple physical serializations. If you're not actually expecting/using the list, I would just use the initial entry (and perhaps even warn if there is more than one).
@@ -288,10 +288,12 @@ def read(self, butlerLocation): | |||
raise(RuntimeError("No formatter for location:{}".format(butlerLocation))) | |||
|
|||
def butlerLocationExists(self, location): | |||
"""Implementaion of PosixStorage.exists for ButlerLocation objects.""" | |||
"""Implementaion of PosixStorage.exists for ButlerLocation objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you fix the typo here while you're in here?
This defines readParquetStorage and writeParquetStorage to define butler get/put operations for ParquetStorage datatypes. They use the wrapper object ParquetTable around pyarrow read/write operations. Read returns a lazy object; meaning no data is loaded, but arbitrary columns can be loaded later using the .to_df method of the ParquetTable.
c6a3e2b
to
bf5ba42
Compare
This defines readParquetStorage and writeParquetStorage to define
butler get/put operations for ParquetStorage datatypes. They use
the wrapper object ParquetTable around pyarrow read/write operations.
Read returns a lazy object; meaning no data is loaded, but arbitrary
columns can be loaded later using the .to_df method of the ParquetTable.