Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-13876: Implement and register ParquetStorage as a butler storage type #90

Merged
merged 3 commits into from May 1, 2018

Conversation

timothydmorton
Copy link
Contributor

This defines readParquetStorage and writeParquetStorage to define
butler get/put operations for ParquetStorage datatypes. They use
the wrapper object ParquetTable around pyarrow read/write operations.
Read returns a lazy object; meaning no data is loaded, but arbitrary
columns can be loaded later using the .to_df method of the ParquetTable.


The object returned by this is expected to be a subtype
of `ParquetTable`, which is a thin wrapper to `pyarrow.ParquetFile`
that allows for lazy loading of the data. This is notably
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is incorrect now, isn't it?

def writeParquetStorage(butlerLocation, obj):
"""Writes pandas dataframe to parquet file

Note, this takes a `pandas.DataFrame` to write, whereas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this comment; I think it expects a ParquetTable as well.


Returns
-------
A list of objects as described by the butler location. One item for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original purpose of the list of locations was to allow a single dataset to be assembled from multiple physical serializations. If you're not actually expecting/using the list, I would just use the initial entry (and perhaps even warn if there is more than one).

@@ -288,10 +288,12 @@ def read(self, butlerLocation):
raise(RuntimeError("No formatter for location:{}".format(butlerLocation)))

def butlerLocationExists(self, location):
"""Implementaion of PosixStorage.exists for ButlerLocation objects."""
"""Implementaion of PosixStorage.exists for ButlerLocation objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you fix the typo here while you're in here?

This defines readParquetStorage and writeParquetStorage to define
butler get/put operations for ParquetStorage datatypes.  They use
the wrapper object ParquetTable around pyarrow read/write operations.
Read returns a lazy object; meaning no data is loaded, but arbitrary
columns can be loaded later using the .to_df method of the ParquetTable.
@timothydmorton timothydmorton merged commit 4948899 into master May 1, 2018
@ktlim ktlim deleted the tickets/DM-13876 branch August 25, 2018 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants