DM-13876: Implement and register ParquetStorage as a butler storage type #90

timothydmorton · 2018-04-16T02:06:21Z

This defines readParquetStorage and writeParquetStorage to define
butler get/put operations for ParquetStorage datatypes. They use
the wrapper object ParquetTable around pyarrow read/write operations.
Read returns a lazy object; meaning no data is loaded, but arbitrary
columns can be loaded later using the .to_df method of the ParquetTable.

ktlim · 2018-04-19T15:39:12Z

python/lsst/daf/persistence/posixStorage.py

+
+    The object returned by this is expected to be a subtype
+    of `ParquetTable`, which is a thin wrapper to `pyarrow.ParquetFile`
+    that allows for lazy loading of the data.  This is notably 


This comment is incorrect now, isn't it?

ktlim · 2018-04-19T15:39:41Z

python/lsst/daf/persistence/posixStorage.py

+def writeParquetStorage(butlerLocation, obj):
+    """Writes pandas dataframe to parquet file
+
+    Note, this takes a `pandas.DataFrame` to write, whereas


And this comment; I think it expects a ParquetTable as well.

ktlim · 2018-04-19T15:41:01Z

python/lsst/daf/persistence/posixStorage.py

+
+    Returns
+    -------
+    A list of objects as described by the butler location. One item for


The original purpose of the list of locations was to allow a single dataset to be assembled from multiple physical serializations. If you're not actually expecting/using the list, I would just use the initial entry (and perhaps even warn if there is more than one).

ktlim · 2018-04-19T15:42:20Z

python/lsst/daf/persistence/posixStorage.py

@@ -288,10 +288,12 @@ def read(self, butlerLocation):
        raise(RuntimeError("No formatter for location:{}".format(butlerLocation)))

    def butlerLocationExists(self, location):
-        """Implementaion of PosixStorage.exists for ButlerLocation objects."""
+        """Implementaion of PosixStorage.exists for ButlerLocation objects.


Could you fix the typo here while you're in here?

This defines readParquetStorage and writeParquetStorage to define butler get/put operations for ParquetStorage datatypes. They use the wrapper object ParquetTable around pyarrow read/write operations. Read returns a lazy object; meaning no data is loaded, but arbitrary columns can be loaded later using the .to_df method of the ParquetTable.

ktlim reviewed Apr 19, 2018

View reviewed changes

timothydmorton added 3 commits May 1, 2018 15:12

fix trailing whitespace

a1ee7a1

correct docstrings and use only first result in list

bf5ba42

timothydmorton force-pushed the tickets/DM-13876 branch from c6a3e2b to bf5ba42 Compare May 1, 2018 22:15

timothydmorton merged commit 4948899 into master May 1, 2018

ktlim deleted the tickets/DM-13876 branch August 25, 2018 06:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-13876: Implement and register ParquetStorage as a butler storage type #90

DM-13876: Implement and register ParquetStorage as a butler storage type #90

timothydmorton commented Apr 16, 2018

ktlim Apr 19, 2018

ktlim Apr 19, 2018

ktlim Apr 19, 2018

ktlim Apr 19, 2018

DM-13876: Implement and register ParquetStorage as a butler storage type #90

DM-13876: Implement and register ParquetStorage as a butler storage type #90

Conversation

timothydmorton commented Apr 16, 2018

ktlim Apr 19, 2018

Choose a reason for hiding this comment

ktlim Apr 19, 2018

Choose a reason for hiding this comment

ktlim Apr 19, 2018

Choose a reason for hiding this comment

ktlim Apr 19, 2018

Choose a reason for hiding this comment