DM-22062: add pandas/parquet support to Gen3 Butler #206

TallJimbo · 2019-11-18T19:21:32Z

No description provided.

timj · 2019-11-18T19:56:32Z

python/lsst/daf/butler/formatters/parquetFormatter.py

+from lsst.daf.butler.core.utils import iterable
+from lsst.daf.butler import Formatter, Location
+
+try:


Formatters are meant to only be imported when you need them so I don't think we need the try block here do we?

Generally, yes, that sounds reasonable, but do I need to guard against pytest-flake8 until the dust settles there?

Oh. I had forgotten about that annoyance. But then how does daf_butler work at all? python/lsst/daf/butler/assemblers/exposureAssembler.py imports afw for example.

Maybe it's okay when there's no __init__.py? I was just being paranoid because I don't understand how it finds things.

If I remove the try block and it passes Travis/Jenkins, is that sufficient to know it's safe? If not I might just keep the paranoia.

Fundamentally we don't understand why sometimes pytest-flake8 is importing code that it didn't import before (especially given that we didn't change flake8 or pytest-flake8 when we switched to conda pytest).

I'd really rather we didn't go through and put lots of try blocks in every formatter. It makes them untidy and is not required by the butler infrastructure and will likely lead to everyone adding try blocks defensively for no reason.

I just ran butler tests without afw setup and everything passes (so that's good).

I removed the try block and Travis and local scons are still happy. Launching Jenkins now.

yalsayyad · 2019-11-18T21:11:46Z

doc/lsst.daf.butler/concreteStorageClasses.rst

+    butler.get(
+        "deepCoadd_obj", ...,
+        parameters={
+            "columns": {"dataset": "meas", "filter": ["HSC-R", "HSC-I"]}


"via:"

Can you add a "column" request to this example to make the difference between "columns" and "column" clear? It's especially confusing because you can request multiple columns e.g.:

butler.get( "deepCoadd_obj", ..., parameters={"columns": {"dataset": "meas", "filter": ["HSC-R", "HSC-I"] "column": ["coord_ra", "coord_dec"]}})

:: here is rST shorthand for "make a Python code block"; it renders as a single colon in HTML.

yalsayyad · 2019-11-18T21:59:54Z

doc/lsst.daf.butler/concreteStorageClasses.rst

+---------
+
+The ``DataFrame`` storage class corresponds to the `pandas.DataFrame` class in Python.
+It includes special support for dealing with hierarchical (i.e. `pandas.MultiIndex`) columns.


"hierarchical columns" sounds weird. I'd say:
It includes special support for dealing with hierarchical, or multi-level, indices (i.e. pandas.MultiIndex columns).

I do want to make it clear that I'm not talking about a MultiIndex on the rows, as I gather that's actually the more common usage. How about:

It includes special support for dealing with multi-level indices (i.e. pandas.MultiIndex) in columns.

?

yalsayyad · 2019-11-18T22:02:17Z

doc/lsst.daf.butler/concreteStorageClasses.rst

+Components
+^^^^^^^^^^
+
+``DataFrame`` has a single component, ``columns``, which contains a description of the columns as a `pandas.Index` (often `pandas.MultiIndex`) instance.


DataFrame --> The DataFrame storage class

so that it's clear that we're not talking about the pandas DataFrame?

yalsayyad · 2019-11-18T22:11:36Z

doc/lsst.daf.butler/concreteStorageClasses.rst

+``DataFrame`` supports a single parameter for partial reads, with the key ``columns``.
+For non-hierachical columns, this should be a single column name (`str`) or a `list` of column names.
+For hierarchical columns, this should be a dictionary whose keys are the names of the levels, and whose values are column names (`str`) or lists thereof.
+The loaded columns are the product of the values for all levels.


I'd call these non-hierarchical or hierarchical "indices" rather than columns.

"Multi-level index columns", again to avoid confusion with multi-level row indexes?

yalsayyad · 2019-11-18T22:29:14Z

tests/test_parquet.py

+TESTDIR = os.path.abspath(os.path.dirname(__file__))
+
+
+@unittest.skipUnless(pyarrow is not None, "Cannot tests ParquetFormatter without pyarrow.")


Extra s in tests.

yalsayyad · 2019-11-18T22:43:35Z

tests/test_parquet.py

+        df2 = self.butler.get(self.datasetType)
+        self.assertTrue(df1.equals(df2))
+        # Read just the column descriptions.
+        columns2 = self.butler.get(f"{self.datasetType.name}.columns")


This columns component is nifty and I'll use it.

yalsayyad · 2019-11-18T22:53:07Z

tests/test_parquet.py

+        self.assertTrue(df1.loc[:, ["a", "c"]].equals(df3))
+        df4 = self.butler.get(self.datasetType, parameters={"columns": "a"})
+        self.assertTrue(df1.loc[:, ["a"]].equals(df4))
+        # Passing an unrecognized column should be a ValueError.


What about when one column doesn't exist but the others do: e.g.

*** ValueError: Failure from formatter 'lsst.daf.butler.formatters.parquetFormatter.ParquetFormatter' for Dataset 1

This is different behavior than existing qa.explorer.ParquetTable, which just replicates pyarrow behavior to return the columns that do exist. See https://jira.lsstcorp.org/browse/DM-21976. Add some info on that ticket about the behavior of Gen3 Parquet, and we can deal with it later. I don't have an opinion on which behavior is less surprising. But they should be consistent and the error message *** ValueError: Failure from formatter 'lsst.daf.butler.formatters.parquetFormatter.ParquetFormatter' for Dataset 1 is not enough info to know that 'd' was the offending request.

I think that exception is chained to the real one, i.e. it'll appear above the

The above exception was the direct cause of the following exception:

that precedes this one in the traceback. I'm not a huge fan of that either, as it still obscures the more useful error message. @timj, what do you think about having the catch-and-reraise at

daf_butler/python/lsst/daf/butler/datastores/posixDatastore.py

Line 145 in 80af466

raise ValueError(f"Failure from formatter '{formatter.name()}' for Dataset {ref.id}") from e

pass through exceptions that are already ValueError (or some other type) unchanged?

I've made a comment on DM-21976 and linked to this ticket. Just to make sure we're on the same page, it's okay to leave the (different) behavior here as-is for now?

My annoyance that raise X from e gives us unreadable stack traces knows no bounds. It's seemingly worse when run from pytest since when pytest reports the stack trace it stops at the boundary.

I'm not entirely sure what the best answer is. On the one hand we could let the formatter fail directly without catching it but then we can't document that you get a ValueError (you might get an ImportError for example) and it's nice that the trace does tell you what dataset you were trying to get and that it was the formatter that failed.

Are we doing it wrong? Is there some other syntax for raise that would work better?

I think what we're doing is as right as it can be given language constraints. Maybe in the kind of high-level contexts where we currently squash exception tracebacks (on the assumption that they're not what users want to see) we'd want to print the original message last and most prominently. But I personally tend to find that behavior annoying anyway, because we haven't got good enough error messages for many failure modes, so usually it just means we tell the user to try again and pass --doRaise or something so we can actually help them.

Anyhow, doing:

try: ... except ValueError: raise except Exception as err: raise ValueError(...) from err

would still let the put interface guarantee a certain exception, but it would mean we don't pass along potentially useful information like the data ID and dataset type (which the formatter doesn't know). So at this point I'm back to thinking what we have now is still the best option, even though none of them are good.

Will squash.

Fill squash.

Formatters should only be imported when actually needed, so these guards are hopefully redundant (as long as pytest-flake8 isn't overly aggressive).

Apparently just importing pyarrow is not enough to trigger the boost version conflict (known) issue we have on OS X.

TallJimbo force-pushed the tickets/DM-22062 branch from 411a38f to 9452fa0 Compare November 18, 2019 19:52

timj reviewed Nov 18, 2019

View reviewed changes

yalsayyad approved these changes Nov 19, 2019

View reviewed changes

TallJimbo force-pushed the tickets/DM-22062 branch from 70e4cf9 to ca31ba4 Compare November 20, 2019 16:00

TallJimbo added 6 commits November 20, 2019 11:19

Add DataFrame storage class.

87aa4da

Add ParquetFormatter.

ee7f6bb

Doc fixes for DataFrame storage class.

2b8db0b

Will squash.

Make empty data ID explicit in tests.

f735395

Will squash.

Doc fix in test_parquet.py.

1147e4c

Fill squash.

Remove import guards in ParquetFormatter.

c011355

Formatters should only be imported when actually needed, so these guards are hopefully redundant (as long as pytest-flake8 isn't overly aggressive).

TallJimbo force-pushed the tickets/DM-22062 branch from ca31ba4 to c011355 Compare November 20, 2019 16:19

Expand test import guard to pyarrow.parquet.

0dea479

Apparently just importing pyarrow is not enough to trigger the boost version conflict (known) issue we have on OS X.

TallJimbo merged commit f9e0909 into master Nov 21, 2019

TallJimbo deleted the tickets/DM-22062 branch November 21, 2019 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-22062: add pandas/parquet support to Gen3 Butler #206

DM-22062: add pandas/parquet support to Gen3 Butler #206

TallJimbo commented Nov 18, 2019

timj Nov 18, 2019

TallJimbo Nov 18, 2019

timj Nov 18, 2019

TallJimbo Nov 18, 2019

TallJimbo Nov 20, 2019

timj Nov 20, 2019

TallJimbo Nov 20, 2019

yalsayyad Nov 18, 2019

TallJimbo Nov 20, 2019

yalsayyad Nov 18, 2019

TallJimbo Nov 20, 2019

yalsayyad Nov 18, 2019

yalsayyad Nov 18, 2019

TallJimbo Nov 20, 2019

yalsayyad Nov 18, 2019

yalsayyad Nov 18, 2019

yalsayyad Nov 18, 2019

TallJimbo Nov 20, 2019

timj Nov 20, 2019

TallJimbo Nov 20, 2019

		TESTDIR = os.path.abspath(os.path.dirname(__file__))


		@unittest.skipUnless(pyarrow is not None, "Cannot tests ParquetFormatter without pyarrow.")

DM-22062: add pandas/parquet support to Gen3 Butler #206

DM-22062: add pandas/parquet support to Gen3 Butler #206

Conversation

TallJimbo commented Nov 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment