New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-22062: add pandas/parquet support to Gen3 Butler #206
Conversation
411a38f
to
9452fa0
Compare
from lsst.daf.butler.core.utils import iterable | ||
from lsst.daf.butler import Formatter, Location | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatters are meant to only be imported when you need them so I don't think we need the try block here do we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, yes, that sounds reasonable, but do I need to guard against pytest-flake8
until the dust settles there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I had forgotten about that annoyance. But then how does daf_butler work at all? python/lsst/daf/butler/assemblers/exposureAssembler.py imports afw for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's okay when there's no __init__.py
? I was just being paranoid because I don't understand how it finds things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remove the try
block and it passes Travis/Jenkins, is that sufficient to know it's safe? If not I might just keep the paranoia.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fundamentally we don't understand why sometimes pytest-flake8 is importing code that it didn't import before (especially given that we didn't change flake8 or pytest-flake8 when we switched to conda pytest).
I'd really rather we didn't go through and put lots of try blocks in every formatter. It makes them untidy and is not required by the butler infrastructure and will likely lead to everyone adding try blocks defensively for no reason.
I just ran butler tests without afw setup and everything passes (so that's good).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the try block and Travis and local scons are still happy. Launching Jenkins now.
butler.get( | ||
"deepCoadd_obj", ..., | ||
parameters={ | ||
"columns": {"dataset": "meas", "filter": ["HSC-R", "HSC-I"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"via:"
Can you add a "column" request to this example to make the difference between "columns" and "column" clear? It's especially confusing because you can request multiple columns e.g.:
butler.get(
"deepCoadd_obj", ...,
parameters={"columns": {"dataset": "meas",
"filter": ["HSC-R", "HSC-I"]
"column": ["coord_ra", "coord_dec"]}})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
::
here is rST shorthand for "make a Python code block"; it renders as a single colon in HTML.
--------- | ||
|
||
The ``DataFrame`` storage class corresponds to the `pandas.DataFrame` class in Python. | ||
It includes special support for dealing with hierarchical (i.e. `pandas.MultiIndex`) columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hierarchical columns" sounds weird. I'd say:
It includes special support for dealing with hierarchical, or multi-level, indices (i.e. pandas.MultiIndex
columns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to make it clear that I'm not talking about a MultiIndex
on the rows, as I gather that's actually the more common usage. How about:
It includes special support for dealing with multi-level indices (i.e.
pandas.MultiIndex
) in columns.
?
Components | ||
^^^^^^^^^^ | ||
|
||
``DataFrame`` has a single component, ``columns``, which contains a description of the columns as a `pandas.Index` (often `pandas.MultiIndex`) instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame
--> The DataFrame
storage class
so that it's clear that we're not talking about the pandas DataFrame?
``DataFrame`` supports a single parameter for partial reads, with the key ``columns``. | ||
For non-hierachical columns, this should be a single column name (`str`) or a `list` of column names. | ||
For hierarchical columns, this should be a dictionary whose keys are the names of the levels, and whose values are column names (`str`) or lists thereof. | ||
The loaded columns are the product of the values for all levels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call these non-hierarchical or hierarchical "indices" rather than columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Multi-level index columns", again to avoid confusion with multi-level row indexes?
tests/test_parquet.py
Outdated
TESTDIR = os.path.abspath(os.path.dirname(__file__)) | ||
|
||
|
||
@unittest.skipUnless(pyarrow is not None, "Cannot tests ParquetFormatter without pyarrow.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra s in tests.
tests/test_parquet.py
Outdated
df2 = self.butler.get(self.datasetType) | ||
self.assertTrue(df1.equals(df2)) | ||
# Read just the column descriptions. | ||
columns2 = self.butler.get(f"{self.datasetType.name}.columns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This columns
component is nifty and I'll use it.
self.assertTrue(df1.loc[:, ["a", "c"]].equals(df3)) | ||
df4 = self.butler.get(self.datasetType, parameters={"columns": "a"}) | ||
self.assertTrue(df1.loc[:, ["a"]].equals(df4)) | ||
# Passing an unrecognized column should be a ValueError. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about when one column doesn't exist but the others do: e.g.
*** ValueError: Failure from formatter 'lsst.daf.butler.formatters.parquetFormatter.ParquetFormatter' for Dataset 1
This is different behavior than existing qa.explorer.ParquetTable, which just replicates pyarrow behavior to return the columns that do exist. See https://jira.lsstcorp.org/browse/DM-21976. Add some info on that ticket about the behavior of Gen3 Parquet, and we can deal with it later. I don't have an opinion on which behavior is less surprising. But they should be consistent and the error message *** ValueError: Failure from formatter 'lsst.daf.butler.formatters.parquetFormatter.ParquetFormatter' for Dataset 1
is not enough info to know that 'd
' was the offending request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that exception is chained to the real one, i.e. it'll appear above the
The above exception was the direct cause of the following exception:
that precedes this one in the traceback. I'm not a huge fan of that either, as it still obscures the more useful error message. @timj, what do you think about having the catch-and-reraise at
raise ValueError(f"Failure from formatter '{formatter.name()}' for Dataset {ref.id}") from e |
ValueError
(or some other type) unchanged?
I've made a comment on DM-21976 and linked to this ticket. Just to make sure we're on the same page, it's okay to leave the (different) behavior here as-is for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My annoyance that raise X from e
gives us unreadable stack traces knows no bounds. It's seemingly worse when run from pytest since when pytest reports the stack trace it stops at the boundary.
I'm not entirely sure what the best answer is. On the one hand we could let the formatter fail directly without catching it but then we can't document that you get a ValueError (you might get an ImportError for example) and it's nice that the trace does tell you what dataset you were trying to get and that it was the formatter that failed.
Are we doing it wrong? Is there some other syntax for raise that would work better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we're doing is as right as it can be given language constraints. Maybe in the kind of high-level contexts where we currently squash exception tracebacks (on the assumption that they're not what users want to see) we'd want to print the original message last and most prominently. But I personally tend to find that behavior annoying anyway, because we haven't got good enough error messages for many failure modes, so usually it just means we tell the user to try again and pass --doRaise
or something so we can actually help them.
Anyhow, doing:
try:
...
except ValueError:
raise
except Exception as err:
raise ValueError(...) from err
would still let the put
interface guarantee a certain exception, but it would mean we don't pass along potentially useful information like the data ID and dataset type (which the formatter doesn't know). So at this point I'm back to thinking what we have now is still the best option, even though none of them are good.
70e4cf9
to
ca31ba4
Compare
Will squash.
Will squash.
Fill squash.
Formatters should only be imported when actually needed, so these guards are hopefully redundant (as long as pytest-flake8 isn't overly aggressive).
ca31ba4
to
c011355
Compare
Apparently just importing pyarrow is not enough to trigger the boost version conflict (known) issue we have on OS X.
No description provided.