Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-22062: add pandas/parquet support to Gen3 Butler #206

Merged
merged 7 commits into from Nov 21, 2019

Conversation

TallJimbo
Copy link
Member

No description provided.

from lsst.daf.butler.core.utils import iterable
from lsst.daf.butler import Formatter, Location

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatters are meant to only be imported when you need them so I don't think we need the try block here do we?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, yes, that sounds reasonable, but do I need to guard against pytest-flake8 until the dust settles there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I had forgotten about that annoyance. But then how does daf_butler work at all? python/lsst/daf/butler/assemblers/exposureAssembler.py imports afw for example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's okay when there's no __init__.py? I was just being paranoid because I don't understand how it finds things.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remove the try block and it passes Travis/Jenkins, is that sufficient to know it's safe? If not I might just keep the paranoia.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fundamentally we don't understand why sometimes pytest-flake8 is importing code that it didn't import before (especially given that we didn't change flake8 or pytest-flake8 when we switched to conda pytest).

I'd really rather we didn't go through and put lots of try blocks in every formatter. It makes them untidy and is not required by the butler infrastructure and will likely lead to everyone adding try blocks defensively for no reason.

I just ran butler tests without afw setup and everything passes (so that's good).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the try block and Travis and local scons are still happy. Launching Jenkins now.

butler.get(
"deepCoadd_obj", ...,
parameters={
"columns": {"dataset": "meas", "filter": ["HSC-R", "HSC-I"]}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"via:"

Can you add a "column" request to this example to make the difference between "columns" and "column" clear? It's especially confusing because you can request multiple columns e.g.:

butler.get(
    "deepCoadd_obj", ...,
    parameters={"columns": {"dataset": "meas",
                            "filter": ["HSC-R", "HSC-I"]
                            "column": ["coord_ra", "coord_dec"]}})  

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:: here is rST shorthand for "make a Python code block"; it renders as a single colon in HTML.

---------

The ``DataFrame`` storage class corresponds to the `pandas.DataFrame` class in Python.
It includes special support for dealing with hierarchical (i.e. `pandas.MultiIndex`) columns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"hierarchical columns" sounds weird. I'd say:
It includes special support for dealing with hierarchical, or multi-level, indices (i.e. pandas.MultiIndex columns).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to make it clear that I'm not talking about a MultiIndex on the rows, as I gather that's actually the more common usage. How about:

It includes special support for dealing with multi-level indices (i.e. pandas.MultiIndex) in columns.

?

Components
^^^^^^^^^^

``DataFrame`` has a single component, ``columns``, which contains a description of the columns as a `pandas.Index` (often `pandas.MultiIndex`) instance.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrame --> The DataFrame storage class

so that it's clear that we're not talking about the pandas DataFrame?

``DataFrame`` supports a single parameter for partial reads, with the key ``columns``.
For non-hierachical columns, this should be a single column name (`str`) or a `list` of column names.
For hierarchical columns, this should be a dictionary whose keys are the names of the levels, and whose values are column names (`str`) or lists thereof.
The loaded columns are the product of the values for all levels.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call these non-hierarchical or hierarchical "indices" rather than columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Multi-level index columns", again to avoid confusion with multi-level row indexes?

TESTDIR = os.path.abspath(os.path.dirname(__file__))


@unittest.skipUnless(pyarrow is not None, "Cannot tests ParquetFormatter without pyarrow.")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra s in tests.

df2 = self.butler.get(self.datasetType)
self.assertTrue(df1.equals(df2))
# Read just the column descriptions.
columns2 = self.butler.get(f"{self.datasetType.name}.columns")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This columns component is nifty and I'll use it.

self.assertTrue(df1.loc[:, ["a", "c"]].equals(df3))
df4 = self.butler.get(self.datasetType, parameters={"columns": "a"})
self.assertTrue(df1.loc[:, ["a"]].equals(df4))
# Passing an unrecognized column should be a ValueError.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about when one column doesn't exist but the others do: e.g.

*** ValueError: Failure from formatter 'lsst.daf.butler.formatters.parquetFormatter.ParquetFormatter' for Dataset 1

This is different behavior than existing qa.explorer.ParquetTable, which just replicates pyarrow behavior to return the columns that do exist. See https://jira.lsstcorp.org/browse/DM-21976. Add some info on that ticket about the behavior of Gen3 Parquet, and we can deal with it later. I don't have an opinion on which behavior is less surprising. But they should be consistent and the error message *** ValueError: Failure from formatter 'lsst.daf.butler.formatters.parquetFormatter.ParquetFormatter' for Dataset 1 is not enough info to know that 'd' was the offending request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that exception is chained to the real one, i.e. it'll appear above the

The above exception was the direct cause of the following exception:

that precedes this one in the traceback. I'm not a huge fan of that either, as it still obscures the more useful error message. @timj, what do you think about having the catch-and-reraise at

raise ValueError(f"Failure from formatter '{formatter.name()}' for Dataset {ref.id}") from e
pass through exceptions that are already ValueError (or some other type) unchanged?

I've made a comment on DM-21976 and linked to this ticket. Just to make sure we're on the same page, it's okay to leave the (different) behavior here as-is for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My annoyance that raise X from e gives us unreadable stack traces knows no bounds. It's seemingly worse when run from pytest since when pytest reports the stack trace it stops at the boundary.

I'm not entirely sure what the best answer is. On the one hand we could let the formatter fail directly without catching it but then we can't document that you get a ValueError (you might get an ImportError for example) and it's nice that the trace does tell you what dataset you were trying to get and that it was the formatter that failed.

Are we doing it wrong? Is there some other syntax for raise that would work better?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we're doing is as right as it can be given language constraints. Maybe in the kind of high-level contexts where we currently squash exception tracebacks (on the assumption that they're not what users want to see) we'd want to print the original message last and most prominently. But I personally tend to find that behavior annoying anyway, because we haven't got good enough error messages for many failure modes, so usually it just means we tell the user to try again and pass --doRaise or something so we can actually help them.

Anyhow, doing:

try:
    ...
except ValueError:
    raise
except Exception as err:
    raise ValueError(...) from err

would still let the put interface guarantee a certain exception, but it would mean we don't pass along potentially useful information like the data ID and dataset type (which the formatter doesn't know). So at this point I'm back to thinking what we have now is still the best option, even though none of them are good.

Formatters should only be imported when actually needed, so these
guards are hopefully redundant (as long as pytest-flake8 isn't overly
aggressive).
Apparently just importing pyarrow is not enough to trigger the
boost version conflict (known) issue we have on OS X.
@TallJimbo TallJimbo merged commit f9e0909 into master Nov 21, 2019
@TallJimbo TallJimbo deleted the tickets/DM-22062 branch November 21, 2019 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants