Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC/API: document how to use metadata #8572

Open
3 tasks
jreback opened this issue Oct 17, 2014 · 11 comments
Open
3 tasks

DOC/API: document how to use metadata #8572

jreback opened this issue Oct 17, 2014 · 11 comments
Labels
Docs metadata _metadata, .attrs

Comments

@jreback
Copy link
Contributor

jreback commented Oct 17, 2014

from SO
xref #2485
xref #7868

  • documentation (maybe in the cookbook) to start.
  • add _metadata to Index
  • top-level metadata control (maybe)
def unit_meta(self, other, method=None):
     # combine and return a new unit for this

      if self.unit == other.unit:
          return self.unit

      # coerce unit
      return coerced_unit

pd.set_metadata({ 'unit' : unit_meta })

This last will require a bit of change in __finalize__ to handle a metadata finalizer for a specific name (but straightforward)

@shoyer
Copy link
Member

shoyer commented Oct 19, 2014

I think this is a good idea, but we'll need to do it carefully.

I'm guessing the method argument is there to provide some description of the context? The difficulties with propagating metadata is that there are some many edge cases. For example, to handle units, it's not enough to know that both units are the same -- you also need to what sort of operation is being performed (e.g., + vs * vs ==). And perhaps you might prefer to raise rather than propagate the wrong units, etc.

When I implemented this in xray, I didn't want to deal with these issues, so I took the more conservative approach of dropping custom metadata (other than name) in all binary arithmetic and aggregations. See http://xray.readthedocs.org/en/stable/faq.html#what-is-your-approach-to-metadata

I think this sort of hook system is a good idea to let someone else deal with the complexity. __numpy_ufunc__ (arriving in numpy 1.10) may provide a useful reference design -- I know astropy is planning on using it for their Quantities class.

RE: your specific design. I would rather make users define a custom subclass, e.g., UnitsDataFrame = pd.add_metadata(pd.DataFrame, {'unit': unit_meta}), then use some sort of global state.

@jreback
Copy link
Contributor Author

jreback commented Oct 20, 2014

@shoyer the design for this was for users to subclass / monkey-patch DataFrame and replace __finalize__ with a completely new routine that handles whatever behavior the user wants. I am proposing an extension to allow a method by method interaction to be specified, e.g. method='merge'|'add' are passed, and then let the user provide a routine to handle it.

This is just a dispatch to user interaction, like: hey I have these 2 objects which I am adding, how do you want to combine the meta-data. It will drop by default, this just allows a 'plug-in' type of mechanism.

@hughesadam87
Copy link

Hi,

Really glad to see the subclassing and finalize behavior come into fruition.

I'm not sure I fully understand the scope of this general metadata problem, but in my experience, the relation of metadata to the results of __finalize__ can get very complex, so any solution you come up with should error on the side of conservative as Stephan mentioned. As a user of such functionality, I'd prefer my metadata to cause errors in operations that are not directly supported rather than having a general schema in place that might lead to convoluted, hard-to-find bugs. For example, in a spectroscopy library, if the user adds two objects that have different unit systems, I'd rather an error get raised rather than a decision be made for me based on Pandas' metadata handling system.

From studying GeoPandas as well as pyuvvis, I get the impression that most pandas-subclassing libraries are going to use a relatively limited set of operations from the API, so maybe it's a good idea to make them work out all of the usecases for metadata as they go, and just provide context and suggestions in the docs?

Sorry if this is not germane to the issue at hand ;0

@jreback
Copy link
Contributor Author

jreback commented Oct 20, 2014

@hugadams

this modification / doc update is for the USER to really do all of the work. pandas provides a framework if you will. but the USER decides ALL interactions with metadata (otherwise they are NOT propogated). So it is basically wide open for the USER to provide a mechanism to propogate some / raise an error if needed etc. I guess some docs are in order!

@hughesadam87
Copy link

Sounds great

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 27, 2017

Resurrecting this with a bit of an alternate / synthesis of previous ideas. The basic idea is to push metadata propagation onto the subclasses, as was previously suggested. The new proposal is for pandas to provide a bit more infrastructure for subclasses, which would remove the need for any global state.


The ._metadata works pretty well for many things. As an example, here's a subclassed DataFrame that have "colors". You can only add dataframes of the same color

import pandas as pd


class SubclassedDataFrame2(pd.DataFrame):

    # normal properties
    _metadata = ['color']

    def __init__(self, *args, color=None, **kwargs):
        self.color = color
        super().__init__(*args, **kwargs)

    @property
    def _constructor(self):
        return SubclassedDataFrame2

    def __add__(self, other):
        if self.color != other.color:
            raise ValueError
        return super().__add__(self, other)

>>> a = SubclassedDataFrame2({"A": [1, 2], "B": [3, 4]}, color='red')

For things like __getitem__ the metadata propagates nicely:

>>> a[['A']].color
red

But binary operations don't propagate the metadata

>>> (a + a).color  # None

We could patch SubclassedDataFrame.__add__ to manually add it ourselves, but that would get old doing it for every method.

A potential solution is for pandas to provide a Metadata class, that subclasses could use to indicate if / how metadata should be propagated for any given operation. The base class would be something like

class Metadata:

    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return "Metadata({})".format(self.name)

    def __add__(self, left, right):
        return None  # do not propogate

and subclasses would override the methods they want

class ColorMetadata(Metadata):

    def __add__(self, left, right):
        if set(left.color, right.color) == {"blue", "yellow"}:
            return 'green'
        elif set(left.color, right.color) == {"blue", "red"}:
            return "purple"
        ...

    def concat(self, left, right):
        return '-'.join([left.color, right.color])

So when defining a subclass, it would be _metadata = [ColorMetadata('color')], and pandas would call the appropriate method to figure out how to propagate metadata.

>>> b = SubclassedDataFrame2({"A": [1, 2], "C": [5, 6]}, color='blue')
>>> print((a + b).color)
purple

thoughts?

@jbrockmendel
Copy link
Member

This is a solid approach. Have you thought about how to handle Series-level metadata when that Series becomes a column in a DataFrame? e.g.

c = SubclassedDataFrame2({"A": a["A"], "C": b["C"]})

Here a["A"].color == 'red' and b["C"].color == 'blue'. What is c.color? Do we still have c["A"].color == a["A"].color?

@jbrockmendel
Copy link
Member

An issue that comes up with the column-specific metadata is that _constructor_sliced may need to be axis-specific.

@TomAugspurger
Copy link
Contributor

This has moved up a bit on my priority list. I'm hoping to use _metadata to propagate a "disallow duplicate labels" property through operations (#27108).


I'm playing with different APIs right now. The core feature I want to provide is for a given attribute to determine how metadata should be propagated for a given pandas method. I think that necessitates some kind of finalizer like

dispatch = {} : Dict[Tuple[method, metadata_name], Callable]

def __finalize__(self, other, method):
    for metadata_name in self._metdata:
        dispatch[(method, metadata_name)](self, other)

There are a few for registering finalizers with the dispatch, but right now I'm favoring something like

duplicate_meta = PandasMetadata("disallow_duplicate_labels")  # the metadata name

@duplicate_meta.register(pd.concat)
def finalize_concat(new, other):
    new.allow_duplicate_labels = all(x.allow_duplicate_labels for x in other)

And we would provide a default finalizer that does what we do on master today (copy from other to self when other is an NDFrame.

The main problems I'm facing now.

  1. What is passed to the user-defined finalizers? In my concat example, I think to_concat makes sense. I think we want to pass through all the NDFrames, but this seems a little tricky to get right.
  2. What keys to use in the function? Something like pd.merge, DataFrame.merge, DataFrame.join might all end up in the same __finalize__ call. Do we want to disambiguate those? I don't know.
  3. Discovery: How do metadata authors know what functions call finalize, and what arguments to expect for each. Still thinking this through.

I have a work in progress like https://github.com/TomAugspurger/pandas/pull/new/metadata-dispatch.

@jbrockmendel
Copy link
Member

AFAICT there isn't a way to comment on that branch until a PR is opened. Can you open it as a "draft" or something?

In this discussion it seems like disallow_duplicate_labels is pinned to a Series/DataFrame, but it was originally discussed as an Index attribute. Is that distinction important?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 6, 2019 via email

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs metadata _metadata, .attrs
Projects
None yet
Development

No branches or pull requests

6 participants