DOC/API: document how to use metadata #8572

jreback · 2014-10-17T11:04:25Z

documentation (maybe in the cookbook) to start.
add _metadata to Index
top-level metadata control (maybe)

def unit_meta(self, other, method=None):
     # combine and return a new unit for this

      if self.unit == other.unit:
          return self.unit

      # coerce unit
      return coerced_unit

pd.set_metadata({ 'unit' : unit_meta })

This last will require a bit of change in __finalize__ to handle a metadata finalizer for a specific name (but straightforward)

The text was updated successfully, but these errors were encountered:

shoyer · 2014-10-19T22:32:02Z

I think this is a good idea, but we'll need to do it carefully.

I'm guessing the method argument is there to provide some description of the context? The difficulties with propagating metadata is that there are some many edge cases. For example, to handle units, it's not enough to know that both units are the same -- you also need to what sort of operation is being performed (e.g., + vs * vs ==). And perhaps you might prefer to raise rather than propagate the wrong units, etc.

When I implemented this in xray, I didn't want to deal with these issues, so I took the more conservative approach of dropping custom metadata (other than name) in all binary arithmetic and aggregations. See http://xray.readthedocs.org/en/stable/faq.html#what-is-your-approach-to-metadata

I think this sort of hook system is a good idea to let someone else deal with the complexity. __numpy_ufunc__ (arriving in numpy 1.10) may provide a useful reference design -- I know astropy is planning on using it for their Quantities class.

RE: your specific design. I would rather make users define a custom subclass, e.g., UnitsDataFrame = pd.add_metadata(pd.DataFrame, {'unit': unit_meta}), then use some sort of global state.

jreback · 2014-10-20T12:05:48Z

@shoyer the design for this was for users to subclass / monkey-patch DataFrame and replace __finalize__ with a completely new routine that handles whatever behavior the user wants. I am proposing an extension to allow a method by method interaction to be specified, e.g. method='merge'|'add' are passed, and then let the user provide a routine to handle it.

This is just a dispatch to user interaction, like: hey I have these 2 objects which I am adding, how do you want to combine the meta-data. It will drop by default, this just allows a 'plug-in' type of mechanism.

hughesadam87 · 2014-10-20T21:12:15Z

Hi,

Really glad to see the subclassing and finalize behavior come into fruition.

I'm not sure I fully understand the scope of this general metadata problem, but in my experience, the relation of metadata to the results of __finalize__ can get very complex, so any solution you come up with should error on the side of conservative as Stephan mentioned. As a user of such functionality, I'd prefer my metadata to cause errors in operations that are not directly supported rather than having a general schema in place that might lead to convoluted, hard-to-find bugs. For example, in a spectroscopy library, if the user adds two objects that have different unit systems, I'd rather an error get raised rather than a decision be made for me based on Pandas' metadata handling system.

From studying GeoPandas as well as pyuvvis, I get the impression that most pandas-subclassing libraries are going to use a relatively limited set of operations from the API, so maybe it's a good idea to make them work out all of the usecases for metadata as they go, and just provide context and suggestions in the docs?

Sorry if this is not germane to the issue at hand ;0

jreback · 2014-10-20T22:15:14Z

@hugadams

this modification / doc update is for the USER to really do all of the work. pandas provides a framework if you will. but the USER decides ALL interactions with metadata (otherwise they are NOT propogated). So it is basically wide open for the USER to provide a mechanism to propogate some / raise an error if needed etc. I guess some docs are in order!

hughesadam87 · 2014-10-21T19:15:32Z

Sounds great

TomAugspurger · 2017-07-27T21:24:21Z

Resurrecting this with a bit of an alternate / synthesis of previous ideas. The basic idea is to push metadata propagation onto the subclasses, as was previously suggested. The new proposal is for pandas to provide a bit more infrastructure for subclasses, which would remove the need for any global state.

The ._metadata works pretty well for many things. As an example, here's a subclassed DataFrame that have "colors". You can only add dataframes of the same color

import pandas as pd


class SubclassedDataFrame2(pd.DataFrame):

    # normal properties
    _metadata = ['color']

    def __init__(self, *args, color=None, **kwargs):
        self.color = color
        super().__init__(*args, **kwargs)

    @property
    def _constructor(self):
        return SubclassedDataFrame2

    def __add__(self, other):
        if self.color != other.color:
            raise ValueError
        return super().__add__(self, other)

>>> a = SubclassedDataFrame2({"A": [1, 2], "B": [3, 4]}, color='red')

For things like __getitem__ the metadata propagates nicely:

>>> a[['A']].color
red

But binary operations don't propagate the metadata

>>> (a + a).color  # None

We could patch SubclassedDataFrame.__add__ to manually add it ourselves, but that would get old doing it for every method.

A potential solution is for pandas to provide a Metadata class, that subclasses could use to indicate if / how metadata should be propagated for any given operation. The base class would be something like

class Metadata:

    def __init__(self, name):
        self.name = name

    def __repr__(self):
        return "Metadata({})".format(self.name)

    def __add__(self, left, right):
        return None  # do not propogate

and subclasses would override the methods they want

class ColorMetadata(Metadata):

    def __add__(self, left, right):
        if set(left.color, right.color) == {"blue", "yellow"}:
            return 'green'
        elif set(left.color, right.color) == {"blue", "red"}:
            return "purple"
        ...

    def concat(self, left, right):
        return '-'.join([left.color, right.color])

So when defining a subclass, it would be _metadata = [ColorMetadata('color')], and pandas would call the appropriate method to figure out how to propagate metadata.

>>> b = SubclassedDataFrame2({"A": [1, 2], "C": [5, 6]}, color='blue')
>>> print((a + b).color)
purple

thoughts?

jbrockmendel · 2017-07-28T21:07:51Z

This is a solid approach. Have you thought about how to handle Series-level metadata when that Series becomes a column in a DataFrame? e.g.

c = SubclassedDataFrame2({"A": a["A"], "C": b["C"]})

Here a["A"].color == 'red' and b["C"].color == 'blue'. What is c.color? Do we still have c["A"].color == a["A"].color?

jbrockmendel · 2017-07-30T06:54:55Z

An issue that comes up with the column-specific metadata is that _constructor_sliced may need to be axis-specific.

TomAugspurger · 2019-09-05T21:38:45Z

This has moved up a bit on my priority list. I'm hoping to use _metadata to propagate a "disallow duplicate labels" property through operations (#27108).

I'm playing with different APIs right now. The core feature I want to provide is for a given attribute to determine how metadata should be propagated for a given pandas method. I think that necessitates some kind of finalizer like

dispatch = {} : Dict[Tuple[method, metadata_name], Callable]

def __finalize__(self, other, method):
    for metadata_name in self._metdata:
        dispatch[(method, metadata_name)](self, other)

There are a few for registering finalizers with the dispatch, but right now I'm favoring something like

duplicate_meta = PandasMetadata("disallow_duplicate_labels")  # the metadata name

@duplicate_meta.register(pd.concat)
def finalize_concat(new, other):
    new.allow_duplicate_labels = all(x.allow_duplicate_labels for x in other)

And we would provide a default finalizer that does what we do on master today (copy from other to self when other is an NDFrame.

The main problems I'm facing now.

What is passed to the user-defined finalizers? In my concat example, I think to_concat makes sense. I think we want to pass through all the NDFrames, but this seems a little tricky to get right.
What keys to use in the function? Something like pd.merge, DataFrame.merge, DataFrame.join might all end up in the same __finalize__ call. Do we want to disambiguate those? I don't know.
Discovery: How do metadata authors know what functions call finalize, and what arguments to expect for each. Still thinking this through.

I have a work in progress like https://github.com/TomAugspurger/pandas/pull/new/metadata-dispatch.

jbrockmendel · 2019-09-05T23:44:23Z

AFAICT there isn't a way to comment on that branch until a PR is opened. Can you open it as a "draft" or something?

In this discussion it seems like disallow_duplicate_labels is pinned to a Series/DataFrame, but it was originally discussed as an Index attribute. Is that distinction important?

TomAugspurger · 2019-09-06T01:56:53Z

Not quite ready yet, still figuring out the design. Will have a WIP soonish. For disallow duplicates, my original idea was on Index, but I think NDFrame is the way to go.

…

On Sep 5, 2019, at 18:44, jbrockmendel ***@***.***> wrote: AFAICT there isn't a way to comment on that branch until a PR is opened. Can you open it as a "draft" or something? In this discussion it seems like disallow_duplicate_labels is pinned to a Series/DataFrame, but it was originally discussed as an Index attribute. Is that distinction important? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jreback added Docs API Design labels Oct 17, 2014

jreback added this to the 0.15.1 milestone Oct 17, 2014

jreback mentioned this issue Oct 17, 2014

API: better way to specify metadata extensions for pandas geopandas/geopandas#171

Open

shoyer mentioned this issue Oct 19, 2014

Allow custom metadata to be attached to panel/df/series? #2485

Closed

jreback modified the milestones: 0.15.2, 0.16.0 Nov 29, 2014

jreback mentioned this issue Dec 7, 2014

API: make attribute setting de-facto insert column #9033

Closed

kay1793 mentioned this issue Jan 21, 2015

_metadata has bizzare behaviour in python 3.3 #9317

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Apr 4, 2015

BUG: DataFrame._slice doesnt retain metadata #9793

Merged

jreback mentioned this issue Sep 9, 2017

Accessing all user-created metadata of a pandas Series #17480

Closed

TomAugspurger added the metadata _metadata, .attrs label Sep 5, 2019

TomAugspurger mentioned this issue Sep 7, 2019

REF/ENH: Refactor NDFrame finalization #28334

Closed

3 tasks

mroeschke removed the API Design label Apr 11, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jorisvandenbossche mentioned this issue Mar 27, 2023

DEPR: attrs #52166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC/API: document how to use metadata #8572

DOC/API: document how to use metadata #8572

jreback commented Oct 17, 2014

shoyer commented Oct 19, 2014

jreback commented Oct 20, 2014

hughesadam87 commented Oct 20, 2014

jreback commented Oct 20, 2014

hughesadam87 commented Oct 21, 2014

TomAugspurger commented Jul 27, 2017 •

edited

Loading

jbrockmendel commented Jul 28, 2017

jbrockmendel commented Jul 30, 2017

TomAugspurger commented Sep 5, 2019

jbrockmendel commented Sep 5, 2019

TomAugspurger commented Sep 6, 2019 via email

DOC/API: document how to use metadata #8572

DOC/API: document how to use metadata #8572

Comments

jreback commented Oct 17, 2014

shoyer commented Oct 19, 2014

jreback commented Oct 20, 2014

hughesadam87 commented Oct 20, 2014

jreback commented Oct 20, 2014

hughesadam87 commented Oct 21, 2014

TomAugspurger commented Jul 27, 2017 • edited Loading

jbrockmendel commented Jul 28, 2017

jbrockmendel commented Jul 30, 2017

TomAugspurger commented Sep 5, 2019

jbrockmendel commented Sep 5, 2019

TomAugspurger commented Sep 6, 2019 via email

TomAugspurger commented Jul 27, 2017 •

edited

Loading