Use a subclass of DataFrame to store metadata and basic methods #83

JackKelly · 2014-01-06T15:05:56Z

Back in issue #12 "How to represent a Building in memory?" we discussed whether or not to subclass pandas.DataFrame.

@jreback very kindly mentioned that GeoPandas create a subclass of pd.DataFrame called GeoDataFrame which carries metadata; and that metadata is copied when methods inherited from Pandas.DataFrame like dropna() are called. More info on the pandas pull request relating to metadata propagation; also see pandas-dev/pandas#2485

There's quite a lot of text below. If you can't be bothered to read all that then here's a summary:

Summary

Subclassing pandas.DataFrame (and having metadata propagate after Pandas methods are called) appears to only work in Pandas-0.13-dev, not the current Pandas release (0.12). Propagating metadata should work in Pandas 0.13
My suggestion would be that we don't worry about subclassing DataFrame for now. Even if we did use it now (and hence require users to install Pandas-dev) then I don't think it would change our paper much, if at all.
After Pandas 0.13 is released, let's re-visit the idea of subclassing DataFrame.
Personally, I quite like the idea of storing metadata in a subclass of DataFrame, it feels tidy.

Some details:

Why subclass DataFrame?

We currently use one DataFrame per meter (e.g. per appliance meter; or per mains meter) and we store metadata about that channel in a separate dict. This feels a little fragile, because we have to write the code to ensure that the metadata stays in sync with each DataFrame. It feels conceptually cleaner to store metadata inside the DataFrame object. right now, in Pandas 0.12, we could do this just by adding a dataframe._metadata attribute but, in Pandas 0.12, that ._metadata attribute will not propagate if we use any Pandas DataFrame method (e.g. dropna()). In Pandas 0.13. that ._metadata attribute will propagate. GeoPandas stores metadata in their GeoSeries and GeoDataFrame classes.
If we kept metadata with the DataFrame then we could have a tidy mechanism for tracking what pre-processing has been applied to each DataFrame. At the moment, our preprocessing.electricity.single functions take a DataFrame as an argument and return a new DataFrame but there is no tidy way to update the metadata associated with the DataFrame to record the fact that that preprocessing function has been run. Recording the sequence of pre-processing steps done on each DataFrame feels like an important "open science" contribution ;)
We could also record which entries have some uncertainty associated with them (e.g. because these entries have been inserted during pre-processing).
We could have a tidy place to cache attributes like sample_period
In the Pandas API, Series and DataFrame both have lots of useful stats / pre-processing methods (like dropna(), resample(), max(), describe() etc). It feels like some of our preprocessing / stats / plotting functions which act on a single DataFrame deserve to be class methods rather than separate functions.
I'm not sure, but I think that _metadata might automatically be pulled into the HDFStore object.

Why did I bother to tinker with subclassing now, where there are other priorities?

Because, if we could get subclassing to work tidily now, I'd advocate implementing subclassing now (which would have implications for some of the code I need to write over the next few days!)

Experiments

I tried running this code on both Pandas 0.12 and Pandas 0.13-DEV:

I'll print the code first, and then the results on Pandas 0.12 and 0.13-DEV:

Code

from __future__ import print_function, division
import pandas as pd

class ElectricDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super(ElectricDataFrame, self).__init__(*args, **kwargs)
        self._metadata = {'test': 'TEST'}

    def foo(self):
        print("FOOOO!")

    #
    # Implement pandas methods
    #
    @property
    def _constructor(self):
        # Borrowed from GeoPandas.GeoDataFrame._constructor
        print("_constructor called")
        return ElectricDataFrame

    def __finalize__(self, other, method=None, **kwargs):
        """ propagate metadata from other to self """
        # Borrowed from GeoPandas.GeoDataFrame.__finalize__
        # NOTE: backported from pandas master (upcoming v0.13)
        print("__finalize__ called")
        for name in self._metadata:
            object.__setattr__(self, name, getattr(other, name, None))
        return self

    def copy(self, deep=True):
        """
        Make a copy of this ElectricDataFrame object

        Parameters
        ----------
        deep : boolean, default True
            Make a deep copy, i.e. also copy data

        Returns
        -------
        copy : ElectricDataFrame
        """
        # Borrowed from GeoPandas.GeoDataFrame.copy
        # FIXME: this will likely be unnecessary in pandas >= 0.13
        print("copy called")
        data = self._data
        if deep:
            data = data.copy()
        return ElectricDataFrame(data).__finalize__(self)

print('pandas.__version__ =', pd.__version__)

def try_to_get_metadata(df):
    try:
        print(df._metadata)
    except Exception as e:
        print('EXCEPTION:', e)

edf = ElectricDataFrame([1,2,3,4,5], pd.date_range('2010', freq='D', periods=5))
print ('edf._metadata:', edf._metadata)
print()
print ('edf.resample(rule=\'D\', how=\'max\')._metadata:')
try_to_get_metadata(edf.resample(rule='D', how='max'))
print()
print ('edf.resample(rule=\'D\')._metadata:')
try_to_get_metadata(edf.resample(rule='D'))
print()
print('edf.dropna()._metadata:')
try_to_get_metadata(edf.dropna())

Run on pandas 0.12:

In [14]: pandas.__version__ = 0.12.0
edf._metadata: {'test': 'TEST'}

edf.resample(rule='D', how='max')._metadata:
EXCEPTION: 'DataFrame' object has no attribute '_metadata'

edf.resample(rule='D')._metadata:
EXCEPTION: 'DataFrame' object has no attribute '_metadata'

edf.dropna()._metadata:
EXCEPTION: 'DataFrame' object has no attribute '_metadata'

Run on pandas 0.13-dev:

In [4]: pandas.__version__ = 0.13.0-75-g7d9e9fa
edf._metadata: {'test': 'TEST'}

edf.resample(rule='D', how='max')._metadata:
[]

edf.resample(rule='D')._metadata:
copy called
__finalize__ called
{'test': 'TEST'}

edf.dropna()._metadata:
_constructor called
__finalize__ called
{'test': 'TEST'}

The text was updated successfully, but these errors were encountered:

JackKelly · 2014-01-09T06:24:12Z

The Pandas 0.13 milestone is now closed... I expect there'll be a Pandas 0.13 release very soon.

ghost assigned JackKelly Jan 6, 2014

This was referenced Jan 6, 2014

API: implement __finalize__ for resample et al. pandas-dev/pandas#5862

Closed

What to fill for missing entries (for non power data) #89

Closed

JackKelly closed this as completed Jul 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a subclass of DataFrame to store metadata and basic methods #83

Use a subclass of DataFrame to store metadata and basic methods #83

JackKelly commented Jan 6, 2014

JackKelly commented Jan 9, 2014

Use a subclass of DataFrame to store metadata and basic methods #83

Use a subclass of DataFrame to store metadata and basic methods #83

Comments

JackKelly commented Jan 6, 2014

Summary

Why subclass DataFrame?

Why did I bother to tinker with subclassing now, where there are other priorities?

Experiments

Code

Run on pandas 0.12:

Run on pandas 0.13-dev:

JackKelly commented Jan 9, 2014