Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a subclass of DataFrame to store metadata and basic methods #83

Closed
JackKelly opened this issue Jan 6, 2014 · 1 comment
Closed

Comments

@JackKelly
Copy link
Contributor

Back in issue #12 "How to represent a Building in memory?" we discussed whether or not to subclass pandas.DataFrame.

@jreback very kindly mentioned that GeoPandas create a subclass of pd.DataFrame called GeoDataFrame which carries metadata; and that metadata is copied when methods inherited from Pandas.DataFrame like dropna() are called. More info on the pandas pull request relating to metadata propagation; also see pandas-dev/pandas#2485

There's quite a lot of text below. If you can't be bothered to read all that then here's a summary:

Summary

  • Subclassing pandas.DataFrame (and having metadata propagate after Pandas methods are called) appears to only work in Pandas-0.13-dev, not the current Pandas release (0.12). Propagating metadata should work in Pandas 0.13
  • My suggestion would be that we don't worry about subclassing DataFrame for now. Even if we did use it now (and hence require users to install Pandas-dev) then I don't think it would change our paper much, if at all.
  • After Pandas 0.13 is released, let's re-visit the idea of subclassing DataFrame.
  • Personally, I quite like the idea of storing metadata in a subclass of DataFrame, it feels tidy.

Some details:

Why subclass DataFrame?

  • We currently use one DataFrame per meter (e.g. per appliance meter; or per mains meter) and we store metadata about that channel in a separate dict. This feels a little fragile, because we have to write the code to ensure that the metadata stays in sync with each DataFrame. It feels conceptually cleaner to store metadata inside the DataFrame object. right now, in Pandas 0.12, we could do this just by adding a dataframe._metadata attribute but, in Pandas 0.12, that ._metadata attribute will not propagate if we use any Pandas DataFrame method (e.g. dropna()). In Pandas 0.13. that ._metadata attribute will propagate. GeoPandas stores metadata in their GeoSeries and GeoDataFrame classes.
  • If we kept metadata with the DataFrame then we could have a tidy mechanism for tracking what pre-processing has been applied to each DataFrame. At the moment, our preprocessing.electricity.single functions take a DataFrame as an argument and return a new DataFrame but there is no tidy way to update the metadata associated with the DataFrame to record the fact that that preprocessing function has been run. Recording the sequence of pre-processing steps done on each DataFrame feels like an important "open science" contribution ;)
  • We could also record which entries have some uncertainty associated with them (e.g. because these entries have been inserted during pre-processing).
  • We could have a tidy place to cache attributes like sample_period
  • In the Pandas API, Series and DataFrame both have lots of useful stats / pre-processing methods (like dropna(), resample(), max(), describe() etc). It feels like some of our preprocessing / stats / plotting functions which act on a single DataFrame deserve to be class methods rather than separate functions.
  • I'm not sure, but I think that _metadata might automatically be pulled into the HDFStore object.

Why did I bother to tinker with subclassing now, where there are other priorities?

Because, if we could get subclassing to work tidily now, I'd advocate implementing subclassing now (which would have implications for some of the code I need to write over the next few days!)

Experiments

I tried running this code on both Pandas 0.12 and Pandas 0.13-DEV:

I'll print the code first, and then the results on Pandas 0.12 and 0.13-DEV:

Code

from __future__ import print_function, division
import pandas as pd

class ElectricDataFrame(pd.DataFrame):
    def __init__(self, *args, **kwargs):
        super(ElectricDataFrame, self).__init__(*args, **kwargs)
        self._metadata = {'test': 'TEST'}

    def foo(self):
        print("FOOOO!")

    #
    # Implement pandas methods
    #
    @property
    def _constructor(self):
        # Borrowed from GeoPandas.GeoDataFrame._constructor
        print("_constructor called")
        return ElectricDataFrame

    def __finalize__(self, other, method=None, **kwargs):
        """ propagate metadata from other to self """
        # Borrowed from GeoPandas.GeoDataFrame.__finalize__
        # NOTE: backported from pandas master (upcoming v0.13)
        print("__finalize__ called")
        for name in self._metadata:
            object.__setattr__(self, name, getattr(other, name, None))
        return self

    def copy(self, deep=True):
        """
        Make a copy of this ElectricDataFrame object

        Parameters
        ----------
        deep : boolean, default True
            Make a deep copy, i.e. also copy data

        Returns
        -------
        copy : ElectricDataFrame
        """
        # Borrowed from GeoPandas.GeoDataFrame.copy
        # FIXME: this will likely be unnecessary in pandas >= 0.13
        print("copy called")
        data = self._data
        if deep:
            data = data.copy()
        return ElectricDataFrame(data).__finalize__(self)

print('pandas.__version__ =', pd.__version__)

def try_to_get_metadata(df):
    try:
        print(df._metadata)
    except Exception as e:
        print('EXCEPTION:', e)

edf = ElectricDataFrame([1,2,3,4,5], pd.date_range('2010', freq='D', periods=5))
print ('edf._metadata:', edf._metadata)
print()
print ('edf.resample(rule=\'D\', how=\'max\')._metadata:')
try_to_get_metadata(edf.resample(rule='D', how='max'))
print()
print ('edf.resample(rule=\'D\')._metadata:')
try_to_get_metadata(edf.resample(rule='D'))
print()
print('edf.dropna()._metadata:')
try_to_get_metadata(edf.dropna())

Run on pandas 0.12:

In [14]: pandas.__version__ = 0.12.0
edf._metadata: {'test': 'TEST'}

edf.resample(rule='D', how='max')._metadata:
EXCEPTION: 'DataFrame' object has no attribute '_metadata'

edf.resample(rule='D')._metadata:
EXCEPTION: 'DataFrame' object has no attribute '_metadata'

edf.dropna()._metadata:
EXCEPTION: 'DataFrame' object has no attribute '_metadata'

Run on pandas 0.13-dev:

In [4]: pandas.__version__ = 0.13.0-75-g7d9e9fa
edf._metadata: {'test': 'TEST'}

edf.resample(rule='D', how='max')._metadata:
[]

edf.resample(rule='D')._metadata:
copy called
__finalize__ called
{'test': 'TEST'}

edf.dropna()._metadata:
_constructor called
__finalize__ called
{'test': 'TEST'}
@JackKelly
Copy link
Contributor Author

The Pandas 0.13 milestone is now closed... I expect there'll be a Pandas 0.13 release very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant