You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's quite a lot of text below. If you can't be bothered to read all that then here's a summary:
Summary
Subclassing pandas.DataFrame (and having metadata propagate after Pandas methods are called) appears to only work in Pandas-0.13-dev, not the current Pandas release (0.12). Propagating metadata should work in Pandas 0.13
My suggestion would be that we don't worry about subclassing DataFrame for now. Even if we did use it now (and hence require users to install Pandas-dev) then I don't think it would change our paper much, if at all.
After Pandas 0.13 is released, let's re-visit the idea of subclassing DataFrame.
Personally, I quite like the idea of storing metadata in a subclass of DataFrame, it feels tidy.
Some details:
Why subclass DataFrame?
We currently use one DataFrame per meter (e.g. per appliance meter; or per mains meter) and we store metadata about that channel in a separate dict. This feels a little fragile, because we have to write the code to ensure that the metadata stays in sync with each DataFrame. It feels conceptually cleaner to store metadata inside the DataFrame object. right now, in Pandas 0.12, we could do this just by adding a dataframe._metadata attribute but, in Pandas 0.12, that ._metadata attribute will not propagate if we use any Pandas DataFrame method (e.g. dropna()). In Pandas 0.13. that ._metadata attribute will propagate. GeoPandas stores metadata in their GeoSeries and GeoDataFrame classes.
If we kept metadata with the DataFrame then we could have a tidy mechanism for tracking what pre-processing has been applied to each DataFrame. At the moment, our preprocessing.electricity.single functions take a DataFrame as an argument and return a new DataFrame but there is no tidy way to update the metadata associated with the DataFrame to record the fact that that preprocessing function has been run. Recording the sequence of pre-processing steps done on each DataFrame feels like an important "open science" contribution ;)
We could also record which entries have some uncertainty associated with them (e.g. because these entries have been inserted during pre-processing).
We could have a tidy place to cache attributes like sample_period
In the Pandas API, Series and DataFrame both have lots of useful stats / pre-processing methods (like dropna(), resample(), max(), describe() etc). It feels like some of our preprocessing / stats / plotting functions which act on a single DataFrame deserve to be class methods rather than separate functions.
I'm not sure, but I think that _metadata might automatically be pulled into the HDFStore object.
Why did I bother to tinker with subclassing now, where there are other priorities?
Because, if we could get subclassing to work tidily now, I'd advocate implementing subclassing now (which would have implications for some of the code I need to write over the next few days!)
Experiments
I tried running this code on both Pandas 0.12 and Pandas 0.13-DEV:
I'll print the code first, and then the results on Pandas 0.12 and 0.13-DEV:
Code
from __future__ importprint_function, divisionimportpandasaspdclassElectricDataFrame(pd.DataFrame):
def__init__(self, *args, **kwargs):
super(ElectricDataFrame, self).__init__(*args, **kwargs)
self._metadata= {'test': 'TEST'}
deffoo(self):
print("FOOOO!")
## Implement pandas methods#@propertydef_constructor(self):
# Borrowed from GeoPandas.GeoDataFrame._constructorprint("_constructor called")
returnElectricDataFramedef__finalize__(self, other, method=None, **kwargs):
""" propagate metadata from other to self """# Borrowed from GeoPandas.GeoDataFrame.__finalize__# NOTE: backported from pandas master (upcoming v0.13)print("__finalize__ called")
fornameinself._metadata:
object.__setattr__(self, name, getattr(other, name, None))
returnselfdefcopy(self, deep=True):
""" Make a copy of this ElectricDataFrame object Parameters ---------- deep : boolean, default True Make a deep copy, i.e. also copy data Returns ------- copy : ElectricDataFrame """# Borrowed from GeoPandas.GeoDataFrame.copy# FIXME: this will likely be unnecessary in pandas >= 0.13print("copy called")
data=self._dataifdeep:
data=data.copy()
returnElectricDataFrame(data).__finalize__(self)
print('pandas.__version__ =', pd.__version__)
deftry_to_get_metadata(df):
try:
print(df._metadata)
exceptExceptionase:
print('EXCEPTION:', e)
edf=ElectricDataFrame([1,2,3,4,5], pd.date_range('2010', freq='D', periods=5))
print ('edf._metadata:', edf._metadata)
print()
print ('edf.resample(rule=\'D\', how=\'max\')._metadata:')
try_to_get_metadata(edf.resample(rule='D', how='max'))
print()
print ('edf.resample(rule=\'D\')._metadata:')
try_to_get_metadata(edf.resample(rule='D'))
print()
print('edf.dropna()._metadata:')
try_to_get_metadata(edf.dropna())
Back in issue #12 "How to represent a Building in memory?" we discussed whether or not to subclass
pandas.DataFrame
.@jreback very kindly mentioned that GeoPandas create a subclass of pd.DataFrame called GeoDataFrame which carries metadata; and that metadata is copied when methods inherited from
Pandas.DataFrame
likedropna()
are called. More info on the pandas pull request relating to metadata propagation; also see pandas-dev/pandas#2485There's quite a lot of text below. If you can't be bothered to read all that then here's a summary:
Summary
Some details:
Why subclass DataFrame?
dataframe._metadata
attribute but, in Pandas 0.12, that._metadata
attribute will not propagate if we use any Pandas DataFrame method (e.g.dropna()
). In Pandas 0.13. that._metadata
attribute will propagate. GeoPandas stores metadata in their GeoSeries and GeoDataFrame classes.preprocessing.electricity.single
functions take a DataFrame as an argument and return a new DataFrame but there is no tidy way to update the metadata associated with the DataFrame to record the fact that that preprocessing function has been run. Recording the sequence of pre-processing steps done on each DataFrame feels like an important "open science" contribution ;)sample_period
_metadata
might automatically be pulled into the HDFStore object.Why did I bother to tinker with subclassing now, where there are other priorities?
Because, if we could get subclassing to work tidily now, I'd advocate implementing subclassing now (which would have implications for some of the code I need to write over the next few days!)
Experiments
I tried running this code on both Pandas 0.12 and Pandas 0.13-DEV:
I'll print the code first, and then the results on Pandas 0.12 and 0.13-DEV:
Code
Run on pandas 0.12:
Run on pandas 0.13-dev:
The text was updated successfully, but these errors were encountered: