New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow custom metadata to be attached to panel/df/series? #2485

Open
y-p opened this Issue Dec 11, 2012 · 82 comments

Comments

Projects
None yet
@y-p
Contributor

y-p commented Dec 11, 2012

related:
#39 (column descriptions)
#686 (serialization concerns)
#447 (comment) (Feature request, implementation variant)

Ideas and issues:

  • is pickle stable enough?
  • serialization req. potentially couples this with storage format
  • allow arbitrary python objects?
  • impl. option: JSONables-only + type tag + load/save callback hook
  • getter/setter or attribute style access (limits naming)?
  • flat or hirarchical?
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 11, 2012

Contributor

storage of this data is is pretty easy to implement in HDFStore.
(not pushing HDFStor as a general mechanism!)

general thoughts on meta data:

  • A single meta attribute avoids lots of issues
  • vote for 'meta'
  • allow hierarchical
  • allow anything (eg no checking)
  • do u keep copying it?, just shallow I think
  • in serialization make json only

specific to HDFStore:

  • serialization -(PyTables already has this requirement), maybe have this raise/warn user that they r going to lose data
  • in HDFStore: what to do when appending (and not overwriting) - maybe ignore it, and force user to pass an option, meta=True?
Contributor

jreback commented Dec 11, 2012

storage of this data is is pretty easy to implement in HDFStore.
(not pushing HDFStor as a general mechanism!)

general thoughts on meta data:

  • A single meta attribute avoids lots of issues
  • vote for 'meta'
  • allow hierarchical
  • allow anything (eg no checking)
  • do u keep copying it?, just shallow I think
  • in serialization make json only

specific to HDFStore:

  • serialization -(PyTables already has this requirement), maybe have this raise/warn user that they r going to lose data
  • in HDFStore: what to do when appending (and not overwriting) - maybe ignore it, and force user to pass an option, meta=True?
@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 11, 2012

Contributor

pytables it is a very good fit in terms of features, but:

  • pytables is not currently a hard dependency of pandas.
  • is not currently available for python3?
  • does not support in memory databases (?)
  • alsom cannot itself be pickled or jsoned easily? so you have to embed a binary hdfs file.
Contributor

y-p commented Dec 11, 2012

pytables it is a very good fit in terms of features, but:

  • pytables is not currently a hard dependency of pandas.
  • is not currently available for python3?
  • does not support in memory databases (?)
  • alsom cannot itself be pickled or jsoned easily? so you have to embed a binary hdfs file.
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 11, 2012

Contributor

oh - was not suggesting we use this as a backend for specific storage
of meta deta in general (the above points were my comments in general on meta
data - reading it again it DOES look like I am pushing HDFStore)

was just pointing out that HDFStore can support meta deta if pandas structures do

to answer your questions

  • not a hard dependency - nor should pandas make it one
  • not yet py3 (being worked on now I believe)
  • not in memory capable
  • hdf is meant to be an external format
    (so would have to include a separate file)
Contributor

jreback commented Dec 11, 2012

oh - was not suggesting we use this as a backend for specific storage
of meta deta in general (the above points were my comments in general on meta
data - reading it again it DOES look like I am pushing HDFStore)

was just pointing out that HDFStore can support meta deta if pandas structures do

to answer your questions

  • not a hard dependency - nor should pandas make it one
  • not yet py3 (being worked on now I believe)
  • not in memory capable
  • hdf is meant to be an external format
    (so would have to include a separate file)
@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 11, 2012

Contributor

+1 for all meta data living under a single attribute

I'm against allowing non-serializable objects as metadata, at all. But not sure
if that should be a constraint on the objects or the serialization format.

in any case, a hook+type tag mechanism would allow users to plant ids of external
objects and reconstruct things at load-time.
I've been thinking of suggesting a hooking mechanism elsewhere (for custom representations
of dataframes - viz, html and so on).

Contributor

y-p commented Dec 11, 2012

+1 for all meta data living under a single attribute

I'm against allowing non-serializable objects as metadata, at all. But not sure
if that should be a constraint on the objects or the serialization format.

in any case, a hook+type tag mechanism would allow users to plant ids of external
objects and reconstruct things at load-time.
I've been thinking of suggesting a hooking mechanism elsewhere (for custom representations
of dataframes - viz, html and so on).

@gerigk

This comment has been minimized.

Show comment
Hide comment
@gerigk

gerigk Dec 11, 2012

what do you mean by not in memory capable?

HDF5 has an in memory+ stdout writer and pytables support has been added
recently.

(PyTables/PyTables#173)

On Tue, Dec 11, 2012 at 8:31 AM, jreback notifications@github.com wrote:

oh - was not suggesting we use this as a backend for specific storage
of meta deta in general
just that HDFStore can support meta deta if pandas structures do

to answer your questions

  • not a hard dependency - nor should pandas make it one

  • not yet py3 (being worked on now I believe)

  • not in memory capable


    Reply to this email directly or view it on GitHubhttps://github.com//issues/2485#issuecomment-11233812.

gerigk commented Dec 11, 2012

what do you mean by not in memory capable?

HDF5 has an in memory+ stdout writer and pytables support has been added
recently.

(PyTables/PyTables#173)

On Tue, Dec 11, 2012 at 8:31 AM, jreback notifications@github.com wrote:

oh - was not suggesting we use this as a backend for specific storage
of meta deta in general
just that HDFStore can support meta deta if pandas structures do

to answer your questions

  • not a hard dependency - nor should pandas make it one

  • not yet py3 (being worked on now I believe)

  • not in memory capable


    Reply to this email directly or view it on GitHubhttps://github.com//issues/2485#issuecomment-11233812.

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 11, 2012

Contributor

oh. I wasn't aware of that and didn't find anythig in the online docs.
This seems to have been added after the latest 2.4.0 pytables release and so is not
yet available off pypi or the distros.

Contributor

y-p commented Dec 11, 2012

oh. I wasn't aware of that and didn't find anythig in the online docs.
This seems to have been added after the latest 2.4.0 pytables release and so is not
yet available off pypi or the distros.

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 11, 2012

Thanks for including me on this request y-p.

IMO, it seems like we should not try to prohibit objects as metadata based on their serialization capacity. I only say this because how would one account for every possible object? For example, Chaco plots from the Enthought Tool Suite don't serialize easily, but who would know that unless they tried. I think that it's best to let users put anything as metadata, and if it can't serialize, then they'll know when an error is thrown. It is also possible to have the program serialize everything but the metadata, and then just alert the user that this aspect has been lost.

Does anyone here know the pandas source code well enough to understand how to implement something like this? I really don't have a clue, but hope this isn't asking too much of the developers.

Also, I think this addition will be a nice way to appease people who are always looking to subclass a dataframe.

*up vote for adding attribute being called 'meta'
*up vote for putting it on Index classes as well as Series, DataFrame and Panel

hugadams commented Dec 11, 2012

Thanks for including me on this request y-p.

IMO, it seems like we should not try to prohibit objects as metadata based on their serialization capacity. I only say this because how would one account for every possible object? For example, Chaco plots from the Enthought Tool Suite don't serialize easily, but who would know that unless they tried. I think that it's best to let users put anything as metadata, and if it can't serialize, then they'll know when an error is thrown. It is also possible to have the program serialize everything but the metadata, and then just alert the user that this aspect has been lost.

Does anyone here know the pandas source code well enough to understand how to implement something like this? I really don't have a clue, but hope this isn't asking too much of the developers.

Also, I think this addition will be a nice way to appease people who are always looking to subclass a dataframe.

*up vote for adding attribute being called 'meta'
*up vote for putting it on Index classes as well as Series, DataFrame and Panel

@dalejung

This comment has been minimized.

Show comment
Hide comment
@dalejung

dalejung Dec 12, 2012

Contributor

Last time I checked, HDF5 has a limit on the size of the AttributeSet. I had to get around it by having my store object encapsulate a directory, with .h5 and pickled meta objects.

Contributor

dalejung commented Dec 12, 2012

Last time I checked, HDF5 has a limit on the size of the AttributeSet. I had to get around it by having my store object encapsulate a directory, with .h5 and pickled meta objects.

@dalejung

This comment has been minimized.

Show comment
Hide comment
@dalejung

dalejung Dec 12, 2012

Contributor

I think that adding metadata to the DataFrame object requires that it serialize and work with all backends (pickle, hdf5, etc). Which probably means restricting the type of metadata that can be added. There are corner cases to pickling custom classes that would become pandas problems.

Contributor

dalejung commented Dec 12, 2012

I think that adding metadata to the DataFrame object requires that it serialize and work with all backends (pickle, hdf5, etc). Which probably means restricting the type of metadata that can be added. There are corner cases to pickling custom classes that would become pandas problems.

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 12, 2012

Hi guys. I'm a bit curious about something. This fix is currently addressing adding custom attributes to a dataframe. The values of these attributes, they can be python functions no? If so, this might be a workaround to adding custom instance methods to a dataframe. I know some people way back when were interested in this possibility.

I think the way this could work is the dataframe should have a new method, call it... I dunno, add_custom_method(). This would take in a function, then add the function to the 'meta' attribute dictionary, with some sort of traceback to let the program know it is special.

When the proposed new machinery assigns custom attributes to the new dataframe, it also may be neat to automatically promote such a function to an instance method. If it could do that, then we would have a way to effectively subclass a DataFrame without actually doing so.

This is likely overkill for the first first go around, but maybe something to think about down the road.

hugadams commented Dec 12, 2012

Hi guys. I'm a bit curious about something. This fix is currently addressing adding custom attributes to a dataframe. The values of these attributes, they can be python functions no? If so, this might be a workaround to adding custom instance methods to a dataframe. I know some people way back when were interested in this possibility.

I think the way this could work is the dataframe should have a new method, call it... I dunno, add_custom_method(). This would take in a function, then add the function to the 'meta' attribute dictionary, with some sort of traceback to let the program know it is special.

When the proposed new machinery assigns custom attributes to the new dataframe, it also may be neat to automatically promote such a function to an instance method. If it could do that, then we would have a way to effectively subclass a DataFrame without actually doing so.

This is likely overkill for the first first go around, but maybe something to think about down the road.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2012

Contributor

@dalejung do you have a link to the AttributeSet limit?
@hugadams you can simply monkey-patch if you want custom instance methods

import pandas
def my_func(self, **kwargs):
    return self * 2
pandas.DataFrame.my_func = myfunc
Contributor

jreback commented Dec 12, 2012

@dalejung do you have a link to the AttributeSet limit?
@hugadams you can simply monkey-patch if you want custom instance methods

import pandas
def my_func(self, **kwargs):
    return self * 2
pandas.DataFrame.my_func = myfunc
@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 12, 2012

@jreback: Thanks for pointing this out man. I've heard of monkeypatching instance methods, but always thought it was more of a colloquialism for something more difficult.

Thanks for showing me this.

hugadams commented Dec 12, 2012

@jreback: Thanks for pointing this out man. I've heard of monkeypatching instance methods, but always thought it was more of a colloquialism for something more difficult.

Thanks for showing me this.

@dalejung

This comment has been minimized.

Show comment
Hide comment
@dalejung

dalejung Dec 12, 2012

Contributor

@jreback http://www.hdfgroup.org/HDF5/doc/UG/13_Attributes.html#SpecIssues maybe? It's been awhile and it could be that pytables hasn't implemented new HDF5 features.

Personally, I had a dataset with ~40k items of metadata. Nothing complicated, just large. It was much easier to just pickle that stuff separately and use HDF for the actual data.

Contributor

dalejung commented Dec 12, 2012

@jreback http://www.hdfgroup.org/HDF5/doc/UG/13_Attributes.html#SpecIssues maybe? It's been awhile and it could be that pytables hasn't implemented new HDF5 features.

Personally, I had a dataset with ~40k items of metadata. Nothing complicated, just large. It was much easier to just pickle that stuff separately and use HDF for the actual data.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2012

Contributor

@dalejung thanks for the link....I am not sure of use-cases for meta data beyond simple structures anyhow....if you have regular data you can always store as separate structures or pickle or whatever....

Contributor

jreback commented Dec 12, 2012

@dalejung thanks for the link....I am not sure of use-cases for meta data beyond simple structures anyhow....if you have regular data you can always store as separate structures or pickle or whatever....

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2012

Contributor

@hugadams np....good luck

Contributor

jreback commented Dec 12, 2012

@hugadams np....good luck

@dalejung

This comment has been minimized.

Show comment
Hide comment
@dalejung

dalejung Dec 12, 2012

Contributor

@jreback sure, but that's kind of the state now. You can use DataFrames as attributes of custom classes. You can keep track of your metadata separately.

My point is that there would be an expectation for the DataFrame metadata serialization to work. The HDF5 limit is worse because it's based on size and not type, which means it can work until it suddenly does not.

There are always going to be use-cases we don't think of. Adding a metadata attribute that sometimes saves will be asking for trouble.

Contributor

dalejung commented Dec 12, 2012

@jreback sure, but that's kind of the state now. You can use DataFrames as attributes of custom classes. You can keep track of your metadata separately.

My point is that there would be an expectation for the DataFrame metadata serialization to work. The HDF5 limit is worse because it's based on size and not type, which means it can work until it suddenly does not.

There are always going to be use-cases we don't think of. Adding a metadata attribute that sometimes saves will be asking for trouble.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 12, 2012

Contributor

@dalejung ok...once this PR GH #2497 is merged in you can try this out in a limited way (limited because data frames don't 'yet' pass this around). could catch errors if you try to store too much (not much to do in this case EXCEPT fail)

Contributor

jreback commented Dec 12, 2012

@dalejung ok...once this PR GH #2497 is merged in you can try this out in a limited way (limited because data frames don't 'yet' pass this around). could catch errors if you try to store too much (not much to do in this case EXCEPT fail)

@ghost ghost assigned y-p Dec 13, 2012

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 13, 2012

Contributor

Looks like the for and against of the thorny serialization issue are clear.

Here is another thorny issue - what's the semantics of propegating meta through operations?

df1.meta.observation_date = "1/1/1981"
df1.meta.origin = "tower1"
df2.meta.observation_date = "1/1/1982"
df2.meta.origin = "tower2"

df3=pd.concat(df1,df2)
# or merge, addition, ix, apply, etc'

Now, what's the "correct" meta for df3?

  • besides load/save, users usually perform operations on dataframes. If most operations (combination,
    mutation, slicing...) invalidate all the meta tags (i.e makes them wrong or drops them completely), what's the
    remaining use case of metadata?
  • If we start defining an algebra for combining meta, this is perhaps getting too complicated
    to be attractive.

I'd be interested to hear specific examples of the problems you hope this will solve for you,
what are the kinds of meta tags you wish you had for your work?

`

Contributor

y-p commented Dec 13, 2012

Looks like the for and against of the thorny serialization issue are clear.

Here is another thorny issue - what's the semantics of propegating meta through operations?

df1.meta.observation_date = "1/1/1981"
df1.meta.origin = "tower1"
df2.meta.observation_date = "1/1/1982"
df2.meta.origin = "tower2"

df3=pd.concat(df1,df2)
# or merge, addition, ix, apply, etc'

Now, what's the "correct" meta for df3?

  • besides load/save, users usually perform operations on dataframes. If most operations (combination,
    mutation, slicing...) invalidate all the meta tags (i.e makes them wrong or drops them completely), what's the
    remaining use case of metadata?
  • If we start defining an algebra for combining meta, this is perhaps getting too complicated
    to be attractive.

I'd be interested to hear specific examples of the problems you hope this will solve for you,
what are the kinds of meta tags you wish you had for your work?

`

@dalejung

This comment has been minimized.

Show comment
Hide comment
@dalejung

dalejung Dec 13, 2012

Contributor

@y-p I agree that propagation logic gets wonky. From experience, whether to propagate meta1/meta2/nothing is specific to the situation and doesn't follow any rule.

Maybe the need for metadata would be fulfilled by easier composition tools? For example, I tend to delegate attribute calls to the child dataframe and also connect the repr/str. There are certain conveniences that pandas provides that you lose with a simple composition.

Thinking about it, an api like the numpy array might be useful to allow composition classes to substitute for DataFrames.

Contributor

dalejung commented Dec 13, 2012

@y-p I agree that propagation logic gets wonky. From experience, whether to propagate meta1/meta2/nothing is specific to the situation and doesn't follow any rule.

Maybe the need for metadata would be fulfilled by easier composition tools? For example, I tend to delegate attribute calls to the child dataframe and also connect the repr/str. There are certain conveniences that pandas provides that you lose with a simple composition.

Thinking about it, an api like the numpy array might be useful to allow composition classes to substitute for DataFrames.

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 13, 2012

Hi y-p. You bring up very good points in regard to merging. My thoughts would be that merged quantities that share keys should store results in a tuple, instead of overwriting; however, this is still a unfavorable situation.

You know, once the monkey patching was made clear to me by jreback, I realized that I could most likely get all the functionality I was looking for in custom attributes. Perhaps what would be more helpful at this point, rather than custom attributes, would be a small tutorial on the main page about how to monkey patch and customize pandas DataStructures.

In my personal situation, I no longer feel that custom metadata would really make or break my projects if monkey patching is adequate; however, you guys seem to have a better overview of pandas, so I think that it really is your judgement call if the new pros of metadata would outweigh the cons.

hugadams commented Dec 13, 2012

Hi y-p. You bring up very good points in regard to merging. My thoughts would be that merged quantities that share keys should store results in a tuple, instead of overwriting; however, this is still a unfavorable situation.

You know, once the monkey patching was made clear to me by jreback, I realized that I could most likely get all the functionality I was looking for in custom attributes. Perhaps what would be more helpful at this point, rather than custom attributes, would be a small tutorial on the main page about how to monkey patch and customize pandas DataStructures.

In my personal situation, I no longer feel that custom metadata would really make or break my projects if monkey patching is adequate; however, you guys seem to have a better overview of pandas, so I think that it really is your judgement call if the new pros of metadata would outweigh the cons.

@y-p

This comment has been minimized.

Show comment
Hide comment
@y-p

y-p Dec 14, 2012

Contributor

Thanks for all the ideas, here is my summary:

  1. It might be useful to attach metadata to serialized files as opposed to live objects.
  2. people want to extend functionality in a natural way, rather then adding meta data,
    even if it makes no sense to have it as part of upstream.
    monkey-patching is a useful idiom for that . I use it myself in my IPython startup file.
    (#2530)
  3. Allowing arbitrary metadata to be added to live objects makes little sense when mutation
    is inevitabely involved. well-defined metadata tags are bound to be either domain-specific,
    or suitable to be "baked in" when general enough.
  4. There might be an existing need for a "scientific data container" file format, probably to be
    designed by a commitee over several years, producing a 250 page standard with a name like
    IKF-B76/J-2017, not adopted by anyone outside the US DOD energy research lab community.
    pandas is not it though.

Dropping the milestone for now but will leave open if someone has more to add.
if you need (1), please open an issue and explain your use-case.

Contributor

y-p commented Dec 14, 2012

Thanks for all the ideas, here is my summary:

  1. It might be useful to attach metadata to serialized files as opposed to live objects.
  2. people want to extend functionality in a natural way, rather then adding meta data,
    even if it makes no sense to have it as part of upstream.
    monkey-patching is a useful idiom for that . I use it myself in my IPython startup file.
    (#2530)
  3. Allowing arbitrary metadata to be added to live objects makes little sense when mutation
    is inevitabely involved. well-defined metadata tags are bound to be either domain-specific,
    or suitable to be "baked in" when general enough.
  4. There might be an existing need for a "scientific data container" file format, probably to be
    designed by a commitee over several years, producing a 250 page standard with a name like
    IKF-B76/J-2017, not adopted by anyone outside the US DOD energy research lab community.
    pandas is not it though.

Dropping the milestone for now but will leave open if someone has more to add.
if you need (1), please open an issue and explain your use-case.

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 18, 2012

Hey y-p. Thanks for leaving this open. It turns out that monkey patching has not solved my problem as I originally thought it would.

Yes, monkey patching does allow one to add custom instance methods and attributes to a dataframe; however, any function that results in a new dataframe will not retain the values of these custom attributes.

From an email currently on the mailing list:

import pandas

pandas.DataFrame.test=None

df=pandas.DataFrame(name='Bill')
df.name
>>> 'Bill'

df2=df.mul(50)
df2.name
>>>

I've put together a custom dataframe for spectroscopy that I'm very excited about putting at the center of a new spectroscopy package; however, realized that every operation that returns a new dataframe resets all of my custom attributes. The instance methods and slots for the attributes are retained, so this is better than nothing, but still is going to hamper my program.

The only workaround I can find is to add some sort of attribute transfer function to every single dataframe method that I want to work with my custom dataframe. Thus, the whole point of making my object a custom dataframe is lost.

With this in mind, I think monkey patching is not adequate unless there's a workaround that I'm not aware of. Will see if anyone replies on the mailing list.

hugadams commented Dec 18, 2012

Hey y-p. Thanks for leaving this open. It turns out that monkey patching has not solved my problem as I originally thought it would.

Yes, monkey patching does allow one to add custom instance methods and attributes to a dataframe; however, any function that results in a new dataframe will not retain the values of these custom attributes.

From an email currently on the mailing list:

import pandas

pandas.DataFrame.test=None

df=pandas.DataFrame(name='Bill')
df.name
>>> 'Bill'

df2=df.mul(50)
df2.name
>>>

I've put together a custom dataframe for spectroscopy that I'm very excited about putting at the center of a new spectroscopy package; however, realized that every operation that returns a new dataframe resets all of my custom attributes. The instance methods and slots for the attributes are retained, so this is better than nothing, but still is going to hamper my program.

The only workaround I can find is to add some sort of attribute transfer function to every single dataframe method that I want to work with my custom dataframe. Thus, the whole point of making my object a custom dataframe is lost.

With this in mind, I think monkey patching is not adequate unless there's a workaround that I'm not aware of. Will see if anyone replies on the mailing list.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 19, 2012

Contributor

@hugadams you are probably much better off to create a class to hold both the frame and the meta and then forward methods as needed to handle manipulations...something like

class MyObject(object):

   def __init__(self, df, meta):
         self.df = df
         self.meta = meta

   @property
    def ix(self):
          return self.df.ix

depending on what exactly you need to do, the following will work

o = MyObject(df, meta)
o.ix[:,'foo'] = 'bar'
o.name = 'myobj'

and then you can custom serialization, object creation, etc
you coulud event allow getattr to automatically forward methods to df/meta as needed

only gets tricky when you do mutations

o.df = o.df * 5

you can even handle this by defining __mul__ in MyObject

you prob have a limited set of operations that you really want to support, power users can just
reach in and grab o.df if they need to...

hth

Contributor

jreback commented Dec 19, 2012

@hugadams you are probably much better off to create a class to hold both the frame and the meta and then forward methods as needed to handle manipulations...something like

class MyObject(object):

   def __init__(self, df, meta):
         self.df = df
         self.meta = meta

   @property
    def ix(self):
          return self.df.ix

depending on what exactly you need to do, the following will work

o = MyObject(df, meta)
o.ix[:,'foo'] = 'bar'
o.name = 'myobj'

and then you can custom serialization, object creation, etc
you coulud event allow getattr to automatically forward methods to df/meta as needed

only gets tricky when you do mutations

o.df = o.df * 5

you can even handle this by defining __mul__ in MyObject

you prob have a limited set of operations that you really want to support, power users can just
reach in and grab o.df if they need to...

hth

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 19, 2012

@jreback

Thanks for the input. I will certainly keep this in mind if the metadata idea of this thread never reaches fruition, as it seems to be the best way forward. Do you know offhand how I can implement direct slicing eg:

o['col1'] instead of o.df['col1']

I wasn't sure how to transfer that functionality to my custom object without a direct call to the underlying dataframe.

Thanks for pointing out the mul redefintion. This will help me going forward.

This really does feel like a roundabout solution to the Dataframe's inability to be subclassed. Especially if my custom object were to evolve with pandas, this would require maintenance to keep it synced up with changes to the Dataframe API.

What if we do this- Using jreback's example, we create a generic class with the specific intention of being subclassed for custom use? We can include the most common Dataframe methods and update all the operators accordingly. Then, hopeless fools like me who come along with the intent to customize have a really strong starting point.

I think that pandas' full potential has yet to be recognized by the research community, and anticipate it will diffuse into many more scientific fields. As such, if we could present them with a generic class for customizing dataframes, then researchers may be more inclined to build packages around pandas, rather than coming up with their own ad-hoc datastructures.

hugadams commented Dec 19, 2012

@jreback

Thanks for the input. I will certainly keep this in mind if the metadata idea of this thread never reaches fruition, as it seems to be the best way forward. Do you know offhand how I can implement direct slicing eg:

o['col1'] instead of o.df['col1']

I wasn't sure how to transfer that functionality to my custom object without a direct call to the underlying dataframe.

Thanks for pointing out the mul redefintion. This will help me going forward.

This really does feel like a roundabout solution to the Dataframe's inability to be subclassed. Especially if my custom object were to evolve with pandas, this would require maintenance to keep it synced up with changes to the Dataframe API.

What if we do this- Using jreback's example, we create a generic class with the specific intention of being subclassed for custom use? We can include the most common Dataframe methods and update all the operators accordingly. Then, hopeless fools like me who come along with the intent to customize have a really strong starting point.

I think that pandas' full potential has yet to be recognized by the research community, and anticipate it will diffuse into many more scientific fields. As such, if we could present them with a generic class for customizing dataframes, then researchers may be more inclined to build packages around pandas, rather than coming up with their own ad-hoc datastructures.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 19, 2012

Contributor

There are only a handful of methods you prob need to worry about, you can always access df anyhow
e.g. arithmetic, getitem,setitem,ix, maybe boolean

depends on what you want the user to be able to do with your object
python is all about least suprise. an object should do what you expect; in this case you are
having your object quack like a DataFrame with extra attributes, or are you really do more complex stuff like redefiing the way operators work?

for example you could redefine * to mean call my cool multiplier function, and in some fields this makes sense (e.g. frequency domain analysis you want * to mean convolution)

can you provide an example of what you are trying to do?

# to provide: o['col1'] access

def __getitem__(self, key):

     # you could intercept calls to metadata here for example
      if key in meta:
           return meta[key]

     return self.df.__getitem__(self, key)
Contributor

jreback commented Dec 19, 2012

There are only a handful of methods you prob need to worry about, you can always access df anyhow
e.g. arithmetic, getitem,setitem,ix, maybe boolean

depends on what you want the user to be able to do with your object
python is all about least suprise. an object should do what you expect; in this case you are
having your object quack like a DataFrame with extra attributes, or are you really do more complex stuff like redefiing the way operators work?

for example you could redefine * to mean call my cool multiplier function, and in some fields this makes sense (e.g. frequency domain analysis you want * to mean convolution)

can you provide an example of what you are trying to do?

# to provide: o['col1'] access

def __getitem__(self, key):

     # you could intercept calls to metadata here for example
      if key in meta:
           return meta[key]

     return self.df.__getitem__(self, key)
@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Dec 19, 2012

All I'm doing is creating a dataframe for spectral data. As such, it has a special index type that I've written called "SpecIndex" and several methods for transforming itself to various representations of data. It also has special methods for extending how temporal data is managed. In any case, these operations are well-contained in my monkey patched version, and also would be easily implemented in a new class formalism as you've shown.

After this, it really should just quack. Besides these spectroscopic functions and attributes, it should behave like a dataframe. Therefore, the most common operations on the dataframe, I would prefer to be seemless and promote to instance methods. I want to encourage users to learn pandas and use this tool for exploratory spectroscopy. As such, I'm trying to intercept any inconsistencies ahead of time like the one you pointed out about o.df=o.df * 5. Will I have to change the behavior of all the basic operators (eg * / + -) or just *? Any caveat like this, I'd like to correct in advance. In the end, I want the class layer itself to be as invisible as possible.

Do any more of these gotchas that come to mind?

hugadams commented Dec 19, 2012

All I'm doing is creating a dataframe for spectral data. As such, it has a special index type that I've written called "SpecIndex" and several methods for transforming itself to various representations of data. It also has special methods for extending how temporal data is managed. In any case, these operations are well-contained in my monkey patched version, and also would be easily implemented in a new class formalism as you've shown.

After this, it really should just quack. Besides these spectroscopic functions and attributes, it should behave like a dataframe. Therefore, the most common operations on the dataframe, I would prefer to be seemless and promote to instance methods. I want to encourage users to learn pandas and use this tool for exploratory spectroscopy. As such, I'm trying to intercept any inconsistencies ahead of time like the one you pointed out about o.df=o.df * 5. Will I have to change the behavior of all the basic operators (eg * / + -) or just *? Any caveat like this, I'd like to correct in advance. In the end, I want the class layer itself to be as invisible as possible.

Do any more of these gotchas that come to mind?

@dalejung

This comment has been minimized.

Show comment
Hide comment
@dalejung

dalejung Dec 19, 2012

Contributor

It's best to think of Pandas objects like you do integers. If you had a hypothetical Person object, its height would just be a number. The number would have no idea it was a height or what unit it was in. It's just there for numerical operations. height / height_avg doesn't care about the person's sex, weight, or race.

I think when the DataFrame is the primary data object this seems weird. But imagine that the Person object had a weight_history attribute. It wouldn't make sense to subclass a DataFrame to hold that attribute. Especially if other Pandas objects existed in Person data.

subclassing/metadata will always run into issues when doing exploratory analysis. Does SubDataFrame.tail() return a SubDataFrame? If it does, will it keep the same attributes? Do we want to make copy of the dict for all ops like +/-*?

After a certain point it becomes obvious that you're not working with a Person or SpectralSeries. You're working on an int or a DataFrame. In the same way that convert_height(Person person) isn't more convenient than convert_height(int height), getting your users into the mindset that a DataFrame is just a data type will be simpler in the long run. Especially if your class gets more complicated and needs to hold more than one Pandas object.

Contributor

dalejung commented Dec 19, 2012

It's best to think of Pandas objects like you do integers. If you had a hypothetical Person object, its height would just be a number. The number would have no idea it was a height or what unit it was in. It's just there for numerical operations. height / height_avg doesn't care about the person's sex, weight, or race.

I think when the DataFrame is the primary data object this seems weird. But imagine that the Person object had a weight_history attribute. It wouldn't make sense to subclass a DataFrame to hold that attribute. Especially if other Pandas objects existed in Person data.

subclassing/metadata will always run into issues when doing exploratory analysis. Does SubDataFrame.tail() return a SubDataFrame? If it does, will it keep the same attributes? Do we want to make copy of the dict for all ops like +/-*?

After a certain point it becomes obvious that you're not working with a Person or SpectralSeries. You're working on an int or a DataFrame. In the same way that convert_height(Person person) isn't more convenient than convert_height(int height), getting your users into the mindset that a DataFrame is just a data type will be simpler in the long run. Especially if your class gets more complicated and needs to hold more than one Pandas object.

@bilderbuchi

This comment has been minimized.

Show comment
Hide comment
@bilderbuchi

bilderbuchi Jan 21, 2016

Any news on this issue? I just found myself wishing for the possibility to attach metadata (probably in a dict) to a dataframe.

bilderbuchi commented Jan 21, 2016

Any news on this issue? I just found myself wishing for the possibility to attach metadata (probably in a dict) to a dataframe.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 21, 2016

Contributor

its certainly possible to add a default propogated attribute like .attrs via the _metadata/__finalize__ machinery. IIRC geopoandas does this.

But would need quite a bit of auditing and testing. You are welcome to have a go. Can you show your non-trivial use case?

Contributor

jreback commented Jan 21, 2016

its certainly possible to add a default propogated attribute like .attrs via the _metadata/__finalize__ machinery. IIRC geopoandas does this.

But would need quite a bit of auditing and testing. You are welcome to have a go. Can you show your non-trivial use case?

@bilderbuchi

This comment has been minimized.

Show comment
Hide comment
@bilderbuchi

bilderbuchi Jan 21, 2016

My use case would be similar to what I imagine @hugadams meant when talking about working with spectroscopy results - data that are constant for the whole dataframe, like

  • when working with output from machines/lab setups: a machine ID, calibration settings, user IDs, scaling factors for pressure transducers,...
  • when doing data analyis: filenames of input files, git hash of the analysis code (for Reproducible Science), config file contents,
  • when doing simulations: git hash of the simulation code (see above), parameters of a batch run creating the dataframe in question...
    Right now I did some systems modelling, and wanted to carry along the name of the scenario that I analyze. As I didn't feel it appropriate to just fill a column with nothing but "scenario 1", I ended up abusing dataframe.columns.name for this - doesn't feel clean or idiomatic, but sufficient for this one case since I only wanted to attach one string.

bilderbuchi commented Jan 21, 2016

My use case would be similar to what I imagine @hugadams meant when talking about working with spectroscopy results - data that are constant for the whole dataframe, like

  • when working with output from machines/lab setups: a machine ID, calibration settings, user IDs, scaling factors for pressure transducers,...
  • when doing data analyis: filenames of input files, git hash of the analysis code (for Reproducible Science), config file contents,
  • when doing simulations: git hash of the simulation code (see above), parameters of a batch run creating the dataframe in question...
    Right now I did some systems modelling, and wanted to carry along the name of the scenario that I analyze. As I didn't feel it appropriate to just fill a column with nothing but "scenario 1", I ended up abusing dataframe.columns.name for this - doesn't feel clean or idiomatic, but sufficient for this one case since I only wanted to attach one string.
@nzjrs

This comment has been minimized.

Show comment
Hide comment
@nzjrs

nzjrs Jan 21, 2016

I have the same use case as @bilderbuchi (recording scientific experimental metadata)

  • subject information - genotype, gender, age
  • experiment information - version hashes, config hashes

nzjrs commented Jan 21, 2016

I have the same use case as @bilderbuchi (recording scientific experimental metadata)

  • subject information - genotype, gender, age
  • experiment information - version hashes, config hashes
@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Jan 21, 2016

It's now much easier to subclass a dataframe and add your own attributes
and methods. This wasn't the case when I started the issue
On Jan 21, 2016 8:09 AM, "John Stowers" notifications@github.com wrote:

I have the same use case as @bilderbuchi https://github.com/bilderbuchi
(recording scientific experimental metadata)

  • subject information - genotype, gender, age
  • experiment information - version hashes, config hashes


Reply to this email directly or view it on GitHub
#2485 (comment).

hugadams commented Jan 21, 2016

It's now much easier to subclass a dataframe and add your own attributes
and methods. This wasn't the case when I started the issue
On Jan 21, 2016 8:09 AM, "John Stowers" notifications@github.com wrote:

I have the same use case as @bilderbuchi https://github.com/bilderbuchi
(recording scientific experimental metadata)

  • subject information - genotype, gender, age
  • experiment information - version hashes, config hashes


Reply to this email directly or view it on GitHub
#2485 (comment).

@nzjrs

This comment has been minimized.

Show comment
Hide comment
@nzjrs

nzjrs Jan 21, 2016

yeah, but something that round-trips through a vanilla pickled dataframe would be preferable

nzjrs commented Jan 21, 2016

yeah, but something that round-trips through a vanilla pickled dataframe would be preferable

@dacoex

This comment has been minimized.

Show comment
Hide comment
@dacoex

dacoex Jan 23, 2016

Contributor

Would the features offered by xarray be somthing that can be adopted here?
Data Structures

They have data attributes. If pandas could get the same features, this would be great.
Unit conversion, unit propagation, etc.

Contributor

dacoex commented Jan 23, 2016

Would the features offered by xarray be somthing that can be adopted here?
Data Structures

They have data attributes. If pandas could get the same features, this would be great.
Unit conversion, unit propagation, etc.

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Jan 23, 2016

I think xarrayis what you want.

You may also try this metadataframe class I wrote a few years ago. It may
not longer work with pandas versions, but I haven't tried.

https://github.com/hugadams/scikit-spectra/blob/b6171bd3c947728e01413fe91ec0bd66a1a02a34/skspec/pandas_utils/metadframe.py

You should be able to download that file, then just make a class that has
attributes like you want. IE

df = MetaDataframe()
df.a = a
df.b = b

I thought that after 0.16, it was possible to simply subclass a dataframe,
right?

IE

class MyDF(DataFrame)
self.a = 50
self.b = 20

Or is this not the case?

On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:

Would the features offered by xarray be somthing that can be adopted here?
Data Structures http://xarray.pydata.org/en/stable/data-structures.html

They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.


Reply to this email directly or view it on GitHub
#2485 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

hugadams commented Jan 23, 2016

I think xarrayis what you want.

You may also try this metadataframe class I wrote a few years ago. It may
not longer work with pandas versions, but I haven't tried.

https://github.com/hugadams/scikit-spectra/blob/b6171bd3c947728e01413fe91ec0bd66a1a02a34/skspec/pandas_utils/metadframe.py

You should be able to download that file, then just make a class that has
attributes like you want. IE

df = MetaDataframe()
df.a = a
df.b = b

I thought that after 0.16, it was possible to simply subclass a dataframe,
right?

IE

class MyDF(DataFrame)
self.a = 50
self.b = 20

Or is this not the case?

On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:

Would the features offered by xarray be somthing that can be adopted here?
Data Structures http://xarray.pydata.org/en/stable/data-structures.html

They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.


Reply to this email directly or view it on GitHub
#2485 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

@hugadams

This comment has been minimized.

Show comment
Hide comment
@hugadams

hugadams Jan 23, 2016

Here's what I was talking about:

http://pandas.pydata.org/pandas-docs/stable/internals.html#override-constructor-properties

On Sat, Jan 23, 2016 at 1:32 PM, Adam Hughes hugadams@gwmail.gwu.edu
wrote:

I think xarrayis what you want.

You may also try this metadataframe class I wrote a few years ago. It may
not longer work with pandas versions, but I haven't tried.

https://github.com/hugadams/scikit-spectra/blob/b6171bd3c947728e01413fe91ec0bd66a1a02a34/skspec/pandas_utils/metadframe.py

You should be able to download that file, then just make a class that has
attributes like you want. IE

df = MetaDataframe()
df.a = a
df.b = b

I thought that after 0.16, it was possible to simply subclass a dataframe,
right?

IE

class MyDF(DataFrame)
self.a = 50
self.b = 20

Or is this not the case?

On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:

Would the features offered by xarray be somthing that can be adopted here?
Data Structures http://xarray.pydata.org/en/stable/data-structures.html

They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.


Reply to this email directly or view it on GitHub
#2485 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

Adam Hughes
Physics Ph.D Candidate
George Washington University

hugadams commented Jan 23, 2016

Here's what I was talking about:

http://pandas.pydata.org/pandas-docs/stable/internals.html#override-constructor-properties

On Sat, Jan 23, 2016 at 1:32 PM, Adam Hughes hugadams@gwmail.gwu.edu
wrote:

I think xarrayis what you want.

You may also try this metadataframe class I wrote a few years ago. It may
not longer work with pandas versions, but I haven't tried.

https://github.com/hugadams/scikit-spectra/blob/b6171bd3c947728e01413fe91ec0bd66a1a02a34/skspec/pandas_utils/metadframe.py

You should be able to download that file, then just make a class that has
attributes like you want. IE

df = MetaDataframe()
df.a = a
df.b = b

I thought that after 0.16, it was possible to simply subclass a dataframe,
right?

IE

class MyDF(DataFrame)
self.a = 50
self.b = 20

Or is this not the case?

On Sat, Jan 23, 2016 at 8:28 AM, DaCoEx notifications@github.com wrote:

Would the features offered by xarray be somthing that can be adopted here?
Data Structures http://xarray.pydata.org/en/stable/data-structures.html

They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.


Reply to this email directly or view it on GitHub
#2485 (comment).

Adam Hughes
Physics Ph.D Candidate
George Washington University

Adam Hughes
Physics Ph.D Candidate
George Washington University

@dacoex

This comment has been minimized.

Show comment
Hide comment
@dacoex

dacoex Jan 25, 2016

Contributor

I think xarrayis what you want.

So did you want to express that all aiming at using metadata may better use xarray?

Contributor

dacoex commented Jan 25, 2016

I think xarrayis what you want.

So did you want to express that all aiming at using metadata may better use xarray?

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Jan 25, 2016

Member

They have data attributes. If pandas could get the same features, this would be great.
Unit conversion, unit propagation, etc.

Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.

Member

shoyer commented Jan 25, 2016

They have data attributes. If pandas could get the same features, this would be great.
Unit conversion, unit propagation, etc.

Just to be clear, xarray does support adding arbitrary metadata, but not automatic unit conversion. We could hook up a library like pint to handle this, but it's difficult to get all the edge cases working until numpy has better dtype support.

@nzjrs

This comment has been minimized.

Show comment
Hide comment
@nzjrs

nzjrs Jan 25, 2016

I think 'automatic unit conversion based on metadata attached to series' is
a significantly different and more involved feature request to this issue.
I hope a simpler upstream supported solution allowing attaching simple
text-only metadata can be found before increasing the scope too much.

On 25 January 2016 at 17:14, Stephan Hoyer notifications@github.com wrote:

They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.

Just to be clear, xarray does support adding arbitrary metadata, but not
automatic unit conversion. We could hook up a library like pint to handle
this, but it's difficult to get all the edge cases working until numpy has
better dtype support.


Reply to this email directly or view it on GitHub
#2485 (comment).

nzjrs commented Jan 25, 2016

I think 'automatic unit conversion based on metadata attached to series' is
a significantly different and more involved feature request to this issue.
I hope a simpler upstream supported solution allowing attaching simple
text-only metadata can be found before increasing the scope too much.

On 25 January 2016 at 17:14, Stephan Hoyer notifications@github.com wrote:

They have data attributes. If pandas could get the same features, this
would be great.
Unit conversion, unit propagation, etc.

Just to be clear, xarray does support adding arbitrary metadata, but not
automatic unit conversion. We could hook up a library like pint to handle
this, but it's difficult to get all the edge cases working until numpy has
better dtype support.


Reply to this email directly or view it on GitHub
#2485 (comment).

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jan 25, 2016

Contributor

This is quite simple in current versions of pandas.

I am using a sub-class here for illustration purposes.
Really all that would be needed would be adding the __finalize__ to most of the construction methods
(this already exists now for Series, but not really for DataFrame).

unambiguous propogation would be quite easy, and users could add in there own __finalize__ to handle more complicated cases (e.g. what would would you do when you have df + df2)?

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:
:class MyDataFrame(DataFrame):
:    _metadata = ['attrs']
:
:    @property
:    def _constructor(self):
:        return MyDataFrame
:
:    def _combine_const(self, other, *args, **kwargs):
:        return super(MyDataFrame, self)._combine_const(other, *args, **kwargs).__finalize__(self)
:--

In [2]: df = MyDataFrame({'A' : [1,2,3]})

In [3]: df.attrs = {'foo' : 'bar'}

In [4]: df.attrs
Out[4]: {'foo': 'bar'}

In [5]: (df+1).attrs
Out[5]: {'foo': 'bar'}

Would take a patch for this, the modification are pretty straightforward, its the testing that is the key here.

Contributor

jreback commented Jan 25, 2016

This is quite simple in current versions of pandas.

I am using a sub-class here for illustration purposes.
Really all that would be needed would be adding the __finalize__ to most of the construction methods
(this already exists now for Series, but not really for DataFrame).

unambiguous propogation would be quite easy, and users could add in there own __finalize__ to handle more complicated cases (e.g. what would would you do when you have df + df2)?

In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:
:class MyDataFrame(DataFrame):
:    _metadata = ['attrs']
:
:    @property
:    def _constructor(self):
:        return MyDataFrame
:
:    def _combine_const(self, other, *args, **kwargs):
:        return super(MyDataFrame, self)._combine_const(other, *args, **kwargs).__finalize__(self)
:--

In [2]: df = MyDataFrame({'A' : [1,2,3]})

In [3]: df.attrs = {'foo' : 'bar'}

In [4]: df.attrs
Out[4]: {'foo': 'bar'}

In [5]: (df+1).attrs
Out[5]: {'foo': 'bar'}

Would take a patch for this, the modification are pretty straightforward, its the testing that is the key here.

@postelrich

This comment has been minimized.

Show comment
Hide comment
@postelrich

postelrich Sep 30, 2016

@jreback is there a generic way to persist metadata amongst all transforms applied to a dataframe including groupbys? Or would one have to go through and override a lot of methods' to call __finalize__?

postelrich commented Sep 30, 2016

@jreback is there a generic way to persist metadata amongst all transforms applied to a dataframe including groupbys? Or would one have to go through and override a lot of methods' to call __finalize__?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Sep 30, 2016

Contributor

@postelrich for most/all things, __finalize__ should already be defined (and so in theory you can make it persist attributes). Its not tested really well though.

For Series I think this is quite robust. DataFrame pretty good. I doubt this works at all for groupby / merge / most reductions. Those are really dependent on the __finalize__ (it may or may not be called), that is the simple part. The hard part is deciding what to do.

Contributor

jreback commented Sep 30, 2016

@postelrich for most/all things, __finalize__ should already be defined (and so in theory you can make it persist attributes). Its not tested really well though.

For Series I think this is quite robust. DataFrame pretty good. I doubt this works at all for groupby / merge / most reductions. Those are really dependent on the __finalize__ (it may or may not be called), that is the simple part. The hard part is deciding what to do.

@jbrockmendel

This comment has been minimized.

Show comment
Hide comment
@jbrockmendel

jbrockmendel Jul 20, 2017

Member

I've been working on an implementation of this that handles the propagation problem by making the Metadata object itself subclass Series. Then patch Series to relay methods to Metadata. Roughly:

class MSeries(pd.Series):
    def __init__(self, *args, **kwargs):
        pd.Series.__init__(self, *args, **kwargs)
        self.metadata = SMeta(self)

    def __add__(self, other):
        res = pd.series.__add__(self, other)
        res.metadata = self.metadata.__add__(other)
        return res

class SMeta(pd.Series):
    def __init__(self, parent):
        super(...)
        self.parent = parent

    def __add__(self, other):
        new_meta = SMeta(index=self.index)
        other_meta = [... other or other.metadata or None depending ...]
        for key in self.index:
             new_meta[key] = self[key].__add__(other)

So it is up to the individual MetaDatum classes to figure out how to propagate.

I've generally got this working. The part that I have not gotten working is the desired MFrame behavior df.metadata['A'] is df['A'].metadata. Any ideas on how to make that happen?

Member

jbrockmendel commented Jul 20, 2017

I've been working on an implementation of this that handles the propagation problem by making the Metadata object itself subclass Series. Then patch Series to relay methods to Metadata. Roughly:

class MSeries(pd.Series):
    def __init__(self, *args, **kwargs):
        pd.Series.__init__(self, *args, **kwargs)
        self.metadata = SMeta(self)

    def __add__(self, other):
        res = pd.series.__add__(self, other)
        res.metadata = self.metadata.__add__(other)
        return res

class SMeta(pd.Series):
    def __init__(self, parent):
        super(...)
        self.parent = parent

    def __add__(self, other):
        new_meta = SMeta(index=self.index)
        other_meta = [... other or other.metadata or None depending ...]
        for key in self.index:
             new_meta[key] = self[key].__add__(other)

So it is up to the individual MetaDatum classes to figure out how to propagate.

I've generally got this working. The part that I have not gotten working is the desired MFrame behavior df.metadata['A'] is df['A'].metadata. Any ideas on how to make that happen?

@JochemBoersma

This comment has been minimized.

Show comment
Hide comment
@JochemBoersma

JochemBoersma Sep 20, 2017

Propagation of attributes (defined in _metadata) gives me some headaches...

Based on the code of jreback, I've tried the following:

from pandas import DataFrame
class MyDataFrame(DataFrame):
    _metadata = ['attrs']

    @property
    def _constructor(self):
        return MyDataFrame

    def _combine_frame(self, other, *args, **kwargs):
        return super(MyDataFrame, self)._combine_frame(other, *args, **kwargs).__finalize__(self)

dfA = MyDataFrame({'A' : [1,2,3]})
dfA.attrs = {'foo' : 'bar'}

dfB = MyDataFrame({'B' : [6,7,8]})
dfB.attrs = {'fuzzy': 'busy'}

dfC = dfA.append(dfB)
dfC.attrs   #Returns error: 'MyDataFrame' object has no attribute 'attrs'
            #I would like that it would be {'foo': 'bar'}

As jreback mentioned: there should be made choices: what to do with the appended atttributes.
However: I would be really helped when the attributes of only '''dfA''' simply propagate towards '''dfC'''

EDIT: more headache is more better, it pushes me to think harder :). Solved it, by stealing the finalize solution which GeoPandas provided. finalize works pretty good indeed. However, I'm not experienced enough to perform the testing.

JochemBoersma commented Sep 20, 2017

Propagation of attributes (defined in _metadata) gives me some headaches...

Based on the code of jreback, I've tried the following:

from pandas import DataFrame
class MyDataFrame(DataFrame):
    _metadata = ['attrs']

    @property
    def _constructor(self):
        return MyDataFrame

    def _combine_frame(self, other, *args, **kwargs):
        return super(MyDataFrame, self)._combine_frame(other, *args, **kwargs).__finalize__(self)

dfA = MyDataFrame({'A' : [1,2,3]})
dfA.attrs = {'foo' : 'bar'}

dfB = MyDataFrame({'B' : [6,7,8]})
dfB.attrs = {'fuzzy': 'busy'}

dfC = dfA.append(dfB)
dfC.attrs   #Returns error: 'MyDataFrame' object has no attribute 'attrs'
            #I would like that it would be {'foo': 'bar'}

As jreback mentioned: there should be made choices: what to do with the appended atttributes.
However: I would be really helped when the attributes of only '''dfA''' simply propagate towards '''dfC'''

EDIT: more headache is more better, it pushes me to think harder :). Solved it, by stealing the finalize solution which GeoPandas provided. finalize works pretty good indeed. However, I'm not experienced enough to perform the testing.

@21stio

This comment has been minimized.

Show comment
Hide comment
@21stio

21stio Nov 11, 2017

Can't we just put metadata in the column name and change how columns are accessed? E.g. ["id"] would internally translate to {"name": "id"}.

Don't know the internals of pandas, so sorry if this might be a little naive. To me it just seems that the column name is really consistent across operations

21stio commented Nov 11, 2017

Can't we just put metadata in the column name and change how columns are accessed? E.g. ["id"] would internally translate to {"name": "id"}.

Don't know the internals of pandas, so sorry if this might be a little naive. To me it just seems that the column name is really consistent across operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment