Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pickle incompatibility between 0.25 and 1.0 when saving a MultiIndex dataframe #34535

Closed
naspli opened this issue Jun 2, 2020 · 10 comments
Closed
Labels
IO Pickle read_pickle, to_pickle

Comments

@naspli
Copy link

naspli commented Jun 2, 2020

This seems to have caused the problem described here:

https://stackoverflow.com/questions/61641738/pandas-1-0-cannot-pickle-load-dict-containing-dataframe-with-multiindex

which I'm now also experiencing. I dumped a MultiIndex dataframe containing ndarrays to a pickle on disk under pandas 0.25.x in Python 3.6, and now I'm getting:

AttributeError: Can't get attribute 'FrozenNDArray' on <module 'pandas.core.indexes.frozen'

when trying to load it in pandas 1.0.3 (still on Python 3.6). Any suggestions/workarounds? Should I instead open up a new issue?

This solved issue seems related, but is for Python 2.7:

#31988

This comment in the original rationale for getting rid of FrozenNDArray mentions pandas.compat.pickle_compat.py, which seems relevant:

#9031 (comment)

Originally posted by @mspacek in #29335 (comment)

@naspli naspli changed the title Pickle incompatibility between 0.25 and 1.0 when using saving a MultiIndex dataframe Pickle incompatibility between 0.25 and 1.0 when saving a MultiIndex dataframe Jun 2, 2020
@naspli
Copy link
Author

naspli commented Jun 2, 2020

@WillAyd #29335 (comment) suggested @mspacek open a new issue but I did not see it. I have ran into the same problem.

@jbrockmendel jbrockmendel added the IO Pickle read_pickle, to_pickle label Jun 5, 2020
@Froskekongen
Copy link

Is there any current efforts on fixing this issue, or is the recommended approach finding a workaround?

@jreback
Copy link
Contributor

jreback commented Aug 5, 2020

you would have to show an actual example

pd.read_pickle is the way to load as it ensures compatibility (pickle.load will not work)

we test loading older pickles explicitly so likely this is your setup

@Froskekongen
Copy link

Thanks, @jreback. The issue is of course related to having containers that contain dataframes. Here is an example with dataclasses:

Run first with pandas==0.25.2 and then with 1.1 to reproduce.

from dataclasses import dataclass
import pandas as pd
import numpy as np
import pickle as pkl


@dataclass
class ContainerWithDataframe:
    name: str
    frame: pd.DataFrame

    def save(self):
        with open(f"{self.name}_{pd.__version__}.pkl", 'wb') as ww:
            pkl.dump(self, ww)


def load_container(name, pdversion):
    fname = f"{name}_{pdversion}.pkl"
    try:
        with open(fname, 'rb') as ff:
            return pkl.load(ff)
    except AttributeError as e:
        print(f"Can't load {fname}: {e}")


if __name__ == "__main__":
    df1 = pd.DataFrame(data=np.ones((2, 2)))
    df2 = pd.DataFrame(data=2*np.ones((2, 2)), index=pd.MultiIndex.from_arrays([[1, 2], ['a', 'a']]))
    df3 = pd.DataFrame(data=np.ones((2,2)), columns=pd.MultiIndex.from_product([[1], ['A', 'B']]))

    cont1 = ContainerWithDataframe('regular', df1)
    cont2 = ContainerWithDataframe('index_multindex', df2)
    cont3 = ContainerWithDataframe('columns_multiindex', df3)

    cont1.save()
    cont2.save()
    cont3.save()

    cont1_loaded = load_container('regular', "0.25.2")
    cont2_loaded = load_container('index_multindex', "0.25.2")
    cont3_loaded = load_container('columns_multiindex', "0.25.2")

Is there a way to modify the container to share the logic used by pd.read_pickle?

@jreback
Copy link
Contributor

jreback commented Aug 5, 2020

you should simply use pd.read_pickle

@Froskekongen
Copy link

@jreback: What exactly do you mean by this (you should simply use pd.read_pickle)?

The example above is a toy example showing the problem. In our systems we have a computational graph where each node itself can be a computational graph. The primary building blocks of the graph may contain dataframes, and what we are doing now is to pickle these computational graph, so that we can run them elsewhere.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2020

exactly what i said
pickle.load does not handle backward compatibility
never has never will

pd.read_pickle does

@jreback jreback closed this as completed Aug 5, 2020
@jreback jreback added this to the No action milestone Aug 5, 2020
@veneto-maggio
Copy link

It seems a valid use case where dataframe(s) are parts of a larger container which is pickled as a whole. In this case pd.read_pickle does not apply, although one can workaround this by defining some new class which holds the dataframe and its setstate calls pd.read_pickle. I'd hope that pickle.load of pandas dataframe includes some backward compatibility support like that.

p-mc-grath added a commit to p-mc-grath/result_caching that referenced this issue Apr 3, 2021
Corrects brain-score#11 

I encountered FrozenIndices Error trying to load a pandas dataframe after updating my pandas version. This commit introduces backward compatibility, i.e. you are fine if you pickle.dump() using an old pandas' version and trying to load using a new pandas' version.
pandas-dev/pandas#34535

pickle.load() still called for all non Dataframe objects see https://github.com/pandas-dev/pandas/blob/f2c8480af2f25efdbd803218b9d87980f416563e/pandas/io/pickle.py#L203
mschrimpf pushed a commit to brain-score/result_caching that referenced this issue Apr 3, 2021
Corrects #11 

I encountered FrozenIndices Error trying to load a pandas dataframe after updating my pandas version. This commit introduces backward compatibility, i.e. you are fine if you pickle.dump() using an old pandas' version and trying to load using a new pandas' version.
pandas-dev/pandas#34535

pickle.load() still called for all non Dataframe objects see https://github.com/pandas-dev/pandas/blob/f2c8480af2f25efdbd803218b9d87980f416563e/pandas/io/pickle.py#L203
@pseudotensor
Copy link

@jreback Ya, the idea that one only ever loads isolated pandas frames is quite simplified. As @veneto-maggio said, often pandas frame would be part of large pickle file, so direct support for pickle is most general.

@fredrikw
Copy link

Old issue, I know but if anyone else finds this by googling, pd.read_pickle will handle any pickled object, not just pickled DataFrames! I believe that is what jreback is refering to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Pickle read_pickle, to_pickle
Projects
None yet
Development

No branches or pull requests

7 participants