Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle and test when __reduce__ returns a string #77

Merged
merged 33 commits into from
Feb 6, 2024

Conversation

liamhuber
Copy link
Contributor

Closes #76

@liamhuber
Copy link
Contributor Author

Passing on windows and osx is not relevant as the test is skipped there. Ubuntu fails with:

=========================== short test summary info ============================
SKIPPED [1] h5io/tests/test_io.py:419: only python3.11 or lower
FAILED h5io/tests/test_io.py::test_state_with_singleton - AttributeError: module 'h5io.tests.test_io' has no attribute 'raw'
============= 1 failed, 13 passed, 1 skipped, 2 warnings in 3.25s ==============
Error: Process completed with exit code 1.

This is perfectly sensible, as I'd defined the "global" variable raw inside the scope of a function.

I just moved it out to the main level of the tests file and everything passes fine. IMO it's kind of ugly having those extra classes defined at the same level as the rest of the tests, but I don't know any non-convoluted ways to expose the variable for import from inside a function. Moving this stuff off to a tests/static/singletons.py and modifying the pythonpath to import from there is also a reasonable solution, but for now I'll keep the ugly but less invasive form here.

@liamhuber
Copy link
Contributor Author

@pmrv (if you have permissions), @jan-janssen, you guys introduced the state and reduce stuff -- could one of you review this?

h5io/_h5io.py Outdated Show resolved Hide resolved
h5io/tests/test_io.py Outdated Show resolved Hide resolved
h5io/tests/test_io.py Outdated Show resolved Hide resolved
h5io/tests/test_io.py Outdated Show resolved Hide resolved
h5io/tests/test_io.py Outdated Show resolved Hide resolved
h5io/tests/test_io.py Outdated Show resolved Hide resolved
Copy link
Member

@jan-janssen jan-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is great and extends the functionality of h5io. It is well documented and worth adding. Just the tests are currently rather lengthly - increasing the total number of lines for the tests by 20%. So I added a couple of comments to reduce the number of lines to half ~50 lines.

@liamhuber
Copy link
Contributor Author

liamhuber commented Feb 2, 2024

The change is great and extends the functionality of h5io. It is well documented and worth adding. Just the tests are currently rather lengthly - increasing the total number of lines for the tests by 20%. So I added a couple of comments to reduce the number of lines to half ~50 lines.

It took me a minute to come around to it, but this is really excellent feedback. I was laser-focused on the singleton pattern case, but the point is that when __reduce__ returns a string, we should grab that local object and return it. The tests are now much shorter and, I hope, more pedagogically focused on this truth.

EDIT: local to the class defining the __reduce__ method I mean. In this case this is also local to the test, but I don't think we need to go as far as introducing some static test file to import from just for demonstrating this use case more thoroughly.

liamhuber and others added 2 commits February 2, 2024 10:41
Oops, slipped past me until I noticed some of the review comments weren't "outdated"
Copy link
Member

@jan-janssen jan-janssen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@liamhuber
Copy link
Contributor Author

@larsoner would you mind seeing if you also approve?

Once(if) this is merged I would appreciate a new release so I can use this my downstream CI 🚀

h5io/_h5io.py Outdated Show resolved Hide resolved
# string return from __reduce__, we expect to get back the exact thing specified
# in that string!
assert not isinstance(reloaded, StringReduce) # Not what got saved, but rather
assert reloaded is what_gets_loaded # The _exact_ object specified in __reduce__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very weird, intuitively it makes me think that this functionality won't survive closing and opening the Python interpreter for example. Like maybe this test is using the global state of this file when doing read_hdf5.

To test this, I write a little file rep.py based on your test:

rep.py
import h5io


class StringReduce:
    """A class to test the case of `__reduce__` returning a string."""

    def __reduce__(self):
        """
        Return a string associated with a local object to return.
        Normally, this would be some instance of this very class, e.g. in the case of
        a singleton. Here, we're going to set the variable to something totally
        different for pedagogical reasons, to demonstrate that on read you get
        _exactly_ what you specify here.
        """
        return "what_gets_loaded"  # The associated global variable


string_reduce_instance = StringReduce()
what_gets_loaded = {"the_point_is": "this is some very particular object"}
h5io.write_hdf5(
    fname="test.h5",
    data=string_reduce_instance,
    overwrite=True,
    use_json=False,
    use_state=True,
)

And it seems to confirm my suspicion:

$ python rep.py 
$ python
>>> import h5io
>>> h5io.read_hdf5("test.h5")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/larsoner/python/h5io/h5io/_h5io.py", line 516, in read_hdf5
    return _read(fid)
           ^^^^^^^^^^
  File "/home/larsoner/python/h5io/h5io/_h5io.py", line 510, in _read
    return _triage_read(fid[title], slash=slash)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/larsoner/python/h5io/h5io/_h5io.py", line 584, in _triage_read
    obj_class=_import_class(class_type=type_str),
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/larsoner/python/h5io/h5io/_h5io.py", line 897, in _import_class
    return getattr(
           ^^^^^^^^
AttributeError: module '__main__' has no attribute 'what_gets_loaded'

So I don't think this feature will actually work for people the way it should, i.e., giving object persistence across instantiations of Python?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larsoner thanks for looking at it so carefully!

It took me a while to figure out what's going wrong, because this feature is working A-OK for me including surviving interpreter restarts.

The difference is that in my use case, I'm always writing and reading data that is in an importable python module, so I can always find the object again without trouble; I even get is comparisons passing as expected when I save and load a singleton this way (which is actually what motivated the PR).

In contrast, in your example, when rep.py is executed the __module__ is __main__. When the interpreter is restarted, the module is still __main__, so there's no import error, but since none of the code in rep.py has been executed yet, the saved instance no longer exists and is not available.

My knee-jerk response was to throw an error if __module__ == "__main__" at write time, but since a user could reasonably just re-create their object in main prior to loading, I am wary of absolutely forbidding this behaviour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to look at is that this only works as expected when h5io.read_hdf5 is used in the same place as h5io.write_hdf5, right? So if I were to write some data in script A (or module A for that matter) to HDF5, it would probably not load properly in script B (or module B) where A != B. Even if __module__ == "__main__" would not be enough in that case -- you'd want to check that the module/global namespace during load is the same as the module/global namespace during read. And even that wouldn't really be a total guarantee (my __main__ might not be your __main__, or any other module name like utils for example more generally).

This is a bit of a departure from how h5io usually works. So far, once you write the data, you can move it to another computer and use it in another script/module/whatever and end up with the same data after read. This tightens the scope to once you write the data in this module or script, it will work in this module or script so long as the global variables are the same during read (or something like this I think?).

So I think to be safest we should probably make this behavior opt-in. Maybe a use_state="unsafe" or a new kwarg or something. And if possible/reasonable it would be good to save the name of the module where write_hdf5 has been called, because we could at least check that on read to make sure it's the same (so we're at least likely to get the correct result). WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR:

  • Issues around __main__ never differed between this PR and existing use_state stuff
  • The __reduce__() -> str case now matches pickle behaviour, which I think should be the guiding star for use_state=True
  • use_state doesn't really introduce any vulnerabilities or broken promises, it just highlights (perfectly reasonable!) limitations that are already there

Complete:

Another way to look at is that this only works as expected when h5io.read_hdf5 is used in the same place as h5io.write_hdf5, right? So if I were to write some data in script A (or module A for that matter) to HDF5, it would probably not load properly in script B (or module B) where A != B.

Thankfully, it isn't quite that severe. If I write data in A and read it in B, it should always work as long as the class for the data (or singleton in this string __reduce__ case) was defined in C. The issue only really comes up when you both define and write the data in A such that the "module" comes out as __main__ but then don't redefine it in __main__ at read time.

This was good feedback though, as it forced me to slow down and think about what I'm trying to accomplish here. Tacitly, I've been trying to match the behaviour of use_state=True to the behaviour of pickle, probably because all the comments in the source code directly quote the pickle docs. However, in the process of putting together an example I noticed I wasn't being strict enough! When __reduce__ returns a string, pickle actually insists that the value stored in that import pass an is comparison with the owner of the __reduce__ method.

This is all probably clearest with an example. I put together three .py scripts:

  • Define classes and write
  • Import classes and write
  • Read

These all live next to each other in a directory at the top of my python path. They operate on three objects:

  • A class that returns the name of a global int in __reduce__
  • A singleton class that returns the name of a global instance of itself in __reduce__
  • Just a regular class

And I use both h5io and pickle to save and load them.

reducers_define.py
from abc import ABCMeta
import os
import pickle

from h5io import write_hdf5

MY_GLOBAL = 42
    
class GlobalReduce:
    def __reduce__(self):
        return "MY_GLOBAL"
    
class Singleton(ABCMeta):
    _instances = {}

    def __call__(cls, *args, **kwargs):
        if cls not in cls._instances:
            cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
        return cls._instances[cls]
    
class SingletonReduce(metaclass=Singleton):
    def __reduce__(self):
        return "MY_SINGLETON"

    
MY_SINGLETON = SingletonReduce()

class JustSomeRegularClass:
    pass

some_class_instance = JustSomeRegularClass()


def to_hdf(file_name, data):
    hdf_file = file_name + ".h5"
    try:
        write_hdf5(
            fname=hdf_file,
            data=data,
            overwrite=True,
            use_json=False,
            use_state=True,
        )
        print(f"HDF SAVED {file_name}")
    except ValueError as e:
        print(f"HDF FAILED {file_name}")  #: {e.args[0]}")
        os.remove(hdf_file)
        
def to_pickle(file_name, data):
    pickle_file = file_name + ".pckl"
    try:
        with open(pickle_file, "wb") as f:
            pickle.dump(data, f)
        print(f"PICKLE SAVED {file_name}")
    except pickle.PicklingError as e:
        print(f"PICKLE FAILED {file_name}")  #: {e.args[0]}")
        os.remove(pickle_file)
        
def write(file_start, objects):
    print(f"########{file_start}########")
    for instance in objects:
            file_name = f"{file_start}_{instance.__class__.__module__}.{instance.__class__.__name__}"
            for attack in [to_hdf, to_pickle]:
                attack(file_name, instance)

if __name__ == "__main__":
    write("reducers_define", [GlobalReduce(), SingletonReduce(), some_class_instance])

reducers_write.py
import os
import pickle

from h5io import write_hdf5

from reducers_define import write, GlobalReduce, SingletonReduce, JustSomeRegularClass

if __name__ == "__main__":
    some_class_instance = JustSomeRegularClass()
    write("reducers_write", [GlobalReduce(), SingletonReduce(), some_class_instance])

reducers_read.py
import glob
import os
import pickle

from h5io import read_hdf5



def read_hdf(file):
    try:
        instance = read_hdf5(fname=file)
        print(f"HDF READ {file}")  # : {instance}")
    except AttributeError as e:
        print(f"HDF FAILED {file}")  #: {e.args[0]}")
        
def read_pickle(file):
    try:
        with open(file, "rb") as f:
            instance = pickle.load(f)
        print(f"PICKLE READ {file}")  # : {instance}")
    except AttributeError as e:
        print(f"PICKLE FAILED {file}")  #: {e.args[0]}")

if __name__ == "__main__":
    hdf_files = glob.glob("reducers*.h5")
    pickle_files = glob.glob("reducers*.pckl")
    
    print(f"########READING########")
    try:
        for file in hdf_files:
            read_hdf(file)
        for file in pickle_files:
            read_pickle(file)
    finally:
        for file in hdf_files + pickle_files:
            os.remove(file)
            # pass

Then in my shell I execute python reducers_define.py; python reducers_write.py; python reducers_read.py, which gives the following feedback on what wrote/read OK with filenames based on the tool for (de)serialization, the script acting, the module of the object, and the class of the object:

########reducers_define########
HDF FAILED reducers_define___main__.GlobalReduce
PICKLE FAILED reducers_define___main__.GlobalReduce
HDF SAVED reducers_define___main__.SingletonReduce
PICKLE SAVED reducers_define___main__.SingletonReduce
HDF SAVED reducers_define___main__.JustSomeRegularClass
PICKLE SAVED reducers_define___main__.JustSomeRegularClass
########reducers_write########
HDF FAILED reducers_write_reducers_define.GlobalReduce
PICKLE FAILED reducers_write_reducers_define.GlobalReduce
HDF SAVED reducers_write_reducers_define.SingletonReduce
PICKLE SAVED reducers_write_reducers_define.SingletonReduce
HDF SAVED reducers_write_reducers_define.JustSomeRegularClass
PICKLE SAVED reducers_write_reducers_define.JustSomeRegularClass
########READING########
HDF READ reducers_write_reducers_define.SingletonReduce.h5
HDF FAILED reducers_define___main__.SingletonReduce.h5
HDF FAILED reducers_define___main__.JustSomeRegularClass.h5
HDF READ reducers_write_reducers_define.JustSomeRegularClass.h5
PICKLE READ reducers_write_reducers_define.SingletonReduce.pckl
PICKLE READ reducers_write_reducers_define.JustSomeRegularClass.pckl
PICKLE FAILED reducers_define___main__.SingletonReduce.pckl
PICKLE FAILED reducers_define___main__.JustSomeRegularClass.pckl

An even more condensed summary is that:

  • We fail to write GlobalReduce() because it's __reduce__() string is not itself
  • We succeed to write SingletonReduce() just fine, but this can only be read when SingletonReduce was imported (or is re-imported/re-defined in the loading context) because otherwise the saved module is __main__ and we just don't know about it
  • Pickle and h5io behave the same
  • h5io doesn't care if we're leveraging the __reduce__ branch of use_state or not, you can't save something defined in __main__ and expect to load it

Even if __module__ == "__main__" would not be enough in that case -- you'd want to check that the module/global namespace during load is the same as the module/global namespace during read. And even that wouldn't really be a total guarantee (my __main__ might not be your __main__, or any other module name like utils for example more generally).

This is a bit of a departure from how h5io usually works. So far, once you write the data, you can move it to another computer and use it in another script/module/whatever and end up with the same data after read. This tightens the scope to once you write the data in this module or script, it will work in this module or script so long as the global variables are the same during read (or something like this I think?).

I would actually push back against this a bit. I think this is exactly how the rest of h5io works, it's just that regular usage of h5io tacitly assumes the data you're writing was defined somewhere else (builtins, datetime, scipy, where ever). In fact, even without use_state, you can push h5io to fail in similar ways.

In terms of scope, h5io depends on the environment being the same in both the saving or loading contexts in order to correctly recover the right object. The real threat here is when versions of the same library differ, but here is an absurdist example where I just completely overwrite scipy and things are silently wrong -- remember these files live in a directory high up my python path, so when I move my_scipy.py over to scipy.py in spoof_write.py, the interpreter finds my custom "scipy" ahead of the real one when I run spoof_read.py:

my_scipy.py
class _NotWhatYouThought:
    def __repr__(self):
        return f"These aren't the droids you're looking for"
    
class sparse:
    @classmethod
    def csc_matrix(cls, *args, **kwargs):
        return _NotWhatYouThought()

spoof_write.py
import shutil

from scipy import sparse

from h5io import write_hdf5, read_hdf5


data = sparse.csc_matrix(1)
file_name = "spoof.h5"

print("WRITING:", data)

write_hdf5(
    fname=file_name,
    data=data,
    overwrite=True,
    use_json=False,
    use_state=False,
)

from_real_module = read_hdf5(file_name)
print("NORMAL MODULE READ:", from_real_module)

shutil.move("my_scipy.py", "scipy.py")

spoof_read.py
import os
import shutil

from h5io import read_hdf5

file_name = "spoof.h5"

try:
    from_custom_module = read_hdf5(file_name)
    print("CUSTOM MODULE READ:", from_custom_module)
except Exception as e:
    print("FAILED", e.args[0])
finally:
    shutil.move("scipy.py", "my_scipy.py")
    os.remove(file_name)

Then python spoof_write.py; python spoof_read.py:

WRITING:   (0, 0)       1
NORMAL MODULE READ:   (0, 0)    1
CUSTOM MODULE READ: These aren't the droids you're looking for

In a similar way, h5io doesn't have any particular safeguards ensuring that what the user wrote is what winds up getting read. E.g. I can use the same three-file setup to define+write, import+write, and read under the condition that I've subclassed one of h5io's "known" classes (datetime in this case). I'm allowed to save it without complaint, and it loads without complaint (even when defined and saved in __main__) but what gets loaded is just a plain datetime object and not the class I saved:

datetime_define.py
from datetime import datetime

from h5io import write_hdf5

class MyDatetime(datetime):
    pass

data = MyDatetime(2024,2,6)
file_name = f"datetime_{data.__class__.__module__}.{data.__class__.__name__}.h5"

print("WRITING", file_name, type(data))
write_hdf5(
    fname=file_name,
    data=data,
    overwrite=True,
    use_json=False,
    use_state=False,
)

datetime_write.py
from h5io import write_hdf5

from datetime_define import MyDatetime

data = MyDatetime(2024,2,6)
file_name = f"datetime_{data.__class__.__module__}.{data.__class__.__name__}.h5"

print("WRITING", file_name, type(data))

write_hdf5(
    fname=file_name,
    data=data,
    overwrite=True,
    use_json=False,
    use_state=False,
)

datetime_read.py
import glob
import os

from h5io import read_hdf5

for file_name in glob.glob("datetime*.h5"):
    data = read_hdf5(fname=file_name)
    print("READING", file_name, type(data))
    os.remove(file_name)

python datetime_define.py; python datetime_write.py; python datetime_read.py:

WRITING datetime___main__.MyDatetime.h5 <class '__main__.MyDatetime'>
WRITING datetime_datetime_define.MyDatetime.h5 <class 'datetime_define.MyDatetime'>
WRITING datetime_datetime_define.MyDatetime.h5 <class 'datetime_define.MyDatetime'>
READING datetime___main__.MyDatetime.h5 <class 'datetime.datetime'>
READING datetime_datetime_define.MyDatetime.h5 <class 'datetime.datetime'>

So, I'm not convinced that use_state actually introduces any additional vulnerabilities or broken promises. Rather, it's just that the user is more likely to run into existing (and perfectly reasonable) limitations because an intended use case is to save custom classes, which is just really highlighting how these need to be imported from the same place at both save and load time.

So I think to be safest we should probably make this behavior opt-in. Maybe a use_state="unsafe" or a new kwarg or something. And if possible/reasonable it would be good to save the name of the module where write_hdf5 has been called, because we could at least check that on read to make sure it's the same (so we're at least likely to get the correct result). WDYT?

I fully agree the state stuff should be behind an opt-in wall, but since the "__reduce__ returns a string" behaviour is identical to the rest of the use_state behaviour, I think the existing boolean gate is sufficient.

We could further do some sort of check and warning at write-time if the module is __main__, but since the behaviour here is anyhow identical to pickle I don't think it's unfair to the users to let it be silent. Also the default error when you forget to re-define your class in the loading context (AttributeError: __main__ has no attribute FOO or however it's phrased) is pretty clear too.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5cad45f) 91.05% compared to head (18d484f) 91.46%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
+ Coverage   91.05%   91.46%   +0.40%     
==========================================
  Files           3        3              
  Lines         704      726      +22     
  Branches      163      166       +3     
==========================================
+ Hits          641      664      +23     
+ Misses         39       38       -1     
  Partials       24       24              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

liamhuber and others added 11 commits February 5, 2024 14:08
Be explicit that SingleValue is abstract

It already was effectively since it didn't implement its parent's abstract method, but just be super clear
Be explicit that SingleValue is abstract

It already was effectively since it didn't implement its parent's abstract method, but just be super clear
We need it that high for the case when state is None
Necessary when, e.g. `__reduce__` returns an integer variable
@larsoner
Copy link
Member

larsoner commented Feb 6, 2024

Thanks for the detailed analysis @liamhuber ! I agree if we do what pickle does it's okay. We should make sure at some point that the use_state adequately documents that it's as fragile and prone to breakage as pickle itself is, which is not designed to be stable/portable across systems, versions of Python, different modules, etc. (I Just hit a problem last week where someone had saved a .npy with a pandas object in it a year or so ago... and it failed because I was using too new a version of pandas. 🙄 )

@larsoner larsoner merged commit 5353e06 into h5io:main Feb 6, 2024
6 checks passed
@liamhuber liamhuber deleted the when_reduce_is_a_string branch February 6, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The use_state branch does not handle when __reduce__ returns a string
4 participants