Handle and test when `reduce` returns a string #77

liamhuber · 2024-01-25T20:38:43Z

Closes #76

for more information, see https://pre-commit.ci

liamhuber · 2024-01-25T21:14:11Z

Passing on windows and osx is not relevant as the test is skipped there. Ubuntu fails with:

=========================== short test summary info ============================
SKIPPED [1] h5io/tests/test_io.py:419: only python3.11 or lower
FAILED h5io/tests/test_io.py::test_state_with_singleton - AttributeError: module 'h5io.tests.test_io' has no attribute 'raw'
============= 1 failed, 13 passed, 1 skipped, 2 warnings in 3.25s ==============
Error: Process completed with exit code 1.

This is perfectly sensible, as I'd defined the "global" variable raw inside the scope of a function.

I just moved it out to the main level of the tests file and everything passes fine. IMO it's kind of ugly having those extra classes defined at the same level as the rest of the tests, but I don't know any non-convoluted ways to expose the variable for import from inside a function. Moving this stuff off to a tests/static/singletons.py and modifying the pythonpath to import from there is also a reasonable solution, but for now I'll keep the ugly but less invasive form here.

…en_reduce_is_a_string

liamhuber · 2024-01-31T17:55:45Z

@pmrv (if you have permissions), @jan-janssen, you guys introduced the state and reduce stuff -- could one of you review this?

h5io/_h5io.py

h5io/tests/test_io.py

jan-janssen

The change is great and extends the functionality of h5io. It is well documented and worth adding. Just the tests are currently rather lengthly - increasing the total number of lines for the tests by 20%. So I added a couple of comments to reduce the number of lines to half ~50 lines.

Co-authored-by: Jan Janssen <jan-janssen@users.noreply.github.com>

for more information, see https://pre-commit.ci

…en_reduce_is_a_string

liamhuber · 2024-02-02T18:32:41Z

The change is great and extends the functionality of h5io. It is well documented and worth adding. Just the tests are currently rather lengthly - increasing the total number of lines for the tests by 20%. So I added a couple of comments to reduce the number of lines to half ~50 lines.

It took me a minute to come around to it, but this is really excellent feedback. I was laser-focused on the singleton pattern case, but the point is that when __reduce__ returns a string, we should grab that local object and return it. The tests are now much shorter and, I hope, more pedagogically focused on this truth.

EDIT: local to the class defining the __reduce__ method I mean. In this case this is also local to the test, but I don't think we need to go as far as introducing some static test file to import from just for demonstrating this use case more thoroughly.

for more information, see https://pre-commit.ci

Oops, slipped past me until I noticed some of the review comments weren't "outdated"

for more information, see https://pre-commit.ci

jan-janssen

Looks good to me

liamhuber · 2024-02-02T19:56:55Z

@larsoner would you mind seeing if you also approve?

Once(if) this is merged I would appreciate a new release so I can use this my downstream CI 🚀

h5io/_h5io.py

larsoner · 2024-02-02T20:01:01Z

h5io/tests/test_io.py

+    # string return from __reduce__, we expect to get back the exact thing specified
+    # in that string!
+    assert not isinstance(reloaded, StringReduce)  # Not what got saved, but rather
+    assert reloaded is what_gets_loaded  # The _exact_ object specified in __reduce__


This is very weird, intuitively it makes me think that this functionality won't survive closing and opening the Python interpreter for example. Like maybe this test is using the global state of this file when doing read_hdf5.

To test this, I write a little file rep.py based on your test:

rep.py

import h5io class StringReduce: """A class to test the case of `__reduce__` returning a string.""" def __reduce__(self): """ Return a string associated with a local object to return. Normally, this would be some instance of this very class, e.g. in the case of a singleton. Here, we're going to set the variable to something totally different for pedagogical reasons, to demonstrate that on read you get _exactly_ what you specify here. """ return "what_gets_loaded" # The associated global variable string_reduce_instance = StringReduce() what_gets_loaded = {"the_point_is": "this is some very particular object"} h5io.write_hdf5( fname="test.h5", data=string_reduce_instance, overwrite=True, use_json=False, use_state=True, )

And it seems to confirm my suspicion:

$ python rep.py $ python >>> import h5io >>> h5io.read_hdf5("test.h5") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/larsoner/python/h5io/h5io/_h5io.py", line 516, in read_hdf5 return _read(fid) ^^^^^^^^^^ File "/home/larsoner/python/h5io/h5io/_h5io.py", line 510, in _read return _triage_read(fid[title], slash=slash) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/larsoner/python/h5io/h5io/_h5io.py", line 584, in _triage_read obj_class=_import_class(class_type=type_str), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/larsoner/python/h5io/h5io/_h5io.py", line 897, in _import_class return getattr( ^^^^^^^^ AttributeError: module '__main__' has no attribute 'what_gets_loaded'

So I don't think this feature will actually work for people the way it should, i.e., giving object persistence across instantiations of Python?

@larsoner thanks for looking at it so carefully!

It took me a while to figure out what's going wrong, because this feature is working A-OK for me including surviving interpreter restarts.

The difference is that in my use case, I'm always writing and reading data that is in an importable python module, so I can always find the object again without trouble; I even get is comparisons passing as expected when I save and load a singleton this way (which is actually what motivated the PR).

In contrast, in your example, when rep.py is executed the __module__ is __main__. When the interpreter is restarted, the module is still __main__, so there's no import error, but since none of the code in rep.py has been executed yet, the saved instance no longer exists and is not available.

My knee-jerk response was to throw an error if __module__ == "__main__" at write time, but since a user could reasonably just re-create their object in main prior to loading, I am wary of absolutely forbidding this behaviour.

Another way to look at is that this only works as expected when h5io.read_hdf5 is used in the same place as h5io.write_hdf5, right? So if I were to write some data in script A (or module A for that matter) to HDF5, it would probably not load properly in script B (or module B) where A != B. Even if __module__ == "__main__" would not be enough in that case -- you'd want to check that the module/global namespace during load is the same as the module/global namespace during read. And even that wouldn't really be a total guarantee (my __main__ might not be your __main__, or any other module name like utils for example more generally).

This is a bit of a departure from how h5io usually works. So far, once you write the data, you can move it to another computer and use it in another script/module/whatever and end up with the same data after read. This tightens the scope to once you write the data in this module or script, it will work in this module or script so long as the global variables are the same during read (or something like this I think?).

So I think to be safest we should probably make this behavior opt-in. Maybe a use_state="unsafe" or a new kwarg or something. And if possible/reasonable it would be good to save the name of the module where write_hdf5 has been called, because we could at least check that on read to make sure it's the same (so we're at least likely to get the correct result). WDYT?

TLDR:

Issues around __main__ never differed between this PR and existing use_state stuff

The __reduce__() -> str case now matches pickle behaviour, which I think should be the guiding star for use_state=True

use_state doesn't really introduce any vulnerabilities or broken promises, it just highlights (perfectly reasonable!) limitations that are already there

Complete:

Another way to look at is that this only works as expected when h5io.read_hdf5 is used in the same place as h5io.write_hdf5, right? So if I were to write some data in script A (or module A for that matter) to HDF5, it would probably not load properly in script B (or module B) where A != B.

Thankfully, it isn't quite that severe. If I write data in A and read it in B, it should always work as long as the class for the data (or singleton in this string __reduce__ case) was defined in C. The issue only really comes up when you both define and write the data in A such that the "module" comes out as __main__ but then don't redefine it in __main__ at read time.

This was good feedback though, as it forced me to slow down and think about what I'm trying to accomplish here. Tacitly, I've been trying to match the behaviour of use_state=True to the behaviour of pickle, probably because all the comments in the source code directly quote the pickle docs. However, in the process of putting together an example I noticed I wasn't being strict enough! When __reduce__ returns a string, pickle actually insists that the value stored in that import pass an is comparison with the owner of the __reduce__ method.

This is all probably clearest with an example. I put together three .py scripts:

Define classes and write

Import classes and write

Read

These all live next to each other in a directory at the top of my python path. They operate on three objects:

A class that returns the name of a global int in __reduce__

A singleton class that returns the name of a global instance of itself in __reduce__

Just a regular class

And I use both h5io and pickle to save and load them.

reducers_define.py

from abc import ABCMeta import os import pickle from h5io import write_hdf5 MY_GLOBAL = 42 class GlobalReduce: def __reduce__(self): return "MY_GLOBAL" class Singleton(ABCMeta): _instances = {} def __call__(cls, *args, **kwargs): if cls not in cls._instances: cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) return cls._instances[cls] class SingletonReduce(metaclass=Singleton): def __reduce__(self): return "MY_SINGLETON" MY_SINGLETON = SingletonReduce() class JustSomeRegularClass: pass some_class_instance = JustSomeRegularClass() def to_hdf(file_name, data): hdf_file = file_name + ".h5" try: write_hdf5( fname=hdf_file, data=data, overwrite=True, use_json=False, use_state=True, ) print(f"HDF SAVED {file_name}") except ValueError as e: print(f"HDF FAILED {file_name}") #: {e.args[0]}") os.remove(hdf_file) def to_pickle(file_name, data): pickle_file = file_name + ".pckl" try: with open(pickle_file, "wb") as f: pickle.dump(data, f) print(f"PICKLE SAVED {file_name}") except pickle.PicklingError as e: print(f"PICKLE FAILED {file_name}") #: {e.args[0]}") os.remove(pickle_file) def write(file_start, objects): print(f"########{file_start}########") for instance in objects: file_name = f"{file_start}_{instance.__class__.__module__}.{instance.__class__.__name__}" for attack in [to_hdf, to_pickle]: attack(file_name, instance) if __name__ == "__main__": write("reducers_define", [GlobalReduce(), SingletonReduce(), some_class_instance])

reducers_write.py

import os import pickle from h5io import write_hdf5 from reducers_define import write, GlobalReduce, SingletonReduce, JustSomeRegularClass if __name__ == "__main__": some_class_instance = JustSomeRegularClass() write("reducers_write", [GlobalReduce(), SingletonReduce(), some_class_instance])

reducers_read.py

import glob import os import pickle from h5io import read_hdf5 def read_hdf(file): try: instance = read_hdf5(fname=file) print(f"HDF READ {file}") # : {instance}") except AttributeError as e: print(f"HDF FAILED {file}") #: {e.args[0]}") def read_pickle(file): try: with open(file, "rb") as f: instance = pickle.load(f) print(f"PICKLE READ {file}") # : {instance}") except AttributeError as e: print(f"PICKLE FAILED {file}") #: {e.args[0]}") if __name__ == "__main__": hdf_files = glob.glob("reducers*.h5") pickle_files = glob.glob("reducers*.pckl") print(f"########READING########") try: for file in hdf_files: read_hdf(file) for file in pickle_files: read_pickle(file) finally: for file in hdf_files + pickle_files: os.remove(file) # pass

Then in my shell I execute python reducers_define.py; python reducers_write.py; python reducers_read.py, which gives the following feedback on what wrote/read OK with filenames based on the tool for (de)serialization, the script acting, the module of the object, and the class of the object:

########reducers_define######## HDF FAILED reducers_define___main__.GlobalReduce PICKLE FAILED reducers_define___main__.GlobalReduce HDF SAVED reducers_define___main__.SingletonReduce PICKLE SAVED reducers_define___main__.SingletonReduce HDF SAVED reducers_define___main__.JustSomeRegularClass PICKLE SAVED reducers_define___main__.JustSomeRegularClass ########reducers_write######## HDF FAILED reducers_write_reducers_define.GlobalReduce PICKLE FAILED reducers_write_reducers_define.GlobalReduce HDF SAVED reducers_write_reducers_define.SingletonReduce PICKLE SAVED reducers_write_reducers_define.SingletonReduce HDF SAVED reducers_write_reducers_define.JustSomeRegularClass PICKLE SAVED reducers_write_reducers_define.JustSomeRegularClass ########READING######## HDF READ reducers_write_reducers_define.SingletonReduce.h5 HDF FAILED reducers_define___main__.SingletonReduce.h5 HDF FAILED reducers_define___main__.JustSomeRegularClass.h5 HDF READ reducers_write_reducers_define.JustSomeRegularClass.h5 PICKLE READ reducers_write_reducers_define.SingletonReduce.pckl PICKLE READ reducers_write_reducers_define.JustSomeRegularClass.pckl PICKLE FAILED reducers_define___main__.SingletonReduce.pckl PICKLE FAILED reducers_define___main__.JustSomeRegularClass.pckl

An even more condensed summary is that:

We fail to write GlobalReduce() because it's __reduce__() string is not itself

We succeed to write SingletonReduce() just fine, but this can only be read when SingletonReduce was imported (or is re-imported/re-defined in the loading context) because otherwise the saved module is __main__ and we just don't know about it

Pickle and h5io behave the same

h5io doesn't care if we're leveraging the __reduce__ branch of use_state or not, you can't save something defined in __main__ and expect to load it

Even if __module__ == "__main__" would not be enough in that case -- you'd want to check that the module/global namespace during load is the same as the module/global namespace during read. And even that wouldn't really be a total guarantee (my __main__ might not be your __main__, or any other module name like utils for example more generally).

This is a bit of a departure from how h5io usually works. So far, once you write the data, you can move it to another computer and use it in another script/module/whatever and end up with the same data after read. This tightens the scope to once you write the data in this module or script, it will work in this module or script so long as the global variables are the same during read (or something like this I think?).

I would actually push back against this a bit. I think this is exactly how the rest of h5io works, it's just that regular usage of h5io tacitly assumes the data you're writing was defined somewhere else (builtins, datetime, scipy, where ever). In fact, even without use_state, you can push h5io to fail in similar ways.

In terms of scope, h5io depends on the environment being the same in both the saving or loading contexts in order to correctly recover the right object. The real threat here is when versions of the same library differ, but here is an absurdist example where I just completely overwrite scipy and things are silently wrong -- remember these files live in a directory high up my python path, so when I move my_scipy.py over to scipy.py in spoof_write.py, the interpreter finds my custom "scipy" ahead of the real one when I run spoof_read.py:

my_scipy.py

class _NotWhatYouThought: def __repr__(self): return f"These aren't the droids you're looking for" class sparse: @classmethod def csc_matrix(cls, *args, **kwargs): return _NotWhatYouThought()

spoof_write.py

import shutil from scipy import sparse from h5io import write_hdf5, read_hdf5 data = sparse.csc_matrix(1) file_name = "spoof.h5" print("WRITING:", data) write_hdf5( fname=file_name, data=data, overwrite=True, use_json=False, use_state=False, ) from_real_module = read_hdf5(file_name) print("NORMAL MODULE READ:", from_real_module) shutil.move("my_scipy.py", "scipy.py")

spoof_read.py

import os import shutil from h5io import read_hdf5 file_name = "spoof.h5" try: from_custom_module = read_hdf5(file_name) print("CUSTOM MODULE READ:", from_custom_module) except Exception as e: print("FAILED", e.args[0]) finally: shutil.move("scipy.py", "my_scipy.py") os.remove(file_name)

Then python spoof_write.py; python spoof_read.py:

WRITING: (0, 0) 1 NORMAL MODULE READ: (0, 0) 1 CUSTOM MODULE READ: These aren't the droids you're looking for

In a similar way, h5io doesn't have any particular safeguards ensuring that what the user wrote is what winds up getting read. E.g. I can use the same three-file setup to define+write, import+write, and read under the condition that I've subclassed one of h5io's "known" classes (datetime in this case). I'm allowed to save it without complaint, and it loads without complaint (even when defined and saved in __main__) but what gets loaded is just a plain datetime object and not the class I saved:

datetime_define.py

from datetime import datetime from h5io import write_hdf5 class MyDatetime(datetime): pass data = MyDatetime(2024,2,6) file_name = f"datetime_{data.__class__.__module__}.{data.__class__.__name__}.h5" print("WRITING", file_name, type(data)) write_hdf5( fname=file_name, data=data, overwrite=True, use_json=False, use_state=False, )

datetime_write.py

from h5io import write_hdf5 from datetime_define import MyDatetime data = MyDatetime(2024,2,6) file_name = f"datetime_{data.__class__.__module__}.{data.__class__.__name__}.h5" print("WRITING", file_name, type(data)) write_hdf5( fname=file_name, data=data, overwrite=True, use_json=False, use_state=False, )

datetime_read.py

import glob import os from h5io import read_hdf5 for file_name in glob.glob("datetime*.h5"): data = read_hdf5(fname=file_name) print("READING", file_name, type(data)) os.remove(file_name)

python datetime_define.py; python datetime_write.py; python datetime_read.py:

WRITING datetime___main__.MyDatetime.h5 <class '__main__.MyDatetime'> WRITING datetime_datetime_define.MyDatetime.h5 <class 'datetime_define.MyDatetime'> WRITING datetime_datetime_define.MyDatetime.h5 <class 'datetime_define.MyDatetime'> READING datetime___main__.MyDatetime.h5 <class 'datetime.datetime'> READING datetime_datetime_define.MyDatetime.h5 <class 'datetime.datetime'>

So, I'm not convinced that use_state actually introduces any additional vulnerabilities or broken promises. Rather, it's just that the user is more likely to run into existing (and perfectly reasonable) limitations because an intended use case is to save custom classes, which is just really highlighting how these need to be imported from the same place at both save and load time.

So I think to be safest we should probably make this behavior opt-in. Maybe a use_state="unsafe" or a new kwarg or something. And if possible/reasonable it would be good to save the name of the module where write_hdf5 has been called, because we could at least check that on read to make sure it's the same (so we're at least likely to get the correct result). WDYT?

I fully agree the state stuff should be behind an opt-in wall, but since the "__reduce__ returns a string" behaviour is identical to the rest of the use_state behaviour, I think the existing boolean gate is sufficient.

We could further do some sort of check and warning at write-time if the module is __main__, but since the behaviour here is anyhow identical to pickle I don't think it's unfair to the users to let it be silent. Also the default error when you forget to re-define your class in the loading context (AttributeError: __main__ has no attribute FOO or however it's phrased) is pretty clear too.

codecov-commenter · 2024-02-02T21:10:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5cad45f) 91.05% compared to head (18d484f) 91.46%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #77      +/-   ##
==========================================
+ Coverage   91.05%   91.46%   +0.40%     
==========================================
  Files           3        3              
  Lines         704      726      +22     
  Branches      163      166       +3     
==========================================
+ Hits          641      664      +23     
+ Misses         39       38       -1     
  Partials       24       24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Be explicit that SingleValue is abstract It already was effectively since it didn't implement its parent's abstract method, but just be super clear

We need it that high for the case when state is None

Necessary when, e.g. `__reduce__` returns an integer variable

for more information, see https://pre-commit.ci

…en_reduce_is_a_string

larsoner · 2024-02-06T20:29:43Z

Thanks for the detailed analysis @liamhuber ! I agree if we do what pickle does it's okay. We should make sure at some point that the use_state adequately documents that it's as fragile and prone to breakage as pickle itself is, which is not designed to be stable/portable across systems, versions of Python, different modules, etc. (I Just hit a problem last week where someone had saved a .npy with a pandas object in it a year or so ago... and it failed because I was using too new a version of pandas. 🙄 )

liamhuber and others added 7 commits January 25, 2024 12:34

Handle and test when __reduce__ returns a string

e05ccb6

[pre-commit.ci] auto fixes from pre-commit.com hooks

3bb6708

for more information, see https://pre-commit.ci

Add docstring

d746be4

Comply with "summary line" complaint

db164ef

Format singleton docstring for the linter too

02a25b1

Move the singletons to a scope where they can be found

ecbabd3

[pre-commit.ci] auto fixes from pre-commit.com hooks

e8bfd56

for more information, see https://pre-commit.ci

liamhuber added 2 commits January 25, 2024 13:16

Address linter nits

458eab9

Merge remote-tracking branch 'origin/when_reduce_is_a_string' into wh…

4c12050

…en_reduce_is_a_string

liamhuber mentioned this pull request Jan 26, 2024

Store with h5io pyiron/pyiron_workflow#181

Merged

3 tasks

liamhuber added 2 commits January 31, 2024 09:59

Merge branch 'main' into when_reduce_is_a_string

31bd800

Merge branch 'main' into when_reduce_is_a_string

1a6bfd0

liamhuber mentioned this pull request Feb 1, 2024

Add and test wrappers for sticking nodes in a pyiron job pyiron/pyiron_workflow#189

Merged