-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle and test when __reduce__
returns a string
#77
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Passing on windows and osx is not relevant as the test is skipped there. Ubuntu fails with:
This is perfectly sensible, as I'd defined the "global" variable I just moved it out to the main level of the tests file and everything passes fine. IMO it's kind of ugly having those extra classes defined at the same level as the rest of the tests, but I don't know any non-convoluted ways to expose the variable for import from inside a function. Moving this stuff off to a |
…en_reduce_is_a_string
@pmrv (if you have permissions), @jan-janssen, you guys introduced the state and reduce stuff -- could one of you review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is great and extends the functionality of h5io
. It is well documented and worth adding. Just the tests are currently rather lengthly - increasing the total number of lines for the tests by 20%. So I added a couple of comments to reduce the number of lines to half ~50 lines.
Co-authored-by: Jan Janssen <jan-janssen@users.noreply.github.com>
for more information, see https://pre-commit.ci
…en_reduce_is_a_string
It took me a minute to come around to it, but this is really excellent feedback. I was laser-focused on the singleton pattern case, but the point is that when EDIT: local to the class defining the |
for more information, see https://pre-commit.ci
Oops, slipped past me until I noticed some of the review comments weren't "outdated"
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
@larsoner would you mind seeing if you also approve? Once(if) this is merged I would appreciate a new release so I can use this my downstream CI 🚀 |
h5io/tests/test_io.py
Outdated
# string return from __reduce__, we expect to get back the exact thing specified | ||
# in that string! | ||
assert not isinstance(reloaded, StringReduce) # Not what got saved, but rather | ||
assert reloaded is what_gets_loaded # The _exact_ object specified in __reduce__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very weird, intuitively it makes me think that this functionality won't survive closing and opening the Python interpreter for example. Like maybe this test is using the global state of this file when doing read_hdf5
.
To test this, I write a little file rep.py
based on your test:
rep.py
import h5io
class StringReduce:
"""A class to test the case of `__reduce__` returning a string."""
def __reduce__(self):
"""
Return a string associated with a local object to return.
Normally, this would be some instance of this very class, e.g. in the case of
a singleton. Here, we're going to set the variable to something totally
different for pedagogical reasons, to demonstrate that on read you get
_exactly_ what you specify here.
"""
return "what_gets_loaded" # The associated global variable
string_reduce_instance = StringReduce()
what_gets_loaded = {"the_point_is": "this is some very particular object"}
h5io.write_hdf5(
fname="test.h5",
data=string_reduce_instance,
overwrite=True,
use_json=False,
use_state=True,
)
And it seems to confirm my suspicion:
$ python rep.py
$ python
>>> import h5io
>>> h5io.read_hdf5("test.h5")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/larsoner/python/h5io/h5io/_h5io.py", line 516, in read_hdf5
return _read(fid)
^^^^^^^^^^
File "/home/larsoner/python/h5io/h5io/_h5io.py", line 510, in _read
return _triage_read(fid[title], slash=slash)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/larsoner/python/h5io/h5io/_h5io.py", line 584, in _triage_read
obj_class=_import_class(class_type=type_str),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/larsoner/python/h5io/h5io/_h5io.py", line 897, in _import_class
return getattr(
^^^^^^^^
AttributeError: module '__main__' has no attribute 'what_gets_loaded'
So I don't think this feature will actually work for people the way it should, i.e., giving object persistence across instantiations of Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@larsoner thanks for looking at it so carefully!
It took me a while to figure out what's going wrong, because this feature is working A-OK for me including surviving interpreter restarts.
The difference is that in my use case, I'm always writing and reading data that is in an importable python module, so I can always find the object again without trouble; I even get is
comparisons passing as expected when I save and load a singleton this way (which is actually what motivated the PR).
In contrast, in your example, when rep.py
is executed the __module__
is __main__
. When the interpreter is restarted, the module is still __main__
, so there's no import error, but since none of the code in rep.py
has been executed yet, the saved instance no longer exists and is not available.
My knee-jerk response was to throw an error if __module__ == "__main__"
at write time, but since a user could reasonably just re-create their object in main prior to loading, I am wary of absolutely forbidding this behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to look at is that this only works as expected when h5io.read_hdf5
is used in the same place as h5io.write_hdf5
, right? So if I were to write some data in script A (or module A for that matter) to HDF5, it would probably not load properly in script B (or module B) where A != B
. Even if __module__ == "__main__"
would not be enough in that case -- you'd want to check that the module/global namespace during load is the same as the module/global namespace during read. And even that wouldn't really be a total guarantee (my __main__
might not be your __main__
, or any other module name like utils
for example more generally).
This is a bit of a departure from how h5io usually works. So far, once you write the data, you can move it to another computer and use it in another script/module/whatever and end up with the same data after read. This tightens the scope to once you write the data in this module or script, it will work in this module or script so long as the global variables are the same during read (or something like this I think?).
So I think to be safest we should probably make this behavior opt-in. Maybe a use_state="unsafe"
or a new kwarg
or something. And if possible/reasonable it would be good to save the name of the module where write_hdf5
has been called, because we could at least check that on read to make sure it's the same (so we're at least likely to get the correct result). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR:
- Issues around
__main__
never differed between this PR and existinguse_state
stuff - The
__reduce__() -> str
case now matchespickle
behaviour, which I think should be the guiding star foruse_state=True
use_state
doesn't really introduce any vulnerabilities or broken promises, it just highlights (perfectly reasonable!) limitations that are already there
Complete:
Another way to look at is that this only works as expected when
h5io.read_hdf5
is used in the same place ash5io.write_hdf5
, right? So if I were to write some data in script A (or module A for that matter) to HDF5, it would probably not load properly in script B (or module B) whereA != B
.
Thankfully, it isn't quite that severe. If I write data in A
and read it in B
, it should always work as long as the class for the data (or singleton in this string __reduce__
case) was defined in C
. The issue only really comes up when you both define and write the data in A
such that the "module" comes out as __main__
but then don't redefine it in __main__
at read time.
This was good feedback though, as it forced me to slow down and think about what I'm trying to accomplish here. Tacitly, I've been trying to match the behaviour of use_state=True
to the behaviour of pickle
, probably because all the comments in the source code directly quote the pickle docs. However, in the process of putting together an example I noticed I wasn't being strict enough! When __reduce__
returns a string, pickle actually insists that the value stored in that import pass an is
comparison with the owner of the __reduce__
method.
This is all probably clearest with an example. I put together three .py scripts:
- Define classes and write
- Import classes and write
- Read
These all live next to each other in a directory at the top of my python path. They operate on three objects:
- A class that returns the name of a global int in
__reduce__
- A singleton class that returns the name of a global instance of itself in
__reduce__
- Just a regular class
And I use both h5io
and pickle
to save and load them.
reducers_define.py
from abc import ABCMeta
import os
import pickle
from h5io import write_hdf5
MY_GLOBAL = 42
class GlobalReduce:
def __reduce__(self):
return "MY_GLOBAL"
class Singleton(ABCMeta):
_instances = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._instances:
cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
return cls._instances[cls]
class SingletonReduce(metaclass=Singleton):
def __reduce__(self):
return "MY_SINGLETON"
MY_SINGLETON = SingletonReduce()
class JustSomeRegularClass:
pass
some_class_instance = JustSomeRegularClass()
def to_hdf(file_name, data):
hdf_file = file_name + ".h5"
try:
write_hdf5(
fname=hdf_file,
data=data,
overwrite=True,
use_json=False,
use_state=True,
)
print(f"HDF SAVED {file_name}")
except ValueError as e:
print(f"HDF FAILED {file_name}") #: {e.args[0]}")
os.remove(hdf_file)
def to_pickle(file_name, data):
pickle_file = file_name + ".pckl"
try:
with open(pickle_file, "wb") as f:
pickle.dump(data, f)
print(f"PICKLE SAVED {file_name}")
except pickle.PicklingError as e:
print(f"PICKLE FAILED {file_name}") #: {e.args[0]}")
os.remove(pickle_file)
def write(file_start, objects):
print(f"########{file_start}########")
for instance in objects:
file_name = f"{file_start}_{instance.__class__.__module__}.{instance.__class__.__name__}"
for attack in [to_hdf, to_pickle]:
attack(file_name, instance)
if __name__ == "__main__":
write("reducers_define", [GlobalReduce(), SingletonReduce(), some_class_instance])
reducers_write.py
import os
import pickle
from h5io import write_hdf5
from reducers_define import write, GlobalReduce, SingletonReduce, JustSomeRegularClass
if __name__ == "__main__":
some_class_instance = JustSomeRegularClass()
write("reducers_write", [GlobalReduce(), SingletonReduce(), some_class_instance])
reducers_read.py
import glob
import os
import pickle
from h5io import read_hdf5
def read_hdf(file):
try:
instance = read_hdf5(fname=file)
print(f"HDF READ {file}") # : {instance}")
except AttributeError as e:
print(f"HDF FAILED {file}") #: {e.args[0]}")
def read_pickle(file):
try:
with open(file, "rb") as f:
instance = pickle.load(f)
print(f"PICKLE READ {file}") # : {instance}")
except AttributeError as e:
print(f"PICKLE FAILED {file}") #: {e.args[0]}")
if __name__ == "__main__":
hdf_files = glob.glob("reducers*.h5")
pickle_files = glob.glob("reducers*.pckl")
print(f"########READING########")
try:
for file in hdf_files:
read_hdf(file)
for file in pickle_files:
read_pickle(file)
finally:
for file in hdf_files + pickle_files:
os.remove(file)
# pass
Then in my shell I execute python reducers_define.py; python reducers_write.py; python reducers_read.py
, which gives the following feedback on what wrote/read OK with filenames based on the tool for (de)serialization, the script acting, the module of the object, and the class of the object:
########reducers_define########
HDF FAILED reducers_define___main__.GlobalReduce
PICKLE FAILED reducers_define___main__.GlobalReduce
HDF SAVED reducers_define___main__.SingletonReduce
PICKLE SAVED reducers_define___main__.SingletonReduce
HDF SAVED reducers_define___main__.JustSomeRegularClass
PICKLE SAVED reducers_define___main__.JustSomeRegularClass
########reducers_write########
HDF FAILED reducers_write_reducers_define.GlobalReduce
PICKLE FAILED reducers_write_reducers_define.GlobalReduce
HDF SAVED reducers_write_reducers_define.SingletonReduce
PICKLE SAVED reducers_write_reducers_define.SingletonReduce
HDF SAVED reducers_write_reducers_define.JustSomeRegularClass
PICKLE SAVED reducers_write_reducers_define.JustSomeRegularClass
########READING########
HDF READ reducers_write_reducers_define.SingletonReduce.h5
HDF FAILED reducers_define___main__.SingletonReduce.h5
HDF FAILED reducers_define___main__.JustSomeRegularClass.h5
HDF READ reducers_write_reducers_define.JustSomeRegularClass.h5
PICKLE READ reducers_write_reducers_define.SingletonReduce.pckl
PICKLE READ reducers_write_reducers_define.JustSomeRegularClass.pckl
PICKLE FAILED reducers_define___main__.SingletonReduce.pckl
PICKLE FAILED reducers_define___main__.JustSomeRegularClass.pckl
An even more condensed summary is that:
- We fail to write
GlobalReduce()
because it's__reduce__()
string is not itself - We succeed to write
SingletonReduce()
just fine, but this can only be read whenSingletonReduce
was imported (or is re-imported/re-defined in the loading context) because otherwise the saved module is__main__
and we just don't know about it - Pickle and h5io behave the same
- h5io doesn't care if we're leveraging the
__reduce__
branch ofuse_state
or not, you can't save something defined in__main__
and expect to load it
Even
if __module__ == "__main__"
would not be enough in that case -- you'd want to check that the module/global namespace during load is the same as the module/global namespace during read. And even that wouldn't really be a total guarantee (my__main__
might not be your__main__
, or any other module name likeutils
for example more generally).This is a bit of a departure from how h5io usually works. So far, once you write the data, you can move it to another computer and use it in another script/module/whatever and end up with the same data after read. This tightens the scope to once you write the data in this module or script, it will work in this module or script so long as the global variables are the same during read (or something like this I think?).
I would actually push back against this a bit. I think this is exactly how the rest of h5io
works, it's just that regular usage of h5io
tacitly assumes the data you're writing was defined somewhere else (builtins
, datetime
, scipy
, where ever). In fact, even without use_state
, you can push h5io
to fail in similar ways.
In terms of scope, h5io depends on the environment being the same in both the saving or loading contexts in order to correctly recover the right object. The real threat here is when versions of the same library differ, but here is an absurdist example where I just completely overwrite scipy and things are silently wrong -- remember these files live in a directory high up my python path, so when I move my_scipy.py
over to scipy.py
in spoof_write.py
, the interpreter finds my custom "scipy" ahead of the real one when I run spoof_read.py
:
my_scipy.py
class _NotWhatYouThought:
def __repr__(self):
return f"These aren't the droids you're looking for"
class sparse:
@classmethod
def csc_matrix(cls, *args, **kwargs):
return _NotWhatYouThought()
spoof_write.py
import shutil
from scipy import sparse
from h5io import write_hdf5, read_hdf5
data = sparse.csc_matrix(1)
file_name = "spoof.h5"
print("WRITING:", data)
write_hdf5(
fname=file_name,
data=data,
overwrite=True,
use_json=False,
use_state=False,
)
from_real_module = read_hdf5(file_name)
print("NORMAL MODULE READ:", from_real_module)
shutil.move("my_scipy.py", "scipy.py")
spoof_read.py
import os
import shutil
from h5io import read_hdf5
file_name = "spoof.h5"
try:
from_custom_module = read_hdf5(file_name)
print("CUSTOM MODULE READ:", from_custom_module)
except Exception as e:
print("FAILED", e.args[0])
finally:
shutil.move("scipy.py", "my_scipy.py")
os.remove(file_name)
Then python spoof_write.py; python spoof_read.py
:
WRITING: (0, 0) 1
NORMAL MODULE READ: (0, 0) 1
CUSTOM MODULE READ: These aren't the droids you're looking for
In a similar way, h5io doesn't have any particular safeguards ensuring that what the user wrote is what winds up getting read. E.g. I can use the same three-file setup to define+write, import+write, and read under the condition that I've subclassed one of h5io's "known" classes (datetime in this case). I'm allowed to save it without complaint, and it loads without complaint (even when defined and saved in __main__
) but what gets loaded is just a plain datetime
object and not the class I saved:
datetime_define.py
from datetime import datetime
from h5io import write_hdf5
class MyDatetime(datetime):
pass
data = MyDatetime(2024,2,6)
file_name = f"datetime_{data.__class__.__module__}.{data.__class__.__name__}.h5"
print("WRITING", file_name, type(data))
write_hdf5(
fname=file_name,
data=data,
overwrite=True,
use_json=False,
use_state=False,
)
datetime_write.py
from h5io import write_hdf5
from datetime_define import MyDatetime
data = MyDatetime(2024,2,6)
file_name = f"datetime_{data.__class__.__module__}.{data.__class__.__name__}.h5"
print("WRITING", file_name, type(data))
write_hdf5(
fname=file_name,
data=data,
overwrite=True,
use_json=False,
use_state=False,
)
datetime_read.py
import glob
import os
from h5io import read_hdf5
for file_name in glob.glob("datetime*.h5"):
data = read_hdf5(fname=file_name)
print("READING", file_name, type(data))
os.remove(file_name)
python datetime_define.py; python datetime_write.py; python datetime_read.py
:
WRITING datetime___main__.MyDatetime.h5 <class '__main__.MyDatetime'>
WRITING datetime_datetime_define.MyDatetime.h5 <class 'datetime_define.MyDatetime'>
WRITING datetime_datetime_define.MyDatetime.h5 <class 'datetime_define.MyDatetime'>
READING datetime___main__.MyDatetime.h5 <class 'datetime.datetime'>
READING datetime_datetime_define.MyDatetime.h5 <class 'datetime.datetime'>
So, I'm not convinced that use_state
actually introduces any additional vulnerabilities or broken promises. Rather, it's just that the user is more likely to run into existing (and perfectly reasonable) limitations because an intended use case is to save custom classes, which is just really highlighting how these need to be imported from the same place at both save and load time.
So I think to be safest we should probably make this behavior opt-in. Maybe a
use_state="unsafe"
or a newkwarg
or something. And if possible/reasonable it would be good to save the name of the module wherewrite_hdf5
has been called, because we could at least check that on read to make sure it's the same (so we're at least likely to get the correct result). WDYT?
I fully agree the state stuff should be behind an opt-in wall, but since the "__reduce__
returns a string" behaviour is identical to the rest of the use_state
behaviour, I think the existing boolean gate is sufficient.
We could further do some sort of check and warning at write-time if the module is __main__
, but since the behaviour here is anyhow identical to pickle I don't think it's unfair to the users to let it be silent. Also the default error when you forget to re-define your class in the loading context (AttributeError: __main__ has no attribute FOO
or however it's phrased) is pretty clear too.
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #77 +/- ##
==========================================
+ Coverage 91.05% 91.46% +0.40%
==========================================
Files 3 3
Lines 704 726 +22
Branches 163 166 +3
==========================================
+ Hits 641 664 +23
+ Misses 39 38 -1
Partials 24 24 ☔ View full report in Codecov by Sentry. |
Be explicit that SingleValue is abstract It already was effectively since it didn't implement its parent's abstract method, but just be super clear
Be explicit that SingleValue is abstract It already was effectively since it didn't implement its parent's abstract method, but just be super clear
We need it that high for the case when state is None
Necessary when, e.g. `__reduce__` returns an integer variable
for more information, see https://pre-commit.ci
…en_reduce_is_a_string
Thanks for the detailed analysis @liamhuber ! I agree if we do what |
Closes #76