### ObjectSerializer Example

In [13]:
import sys
import os
import dill
import numpy as np
from dataclasses import dataclass

from slurmflow.serializer import ObjectSerializer
from slurmflow.config import ConfigParser

The `ObjectSerializer` is an algorithm for storing arbitrary Python objects. The serializer leverages the `h5py` library for storing hierarchical data structures and `blosc2` to compress the bytecode of serialized objects, allowing for efficient storage of large datasets.

It won't work for everything (most notably objects which are wrappers around C objects, e.g. many OpenMM objects), but I have found that it works well for custom dataclasses across various packages (e.g. MDtraj trajectories, MDAnalysis universes, numpy arrays, and torch tensors).

Below I have provided an example where a custom dataclass is defined, and then saved using the serializer. The stored object is then loaded and equality of checked between the original and retrieved objects. The compression factor is also reported.

In [14]:
@dataclass
class ExampleDataClass:
    id: int
    name: str
    data: list
    is_active: bool
    config: ConfigParser
        
    def __eq__(self, other):
        return (self.id == other.id and
                self.name == other.name and
                np.array_equal(self.data, other.data) and
                self.is_active == other.is_active)

In [15]:
OS = ObjectSerializer()
cfg = ConfigParser("config/example.yml") # see config.ipynb for more (here it's just an object to be stored).

# Step1 1: Create an instance of the dataclass
example_data = ExampleDataClass(1, "Test Object", np.arange(400000), True, cfg)

# Step 2: Serialize the Object
OS.save(example_data, "example/example_data.h5", overwrite=True) # overwrite should be disabled for partial saving.

# Step 3: Deserialize the Object
deserialized_data = OS.load("example/example_data.h5")

# Step 4: Equality Check
is_equal = deserialized_data == example_data
print(f"Equality check: {is_equal}")

# Step 5: Compute Compression Factor

serialized_size = os.path.getsize("example/example_data.h5")
in_memory_size = sys.getsizeof(dill.dumps(example_data))
compression_factor = in_memory_size / serialized_size

print(f"Serialized Size: {serialized_size} bytes")
print(f"In-Memory Size: {in_memory_size} bytes")
print(f"Compression Factor: {compression_factor:.3f}")


Equality check: True
Serialized Size: 427960 bytes
In-Memory Size: 3205193 bytes
Compression Factor: 7.489


The contents of an archive can be printed using the `print_summary` method:

In [16]:
OS.print_summary("example/example_data.h5")

config
config/config_data
config/parent_config_data
data
id
is_active
name


Before being stored in the `h5` archive the bytecode is wrapped in `np.void` which has a size limit. To get around this the bytecode is chunked and the chunks are wrapped in `np.void`. By default the chunks are hidden in the printed summary; they can be viewed using the `chunks` flag.

In [18]:
OS.print_summary("example/example_data.h5", chunks=True)

config
config/config_data
config/config_data/chunk_0
config/parent_config_data
config/parent_config_data/chunk_0
data
data/chunk_0
id
id/chunk_0
is_active
is_active/chunk_0
name
name/chunk_0
