# Universal Python Serialization

The first tool you should always think about when serializing Python objects is the native pickle format. A pickle can serialize *almost any* Python object in a binary format.

More specialized protocols exist for serialization within a cluster computing frameworks. Cloudpickle is widely used for this purpose, but is not specifically discussed in this training.  Later in this training we will look at a variety of formats that Python can work with, but that are not specific to Python objects.

Let us start out by loading a few Python standard library modules this lesson will utilize.

In [1]:
import pickle
import io
from pprint import pprint
from dataclasses import dataclass
from zipfile import ZipFile
from datetime import datetime

### A Byte Representation

Let us create a dictionary and use pickle to serialize it in a binary form.

In [2]:
my_data = dict(name="David", real_number=76.54, count=22,
               pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])

pkl = pickle.dumps(my_data)
pprint(pkl, width=50)

(b'\x80\x04\x95g\x00\x00\x00\x00\x00\x00\x00}'
 b'\x94(\x8c\x04name\x94\x8c\x05David'
 b'\x94\x8c\x0breal_number\x94G@S"\x8f\\(\xf5\xc3'
 b'\x8c\x05count\x94K\x16\x8c\x04pets\x94]\x94('
 b'\x8c\x08Astrophe\x94\x8c\x07Kachina'
 b'\x94\x8c\x07Jackson\x94\x8c\x05Rebel\x94eu.')


## Same Values, Different Object

Unpickling a serialization will create an *equivalent* object, but not an identical object.  It should not be confused with a shared memory or concurrency mechanism (pickles are a building block for *some* concurrency models, however).

In [3]:
new_data = pickle.loads(pkl)
new_data

{'name': 'David',
 'real_number': 76.54,
 'count': 22,
 'pets': ['Astrophe', 'Kachina', 'Jackson', 'Rebel']}

In [4]:
print("Equality:", new_data == my_data)
print("Identity:", new_data is my_data)

Equality: True
Identity: False


## Pickling to Files

We can pickle to raw bytes, but for many or most purposes, it is useful to write these serializations to files.  The functions `.dump()` and `load()` serialize to file-like objects rather than create byte strings.

In [5]:
with open('tmp/data.pkl', 'wb') as fh:
    pickle.dump(my_data, fh)

FileNotFoundError: [Errno 2] No such file or directory: 'tmp/data.pkl'

In [None]:
%%bash
hexdump -C tmp/data.pkl

## Reading Objects from Files

Reading a pickle from a file—or from another file-like object—is exactly symmetrical with writing it.  With Python's so-called duck-typing, anything with a `.read()` method producing bytes allows unpickling.  Symmetrically, any object with a `.write()` method accepting bytes is suitable for pickling.

In [None]:
pickle.load(open('tmp/data.pkl', 'rb'))

### File-Like Objects

A regular file on the local filesystem is a common location for pickles, but they might be available over a socket, or from a database connection with BLOB storage of pickles, or over an HTTP request, and so on.  For example, perhaps a zip file contains one or more pickles.

In [None]:
%%bash
zip tmp/data tmp/data.pkl

In [None]:
with ZipFile('tmp/data.zip') as zf:
    with zf.open('tmp/data.pkl') as zfile:
        pprint(pickle.load(zfile))

### Other File-Like Objects

Python uses *duck-typing* quite extensively; a great many things are file-like.  For example, we might use a memory IO buffer as the equivalent of a file.

In [None]:
memfile = io.BytesIO(pkl)
pickle.load(memfile)

# Pickle Limitations

Most Python objects can be pickled and unpickled.  A simple dictionary, with some scalars and one nested list, were used in the examples earlier.  You are not limited default data structures; however, there are a few limits.  If you pickle an instance of a class, the class itself needs to be available on the receiving system.  Often this is no problem, since the class is from a library installed at both ends.  

### Round-Trip with a DataClass

Perhaps we wish to use a dataclass instead of a dictionary in a program.

In [None]:
@dataclass
class Trainer:
    name: str
    real_number: float
    count: int
    pets: list

In [None]:
my_instance = Trainer(name="David", real_number=76.54, count=22,
                      pets=['Astrophe', 'Kachina', 'Jackson', 'Rebel'])
my_instance

In [None]:
pickle.loads(pickle.dumps(my_instance))

### Round-Trip with Datetimes

Or we want to store and retrieve datetime values.

In [None]:
events = {'description': 'Developed Lesson',
          'start': datetime.fromisoformat('2020-05-22T12:11:10'),
          'end': datetime(2020, 5, 23, 9, 10, 11)}
events

In [None]:
pickle.loads(pickle.dumps(events))

### Missing Classes

If the class we want is not available, or even simply lives in the wrong namespace, we will not succeed in unpickling.  For example, a pickle file is available, but the code defining the class of the pickled instance is not on the local system.

In [None]:
try:
    with open('data/3dpoint.pkl', 'rb') as fh:
        print(pickle.load(fh))
except Exception as err:
    print(err)

### Transient State

Pickling is not directly possible for objects that are inherently impermanent.  For example, objects may represent file descriptors to the local filesystem or connections to a database.  The store the state of one particular computer at one particular time, and cannot be serialized.

In [None]:
hello = "¡Hola Mundo!"
num = 999
fname = 'tmp/test.data'
fd = open(fname, 'w')
fd.write(hello)
data = {'fd': fd, 'num': num, 'hello': hello}

In [None]:
try:
    pickle.dumps(data)
except Exception as err:
    print(err)

# Customizing Serialization

If you wish to serialize and deserialize classes you create yourself, you are free to specify which data is actually necessary and relevant for recreating *equivalent* instances.  This customization can allow you to initialize transient state in manner to allow something close to round-tripping.  For example, a particular local file cannot be shared on a non-networked filesystem; however, unpickling might create a usable file local to the destination filesystem.

In [None]:
class HelloNum:
    "Plain class that holds file descriptor"
    def __init__(self, fname, hello, num):
        self.fd = open(fname, 'r+')
        self.fd.write(hello)
        self.num = num
        
    def __str__(self):
        return (f"<{self.__class__.__name__} holding file "
                f"{self.fd.name}({self.fd.fileno()}) and num {self.num}>")

We can add the capability of serializing the most important information in an instance to a simple tuple (another structure would work also; e.g. a list, a dict, a namedtuple, etc).

In [None]:
class HelloNum2(HelloNum):
    "Add the ability to pickle the essential data"
    def __getstate__(self):
        pos = self.fd.tell()
        self.fd.flush()
        self.fd.seek(0)
        hello = self.fd.read()
        self.fd.seek(pos)
        data = (self.fd.name, # fname
                hello,   # file content
                self.num)
        print("Pickling tuple only...")
        return data

The `.__init__()` of a class is not called during unpickling.  By default its `.__dict__` is simply restored.  We can make our class do something different from that.

In [None]:
class HelloNum3(HelloNum2):
    "Add the ability to reconstruct local state"
    def __setstate__(self, data):
        self.fd = open(data[0], 'r+')
        self.fd.write(data[1])
        self.num = data[2]

Let us create an instance then round-trip it.

In [None]:
hi = HelloNum3(fname, hello, num)
print(hi)

In [None]:
pkl = pickle.dumps(hi)
pprint(pkl, width=60)

We can unpickle and get a suitable instance.  The crucial detail to notice is that the file number of the file descriptor has change.  De-serialization creates a new file on the target system and populates it with the same content.  But it is a distinct file that will not be synchronized with the sending system in any way.

In [None]:
new_hi = pickle.loads(pkl)
print(new_hi)