##### Diving In

Serializing enable saving an reloading of an in-memory data structure. The data is only meant to be used by the same program that created it, therefore the interoperability issues are limited to ensuring that later versions of the program can read data written by earlier versions.

For cases like this, the `pickle` module in python is ideal.

What can `pickle` module store?

* All the native data types: booleans, integers, floating point numbers, complex numbers, strings, bytes objects, arrays and None
* Lists tuples, dictionaries and sets containing any combination of native data types
* Lists, tuples, dictionaries and sets containing any combination of lists, tuples, dictionaries and sets containing any combination of native data types (and so on, to the maximum nesting level that Python supports).
* functions, classes and instances of classes (with caveats)

##### Saving data to a pickle file.

The `pickle` module works with data structures, let's build one.

In [42]:
entry = {}
entry['title'] = 'Dive into history, 2009 edition'
entry['article_link'] = 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
entry['comments_link'] = None
entry['internal_id'] = b'\xDE\xD5\xB4\xF8'
entry['tags'] = ('diveintopython', 'docbook', 'html')
entry['published'] = True

from datetime import datetime
entry['published_date'] = datetime.strptime('21/11/06 16:30', "%d/%m/%y %H:%M")
entry

{'title': 'Dive into history, 2009 edition',
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'tags': ('diveintopython', 'docbook', 'html'),
 'published': True,
 'published_date': datetime.datetime(2006, 11, 21, 16, 30)}

In [43]:
import pickle

print("Saving entry dictionary to file")

with open('examples/entry.pickle', 'wb') as f:
    pickle.dump(entry, f)

Saving entry dictionary to file


The `dump` function in `pickle` modules takes a serializable python data structure and serializes it into a binary format using the latest version of the pickle protocol and saves it to an open file.

* The `pickle` module takes a python data structure and saves it to a file
* To do this, it serializes the data structure using a data format called the pickle protocol
* The pickle protocol is python specific. there is no cross languages compatibility
* Not every python data structure can be serialized by the `pickle` module. The pickle protocol has changed several times as new data types have been added to Python langues, but there are still limitations.
* As a result of these changes, ther eis no guarantee of compatility between versions. Newer versions of Python support the older serialization formats, but older versions do not support newer versions.
* Unless otherwise specified the functions in the `pickle` module will use the latest version of the protocol.
* The latest version of the protocol is a binary format.

##### Loading data from a pickle file

Lets load data from the pickle file - start by clearing entry

In [44]:
del entry

In [45]:
import pickle

with open('examples/entry.pickle', 'rb') as f:
    entry = pickle.load(f)

In [46]:
entry

{'title': 'Dive into history, 2009 edition',
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'tags': ('diveintopython', 'docbook', 'html'),
 'published': True,
 'published_date': datetime.datetime(2006, 11, 21, 16, 30)}

The `pickle.dump()/pickle.load()` cycle results in a new data structure that is equal to the original data structure

##### Pickling without a file

The examples earlier showed how to serialize to a Python object directly to a file on disk. But what if you don't want or need a file? You can also serialize to a bytes object in memory.

In [47]:
b = pickle.dumps(entry)
type(b)

bytes

In [48]:
entry2 = pickle.loads(b)
entry2 == entry

True

##### Bytes and String when pickling

The pickle protocol has been around for many years. There are now four different versions of the protocol.

* Python 1.x had two version of the protocol, a text and a binary version
* Python 2.3 introduced version 2 to handle new functionality in class objects. It is a binary format
* POython 3.0 introduced another pickle protocol with explicit support for `byte` objects and `byte` arrays. It is binary format.

##### Debugging pickle files.

Pickle protocol is binary, attempting to `cat` a pickle file will result is unprintable characters, it's not helpful.

In [49]:
import pickletools

with open('examples/entry.pickle', 'rb') as f:
    pickletools.dis(f)

    0: \x80 PROTO      4
    2: \x95 FRAME      291
   11: }    EMPTY_DICT
   12: \x94 MEMOIZE    (as 0)
   13: (    MARK
   14: \x8c     SHORT_BINUNICODE 'title'
   21: \x94     MEMOIZE    (as 1)
   22: \x8c     SHORT_BINUNICODE 'Dive into history, 2009 edition'
   55: \x94     MEMOIZE    (as 2)
   56: \x8c     SHORT_BINUNICODE 'article_link'
   70: \x94     MEMOIZE    (as 3)
   71: \x8c     SHORT_BINUNICODE 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'
  147: \x94     MEMOIZE    (as 4)
  148: \x8c     SHORT_BINUNICODE 'comments_link'
  163: \x94     MEMOIZE    (as 5)
  164: N        NONE
  165: \x8c     SHORT_BINUNICODE 'internal_id'
  178: \x94     MEMOIZE    (as 6)
  179: C        SHORT_BINBYTES b'\xde\xd5\xb4\xf8'
  185: \x94     MEMOIZE    (as 7)
  186: \x8c     SHORT_BINUNICODE 'tags'
  192: \x94     MEMOIZE    (as 8)
  193: \x8c     SHORT_BINUNICODE 'diveintopython'
  209: \x94     MEMOIZE    (as 9)
  210: \x8c     SHORT_BINUNICODE 'docbook'
  219

The most interesting information in that disassembly is on the last line, because it include the version of pickle with which the file was ssaved. To determine the protocol version used to store a pickle file, you need to look at the markers ("opcodes") within the pickled data and use hard coded knowledge of which opcodes were intorduced with each version of the protocol.

The `pickletools.dis()` function does exactly that, and it prints the result in the ast line of the output.

In [50]:
import pickletools

def protocol_version(file_object):
    maxproto = 1
    for opcode, arg, pos in pickletools.genops(file_object):
        maxproto = max(maxproto, opcode.proto)
        
    return maxproto

with open('examples/entry.pickle', 'rb') as f:
    v = protocol_version(f)
    
v

4

##### Serializing Python Objects to be read by other languages

The data format used by the `pickel` module if python specific. If cross-language compatibility is one of the requirements, one needs to look at other formats. One such format is `JSON`.

Python includes a `json` module in the standard library. Like the `pickle` module the `json` module has functions for serializing data structures, but with some important differences.

First of all `JSON` data format is text-based, not binary. For example a boolean value is stored as either the character string `false` or `true`. All json values are case-sensitive.

JSON allows arbitrary amounts of whitespace between values. This whitespace is insignificant which means json encodes can add as much or as little whitespace as they like, and json decoders are required to ignore the whitespace.

There is the perennial problem of character encoding. JSON must be stored as unicode.

##### Saving data to JSON file

JSON looks remarkably similar to a dictionary.

In [51]:
entry

{'title': 'Dive into history, 2009 edition',
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'tags': ('diveintopython', 'docbook', 'html'),
 'published': True,
 'published_date': datetime.datetime(2006, 11, 21, 16, 30)}

In [52]:
import json

basic_entry = {}
basic_entry['id'] = 256
basic_entry['title'] = 'Dive into history, 2009 edition'
basic_entry['tags'] = ('diveintopython', 'docbook', 'html')
basic_entry['published'] = True
basic_entry['comments_link'] = None

with open('examples/basic.json', 'w') as f:
    json.dump(basic_entry, f)

In [53]:
with open('examples/basic.json', 'r') as f:
    for line in f:
        print(line)

{"id": 256, "title": "Dive into history, 2009 edition", "tags": ["diveintopython", "docbook", "html"], "published": true, "comments_link": null}


JSON is more readable than the `pickle` format. Lets indent the json file

In [54]:
import json

with open('examples/basic-pretty.json', 'w') as f:
    json.dump(basic_entry, f, indent=2)

In [55]:
with open('examples/basic-pretty.json', 'r') as f:
    for line in f:
        print(line.rstrip())

{
  "id": 256,
  "title": "Dive into history, 2009 edition",
  "tags": [
    "diveintopython",
    "docbook",
    "html"
  ],
  "published": true,
  "comments_link": null
}


##### Mapping of Python DataTypes to JSON

Since JSON is not python specific there are some mismatches in its coverage of python data types. Some of them are simply naming differences, but there is two important data types that are missing

* bytes
* Tuple

JSON has an array type, which the `json` module maps to Python list, but it does not have a seperate type for tuples. And while JSON has support for string, it has not support for `bytes` (objects and arrays)

##### Serializing unsupported data types by JSON

The JSON module provides extensibility hooks for encoding and decoding unknown datatypes. Lets take a look at some examples

In [56]:
entry

{'title': 'Dive into history, 2009 edition',
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'tags': ('diveintopython', 'docbook', 'html'),
 'published': True,
 'published_date': datetime.datetime(2006, 11, 21, 16, 30)}

In [57]:
from datetime import datetime

def to_json(pobject):
    if isinstance(pobject, bytes):
        print('Serializing bytes')
        return { '__class__':  'bytes', '__value__': list(pobject)}
    
    if isinstance(pobject, datetime):
        print('Serializing datetime')
        return {'__class__' : 'datetime.asctime',
                '__value__': pobject.strftime('%d/%m/%y %H:%M')}
    
    print(f'No custom for type {type(pobject)}')

    raise TypeError(repr(pobject) + ' is not JSON serializable')

In [58]:
with open('examples/entry.json', 'w') as f:
    json.dump(entry, f, default=to_json, indent=2)

Serializing bytes
Serializing datetime


In [59]:
with open('examples/entry.json', 'r') as f:
    for line in f:
        print(line.rstrip())

{
  "title": "Dive into history, 2009 edition",
  "article_link": "http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition",
  "comments_link": null,
  "internal_id": {
    "__class__": "bytes",
    "__value__": [
      222,
      213,
      180,
      248
    ]
  },
  "tags": [
    "diveintopython",
    "docbook",
    "html"
  ],
  "published": true,
  "published_date": {
    "__class__": "datetime.asctime",
    "__value__": "21/11/06 16:30"
  }
}


##### Loading data from a json file

Like the `pickle` module the `json` module has a `load()` function which takes a stream object, reads JSON-encoded data from it and creates a new Python object that mirrors the JSON data

In [60]:
del entry

In [92]:
import json

with open('examples/entry.json', 'r', encoding='utf-8') as f:
    entry = json.load(f)

In [93]:
entry

{'title': 'Dive into history, 2009 edition',
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'comments_link': None,
 'internal_id': {'__class__': 'bytes', '__value__': [222, 213, 180, 248]},
 'tags': ['diveintopython', 'docbook', 'html'],
 'published': True,
 'published_date': {'__class__': 'datetime.asctime',
  '__value__': '21/11/06 16:30'}}

`json.loads()` does not know anything about any conversion function we used in the call to `json.dumps()`. We need a function that is the opposite of `to_json` - a function that will take a custom converted JSON object and convert it back to the python datatype

In [97]:
from datetime import datetime

def from_json(json_object):
    if '__class__' in json_object:
        if json_object['__class__'] == 'bytes':
            return bytes(json_object['__value__'])
        
        if json_object['__class__'] == 'datetime.asctime':
            return datetime.strptime(json_object['__value__'], 
                                     "%d/%m/%y %H:%M")
        
    return json_object

In [98]:
del entry

In [100]:
import json

with open('examples/entry.json', 'r', encoding='utf-8') as f:
    entry = json.load(f, object_hook=from_json)
    
entry

{'title': 'Dive into history, 2009 edition',
 'article_link': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
 'comments_link': None,
 'internal_id': b'\xde\xd5\xb4\xf8',
 'tags': ['diveintopython', 'docbook', 'html'],
 'published': True,
 'published_date': datetime.datetime(2006, 11, 21, 16, 30)}

In [80]:
entry