
*Note:* You can run this from your computer (Jupyter or terminal), or use one of the
hosted options:

[![binder-logo](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ploomber/binder-env/main?urlpath=git-pull%3Frepo%3Dhttps%253A%252F%252Fgithub.com%252Fploomber%252Fprojects%26urlpath%3Dlab%252Ftree%252Fprojects%252Fserialize-unserialize%252FREADME.ipynb%26branch%3Dmaster)

[![deepnote-logo](https://deepnote.com/buttons/launch-in-deepnote-small.svg)](https://deepnote.com/launch?template=deepnote&url=https://github.com/ploomber/projects/blob/master/serialize-unserialize/README.ipynb)


# Serialization

Incremental builds allows Ploomber to skip tasks whose source code hasn't changed; to enable such a feature, each task must save their products to disk. However, there are some cases when we don't want our pipeline to perform disk operations. For example, if we want to deploy our pipeline, eliminating disk operations reduces runtime considerably.

To enable a pipeline to work in both disk-based and in-memory scenarios, we can declare a `serializer` and `unzerializer` in our pipeline declaration, effectively separating our task's logic from the read/write logic.

Note that this only applies to function tasks, other tasks are unaffected by the `serializer`/`unserializer` configuration.

In [1]:
from ploomberutils import display_file

## Built-in pickle serialization

The easiest way to get started is to use the built-in serializer and unserializer which use the `pickle` module.

Let's see an example, the following pipeline has two tasks, the first one generates a dictionary and the second one two dictionaries. Since we are using the pickle-based serialization, each dictionary is saved in the pickle binary format:

In [2]:
display_file('simple.yaml')


```yaml
serializer: ploomber.io.serializer_pickle
unserializer: ploomber.io.unserializer_pickle

tasks:
  - source: tasks.first
    product: output/one_dict
  
  - source: tasks.second
    product:
        another: output/another_dict
        final: output/final_dict
```


In [3]:
%%sh
ploomber build --entry-point simple.yaml --force

name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
first   True         0.000624       26.8156
second  True         0.001703       73.1844


Building task 'second': 100%|██████████| 2/2 [00:06<00:00,  3.42s/it]


The pickle format has important [security concerns](https://docs.python.org/3/library/pickle.html), **remember to only unpickle data you trust**.

## Custom serialization logic

We can also define our own serialization logic, by using the `@serializer`, and `@unserializer` decorators. Let's replicate what our pickle-based serializer/unserializer is doing as an example:

In [4]:
display_file('custom.py', symbols=['my_pickle_serializer', 'my_pickle_unserializer'])


```py
from pathlib import Path
import pickle

from ploomber.io import serializer, unserializer


@serializer()
def my_pickle_serializer(obj, product):
    Path(product).write_bytes(pickle.dumps(obj))


@unserializer()
def my_pickle_unserializer(product):
    return pickle.loads(Path(product).read_bytes())
```


A `@serializer` function must take two arguments: the object to serializer and the product object (taken from the task declaration). The `@unserializer` must take a single argument (the product to unserializer), and return the unserializer object.

Let's modify our original pipeline to use this serializer/unserializer:

In [5]:
display_file('custom.yaml')


```yaml
serializer: custom.my_pickle_serializer
unserializer: custom.my_pickle_unserializer

tasks:
  - source: tasks.first
    product: output/one_dict
  
  - source: tasks.second
    product:
        another: output/another_dict
        final: output/final_dict
```


In [6]:
%%sh
ploomber build --entry-point custom.yaml --force

name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
first   True         0.000533       21.0008
second  True         0.002005       78.9992


Building task 'second': 100%|██████████| 2/2 [00:06<00:00,  3.39s/it]


## Custom serialization logic based on the product's extension

Under many circumstances, there are more suitable formats than pickle. For example, we may want to store lists or dicts as JSON files, and any other files using pickle. The `@serializer`/`@unserializer` decorators take a mapping as first argument to dispatch to different functions depending on the product's logic. Let's see an example:

In [7]:
display_file('custom.py', symbols=['write_json', 'read_json', 'my_serializer', 'my_unserializer'])


```py
from pathlib import Path
import pickle
import json

from ploomber.io import serializer, unserializer


def write_json(obj, product):
    Path(product).write_text(json.dumps(obj))


def read_json(product):
    return json.loads(Path(product).read_text())


@serializer({'.json': write_json})
def my_serializer(obj, product):
    Path(product).write_bytes(pickle.dumps(obj))


@unserializer({'.json': read_json})
def my_unserializer(product):
    return pickle.loads(Path(product).read_bytes())
```


Let's modify our example pipeline. The product in the first task does not have an extension (`output/one_dict`), hence, it will use the pickle-based logic. However, the tasks in the second task have a `.json` extension, hence, they will be saved as JSON files.

In [8]:
display_file('with-json.yaml')


```yaml
serializer: custom.my_serializer
unserializer: custom.my_unserializer

tasks:
  - source: tasks.first
    product: output/one_dict
  
  - source: tasks.second
    product:
        another: output/another_dict.json
        final: output/final_dict.json
```


In [9]:
%%sh
ploomber build --entry-point with-json.yaml --force

name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
first   True         0.00055        24.7525
second  True         0.001672       75.2475


Building task 'second': 100%|██████████| 2/2 [00:06<00:00,  3.31s/it]


Let's print the `.json` files to verify they're not pickle files:

In [10]:
display_file('output/another_dict.json')


```json
{"a": 3, "b": 2}
```


In [11]:
display_file('output/final_dict.json')


```json
{"a": 100, "b": 200}
```


## Using a fallback format

Since it's common to have a `fallback` serialization format. The decorators have a `fallback` argument that when enabled, uses the `pickle` module when the product's extension does not match any of the registered one in the first argument.

The example works the same as the previous one, except we don't have to write our on pickle-based logic.

`fallback` can also take the [joblib](https://github.com/joblib/joblib) or [cloudpickle](https://github.com/cloudpipe/cloudpickle) values. They're similar to the pickle format but have some advantages. For example, `joblib` produces smaller files when the serialized object contains many NumPy arrays, where as cloudpickle supports serialization of some objects that the pickle module doesn't. To use `joblib` or `cloudpickle` the corresponding module must be installed.

In [12]:
display_file('custom.py', symbols=['my_fallback_serializer', 'my_fallback_unserializer'])


```py
from ploomber.io import serializer, unserializer


@serializer({'.json': write_json}, fallback=True)
def my_fallback_serializer(obj, product):
    pass


@unserializer({'.json': read_json}, fallback=True)
def my_fallback_unserializer(product):
    pass
```


In [13]:
display_file('fallback.yaml')


```yaml
serializer: custom.my_fallback_serializer
unserializer: custom.my_fallback_unserializer

tasks:
  - source: tasks.first
    product: output/one_dict
  
  - source: tasks.second
    product:
        another: output/another_dict.json
        final: output/final_dict.json
```


In [14]:
%%sh
ploomber build --entry-point fallback.yaml --force

name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
first   True         0.000541       24.2275
second  True         0.001692       75.7725


Building task 'second': 100%|██████████| 2/2 [00:06<00:00,  3.22s/it]


Let's print the JSON files to verify their contents:

In [15]:
display_file('output/another_dict.json')


```json
{"a": 3, "b": 2}
```


In [16]:
display_file('output/final_dict.json')


```json
{"a": 100, "b": 200}
```


## Using default serializers

Ploomber comes with a few convenient serialization functions to write more succint serializers. We can request the use of such default serializers using the `defaults` argument, which takes a list of extensions:

In [17]:
display_file('custom.py', symbols=['my_defaults_serializer', 'my_defaults_unserializer'])


```py
from ploomber.io import serializer, unserializer


@serializer(fallback=True, defaults=['.json'])
def my_defaults_serializer(obj, product):
    pass


@unserializer(fallback=True, defaults=['.json'])
def my_defaults_unserializer(product):
    pass
```


Here we're asking to dispatch `.json` products and use `pickle` for all other extensions, the same as we did for the previous examples, except this time we don't have to pass the mapping argument to the decorators.

`defaults` supports:

1. `.json`: the returned object must be JSON-serializable (e.g., a list or a dictionary)
2. `.txt`: the returned object must be a string
3. `.csv`: the returned object must be a `pandas.DataFrame`
4. `.parquet`: the returned object must be a `pandas.DataFrame` and there should be a parquet library installed (such as `pyarrow`).

In [18]:
display_file('defaults.yaml')


```yaml
serializer: custom.my_defaults_serializer
unserializer: custom.my_defaults_unserializer

tasks:
  - source: tasks.first
    product: output/one_dict
  
  - source: tasks.second
    product:
        another: output/another_dict.json
        final: output/final_dict.json
```


In [19]:
%%sh
ploomber build --entry-point defaults.yaml --force

name    Ran?      Elapsed (s)    Percentage
------  ------  -------------  ------------
first   True         0.000658       27.9524
second  True         0.001696       72.0476


Building task 'second': 100%|██████████| 2/2 [00:06<00:00,  3.28s/it]


Let's print the JSON files to verify their contents:

In [20]:
display_file('output/another_dict.json')


```json
{"a": 3, "b": 2}
```


In [21]:
display_file('output/final_dict.json')


```json
{"a": 100, "b": 200}
```
