<a href="https://colab.research.google.com/github/munich-ml/file_IO/blob/master/JSON_datatypes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Motivation

**JSON** is a very popular for storing structured data in (.json) files. IO libraries for JSON handling exist for most programming languages. In Python `json` is the standard module providing write `json.dump()` and read `json.load()` functionality.

A limitation that I regularly encountered is the support of just basic Python datatypes: ``int``, ``float``, ``str``, ``bool``, ``None``, ``list``, (``tuple``), ``dict``.  I put ``tuple`` in parenthesis, because they are converted to ``list``. Thus I consider them just partially supported, but that doesn't bother me too much. More painful is the lack of powerful but common datatypes like `datetime.datetime` or `pandas.DataFrame`.

Therefore, a custom JSON encoder and decoder pair is implemented in here, that supports:
- ``pandas.DataFrame`` and ``pandas.Series``,
- ``numpy.array``,
- ``datetime.datetime`` and ``datetime.timedelta``

More datatypes can easily be added.

# Implementation

We import `json`, of course, as well as `numpy`, `pandas` and `matplotlib`, because those modules contain the datatypes that we want to support. 

In [1]:
import numpy as np
import pandas as pd
import datetime as dt
import json

### Custom JSON encoder

According to https://docs.python.org/3/library/json.html we implement a **custom encoder `JsonEnc`** by subclassing the **default encoder `json.JSONEncoder`**. 

Each **additional datatype** will be represented using the JSON's **standard datatypes**. Unique keys are used to allow reconstruction of the original datatype within the Decoder. The unique keys are:
- ``"@DataFrame"``
- ``"@Series"``
- ``"@np.array"``
- ``"@datetime"``
- ``"@timedelta"`` 

Let's look at ``datetime.timedelta`` encoding as example:
```python
if isinstance(obj, dt.timedelta):
    return {"@timedelta": obj.total_seconds()}
```
A `datetime.timedelta` object is encoded into a `dict` holding exactly 1 item with 
- the keyword `str` `@timedelta` as dictionary key, and 
- the `float` `total_seconds` as dictionary value.  

In [2]:
class JsonEnc(json.JSONEncoder):
    """
    Extends the standard JSONEncoder to support additional datatypes.
    
    Keywords strings as dict keys are used to identify instances of the 
    additional types.
    
    Additional datatype  | keyword
    ---------------------|------------
    pandas DataFrame     | @DataFrame
    pandas Series        | @Series
    numpy array          | @np.array
    datetime.datetime    | @datetime
    datetime.timedelta   | @timedelta
    
    Of course, the regular JSON datatypes are supported, too:
        int, float, str, bool, None, list, (tuple), dict
        
    Example usage:
        # Encode data object to json_str
        json_str = json.dumps(data, cls=JsonEnc)
        
        # Decode json_str to a data object
        data_copy = json.loads(json_str, cls=JsonDec)
        
    """
    def default(self, obj):
        if isinstance(obj, pd.DataFrame):
            return {"@DataFrame": {"columns": list(obj.columns),
                                   "index": list(obj.index),
                                   "data": obj.values.tolist()}}
        
        if isinstance(obj, pd.Series):
            return {"@Series": {"name": obj.name,
                                "index": list(obj.index),
                                "data": obj.values.tolist()}}
        
        if isinstance(obj, np.ndarray):
            return {"@np.array": obj.tolist()}
        
        if isinstance(obj, dt.datetime):
            return {"@datetime": obj.strftime('%Y-%m-%d %H:%M:%S.%f')}

        if isinstance(obj, dt.timedelta):
            return {"@timedelta": obj.total_seconds()}

        return json.JSONEncoder.default(self, obj)

### Custom JSON decoder

The **custom decoder `JsonDec`** is implemented by subclassing the **default decoder `json.JSONDecoder`**. 

The custom part of the decoder **JsonDec** is triggered by the **keywords** injected by the custom encoder **JsonEnc**: 

Again, let's look at the `datetime.timedelta` example:
```python
if len(dct) == 1:
    if "@timedelta" in dct:
        return dt.timedelta(seconds=dct["@timedelta"])
return dct
```
A `datetime.timedelta` object is encoded into a `dict` holding exactly 1 item with 
- the keyword `str` `@timedelta` as dictionary key, and 
- the `float` `total_seconds` as dictionary value.  

In [3]:
class JsonDec(json.JSONDecoder):
    """
    Extends the standard JSONDecoder to support additional datatypes.
    
    Additional types are recognized by dict key keywords, which are injected 
    by the JsonEnc.
    
    Additional datatype  | keyword
    ---------------------|------------
    pandas DataFrame     | @DataFrame
    pandas Series        | @Series
    numpy array          | @np.array
    datetime.datetime    | @datetime
    datetime.timedelta   | @timedelta
    
    Of course, the regular JSON datatypes are supported, too:
        int, float, str, bool, None, list, (tuple), dict
        
    Example usage:
        # Encode data object to json_str
        json_str = json.dumps(data, cls=JsonEnc)
        
        # Decode json_str to a data object
        data_copy = json.loads(json_str, cls=JsonDec)
        
    """
    def __init__(self, *args, **kwargs):
        super().__init__(object_hook=JsonDec.custom_hook, *args, **kwargs)
    
    @staticmethod
    def custom_hook(dct):
        if len(dct) == 1:  # add. datatypes are coded in dict of len=1
            if "@np.array" in dct:
                return np.array(dct["@np.array"])
            
            if "@DataFrame" in dct:
                return pd.DataFrame(data=dct["@DataFrame"]["data"],
                                    columns=dct["@DataFrame"]["columns"],
                                    index=dct["@DataFrame"]["index"])
            
            if "@Series" in dct:
                return pd.Series(data=dct["@Series"]["data"],
                                 name=dct["@Series"]["name"],
                                 index=dct["@Series"]["index"])
            
            if "@datetime" in dct:
                return dt.datetime.strptime(dct["@datetime"],
                                            '%Y-%m-%d %H:%M:%S.%f')
            
            if "@timedelta" in dct:
                return dt.timedelta(seconds=dct["@timedelta"])
            
        return dct

# Verification

### Create test data

Firstly, we wirte a function `create_example_container` that returns a test dictionary containing all additional datatypes supported by the custom JSON encoder/decoder.

In [4]:
def create_example_container():    
    """
    Returns an example container as dict with all supported additional
    datatypes.
    """
    nCols, nRows = 3, 4
    df1 = pd.DataFrame(np.random.randint(0, high=10, size=(nRows, nCols)),
                        columns=["col"+str(i) for i in range(nCols)],
                        index=["idx"+str(i) for i in range(nRows)])
    
    df2 = pd.DataFrame({"dates": [dt.datetime(2020, 6, 18), 
                                  dt.datetime(2020, 6, 22, 1, 2, 3)],
                        "values": [42, True]})
    df2["timedeltas"] = dt.datetime.now() - df2["dates"]
    
    return {"regular_json": ["string", 1, 2.33, None, False],
            "some_datetime": dt.datetime.now(),
            "some_timedelta": dt.timedelta(days=1, seconds=100),
            "some_np_array": np.eye(3),
            "some_DateFrame": df1,
            "DataFrame_with_dt": df2,
            "some_Series": df2["values"]}

# Example usage
# 1. create an example dict with all additional datatypes
data = create_example_container()
data

{'DataFrame_with_dt':                 dates values               timedeltas
 0 2020-06-18 00:00:00     42   0 days 18:17:49.123795
 1 2020-06-22 01:02:03   True -4 days +17:15:46.123795,
 'regular_json': ['string', 1, 2.33, None, False],
 'some_DateFrame':       col0  col1  col2
 idx0     6     2     4
 idx1     4     1     9
 idx2     0     6     5
 idx3     1     8     6,
 'some_Series': 0      42
 1    True
 Name: values, dtype: object,
 'some_datetime': datetime.datetime(2020, 6, 18, 18, 17, 49, 129285),
 'some_np_array': array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]]),
 'some_timedelta': datetime.timedelta(1, 100)}

Ok, the `data` object contains at least on instance of all additionally supported datatypes.

### Save JSON file

For saving the `data` object to file, we can use the regular `json.dump` / `json.dumps` methods. 

The **custom encoder `JsonEnc`** is handed over to the `cls` keyword argument. The docstring says:

> *To use a custom ``JSONEncoder`` subclass (e.g. one that overrides the ``.default()`` method to serialize additional types), specify it with the ``cls`` kwarg; otherwise ``JSONEncoder`` is used.*


In [5]:
with open("data.json", "w") as f:
    json.dump(data, f, cls=JsonEnc)

Let's double-check how the actual JSON string looks like, using `json.dumps`.

For the sake of pretty-printing I use `indent=4`, which I don't recommend when dumping to files. The file size would be significantly larger (up to factor 10 for large integer tables) compared to the *on-liner-JSON* from `indent=None`.

In [6]:
print(json.dumps(data, cls=JsonEnc, indent=4))

{
    "regular_json": [
        "string",
        1,
        2.33,
        null,
        false
    ],
    "some_datetime": {
        "@datetime": "2020-06-18 18:17:49.129285"
    },
    "some_timedelta": {
        "@timedelta": 86500.0
    },
    "some_np_array": {
        "@np.array": [
            [
                1.0,
                0.0,
                0.0
            ],
            [
                0.0,
                1.0,
                0.0
            ],
            [
                0.0,
                0.0,
                1.0
            ]
        ]
    },
    "some_DateFrame": {
        "@DataFrame": {
            "columns": [
                "col0",
                "col1",
                "col2"
            ],
            "index": [
                "idx0",
                "idx1",
                "idx2",
                "idx3"
            ],
            "data": [
                [
                    6,
                    2,
                    4
                ],
   

One can see the custom dict keys (e.g. `@Series`, `@DataFrame`)

### Load JSON file

In [7]:
json.load?

Again, the regular class methods `json.load` / `json.loads` methods can be used, but with the **custom decoder `JsonDec`**, handed over with the `cls` keyword argument. The docstring says:

> *To use a custom ``JSONDecoder`` subclass, specify it with the ``cls`` kwarg; otherwise ``JSONDecoder`` is used.*


In [8]:
with open("data.json", "r") as f:
    data_copy = json.load(f, cls=JsonDec)

### Both containers equal?

Unfortunatelly we can't test for equality the simple way: `data == data_copy`
This is because equality of arrays or DataFrames is ambiguous, unless we test their values with `array.all()`.

For the sake of simplicity, let's check the printouts visually ...

In [9]:
data

{'DataFrame_with_dt':                 dates values               timedeltas
 0 2020-06-18 00:00:00     42   0 days 18:17:49.123795
 1 2020-06-22 01:02:03   True -4 days +17:15:46.123795,
 'regular_json': ['string', 1, 2.33, None, False],
 'some_DateFrame':       col0  col1  col2
 idx0     6     2     4
 idx1     4     1     9
 idx2     0     6     5
 idx3     1     8     6,
 'some_Series': 0      42
 1    True
 Name: values, dtype: object,
 'some_datetime': datetime.datetime(2020, 6, 18, 18, 17, 49, 129285),
 'some_np_array': array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]]),
 'some_timedelta': datetime.timedelta(1, 100)}

In [10]:
data_copy

{'DataFrame_with_dt':                 dates values               timedeltas
 0 2020-06-18 00:00:00     42   0 days 18:17:49.123795
 1 2020-06-22 01:02:03   True -4 days +17:15:46.123795,
 'regular_json': ['string', 1, 2.33, None, False],
 'some_DateFrame':       col0  col1  col2
 idx0     6     2     4
 idx1     4     1     9
 idx2     0     6     5
 idx3     1     8     6,
 'some_Series': 0      42
 1    True
 Name: values, dtype: object,
 'some_datetime': datetime.datetime(2020, 6, 18, 18, 17, 49, 129285),
 'some_np_array': array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]]),
 'some_timedelta': datetime.timedelta(1, 100)}

Although the printout of those `data` and `data_copy` dictionaries isn't pretty, one can see that both are identical.

# Conclusion

``JsonEnc`` and ``JsonDec`` work just fine.