<a href="https://colab.research.google.com/github/munich-ml/file_IO/blob/master/JSON_datatypes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Motivation

# Implementation

In [31]:
import numpy as np
import pandas as pd
import datetime as dt
import json

In [3]:
class JsonEnc(json.JSONEncoder):
    """
    Extends the standard JSONEncoder to support additional datatypes.
    
    Keywords strings as dict keys are used to identify instances of the 
    additional types.
    
    Additional datatype  | keyword
    ---------------------|------------
    pandas DataFrame     | @DataFrame
    pandas Series        | @Series
    numpy array          | @np.array
    datetime.datetime    | @datetime
    datetime.timedelta   | @timedelta
    
    Of course, the regular JSON datatypes are supported, too:
        int, float, str, bool, None, list, (tuple), dict
        
    Example usage:
        # Encode data object to json_str
        json_str = json.dumps(data, cls=JsonEnc)
        
        # Decode json_str to a data object
        data_copy = json.loads(json_str, cls=JsonDec)
        
    """
    def default(self, obj):
        if isinstance(obj, pd.DataFrame):
            return {"@DataFrame": {"columns": list(obj.columns),
                                   "index": list(obj.index),
                                   "data": obj.values.tolist()}}
        
        if isinstance(obj, pd.Series):
            return {"@Series": {"name": obj.name,
                                "index": list(obj.index),
                                "data": obj.values.tolist()}}
        
        if isinstance(obj, np.ndarray):
            return {"@np.array": obj.tolist()}
        
        if isinstance(obj, dt.datetime):
            return {"@datetime": obj.strftime('%Y-%m-%d %H:%M:%S.%f')}

        if isinstance(obj, dt.timedelta):
            return {"@timedelta": obj.total_seconds()}

        return json.JSONEncoder.default(self, obj)

In [4]:
class JsonDec(json.JSONDecoder):
    """
    Extends the standard JSONDecoder to support additional datatypes.
    
    Additional types are recognized by dict key keywords, which are injected 
    by the JsonEnc.
    
    Additional datatype  | keyword
    ---------------------|------------
    pandas DataFrame     | @DataFrame
    pandas Series        | @Series
    numpy array          | @np.array
    datetime.datetime    | @datetime
    datetime.timedelta   | @timedelta
    
    Of course, the regular JSON datatypes are supported, too:
        int, float, str, bool, None, list, (tuple), dict
        
    Example usage:
        # Encode data object to json_str
        json_str = json.dumps(data, cls=JsonEnc)
        
        # Decode json_str to a data object
        data_copy = json.loads(json_str, cls=JsonDec)
        
    """
    def __init__(self, *args, **kwargs):
        super().__init__(object_hook=JsonDec.custom_hook, *args, **kwargs)
    
    @staticmethod
    def custom_hook(dct):
        if len(dct) == 1:  # add. datatypes are coded in dict of len=1
            if "@np.array" in dct:
                return np.array(dct["@np.array"])
            
            if "@DataFrame" in dct:
                return pd.DataFrame(data=dct["@DataFrame"]["data"],
                                    columns=dct["@DataFrame"]["columns"],
                                    index=dct["@DataFrame"]["index"])
            
            if "@Series" in dct:
                return pd.Series(data=dct["@Series"]["data"],
                                 name=dct["@Series"]["name"],
                                 index=dct["@Series"]["index"])
            
            if "@datetime" in dct:
                return dt.datetime.strptime(dct["@datetime"],
                                            '%Y-%m-%d %H:%M:%S.%f')
            
            if "@timedelta" in dct:
                return dt.timedelta(seconds=dct["@timedelta"])
            
        return dct

# Verification

### Create test data

Firstly, we wirte a function `create_example_container` that returns a test dictionary containing all additional datatypes supported by the custom JSON encoder/decoder.

In [32]:
def create_example_container():    
    """
    Returns an example container as dict with all supported additional
    datatypes.
    """
    nCols, nRows = 3, 4
    df1 = pd.DataFrame(np.random.randint(0, high=10, size=(nRows, nCols)),
                        columns=["col"+str(i) for i in range(nCols)],
                        index=["idx"+str(i) for i in range(nRows)])
    
    df2 = pd.DataFrame({"dates": [dt.datetime(2020, 6, 18), 
                                  dt.datetime(2020, 6, 22, 1, 2, 3)],
                        "values": [42, True]})
    df2["timedeltas"] = dt.datetime.now() - df2["dates"]
    
    return {"regular_json": ["string", 1, 2.33, None, False],
            "some_datetime": dt.datetime.now(),
            "some_timedelta": dt.timedelta(days=1, seconds=100),
            "some_np_array": np.eye(3),
            "some_DateFrame": df1,
            "DataFrame_with_dt": df2,
            "some_Series": df2["values"]}

# Example usage
# 1. create an example dict with all additional datatypes
data = create_example_container()
data

{'DataFrame_with_dt':                 dates values               timedeltas
 0 2020-06-18 00:00:00     42   0 days 15:05:16.231744
 1 2020-06-22 01:02:03   True -4 days +14:03:13.231744,
 'regular_json': ['string', 1, 2.33, None, False],
 'some_DateFrame':       col0  col1  col2
 idx0     6     6     5
 idx1     6     6     8
 idx2     8     8     0
 idx3     1     6     9,
 'some_Series': 0      42
 1    True
 Name: values, dtype: object,
 'some_datetime': datetime.datetime(2020, 6, 18, 15, 5, 16, 233703),
 'some_np_array': array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]]),
 'some_timedelta': datetime.timedelta(1, 100)}

### Save JSON file

For saving the `data` object to file, we can use the regular `json.dump` / `json.dumps` methods. 

The **custom encoder `JsonEnc`** is handed over to the `cls` keyword argument. The docstring says:

> *To use a custom ``JSONEncoder`` subclass (e.g. one that overrides the ``.default()`` method to serialize additional types), specify it with the ``cls`` kwarg; otherwise ``JSONEncoder`` is used.*


In [35]:
with open("data.json", "w") as f:
    json.dump(data, f, cls=JsonEnc)

Let's double-check how the actual JSON string looks like, using `json.dumps`.

For the sake of pretty-printing I use `indent=4`, which I don't recommend when dumping to files. The file size would be significantly larger (up to factor 10 for large integer tables) compared to the *on-liner-JSON* from `indent=None`.

In [36]:
print(json.dumps(data, cls=JsonEnc, indent=4))

{
    "regular_json": [
        "string",
        1,
        2.33,
        null,
        false
    ],
    "some_datetime": {
        "@datetime": "2020-06-18 15:05:16.233703"
    },
    "some_timedelta": {
        "@timedelta": 86500.0
    },
    "some_np_array": {
        "@np.array": [
            [
                1.0,
                0.0,
                0.0
            ],
            [
                0.0,
                1.0,
                0.0
            ],
            [
                0.0,
                0.0,
                1.0
            ]
        ]
    },
    "some_DateFrame": {
        "@DataFrame": {
            "columns": [
                "col0",
                "col1",
                "col2"
            ],
            "index": [
                "idx0",
                "idx1",
                "idx2",
                "idx3"
            ],
            "data": [
                [
                    6,
                    6,
                    5
                ],
   

One can see the custom dict keys (e.g. `@Series`, `@DataFrame`)

### Load JSON file

In [None]:
# 3. Read json file and decode using the custpm JsonDec
with open("data.json", "r") as f:
    data_copy = json.load(f, cls=JsonDec)

### Compare the 2 data containers

In [25]:
data

{'DataFrame_with_dt':                 dates values               timedeltas
 0 2020-06-18 00:00:00     42   0 days 14:49:14.301799
 1 2020-06-22 01:02:03   True -4 days +13:47:11.301799,
 'regular_json': ['string', 1, 2.33, None, False],
 'some_DateFrame':       col0  col1  col2
 idx0     2     1     2
 idx1     0     8     5
 idx2     6     3     5
 idx3     9     0     8,
 'some_Series': 0      42
 1    True
 Name: values, dtype: object,
 'some_datetime': datetime.datetime(2020, 6, 18, 14, 49, 14, 303031),
 'some_np_array': array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]]),
 'some_timedelta': datetime.timedelta(1, 100)}

In [26]:
data_copy

{'DataFrame_with_dt':                 dates values               timedeltas
 0 2020-06-18 00:00:00     42   0 days 14:49:14.301799
 1 2020-06-22 01:02:03   True -4 days +13:47:11.301799,
 'regular_json': ['string', 1, 2.33, None, False],
 'some_DateFrame':       col0  col1  col2
 idx0     2     1     2
 idx1     0     8     5
 idx2     6     3     5
 idx3     9     0     8,
 'some_Series': 0      42
 1    True
 Name: values, dtype: object,
 'some_datetime': datetime.datetime(2020, 6, 18, 14, 49, 14, 303031),
 'some_np_array': array([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]]),
 'some_timedelta': datetime.timedelta(1, 100)}

Although the printout of those `data` and `data_copy` dictionaries isn't pretty, one can see that both are identical.

# Conclusion