# Keeping clean JSON in python with colander

Despite being derived from Java Script, the JSON data format has quickly spread to other languages and python is no exception. Commonly referred to as dictionary, the simple key-value mapping offers a wide range of use cases: data storage, configuration and … data transfer !

Let’s assume that one of your component needs to send the details of a person’s order to other components.

In [1]:
message = {
    "name": "Paul O'Brady",
    "email": "paul.obrady@example.com",
    "gender": "M",
    "date_of_birth": "1990-01-01",
    "coffee_club_member": True,
    "order": {
        "order_id": 1111,
        "items": [
            {
                "product_id": 12345,
                "quantity": 1,
                "unit_price": 4.65,
            },
            {
                "product_id": 54321,
                "quantity": 2,
                "unit_price": 8.19,
            },
            {
                "product_id": 112233,
                "quantity": 1,
                "unit_price": None,
            },
        ],
    }
}

## 1. Default serialization

To make sure that the information is understood by the receiver, we need to format the json to a certain standard: the JSON open standard file format. This operation converts each value of the dictionary as a json data type and place the whole object into a string. This way it can be decoded by any receiver, regardless of the language it is written with.

In [2]:
import json

message_serialized = json.dumps(message)
message_serialized

'{"name": "Paul O\'Brady", "email": "paul.obrady@example.com", "gender": "M", "date_of_birth": "1990-01-01", "coffee_club_member": true, "order": {"order_id": 1111, "items": [{"product_id": 12345, "quantity": 1, "unit_price": 4.65}, {"product_id": 54321, "quantity": 2, "unit_price": 8.19}, {"product_id": 112233, "quantity": 1, "unit_price": null}]}}'

Few things worth noting here: 
* the apostrophe in the `name` is automatically escaped
* the boolean `coffee_club_member` switched from True (python) to true (js)
* the last `unit_price` switch from None (python) to null (js). Also the whole object is now a string.

In [3]:
message_deserialized = json.loads(message_serialized)
message_deserialized

{'name': "Paul O'Brady",
 'email': 'paul.obrady@example.com',
 'gender': 'M',
 'date_of_birth': '1990-01-01',
 'coffee_club_member': True,
 'order': {'order_id': 1111,
  'items': [{'product_id': 12345, 'quantity': 1, 'unit_price': 4.65},
   {'product_id': 54321, 'quantity': 2, 'unit_price': 8.19},
   {'product_id': 112233, 'quantity': 1, 'unit_price': None}]}}

And just like that you get back what we have originally.

Regardless of what you want to do with this serialized message (save it in a database, send it via a Pub/Sub messaging service…), it is clear that the sender and receiver components need to agree on a shared data schema.

This default serialization doesn't define nor validate the data contained inside the json.

**What would happen if some keys are missing, or if some values are incorrect ?**

## 2. Using [colander](https://docs.pylonsproject.org/projects/colander/en/latest/)

In the case where the data transfer occurs between 2 python processes, [colander](https://docs.pylonsproject.org/projects/colander/en/latest/) is a better serialization library as it checks for consistency of the JSON schema by performing a key be key validation and then converts each value into a string. Similarly, colander deserialization will check each serialized string and and convert it back to the defined type.

We start by defining each structure as a child class of `MappingSchema`. Each key inside the structure is defined as a `SchemaNode` which can be of different types all documented [here](https://docs.pylonsproject.org/projects/colander/en/latest/api.html#types). On top of defining the data types, we have the possibility to add [validators](https://docs.pylonsproject.org/projects/colander/en/latest/api.html#validators) and preparers. For lists, we inherit from a `SequenceSchema`.
Additionally, if a key is missing, we can give it a default value during serialization by specifying `default` or a default value during deserialization with the `missing` argument.

![](colander.png)

In [4]:
%%capture
!pip install colander

In [5]:
# just updating the value to be a date type
from datetime import datetime
message.update({"date_of_birth": datetime.strptime(message['date_of_birth'], '%Y-%M-%d')})

In [6]:
import colander


class Item(colander.MappingSchema):
    product_id = colander.SchemaNode(colander.Int())
    quantity = colander.SchemaNode(colander.Int())
    unit_price = colander.SchemaNode(colander.Float(), missing=None)

    
class Items(colander.SequenceSchema):
    item = Item()

    
class Order(colander.MappingSchema):
    order_id = colander.SchemaNode(colander.Int())
    items = Items()

    
class Message(colander.MappingSchema):
    name = colander.SchemaNode(colander.Str())
    email = colander.SchemaNode(colander.Str(), validator=colander.Email())    
    gender = colander.SchemaNode(colander.Str(), validator=colander.OneOf(['M', 'F', '']), default='', missing='')
    date_of_birth = colander.SchemaNode(colander.Date(format='%d %b %Y'))
    coffee_club_member = colander.SchemaNode(colander.Bool(), default=False, missing=False)
    percent_discount = colander.SchemaNode(colander.Float(), validator=colander.Range(min=0, max=1), missing=0)
    order = Order()


json_serialized = Message().serialize(message)
json_serialized

{'name': "Paul O'Brady",
 'email': 'paul.obrady@example.com',
 'gender': 'M',
 'date_of_birth': '01 Jan 1990',
 'coffee_club_member': 'true',
 'percent_discount': <colander.null>,
 'order': {'order_id': '1111',
  'items': [{'product_id': '12345', 'quantity': '1', 'unit_price': '4.65'},
   {'product_id': '54321', 'quantity': '2', 'unit_price': '8.19'},
   {'product_id': '112233', 'quantity': '1', 'unit_price': <colander.null>}]}}

In [7]:
json_deserialized = Message().deserialize(json_serialized)
json_deserialized

{'name': "Paul O'Brady",
 'email': 'paul.obrady@example.com',
 'gender': 'M',
 'date_of_birth': datetime.date(1990, 1, 1),
 'coffee_club_member': True,
 'percent_discount': 0,
 'order': {'order_id': 1111,
  'items': [{'product_id': 12345, 'quantity': 1, 'unit_price': 4.65},
   {'product_id': 54321, 'quantity': 2, 'unit_price': 8.19},
   {'product_id': 112233, 'quantity': 1, 'unit_price': None}]}}

## 3. Colander advanced features

Let's try out a scenario where your JSON structure is more complex, for example it contains your model predictions.

In [8]:
import numpy as np

json_predictions = {
    'proba': np.array([0.56, 0.83, 0.23, 0.76, 0.92]),
    'categ': ['A', 'A', 'B', 'B', 'C']
}

Similar as before, we define the nodes, sequence and mapping.

### 3.1 Custom `SchemaType`

In [9]:
# Define the individual
class Probability(colander.SchemaNode):
    schema_type = colander.Float
    validator = colander.Range(min=0.00, max=1.00)

class Category(colander.SchemaNode):
    schema_type = colander.Str
    validator = validator=colander.OneOf(['A', 'B', 'C'])
    
# Define the sequences
class Probabilities(colander.SequenceSchema):
    proba = Probability()

class Categories(colander.SequenceSchema):
    proba = Category()
    
# Define the SchemaMapping
class ModelResults(colander.MappingSchema):
    proba = Probabilities()
    categ = Categories()

In [10]:
# serialized
serialized_results = ModelResults().serialize(json_predictions)
serialized_results

{'proba': ['0.56', '0.83', '0.23', '0.76', '0.92'],
 'categ': ['A', 'A', 'B', 'B', 'C']}

In [11]:
# deserialized
deserialized_results = ModelResults().deserialize(serialized_results)
deserialized_results

{'proba': [0.56, 0.83, 0.23, 0.76, 0.92], 'categ': ['A', 'A', 'B', 'B', 'C']}

Note that we have lost our original numpy array ! This is because the `SchemaType` of our probability is define as a `colander.Float`. In order to get back a numpy array, we'll need to define our own `SchemaType`.

In [12]:
from colander import SchemaType, Invalid, null


class NumpyArray(SchemaType):
    def serialize(self, node, cstruct):
        if cstruct is null:
            return null
        if not isinstance(cstruct, np.ndarray):
            raise Invalid(node, '%r is not a np.array' % cstruct)
        return cstruct.tolist()
    def deserialize(self, node, appstruct):
        if appstruct is null:
            return null
        if not isinstance(appstruct, list):
            raise Invalid(node, '%r is not a list' % appstruct)
        return np.array([x for x in appstruct], dtype=float)

# Define the SchemaMapping
class ModelResults(colander.MappingSchema):
    proba = Probabilities(NumpyArray())
    categ = Categories()

In [13]:
# serialized
serialized_results = ModelResults().serialize(json_predictions)
serialized_results 

{'proba': [0.56, 0.83, 0.23, 0.76, 0.92], 'categ': ['A', 'A', 'B', 'B', 'C']}

In [14]:
# deserialized

deserialized_results = ModelResults().deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83, 0.23, 0.76, 0.92]),
 'categ': ['A', 'A', 'B', 'B', 'C']}

### 3.2 Deffered functions and schema binding

If we need an extra indicator in our deserialized JSON which is not always provided pre serialization, and should default to `False`, we can do

In [15]:
class Indicators(colander.SequenceSchema):
    ind = colander.SchemaNode(colander.Bool(), default=False, missing=False)
    missing = [False, False, False, False]
    
    
class ModelResults(colander.MappingSchema):
    proba = Probabilities(NumpyArray())
    categ = Categories()
    indic = Indicators()

# deserialized
deserialized_results = ModelResults().deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83, 0.23, 0.76, 0.92]),
 'categ': ['A', 'A', 'B', 'B', 'C'],
 'indic': [False, False, False, False]}

However here we assumed that regardless of the length of our input proba and categ lists, the default indic value will always have a length of 5. 

A way around this issue is to:
* define a deferred function that we'll use in place of the hardcoded missing values
* bind the schema passing the relevent parameter before deserialization

In [16]:
@colander.deferred
def missing_indicators(node, kw):
    return [False] * kw.get('n_predictions')

class Indicators(colander.SequenceSchema):
    ind = colander.SchemaNode(colander.Bool(), default=False, missing=False)
    missing = missing_indicators

class ModelResults(colander.MappingSchema):
    proba = Probabilities(NumpyArray())
    categ = Categories()
    indic = Indicators()
    
# deserialized
deserialized_results = ModelResults().bind(n_predictions=len(json_predictions['proba'])).deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83, 0.23, 0.76, 0.92]),
 'categ': ['A', 'A', 'B', 'B', 'C'],
 'indic': [False, False, False, False, False]}

In [17]:
json_predictions = {
    'proba': np.array([0.56, 0.83]),
    'categ': ['A', 'A']
}

In [18]:
serialized_results = ModelResults().bind(n_predictions=len(json_predictions['proba'])).serialize(json_predictions)
serialized_results

{'proba': [0.56, 0.83], 'categ': ['A', 'A'], 'indic': <colander.null>}

In [19]:
deserialized_results = ModelResults().bind(n_predictions=len(json_predictions['proba'])).deserialize(serialized_results)
deserialized_results

{'proba': array([0.56, 0.83]), 'categ': ['A', 'A'], 'indic': [False, False]}