#### Introduction

Serialising and deserialising objects is useful for persistence of data (even after a program has terminated) and/or transmission.

# 01 - Pickling

#### Lecture

This is a python-specific mechanism to serialise/deserialise objects using **binary** (by default) representation.

While pickling applies to *more* than just python dictionaries, we will focus on dictionaries here because of JSON - it's easy to serialise/deserialise them into JSON.

But not all data types are serialisable; `datetime`s, for example, don't serialise without loss of data, but there are 3rd party libraries that solve these problems (marshmallow).

**Object/Data Marshalling** is the process of serialising **and** deserialising objects/data:

`obj -- serialise --> 0101001110011... -- deserialise --> obj`

Unpickling data can be **dangerous** because they can **execute code**.

##### Usage

```import pickle```

`dump` -> pickle to file

`load` -> unpickle from file

`dumps` -> returns a string pickled representation that can be stored in a variable

`loads` -> unpickles from a string

##### Equality and Identity

A pickled object does not contain information of its ID. Therefore, if a dictionary `dict_1` is pickled and then unpickled, the final dictionary `dict_2` will have a different ID to the original.

`dict_1 == dict_2` but `dict_1 is not dict_2`

Serialising/Deserialising data behaves very similar to making deepcopies. If we deepcopy an object which contains two identical references to the same object, then, the copy will ensure that the relationship is maintained. To elaborate with an example:

```python
my_list = [1, 2]
l1 = ['a', 'b', my_list, my_list]

l1[2] == l1[3] --> True
l1[2] is l1[3] --> True

l2 = deepcopy(l1)
l2 -> ['a', 'b', [1, 2], [1, 2]]

l2[2] == l2[3] --> True
l2[2] is l2[3] --> True
```

So Python sees the shared reference of `l1[2]` and `l2[3]` pointing to `my_list` and it replicates that relationship in the copy

#### Coding

##### `.dumps()` and `.loads()`

We can pickle **strings**:

In [1]:
import pickle

In [2]:
ser = pickle.dumps('Python Pickle Peppers')
ser

b'\x80\x04\x95\x19\x00\x00\x00\x00\x00\x00\x00\x8c\x15Python Pickle Peppers\x94.'

In [3]:
deser = pickle.loads(ser)
deser

'Python Pickle Peppers'

And **floats/integers**:

In [4]:
ser = pickle.dumps(3.14)
ser

b'\x80\x04\x95\n\x00\x00\x00\x00\x00\x00\x00G@\t\x1e\xb8Q\xeb\x85\x1f.'

In [5]:
deser = pickle.loads(ser)
deser

3.14

And **sets**:

In [13]:
ser = pickle.dumps({'a', 'b', 10})
ser

b'\x80\x04\x95\x0f\x00\x00\x00\x00\x00\x00\x00\x8f\x94(\x8c\x01a\x94K\n\x8c\x01b\x94\x90.'

In [14]:
deser = pickle.loads(ser)
deser

{10, 'a', 'b'}

And **lists/tuples**:

In [10]:
l1 = [10, 20, ('a', 'b', 30)]
ser = pickle.dumps(l1)
ser

b'\x80\x04\x95\x15\x00\x00\x00\x00\x00\x00\x00]\x94(K\nK\x14\x8c\x01a\x94\x8c\x01b\x94K\x1e\x87\x94e.'

In [11]:
l2 = pickle.loads(ser)
l2

[10, 20, ('a', 'b', 30)]

But remember that the IDs will **change**. They are **equal** but not **identical**.

In [12]:
print(f"{l1 == l2 = }")
print(f"{l1 is l2 = }")

l1 == l2 = True
l1 is l2 = False


And **dictionaries**:

In [15]:
from datetime import datetime

d = {
    'a': 100,
    'b': [1, 2, 3],
    'c': (1, 2, 3),
    'd': {'x': 1 + 1j, 'y': datetime.utcnow()}
}

ser = pickle.dumps(d)
ser

b'\x80\x04\x95\x8b\x00\x00\x00\x00\x00\x00\x00}\x94(\x8c\x01a\x94Kd\x8c\x01b\x94]\x94(K\x01K\x02K\x03e\x8c\x01c\x94K\x01K\x02K\x03\x87\x94\x8c\x01d\x94}\x94(\x8c\x01x\x94\x8c\x08builtins\x94\x8c\x07complex\x94\x93\x94G?\xf0\x00\x00\x00\x00\x00\x00G?\xf0\x00\x00\x00\x00\x00\x00\x86\x94R\x94\x8c\x01y\x94\x8c\x08datetime\x94\x8c\x08datetime\x94\x93\x94C\n\x07\xe8\x01\x1c\x13: \x07\x94\xda\x94\x85\x94R\x94uu.'

In [16]:
deser = pickle.loads(ser)
deser

{'a': 100,
 'b': [1, 2, 3],
 'c': (1, 2, 3),
 'd': {'x': (1+1j), 'y': datetime.datetime(2024, 1, 28, 19, 58, 32, 496858)}}

As mentioned in the lecture, shared reference relationships are maintained with serialising/deserialising just like with deepcopies:

In [18]:
my_dict = {'a': 10, 'b': 20}
d = {'x': 100, 'y': my_dict, 'z': my_dict}

print(d['y'] == d['y'])
print(d['y'] is d['y'])

True
True


In [19]:
ser = pickle.dumps(d)
d2 = pickle.loads(ser)

print(d2['y'] == d2['y'])
print(d2['y'] is d2['y'])

True
True


# 02 - JSON Serialization

#### Lecture

JSON has just a few data types it supports:

* **Strings**: must be delimited by double quotes
* **Booleans**: the values `true` and `false`
* **Numbers**: can be integers, or floats (including exponential notation, `1.3E2` for example), but are all considered **floats** in the standard
* **Arrays**: an **ordered** collection of zero or more items of any valid JSON type
* **Objects**: an **unordered** collection of `key:value` pairs - **the keys must be strings (so delimited by double quotes)**, and the values can be any valid JSON type.
* **NULL**: a null object, denoted by `null` and equivalent to `None` in Python.

Python dictionaries are **objects** while JSON is essentially a **string**.

##### Problems

- JSON keys must be strings, but python dictionary keys just need to be hashable. So if you had an `integer` as a key in your python dictionary, how will you serialise it?
- JSON value types are limited to those above. So we can't have tuples, datetime objects, `Decimal`'s `Fraction`'s, custom classes - how do we serialise these back to their original object?

Solution? **Custom Serialisation**.

#### Coding

Serialisation and deserialisation is very similar to pickling.

```import json```

`dump` -> dump to file

`load` -> load from file

`dumps` -> dumps python object to a string containing JSON

`loads` -> loads a string containing JSON into a python object

`dump` and `dumps` have additional arguments for controlling serialisation:
- `skipkeys: Bool = False`: if we are dumping a python dictionary, then the key must be one of the basic types and hashable. If set to true, then unserialisable keys will be skipped.
- `indent: int = None`: useful for human readability
- `separators: tuple = (", ", ": ")`: the first argument customises how key-value pairs are separated and the 2nd customises how keys are separated from their values.

   We can use this to compact the JSON object for small performance improvements.

  Note, we still need a valid JSON, so the most compact form is having `indent = None` and `separators = (",",":")`
- `sort_keys: bool = False`: if `True` the JSON string will have sorted keys. Since the keys will be strings, they will be alphanumerically sorted.

and more..


We have a `pprint` equivalent for JSON which is achieved using the `indent` parameter of `json.dumps()`:

In [50]:
import json

d1 = {'a': 100, 'b': 200, 'c': [1, 2, 3]}
d1_json = json.dumps(d1)

print(json.dumps(d1), "\n")
print(json.dumps(d1, indent=2), "\n")
print(json.dumps(d1, indent='___'))

{"a": 100, "b": 200, "c": [1, 2, 3]} 

{
  "a": 100,
  "b": 200,
  "c": [
    1,
    2,
    3
  ]
} 

{
___"a": 100,
___"b": 200,
___"c": [
______1,
______2,
______3
___]
}


##### Problem 1

As we said, json keys must be strings. So what if we serialise a deserialise objects with non-string keys? Do we get the same object back? 

In [22]:
d1 = {1: 100, 2: 200}
d1_json = json.dumps(d1, indent=2)
d2 = json.loads(d1_json)
print(d2)

{'1': 100, '2': 200}


The keys are **different**.

In [23]:
d1 == d2

False

##### Problem 2

How will Python handle unsupported types such as tuples for the dictionary values?

In [35]:
d = {'a': (1, 2, 3)}
ser = json.dumps(d)
print(ser)

{"a": [1, 2, 3]}


In [37]:
deser = json.loads(ser)
print(deser)

{'a': [1, 2, 3]}


The tuple was coerced into a list. Therefore, the value is **different**:

In [38]:
ser == deser

False

But, python at least tried to find the most similar thing to a tuple. With more complex objects such as datetime objects, `Decimal`s and class instances, it will raise `TypeError`s:

In [49]:
from decimal import Decimal
from datetime import datetime

try:    
    json.dumps({'a': datetime(2024, 12, 23, 13, 37), 'b': Decimal(0.5)})
except TypeError as ex:
    print(ex)

Object of type datetime is not JSON serializable


# 03 - Custom JSON Encoding

##### `datetime` objects

Python cannot serialise certain data types by itself but we can tell it how to. `json.dump()` and `json.dumps()` have a parameter caled `default`. This takes a callable that takes only one argument which is called on any objects that cannot be serialised by python itself. For example the `str()` will convert any unserialiable object to its string representation which can be serialised.

Often we will come up with our custom format or use industry standards. For example, datetime objects use the **ISO 8601** (https://en.wikipedia.org/wiki/ISO_8601).

This has the format of:
*YYYY-MM-DD***T***HH:MM:SS*

The **T** is a character to separate the date from the time.

Note: the `str` representation of a time is not the same as the ISO format; it's similar but more human readable. As you can see it does not contain the required **T** character.

In [52]:
from datetime import datetime
str(datetime.now())

'2024-02-17 17:45:31.612621'

If you want to format the year, month, day, time etc, in a particular way, use **`.strftime()` method** of datetime objects. 

The reverse of this method is `strptime()` which takes a string in a particular format and converts it to a datetime object.

In the format, you indicate the different time components with directives. For example, `'%Y'` returns a 4 digit year e.g. 2012, where `'&y'` returns a year without century, zero-padded e.g. 2012 -> 12.

Here's our custom, simple ISO format (without including timezones etc.)

In [55]:
def format_iso(dt):
    return dt.strftime('%Y-%m-%dT%H:%M:%S')

format_iso(datetime.now())

'2024-02-17T17:52:25'

Alternatively, we can use the inbuilt ISO format in `datetime` objects:

In [56]:
datetime.now().isoformat()

'2024-02-17T17:53:46.213212'

Now lets serialise it:

In [57]:
log_record = {'time': datetime.utcnow()}
print(json.dumps(log_record, default=format_iso))

{"time": "2024-02-17T17:55:55"}


##### `singledispatch`

If our `log_record` dictionary contained an unserialisable type that was *not* a datetime object, we're going to run into issues because the same callable will be used on it.

We can get around this by checking the type of argument in the callable and handling it that way. For example, if we only had `datetime`s and `set`, this could work:

In [None]:
def custom_json_formatter(arg):
    if isinstance(arg, datetime):
        return arg.isoformat()
    elif isinstance(arg, set):
        return list(arg)

Obviously this doesn't scale well, so we should move over to using a `singledispatch`:

Let's write a single dispatcher that registers specific implementations for `datetime`, `set`, `Decimal` - and for anything else, use their string representation. 

To do that, we first need to write the default behaviour which is stringifying the object. Then, we register our individual types:

In [60]:
from functools import singledispatch
from decimal import Decimal

@singledispatch
def json_format(arg):
    return str(arg)

@json_format.register(datetime)
def _(arg):
    return arg.isoformat()

@json_format.register(set)
def _(arg):
    return list(arg)

@json_format.register(Decimal)
def _(arg):
    return f"Decimal({str(arg)})"

In [61]:
class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age
        self.create_dt = datetime.utcnow()
        
    def __repr__(self):
        return f'Person(name={self.name}, age={self.age})'

log_record = dict(
    time=datetime.utcnow(),
    message='Created new person',
    person=Person('John', 24),
    complex_number = 1 + 1j,
)

print(json.dumps(log_record, indent=2, default=json_format))

{
  "time": "2024-02-17T18:26:58.562067",
  "message": "Created new person",
  "person": "Person(name=John, age=24)",
  "complex_number": "(1+1j)"
}


# 04 - Custom Encoding using JSONEncoder

#### Lecture

Python already uses the `JSONEncoder` class in the `json` module we use `json.dumps()` for serialisation.

The `JSONEncoder` shares many arguments with `dump`/`dumps`: `default`, `skipkeys`, `sort_keys`, `indent`, `separators`, ...

But `dump`/`dumps` has one extra: `cls`: this allows us to specify our **own** version of `JSONEncoder` to be used when `dump`/`dumps` runs.

The reason why we would want to make our own `JSONEncoder` class is so that we can define our arguments (`default`, `skipkeys`, `indent`, etc.) once and then apply it to all of our `json.dumps()` calls. 

**Procedure**

1. Subclass `JSONEncoder`.
2. Customise initialiser of the **parent** class with the specific arguments that we want. (`super().__init__(skipkeys=True, allow_nan=False)`.)
3. Override the `default` method if we want. If we do not, we can auto-delegate back to the parent class. `else: return super().default(self, arg)`

#### Coding

To get a handle on the default encoder, we just create an instance of `JSONEncoder`:

In [1]:
import json
default_encoder = json.JSONEncoder()
print(default_encoder.encode(True))
print(default_encoder.encode(None))
print(default_encoder.encode((1, 2, 3)))

true
null
[1, 2, 3]


Now for our custom one. Note that our `CustomJSONEncoder` inherits from `json.JSONEncoder`. `json.dumps()` passes all of its arguments to the `cls` argument, which is `CustomJSONEncoder` in our case, as `*args` and `**kwargs` .

We don't do anything we these `*args` and `**kwargs` but we do define which ones we want and explicitly pass them to the superclass. When `encode` is called on our `customJSONEncoder`, it will call the `encode` method of the superclass (`json.JSONEncoder`), which will make use of our explictly passed parameters that we initialised earlier in the superclass.

The entire serialisation implementation is done in the superclass `json.JSONEncoder`, except those methods that we've overridden such as `default`. 

So, when `json.dumps()` requires `default()` to pass a particular value, it will use our overridden implementation, not the one from the superclass.  

In [7]:
from datetime import datetime

class CustomJSONEncoder(json.JSONEncoder):
    def __init__(self, *args, **kwargs):
        super().__init__(
            skipkeys=True,
            allow_nan=False,
            indent='___',
            separators=('',' = ')
        )

    # overriding original default with ours
    def default(self, arg):
        if isinstance(arg, datetime):
            datetime_dict = dict(
                datatype="datetime",
                iso=arg.isoformat(),
                date=arg.date().isoformat(),
                time=arg.time().isoformat(),
                year = arg.year
            )
            return datetime_dict
        else:
            super().default(arg)

custom_encoder = CustomJSONEncoder()

In [8]:
print(custom_encoder.encode(datetime.now()))

{
___"datatype" = "datetime"
___"iso" = "2024-02-24T14:44:30.597643"
___"date" = "2024-02-24"
___"time" = "14:44:30.597643"
___"year" = 2024
}


In [74]:
try:
    print(custom_encoder.encode({1, 2, 3}))
except TypeError as ex:
    print(ex)

Object of type set is not JSON serializable


Now for dumping. Note that we pass the *class* itself and not the class instance.

In [75]:
my_dict = dict(name='test', time=datetime.now())
print(json.dumps(my_dict, cls=CustomJSONEncoder))

{
___"name" = "test"
___"time" = {
______"datatype" = "datetime"
______"iso" = "2024-02-18T16:34:19.151692"
______"date" = "2024-02-18"
______"time" = "16:34:19.151692"
______"year" = 2024
___}
}


**Note:** 

- You might've noticed the `*args, **kwargs` in our custom class' `__init__`. This is because `json.dumps()` will pass the default arguments of `json.dumps()` to the class in its `cls` argument. In other words, `json.dumps(my_dict)` is identical to `json.dumps(my_dict, skipkeys=False, allow_nan=True, ...)` because those are the default arguments.
- `default` does not need to return a single value. In our case, we've serialised the datetime object into a dictionary.

# 05 - Custom JSON Decoding

#### Lecture

##### Introduction

When we use `json.load()` / `json.loads()`, the simple standard types i.e. strings, booleans, numbers (int/float), arrays and objects (dictionaries) will work out the box. 

**`load()`/`loads()` has some optional arguments:**
- `object_hook`: **Callable** - This is called on every object (dictionary) including the root object (dictionary) that encloses the entire json object. This callable returns another dictionary which replaces the original object. This is very similar to the `default` argument we saw in the `dump`/`dumps` functions - but works for decoding instead of encoding.

    For example, if we had:
  ```python
  data = {'username': 'John',
          'createdAt': {objecttype='datetime', value='2018-10-21T09:14:15'}
  ```
  We could write a function `custom_decoder` that takes a dictionary, looks to see if it has a key called with `objecttype='datetime' and if so, converts the value to a datetime object and returns the new decoded dictionary.

  So, `custom_decoder(data) ->`
  ```python
  data = {'username': 'John',
          'createdAt': {objecttype='datetime', value=Datetime(2024, 10, 21, 09, 14, 15}
          }
  ``` 


    Note: when `object_hook` is provided, this procedure only occurs *after* the initial `deserialisation` - that which occurs when no `object_hook` is provided; that is, `object_hook` and `object_pairs_hook` receives a **parsed** object which may have be parsed using one of the `parse_...` arguments **first** (see below).

    This initial deserialisation converts e.g. `j = '''{"a": [1, 2]}'''` into a dictionary where the keys and values are the appropriate types, e.g. '[1, 2]' into a list object.
  ```python
  j = '''
      { # root dictionary
          "a": 1, 
          "b": {
             "sub1": [1, 2, 3],
             "sub2": {
                   "x": 100,
                   "y": 200
              }
          }
      }     
      '''
  ```
  `sub2`, `b` and `root` are all dictionaries that will be passed to the `object_hook` callable in that order - from the inner to the out. The `root` is always called last.<br/><br/>
- `object_pairs_hook`: Related to `object_hook`. We can't use both at the same time; if both specified, `object_hook` ignored. First note that order of items in your json string is not necessarily preserved upon deserialising, but lists preserve order: deserialising '[1, 2, 3]' will always be `[1, 2, 3]`. `object_hook_pairs` behaves identical to `object_hook` except instead of passing a dictionary to the callable, a list of key-value tuples will be passed instead:
    - `object_hook` recevies: `{"a": 1, "b": 2}`
    - `object_pairs_hook` receives: `[ ("a", 1), ("b", 2) ]`

    With this, we are guaranteed order of keys seen in the initial json string.<br/><br/> 
- `parse_float`, `parse_int`, `parse_constant`: **Callable** with single str argument - Remember that we could custom serialise `Fraction(3, 2)` to the number 1.5. Well, what if we got a JSON containing the number 1.5 and we wanted to deserialise to `Fraction(3, 2)`? 

    This would require overriding the **initial deserialisation** that occurs before `object_hook`. Depending on what the value is in the JSON, we would use either `parse_float`, `parse_int` or `parse_constant` (relevant for Infinity, NaN, etc.) to override its deserialisation before. Note that we cannot override strings. Here's an example:

    ```python
    from decimal import Decimal

    def make_decimal(arg: str):
        return Decimal(arg)
    j = '''{"a": 100.5}'''
    res = json.loads(j, parse_float=make_decimal)

    res -> {"a": Decimal('100.5')}
    ```

##### Schemas

We need to know the structure of the JSON data in order to perform custom **deserialisation**. This structure is called a **schema** - a pre-defined agreement on how the JSON is going to be **serialised**.

The schema does not need to be for the entire JSON; it can be for subcomponents only.

This is an example JSON schema for dealing with fractions:
```python
j = '''
    {
        "cake": "yummy chocolate cake",
        "myShare": {
            "objecttype": "fraction",
            "numerator": 1,
            "denominator": 8
        }
    }
'''
```

We can use this to create a `Fraction` object with the appropriate numerator and denominator that will replace the original value of the `myShare` key. 

As you can imagine, we can define numerous dictionaries with different `objecttype` and handle each of them accordingly. 

The way to do this in a *less clunky* way is using `object_hook`. This is like the decode equivalent of the `default` argument which was used for encoding.

#### Coding

Let's make some JSON data and write our custom decoder which takes a dictionary and **replaces it** with an appropriate python type. We will take datetimes strings and parse them into datetime objects and fraction strings and parse them into Fraction objects:

In [5]:
from datetime import datetime
import json
from pprint import pprint
from fractions import Fraction

j = '''
    {
        "times": {
            "created": {
                "objecttype": "datetime",
                "value": "2018-10-21T09:14:15"
                },
            "updated": {
                "objecttype": "datetime",
                "value": "2018-10-22T10:00:05"
                }
            },

        "cake": "yummy chocolate cake",
        "myShare": {
            "objecttype": "fraction",
            "numerator": 1,
            "denominator": 8
        }
    }
'''

def custom_decoder(arg: dict):
    if "objecttype" in arg:
        if arg["objecttype"] == "datetime":
            return datetime.strptime(arg["value"], "%Y-%m-%dT%H:%M:%S")
        
        elif arg['objecttype'] == 'fraction':
            return Fraction(arg['numerator'], arg['denominator'])

        return arg

    else:
        return arg

pprint(json.loads(j, object_hook=custom_decoder))

{'cake': 'yummy chocolate cake',
 'myShare': Fraction(1, 8),
 'times': {'created': datetime.datetime(2018, 10, 21, 9, 14, 15),
           'updated': datetime.datetime(2018, 10, 22, 10, 0, 5)}}


We **need** the `return arg` blocks because the root dictionary of the JSON will always be passed to the `object_hook` callable. Without the block, when the root dictionary is passed in, the decoder will return `None` and it'll be replaced with `None`. Therefore, `json.loads()` returns `None` overall.

An example of `object_pairs_hook` can be found in the original notebook.

# 06 - Using JSONDecoder

Instead of using an `object_hook`, we can create a custom decoder to override the `json.JSONDecoder` that is called with every `json.loads()`. Both approaches are fine but the reason why we'd want to do it this way is for more consistent usage of our decoder as well as hardcoding in some default arguments in the `json.loads()` function.

Just like we can use a subclass of `JSONEncoder` to customize our json encodings, we can use a subclass of the default `JSONDecoder` class to customize decoding our json strings.

It works quite differently from the `JSONEncoder` subclassing though.

When we subclass `JSONEncoder` we override the `default` method which then allows us to intercept encoding of specific types of objects, and delegate back to the parent class what we don't want to handle specifically.

With the `JSONDecoder` class we override the `decode` function which passes us the **entire** JSON as a **string** and we have to return whatever Python object we want, which is usually a dictionary. There's no delegating anything back to the parent class unless we want to completely skip customizing the output.

In [37]:
import json

In [38]:
j = '''
    {
        "a": 100,
        "b": [1, 2, 3],
        "c": "python",
        "d": {
            "e": 4,
            "f": 5.5
        }
    }
'''

In [39]:
class CustomDecoder(json.JSONDecoder):
    def decode(self, arg: str):
        print("decode:", type(arg), arg)
        return "a simple string object"

In [40]:
res = json.loads(j, cls=CustomDecoder)

decode: <class 'str'> 
    {
        "a": 100,
        "b": [1, 2, 3],
        "c": "python",
        "d": {
            "e": 4,
            "f": 5.5
        }
    }



In [41]:
res

'a simple string object'

-----------------

**Note**:

if we only want to override the defaults of `json.JSONDecoder`, we don't need to create a new custom class; instead, we can just instantiate an instance of `json.JSONDecoder` with those hardcoded arguments and use the `decode` method:

In [58]:
from decimal import Decimal
from pprint import pprint
CustomDecoder = json.JSONDecoder(parse_float=Decimal)
res = CustomDecoder.decode(j)
pprint(res, width=1)

{'a': 100,
 'b': [1,
       2,
       3],
 'c': 'python',
 'd': {'e': 4,
       'f': Decimal('5.5')}}


-----------

It might seem like a lot of work to parse the received string into a dictionary, but there's a better approach which is typically taken.

What we're going to do is, in our `CustomDecoder`, we will call `json.loads()` on the received string - this deserialises the JSON string into a python dictionary with all default deserialisations applied e.g. `'[1, 2, 3]'` will be converted into a Python list `[1, 2, 3]`.  

We can now convert those deserialised objects into whatever specific types we want.

In the example below, we'll convert our list of `points`, each represented as a list of two values, into a point object:

In [8]:
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __repr__(self):
        return f'Point(x={self.x}, y={self.y})'

In [14]:
j_points = '''
{
    "points": [
        [10, 20],
        [-1, -2],
        [0.5, 0.5]
    ]
}
'''

j_other = '''
{
    "a": 1,
    "b": 2
}
'''

In [15]:
import json

class CustomDecoder(json.JSONDecoder):

    def decode(self, arg: str):  # remember that `decode` takes the original string
        obj: dict = json.loads(arg)  # we delegate the rough business of string parsing to `json.loads`
        if "points" in obj:
            obj["points"] = [Point(x, y) for x, y in obj["points"]]
        return obj

In [16]:
json.loads(j_points, cls=CustomDecoder)

{'points': [Point(x=10, y=20), Point(x=-1, y=-2), Point(x=0.5, y=0.5)]}

In [17]:
json.loads(j_other, cls=CustomDecoder)

{'a': 1, 'b': 2}

One limitation of this approach is that, even with `j_other`, we had to `json.loads()` the entire thing despite it not having the `"points"` key. An alternative approach which gets around this is by writing a regex pattern to match on the received string if it contains `"_type": "point"`. But there's a problem. Any amount of whitespace before and after the colon in a JSON string is still valid JSON, so if the received string contains `"_type"     :    "point"`, we would need to match for that too.

I won't go into too many details, but in regex you can do this with a particular pattern: `pattern = r'"_type"\s*:\s*"point"'`

In [30]:
import re
pattern = r'"_type"\s*:\s*"point"'
regexp = re.compile(pattern)  # compiling the pattern is useful if the pattern is used frequently throughout your program

In [31]:
print(regexp.search('"a": 1'))

None


In [32]:
print(regexp.search('"_type"   : "point"'))

<re.Match object; span=(0, 19), match='"_type"   : "point"'>


In [33]:
re.search(pattern, '"_type"  :  "point"')  # this is how you do it if you don't want to compile

<re.Match object; span=(0, 19), match='"_type"  :  "point"'>

See below for an implementation - go to Section 7, Video 56 if you want more explanation:

In [34]:
class CustomDecoder(json.JSONDecoder):
    def decode(self, arg):
        obj = json.loads(arg)
        pattern = r'"_type"\s*:\s*"point"'
        if re.search(pattern, arg):
            # we have at least one `Point'
            obj = self.make_pts(obj)
        return obj
    
    def make_pts(self, obj):
        # recursive function to find and replace points
        # received object could be a dictionary, a list, or a simple type
        if isinstance(obj, dict):
            # first see if this dictionary is a point itself
            if '_type' in obj and obj['_type'] == 'point':
                # could have used: if obj.get('_type', None) == 'point'
                obj = Point(obj['x'], obj['y'])
            else:
                # root object is not a point
                # but it could contain a sub-object which itself 
                # is or contains a Point object
                for key, value in obj.items():
                    obj[key] = self.make_pts(value)
        elif isinstance(obj, list):
            for index, item in enumerate(obj):
                obj[index] = self.make_pts(item)
        return obj

In [3]:
j = '''
{
    "a": 100,
    "b": 0.5,
    "rectangle": {
        "corners": {
            "b_left": {"_type": "point", "x": -1, "y": -1},
            "b_right": {"_type": "point", "x": 1, "y": -1},
            "t_left": {"_type": "point", "x": -1, "y": 1},
            "t_right": {"_type": "point", "x": 1, "y": 1}
        },
        "rotate": {"_type" : "point", "x": 0, "y": 0},
        "interior_pts": [
            {"_type": "point", "x": 0, "y": 0},
            {"_type": "point", "x": 0.5, "y": 0.5}
        ]
    }
}
'''

In [62]:
from pprint import pprint
pprint(json.loads(j, cls=CustomDecoder))

{'a': 100,
 'b': 0.5,
 'rectangle': {'corners': {'b_left': Point(x=-1, y=-1),
                           'b_right': Point(x=1, y=-1),
                           't_left': Point(x=-1, y=1),
                           't_right': Point(x=1, y=1)},
               'interior_pts': [Point(x=0, y=0), Point(x=0.5, y=0.5)],
               'rotate': Point(x=0, y=0)}}


Let's say we also want to convert decimal values to Decimal points. We're going to need to hardcode `parse_float` in our initialisation: (see original notebook for other, more-readable way of doing this.)

In [31]:
import json
from decimal import Decimal
import re

class CustomDecoder(json.JSONDecoder):
    def __init__(self, *args, **kwargs):
        super().__init__(parse_float=Decimal)
    
    def decode(self, arg):
        obj = super().decode(arg) 
        pattern = r'"_type"\s*:\s*"point"'
        if re.search(pattern, arg):
            # we have at least one `Point'
            obj = self.make_pts(obj)
        return obj
    
    def make_pts(self, obj):
        # recursive function to find and replace points
        # received object could be a dictionary, a list, or a simple type
        if isinstance(obj, dict):
            # first see if this dictionary is a point itself
            if '_type' in obj and obj['_type'] == 'point':
                # could have used: if obj.get('_type', None) == 'point'
                obj = Point(obj['x'], obj['y'])
            else:
                # root object is not a point
                # but it could contain a sub-object which itself 
                # is or contains a Point object
                for key, value in obj.items():
                    obj[key] = self.make_pts(value)
        elif isinstance(obj, list):
            for index, item in enumerate(obj):
                obj[index] = self.make_pts(item)
        return obj

In [32]:
from pprint import pprint
res = json.loads(j, cls=CustomDecoder)
pprint(res)

{'a': 100,
 'b': Decimal('0.5'),
 'rectangle': {'corners': {'b_left': Point(x=-1, y=-1),
                           'b_right': Point(x=1, y=-1),
                           't_left': Point(x=-1, y=1),
                           't_right': Point(x=1, y=1)},
               'interior_pts': [Point(x=0, y=0), Point(x=0.5, y=0.5)],
               'rotate': Point(x=0, y=0)}}


In [33]:
type(res['rectangle']['interior_pts'][1].x)

decimal.Decimal

# 07 - JSONSchema

Often when we work with JSON data, the way the data is formatted is not haphazard - it often conforms to some very precise specification.

For example, REST API's will conform to some specific format for JSON input and output. 

This is called conforming to a **schema**. It is very similar to how relational databases work - we have a schema that precisely defines the columns in tables, the relationships between tables and so on.

One of the main reasons for having these schemas for JSON data is that it allows us to serialize and deserialize the data more easily - we know in advance what the JSON structure will look like, and we can therefore write code that will leverage our understanding of the JSON structure.

There are many ways in which we can define a JSON schema - it could be as simple as creating a Word document that explains how the JSON needs to be structured. Although that works, there are better, standards-based approaches though.

One of these is the JSON Schema standard:
https://json-schema.org/

Some terminology:

- instance: Any JSON value (e.g., a number, string, object, array, etc.) that is being validated against the schema
- schema: document that contains the description
- [validation keywords](https://json-schema.org/draft/2020-12/json-schema-validation#section-6.1.1): Defines the constraints and requirements of the instance. Examples to follow on how these can be used. Some common words are:
    - `"type"`: Data type of the instance.
    - `"minimum"`
    - `"maxLength"`
    - `"properties"` Specifies what keys should be in the instance 
    - `"additionalProperties"` Indicates whether additional properties are allowed in the instance
    - `"enum"`: Value must be one of the specified values in an array
- [schema annotations](https://json-schema.org/draft/2020-12/json-schema-validation#section-9.1)]: 
Schema annotations in JSON Schema provide metadata and descriptive information about the schema or its parts. These annotations do not affect the validation of the JSON data but serve to offer additional context, documentation. Some common ones are:
    - `"title"`: A short, descriptive title for the schema or a specific property
    - `"description"`: A detailed description of the schema or property for context
    - `"default"`: Specifies a default value for a property when no value is provided
- [schema keywords](https://json-schema.org/draft/2020-12/json-schema-core#section-8.1.1): Refer to specific keywords used to provide meta information about the JSON Schema itself. Some common ones include:
    - `"$schema"`: Specifies the version of the JSON Schema specification that the schema conforms to. For example, "$schema": "http://json-schema.org/draft-07/schema#" indicates that the schema follows the draft-07 version of the JSON Schema specification
    - `"$id"`: Provides a unique identifier for the schema. This can be a URI that allows the schema to be referenced from other schemas.
    - `"$vocabulary"`: This keyword is used to declare the core vocabulary that the schema uses, allowing for vocabulary composition and extension. Default vocabulary includes things like "minimum", "maxLength", etc., but custom vocabularies can be made too (https://json-schema.org/draft/2020-12/json-schema-core#name-the-vocabulary-keyword).
 

Without going into much more detail, we'll show how it can be used with some examples. We can validate our schema with many different libraries in many different languages but here we'll use [jsonschema](https://github.com/Julian/jsonschema). You will need to `pip install jsonschema`

Here's a schema:

In [11]:
person_schema = {
    "type": "object",
    "properties": {
        "firstName": {
            "type": "string",
            "minLength": 1
        },
        "middleInitial": {
            "type": "string",
            "minLength": 1,
            "maxLength": 1
        },
        "lastName": {
            "type": "string",
            "minLength": 1
        },
        "age": {
            "type": "integer", 
            "minimum": 0
        },
        "eyeColor": {
            "type": "string",
            "enum": ["amber", "blue", "brown", "gray", 
                     "green", "hazel", "red", "violet"]
        }
    },
    "required": ["firstName", "lastName"]
}

We can use the `validate` function, but it will not work with a string - it needs to be deserialized into a Python dictionary first (which means it will have to be a valid JSON structure first).

In [12]:
import json
import jsonschema
from jsonschema.exceptions import ValidationError

p1 = '''
    {
        "firstName": "John",
        "middleInitial": "M",
        "lastName": "Cleese",
        "age": 79
    }
'''

try:
    jsonschema.validate(json.loads(p1), person_schema)
except json.JSONDecodeError as ex:
    print(f'Invalid JSON: {ex}')
except ValidationError as ex:
    print(f'Validation error: {ex}')
else:
    print('JSON is valid')

JSON is valid


In [14]:
p2 = '''
    {
        "firstName": "John",
        "middleInitial": 100,
        "lastName": "Cleese",
        "age": "Unknown"
    }
'''

try:
    jsonschema.validate(json.loads(p2), person_schema)
except json.JSONDecodeError as ex:
    print(f'Invalid JSON: {ex}')
except ValidationError as ex:
    print(f'Validation error: {ex}')
else:
    print('JSON is valid')

Validation error: 100 is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    100


You'll notice that the validator only returns the first validation error it encounters. This can be changed to run the entire validation and return all the validation errors (if any), but utilizes a slightly different way of performing validation:

In [15]:
from jsonschema import Draft4Validator

validator = Draft4Validator(person_schema)

In [18]:
for error in validator.iter_errors(json.loads(p2)):
    print(error, end='\n-----------\n')

100 is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    100
-----------
'Unknown' is not of type 'integer'

Failed validating 'type' in schema['properties']['age']:
    {'minimum': 0, 'type': 'integer'}

On instance['age']:
    'Unknown'
-----------


The errors returned are `ValidationError` objects (below, we can see their `__repr__` (i think).

In [29]:
errors = list(validator.iter_errors(json.loads(p2)))
print(errors)

[<ValidationError: "100 is not of type 'string'">, <ValidationError: "'Unknown' is not of type 'integer'">]


When we print each error, we get more information:

In [35]:
print(errors[0])
print('-'*80)
print(errors[0])

100 is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    100
--------------------------------------------------------------------------------
100 is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    100


To look at another example, we can see in the schema that `eyeColor` is not required, but if it's passed in, then it must be a string out of one the options. In this example, `eyeColor` is the **enumerated type** and "amber", "blue", "brown", etc. are the **enumerators**.

Also, `"middleInitial"` must be a single character string.

In [38]:
p3 = '''
    {
        "firstName": "John",
        "middleInitial": null,
        "lastName": "Cleese",
        "eyeColor": "blue-gray"
    }
'''

for error in validator.iter_errors(json.loads(p3)):
    print(error, end='\n-----------\n')    

None is not of type 'string'

Failed validating 'type' in schema['properties']['middleInitial']:
    {'maxLength': 1, 'minLength': 1, 'type': 'string'}

On instance['middleInitial']:
    None
-----------
'blue-gray' is not one of ['amber', 'blue', 'brown', 'gray', 'green', 'hazel', 'red', 'violet']

Failed validating 'enum' in schema['properties']['eyeColor']:
    {'enum': ['amber',
              'blue',
              'brown',
              'gray',
              'green',
              'hazel',
              'red',
              'violet'],
     'type': 'string'}

On instance['eyeColor']:
    'blue-gray'
-----------


So JSON Schema paired with this library is a great way to ensure a JSON document conforms to some specific schema. It is useful even when you create your own JSON serializer to make sure you are conforming to your own pre-determined schema - especially useful in unit testing to make sure you did not miss something when serializing your objects to JSON.

But all this does not address the other issue we have - serializing and deserializing Python objects to and from JSON strings (marshalling). This is addressed in the next section.

# 08 - Marshmallow

The original videos showed how this could be done using the Marshmallow library, however, Pydantic is now recommended. There is an entire summary on Fred Baptiste's Pydantic course found in this repo too.

# 09 - YAML

# 10 - Serpy